Research Reference Datasets

Large-scale Clinical Data Reference Dataset

Instructors often find it a challenge to locate readily available “big data” resources that can be utilized in informatics instructional settings. To address this issue, Dr. Kathy Bobay collaborated with the Informatics and Clinical Research (ICR) team to develop a curated and fully de-identified “big data” instructional data resource that can be utilized in instructional activities. This resource (link to dataset documentation provided below) was completed in April of 2020.

The “Big Data Clinical Reference Dataset” consist of a select set of curated longitudinal clinical data which is fully de-identified. One unique aspect of the dataset is that beyond normal structured data elements it also contains a full set of National Library of Medicine (NLM) Unified Medical Language System (UMLS) concept unique identifiers (CUIs) and semantic type identifiers (TUIs) that were produced through large-scale natural language processing (NLP) of clinical reports that are associated with the dataset’s structured elements.

Note: This dataset is intended for instructional activities only and it is NOT suited for actual clinical research. The underlying data are select and may contain some synthetic components.

Common uses of the resource:

This reference dataset is intended for instructional activities related to clinical informatics, biostatistics and data science.

Resource available to the following users:

Dataset is available for use by Loyola University Chicago faculty and students.

Requests for access require:

Use is contingent upon execution of an Institutional Review Board (IRB) application and data use agreement.

Current resources:

National Library of Medicine (NLM)
Unified Medical Language System (UMLS)

Reference dataset contacts:

For information or use of this resource, please contact Dr. Kathy Bobay of the Parkinson School of Health Sciences and Public Health.

Last Modified: Tue, November 7, 2023 12:00 PM CST