One of the main obstacles to NLP research in the clinical domain is data access. On this page, we will assemble links to existing data sets (both raw and annotated) that are currently available to the general public.
MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with >40,000 critical care patients. In addition to structured clinical data (demographics, vital signs, laboratory tests, medications, etc.), it contains over 2 million free-text notes from nurses, physicians,specialists, and more.
In an effort to provide annotated data for a variety of NLP tasks in the clinical domain, the i2b2 (Informatics for Integrating Biology and the Bedside) project has organized a yearly series of shared tasks, starting in 2006. Each year, several hundreds of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare is annotated for that year's task and released to the research community. To date, these efforts have covered a variety of tasks, including de-identification, named entity and relation extraction, negation and modality, co-reference resolution, temporal information extraction, and others. The notes from each i2b2 shared task are released under the appropriate data use agreements to the research community at large on the one year anniversary of the task's completion. The data from previous shared tasks up through 2014 as i2b2 NLP Research Data Sets from the i2b2 project website.
A number of teams have worked with the i2b2/n2c2 data to create annotations for other NLP tasks. These will be made available as i2b2 spin-off tasks from n2c2. The first spin-off task focuses on [concept normalization](https://n2c2.dbmi.hms.harvard.edu/track3) and expands the scope of previous clinical normalization tasks (ShARe/CLEF eHealth 2013 Task 1, SemEval-2014 Task 7 and SemEval-2015 Task 14).
The Sharing Annotated Resources (ShARe) / Conference and Labs of the Evaluation Forum (CLEF) included shared tasks on disease/disorder named entity recognition, normalization of named entities to the Unified Medical Language System (UMLS), and disease/disorder template filling.
Several shared tasks in the clinical domain have been organized as a part of the yearly SemEval competitions. These include:
The data for SemEval shared tasks is typically available after the tasks complete.
Medical Natural Language Processing for Clinical Document (MedNLPDoc) has run three shared tasks in processing of Japanese clinical records. The tasks included named entity recognition, term normalization, and International Codes for Diseases (ICD) disease name identification.
This is a corpus of 43,985 clinical patient notes (PNs) written by 35,156 examinees during the high-stakes USMLE® Step 2 Clinical Skills examination. In this exam, examinees interact with standardized patients - people trained to portray simulated scenarios called clinical cases. For each encounter, an examinee writes a PN, which is then scored by physician raters using a rubric of clinical concepts, expressions of which should be present in the PN. The corpus features PNs from 10 clinical cases, as well as the clinical concepts from the case rubrics. A subset of 2,840 PNs were annotated by 10 physician experts such that all 143 concepts from the case rubrics (e.g., shortness of breath) were mapped to 34,660 PN phrases (e.g., dyspnea, difficulty breathing). The corpus is available via a data sharing agreement with NBME and can be requested at https://www.nbme.org/services/data-sharing.
Recent success in neural language models has led to the creation of language models for the clinical domain.