One of the main obstacles to NLP research in the clinical domain is data access. On this page, we will
assemble links to existing data sets (both raw and annotated) that are currently available to the general
MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with >40,000 critical care patients. In addition to structured clinical data (demographics, vital signs, laboratory tests, medications, etc.), it contains over 2 million free-text notes from nurses, physicians, specialists, and more.
In an effort to provide annotated data for a variety of NLP tasks in the clinical domain, the i2b2 (Informatics for Integrating Biology and the Bedside) project has organized a yearly series of shared tasks, starting in 2006. Each year, several hundreds of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare is annotated for that year's task and released to the research community. To date, these efforts have covered a variety of tasks, including de-identification, named entity and relation extraction, negation and modality, co-reference resolution, temporal information extraction, and others. The notes from each i2b2 shared task are released under the appropriate data use agreements to the research community at large on the one year anniversary of the task's completion. The data from previous shared tasks up through 2014 as i2b2 NLP Research Data Sets from the i2b2 project website.
A number of teams have worked with the i2b2/n2c2 data to create annotations for other NLP tasks. These will be made available as i2b2 spin-off tasks from n2c2. The first spin-off task focuses on concept normalization and expands the scope of previous clinical normalization tasks (ShARe/CLEF eHealth 2013 Task 1, SemEval-2014 Task 7 and SemEval-2015 Task 14).
The Sharing Annotated Resources (ShARe) / Conference and Labs of the Evaluation Forum (CLEF) included shared tasks on disease/disorder named entity recognition, normalization of named entities to the Unified Medical Language System (UMLS), and disease/disorder template filling.
Several shared tasks in the clinical domain have been organized as a part of the yearly SemEval competitions. These include:
The data for SemEval shared tasks is typically available after the tasks complete.
Medical Natural Language Processing for Clinical Document (MedNLPDoc) has run three shared tasks in processing of Japanese clinical records. The tasks included named entity recognition, term normalization, and International Codes for Diseases (ICD) disease name identification.