18 Open Healthcare and Medical Datasets for Machine Learning

Article by Rei Morikawa | March 12, 2019

We’re continuing our series of articles on open datasets for machine learning. If you missed the previous articles, check out our finance and economics datasets, natural language processing datasets, and more.

This article features life sciences, healthcare and medical datasets. Machine learning has a lot of potential applications in healthcare, and is already being used to provide economical solutions and medical diagnosis software systems. At a time where many first-world countries are facing an aging and declining population crisis, machine learning could help us provide better care for the elderly.


General Life Sciences, Healthcare and Medical Datasets

HealthData.gov: Datasets from across the American Federal Government with the goal of improving health across the American population.

Big Cities Health Inventory Data Platform: Health data from 26 cities, for 34 health indicators, across 6 demographic indicators.

Chronic Disease Data: Data on chronic disease indicators throughout the US.

Human Mortality Database: Mortality and population data for over 35 countries.

MHealth (Mobile Health) Dataset: Body motion and vital signs recordings for ten volunteers of diverse profile, while performing physical activities.

Medicare Provider Utilization and Payment Data: Data on services and procedures that physicians and other healthcare professionals provided to Medicare beneficiaries.

Life Science Database Archive: Datasets generated by life scientists in Japan in a long-term and stable state as national public goods. The Archive makes it easier for many people to search datasets by metadata in a unified format, and to access and download the datasets with clear use terms.


Image Datasets for Life Sciences, Healthcare and Medicine

OASIS: The Open Access Series of Imaging Studies (OASIS) is a project aimed at making neuroimaging datasets of the brain freely available to the scientific community. They compile and freely distribute neuroimaging datasets, with the hope of aiding future discoveries in basic and clinical neuroscience.

OpenfMRI: Magnetic resonance imaging (MRI) datasets openly available to the research community.

ADNI: Alzheimer’s Disease Neuroimaging Initiative (ADNI) researchers collect several types of data from volunteer study participants. The data is available for free to authorized investigators, but requires an application and prior approval.


Genome Datasets

GEO Datasets: This database stores curated gene expression datasets, as well as original series and platform records in the gene expression omnibus (GEO) repository.

1000 Genomes Project: The 1000 Genomes Project is an international collaboration which has established the most detailed catalog of human genetic variation. The final phase of the project sequenced over 2,500 individuals from 26 different populations around the world.

Genome in a Bottle: Dataset includes several reference genomes to enable translation of whole human genome sequencing to clinical practice.


Hospital Datasets

Medicare Hospital Quality: Official datasets used on the Medicare.gov Hospital Compare Website provided by the Centers for Medicare & Medicaid Services. These data allow you to compare the quality of care at over 4,000 Medicare-certified hospitals across the country.

Healthcare Cost and Utilization Project (HCUP): Datasets contain encounter-level information on impatient stays, emergency department visits, and ambulatory surgery in US hospitals.

MIMIC Critical Care Database: MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising unidentified health data associated with approximately 40,000 critical care patients. The dataset includes demographics, vital signs, laboratory tests, medications, and more.


Cancer Datasets

SEER cancer incidence: Data about cancer incidences segmented by demographic groups such as age, race, and gender, provided by the US government.

BROAD Institute Cancer Program Datasets: Data categorized by project such as brain cancer, leukemia, melanoma, etc.

CT Medical Images: This dataset contains a small set of CT scan images of cancer patients. The images are annotated with age, modality, and contrast tags.

Still can’t find what you need? Lionbridge AI can provide you with a custom machine learning dataset that fits your needs exactly. We have over 500,000 contributors, and Lionbridge AI manages the entire process from designing a custom workflow to sourcing qualified workers for your project.

The Author
Rei Morikawa

Rei writes content for Lionbridge’s website, blog articles, and social media. Born and raised in Tokyo, but also studied abroad in the US. A huge people person, and passionate about long-distance running, traveling, and discovering new music on Spotify.


    Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.