00.Datasets01 (A C) - sporedata/researchdesigneR GitHub Wiki

Select datasets

This section focuses on a few select healthcare datasets that hold special value in patient-centered research:

  • 200ks_Med_ResPaper_Abstracts (USA) #public - Consists of 200,000 abstracts for NLP and Sequential Sentence Classification Problems.
  • 4DN-DP (USA) #public - The 4D Nucleome Data Portal is a publicly accessible repository for genomic and microscopic datasets related to nuclear architecture.
  • AARP (USA) - The American Association of Retired Persons (AARP)'s Livability Index assesses the livability of a community.
  • ACAG (USA) - The Atmospheric Composition Analysis Group applies satellite observations, global models, and in situ measurements to improve understanding of the processes controlling air quality, climate, and biogeochemical cycling.
  • AccessClinicalData (USA) - Enables access to and sharing of data sets and reports from NIAID COVID-19 and other sponsored clinical trials.
  • AC-MRI (USA) - Consists of 2,888 clinical MRIs of patients admitted with acute or early subacute stroke, including diverse protocols and MRI modalities with typical clinical resolution.
  • ACRD (USA) - Archived Clinical Research Datasets describe a data repository that houses the NINDS Division of Clinical Research (DCR)- funded studies and trials in neurological areas.
  • ACS (USA) - The American Community Survey (ACS) is the most comprehensive source of demographic and housing data for the United States.
  • ADDEP (USA) - The Archive of Data on Disability to Enable Policy and research.
  • ADDI (USA) - The Alzheimer’s Disease Data Initiative (AD Data Initiative or ADDI) is a cloud-based platform for the management and sharing of human or human-derived Alzheimer’s and related dementia data.
  • ADI (USA) - The Area Deprivation Index (ADI) ranks neighborhoods in a particular region based on socioeconomic disadvantage, considering social determinants of health (SDoH).
  • ADKP (USA) - The AD Knowledge Portal is an NIH-recognized repository and the primary distribution hub for multi-omic data derived from human samples and model systems.
  • AgingResearchBiobank (USA) - designed to provide a state-of-the-art inventory system for storing, maintaining, and distributing biospecimens and associated data on aging and clinical trials with the broader scientific community.
  • AHA (USA) #private - The American Hospital Association (AHA) is a hospital-level dataset designed to supplement the data elements in the SID, SASD, and SEDD databases.
  • AHRF (USA) - The Area Health Resources Files (AHRF) comprises data on economics, environment, health care professions, health facilities, health professions training, hospital expenditures, hospital use, and population demographics.
  • AHRQ (USA) - The Agency for Healthcare Research and Quality (AHRQ) is the primary federal agency charged with improving the quality and safety of the American healthcare system.
  • Air-Quality-and-Meteorological-Information-of-Chile (Chile) - Compiles air quality data from the National Air Quality System (S.I.N.C.A.).
  • ALFA (USA) - The NCBI Allele Frequency Aggregator (ALFA) was designed to make allele frequency datasets from dbGaP studies the largest, free, and most complete aggregated variant datasets available.
  • AllOfUs (USA) - The All of Us Research Program stands as a major biomedical data resource of unparalleled scale.
  • AMDS (Netherlands) - Amsterdam's Medical Data Science contains data from the clinical patient data management system from Amsterdam University Medical Center's (ICU).
  • AMP-PD (USA) - The Accelerating Medicines Partnership – Parkinson's Disease is focused discovering and validating the most promising biological targets for drug development.
  • ANA (Brazil) - Brazil National Water Agency
  • APCDs (USA) - The All-Payer Claims Databases comprise data from multiple payer sources, thus leveraging data from within the insurance claims and reimbursement system.
  • AphasiaBank (USA) - A collaborative repository containing multimedia interactions aimed at researching communication in individuals with aphasia.
  • ARS (Italy) - Agenzia Regionale di Sanità della Toscana is the technical and scientific consultant agency to the regional council and government of Tuscany.
  • ARSA (USA) - The Atlas of Rural and Small-Town America (ARSA) provides data by broad categories of socioeconomic factors: county classification, income, jobs, people, and veterans.
  • ASCQ-Me (USA) - The Adult Sickle Cell Quality of Life Measurement developed a set of self-report measures for use with adults with sickle cell disease.
  • AYA-HOPE (USA) - the Adolescent & Young Adult Health Outcomes & Patient Experience Study (AYA HOPE).
  • Base de Datos de Facil Acceso del Censo 2017 de Chile (Chile) - 2017 Chilean Census Easy Access Database provides convenient access to more than 17 million records from the 2017 Census database.
  • BDC (USA) - The BioData Catalyst (BDC) is designed to be agile and responsive to the ever-changing biomedical science and data community conditions.
  • Bengali_Medical_Dataset (Bangladesh, India) #public - The Bengali Medical Dataset
  • BIFAP (Spain) - The Base de Datos para la Investigación Farmacoepidemiológica en Atención Primaria is the data resource for pharmacoepidemiological research in Spain
  • BIL (USA) - The Brain Image Library is a national public resource that enables researchers to deposit, analyze, interact, mine, and share large brain image datasets.
  • BindingDB (USA) - BindingDB is an open, centralized, web-based repository primarily focused on cataloging measured binding affinities.
  • BioASQ (USA) - BioASQ Challenge Data is a challenge on large-scale biomedical semantic indexing and question answering (QA).
  • BioBERT_QA_Model (USA) #public - BioBERT-based extractive question-and-answering model, finetuned on SQuAD 2.0.
  • BioLINCC (USA) - The Biologic Specimen and Data Repository Information Coordinating Center.
  • BioPortal (USA) - BioPortal is the most expansive integrated repository of global biomedical ontologies and controlled terminologies.
  • BioSystics-AP (USA) - The BioSystics Analytics Platform is designed to store, analyze, and share complex multimodal datasets from in vitro 2D- and 3D-models.
  • BioVU (USA) - Vanderbilt’s de-identified DNA data bank
  • BossDB (USA) - The Brain Observatory Storage Service & Database (BossDB) is a scalable, open-data ecosystem for storing, accessing, and processing multidimensional and volumetric 3D and 4D neuroscience datasets.
  • BRFSS (USA) - Behavioral Risk Factor Surveillance System is the primary system of health-related telephone surveys that collects data on on healthcare access related to chronic diseases and injury, health-related risk behaviors, preventive health practices, and social determinants of health (SDoH).
  • Broadband Deployment Data (USA) - Used to develop broadband networks or infrastructure through which broadband services can be delivered.
  • BV-BRC (USA) - The Bacterial and Viral Bioinformatics Resource Center is an information system designed to support the biomedical research community’s work on bacterial and viral infectious diseases via the integration of vital pathogen information.
  • CAHPS(R) Database (USA) - The Consumer Assessment of Healthcare Providers and Systems Database is AHRQ's data repositories for selected CAHPS surveys aimed at facilitating the comparisons of CAHPS survey results by and among survey users.
  • caNanoLab (USA) - The cancer Nanotechnology Laboratory (caNanoLab) portal is a data sharing platform created to facilitate information sharing within the global community of biomedical nanotechnology researchers.
  • CAPER (USA) - The Comprehensive Ambulatory Professional Encounter Record (CAPER) is the only source of direct care, outpatient clinical data replacement for the Standard Ambulatory Data Record (SADR) dataset.
  • CART (USA) - The Clinical Assessment Reporting and Tracking (CART) contains patient and facility procedural information on those undergoing invasive cardiac procedures at VA facilities.
  • Caserta (USA) - Contains individual claims databases with information on NHS-covered healthcare services.
  • CASI (USA) - Clinical Abbreviation Sense Inventory (CASI) for medical term disambiguation dataset.
  • CCDI-CCDC (USA) - The Childhood Cancer Data Initiative's Childhood Cancer Data Catalog (CCDI-CCDC) is an inventory of pediatric oncology data resources.
  • CCDI-MTP (USA) - The Childhood Cancer Data Initiative's Molecular Targets Platform (CCDI-MTP) is a tool that supports the identification and prioritization of molecular targets expressed in childhood cancers.
  • CCHMC (USA) #Text - Cincinnati Children’s Hospital Medical Center (CCHMC) ICD-9 radiology corpus.
  • CDE (USA) - The Crime Data Explorer (CDE) collects information on violent and property crimes.
  • CDI (USA) - The Chronic Disease Indicators (CDI) is the sole comprehensive and integrated source for complete access to a wide range of indicators for state-level surveillance of chronic diseases, conditions, and risk factors, including overarching conditions that are SDOH.
  • CDS (USA) - The Cancer Data Service (CDS) is home to both controlled and open access data and provides data storage and sharing capabilities for NCI-funded studies.
  • CDW (USA) #private - Veterans Health Administration’s (VHA) Corporate Data Warehouse (CDW) comprising data from the VHA information infrastructure and the Veterans Information Systems Technology Architecture.
  • CEDCD (USA) #public - Cancer Epidemiology Descriptive Cohort Database (CEDCD) is a searchable database that contains biospecimen information, cancer sites, general study information, number of participants diagnosed with cancer, and the type of data collected at baseline.
  • Census Tract (USA) - The Social Determinants of Health by US Census Tract comprises social determinants of health (SDoH) constructs for each US census tract as defined by 2010 census tract boundaries.
  • censusIncarceration (USA) - shows the number of people incarcerated across the United States, per the 2000, 2010, and 2020 Decennial Census.
  • CFDE (USA) - The Common Fund Data Ecosystem (CFDE) is designed to enable researchers to access and interact with various data sets from multiple Common Fund (CF) programs within a cloud-based digital environment.
  • CHARLS (China) - China Health and Retirement Longitudinal Study
  • Chatbot (USA) #private - Contains information about University Inquiry for ordinary purpose, including a list of intents with pattern, reponses, tags, and context set.
  • CHILDES (USA) - Child Language Data Exchange System is mainly used for analyzing young children's language and adult speech directed to children.
  • CHSI (USA) - CDC's Community Health Status Indicators (CHSI) provides public health profiles for all 3,143 counties in the United States.
  • CIBMTR (USA) #public - The Center for International Blood and Marrow Transplant Research is a collaborative resource of data and experts supporting research in cellular therapies to improve patient outcomes.
  • CIL (USA) #public - The Cell Image Library (CIL) serves as a communal platform dedicated to collecting, storing, managing, and sharing extensive microscopy data.
  • CIMRD (USA) #private - The California Independent Medical Review Dataset (CIMRD) is derived from the California Department of Managed Health Care (DMHC).
  • CKD-SS (USA) - The Chronic Kidney Disease Surveillance System (CKD-SS) is a comprehensive, interactive, and systematic surveillance system that tracks the burden of chronic kidney disease (CKD) over time.
  • ClinicalTrials (USA) - ClinicalTrials.gov is a register and results database of clinical studies conducted worldwide and funded by public and private sources.
  • ClinVar (USA) - A publicly available archive of reports on the associations between human variations and phenotypes, along with supporting data.
  • CMDKP (USA) - Common Metabolic Diseases Knowledge Portal
  • CMS claims data (USA) - Centers for Medicare and Medicaid Services
  • CMS Cost Reports (USA) - Comprises files with comprehensive cost center information from 2010-2021.
  • CMS Hospital Compare (USA) - Displays hospital performance data in a consistent, unified manner to ensure the availability of credible information about the care delivered in the nation’s hospitals.
  • CMS–Physician Compare (USA) - Provides useful information about the physicians and other healthcare professionals currently enrolled in Medicare
  • COMETS (USA) #public - The Consortium of Metabolomics Studies represents a collaborative effort between extramural and intramural entities, fostering cooperation among prospective cohort studies.
  • Connect (USA) #public - The Connect for Cancer Prevention Study (“Connect”) is a prospective cohort of 200,000 adult patients aged 40-65 years.
  • Copernicus (EU) - The most ambitious Earth observation program headed by the European Commission (EC) and the European Space Agency (ESA).
  • CORD-19 (USA) #public - COVID-19 Open Research Dataset Challenge dataset comprises over a million scholarly articles, including over four hundred thousand with full text, on COVID-19, SARS-CoV-2, and related coronaviruses.
  • CORD-19_corpus-Mining (USA) #public - Mining CORD-19 corpus for biomedical associations dataset captures associations between different entities in the provided Kaggle corpus.
  • CoreNLP (USA) #public - CoreNLP is a comprehensive solution for natural language processing in Java.
  • COUGHVID-3 (USA) #private - expert-labeled cough dataset that can be applied to a plethora of cough audio classification tasks.
  • County Health Rankings (USA) - County Health Rankings and Roadmaps measures important health characteristics in nearly every county in the United States.
  • COVID-19 vaccinations (USA) - COVID-19 Vaccinations in the United States, County
  • COVID-19_Graphs (USA) #public - Contains a 4D sequence encoding of SARS Cov2 sequences.
  • COVID-19_Translations #public - Contains translations of COVID-19 related documents.
  • COVID-19_Tweets (USA) #public - Contains tweets with hashtags associated with Coronavirus
  • COVID-19_Tweets_India (India) #public - Contains day-wise aggregated tweets from the onset of the outbreak through July 30, 2020.
  • COVID-19_Xray #public - Contains manually drawn pixel-level lung segmentations, with and without COVID.
  • CPRD (UK)- Clinical Practice Research Datalink is the source of the largest research database in the UK with longitudinal, representative primary care data linked to data from other healthcare settings.
  • CRE (USA) - The Community Resilience Estimates (CRE) shows how vulnerable each community in the United States is to disasters.
  • CRISP (USA) - Chesapeake Regional Information System for our Patients
  • CVRG (USA) - The CardioVascular Research Grid.
⚠️ **GitHub.com Fallback** ⚠️