Clinical Dataset for Machine Learning - clumsyspeedboat/Decision-Tree-Neo4j GitHub Wiki

Clinical Data

The key resource in health & medical research, there are several ways clinical data is collected & managed. We are going to summarize in brief of 4 major types which are crucial to understanding in retrospect of our research

Electronic Health Records

These are records maintained at individual medical institutions, hospital, clinic etc., & it includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc. Even though such data is not generally available to outsiders for open research, larger collaborators may get mediated or collaborated access.

Patient Disease Registries

For widespread chronic conditions such as cancer, diabetes, heart disease, asthma, Alzheimer's disease etc, digital entries are sometimes kept to study a narrow prospect of the condition. They contain critical information regarding the patient management system.

Health Surveys

To have a proper assessment of the public health infrastructure of a population, surveys are often conducted by national health institutes & other associated departments. Non-Governmental Organizations and Private as well as Organizations conduct surveys on certain population groups of interest & can make data accessible for research purpose

Clinical Trials & Clinical Research Datasets

Clinical research data may be available through national or discipline-specific organizations. Right to access is likely restricted but available through proper channels. Proprietary research data may also be available through individual agreements with private companies.

source: "Data Resources in Health Science", Health Sciences Library, https://guides.lib.uw.edu/hsl/data/findclin


Kaggle

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Kaggle's services

  • Machine learning competitions: this was Kaggle's first product. Companies post problems and machine learners compete to build the best algorithm.
  • Kaggle Kernels: a cloud-based workbench for data science and machine learning. Allows data scientists to share code and analysis in Python and R. Over 150K "kernels" (code snippets) have been shared on Kaggle covering everything from sentiment analysis to object detection.
  • Public datasets platform: community members share datasets. Has datasets on everything from bone x-rays to results from boxing bouts.
  • Kaggle Learn: for short-form AI education.
  • Jobs board: employers post machine learning and AI jobs.

source: "Kaggle", Wikipedia, https://en.wikipedia.org/wiki/Kaggle


Our preliminary dataset prerequisites:

  • Clinical diagnosis (patient data, pathogenic diagnosis data)
  • Versatile variables (different types of values in each column)
  • At least a 100 meaningful instances (rows)
  • Interpretable classification class variable
  • Hidden relationships interpretable as edges on a graph database