#00.Datasets - sporedata/researchdesigneR GitHub Wiki
The following sections describe datasets frequently used in clinical research, indices that might be derived from them, and a selection of data simulation packages.
00.Datasets01 (A-C)
00.Datasets02 (D-L)
00.Datasets03 (M-R)
00.Datasets04 (S-Z)
Indexes that can be created from existing datasets
In order for a score to be calculated, all component variables have to be present in the dataset, unless the score description explicitly says otherwise.
- 5-Factor Modified Frailty Index. See - New 5-Factor Modified Frailty Index Using American College of Surgeons NSQIP Data
- Association of claims-based assessment of care process to function and survival - see Association of Claims-Based Quality of Care Measures with Outcomes among Community-Dwelling Vulnerable Elders
- Frailty and quality of end-of-life care - see Preoperative Frailty Status and Intensity of End-of-Life Care Among Older Adults After Emergency Surgery
- Frailty index - is a state of increased vulnerability to adverse outcomes, the underlying principle being to count deficits in health. See - Measuring Frailty in Medicare Data: Development and Validation of a Claims-Based Frailty Index
- Function index based on ICD-codes - see Beyond comorbidity: expanding the definition and measurement of complexity among older adults using administrative claims data
- Invasive Procedure Complexity Matrix (IPCM)- see Facility procedure complexity designation requirements to perform invasive procedures in any clinical setting
- Palliative quality of care for seriously ill surgical patients - see Defining Serious Illness Among Adult Surgical Patients
- Patient Function, Long-term Survival, and Use of Surgery for Kidney Cancer - see Patient Function, Long-term Survival, and Use of Surgery for Kidney Cancer
- PhenX Toolkit - see Patient Function, Long-term Survival, and Use of Surgery for Kidney Cancer
- Post hospitalization quality of care for high-acuity emergency general geriatric surgery - see Loss of Community-Dwelling Status Among Survivors of High-Acuity Emergency General Surgery Disease
- Quality of care - see Creating Unidimensional Global Measures of Physician Practice Quality Based on Health Insurance Claims Data
- Real-world effectiveness of geriatric oncology therapies - see Geriatric oncology health services research: Cancer and Aging Research Group Infrastructure Core
- SF-12 to QALYs - see The estimation of a preference-based measure of health from the SF-12
- Use of machine learning to generate mental health scores - see Forecasting Mental Distress using Healthcare Claims Data
- VA Frailty index - see Development and Initial Validation of the Risk Analysis Index for Measuring Frailty in Surgical Populations
Simulated datasets - packages and libraries
- ar.matrix: Simulate Auto-Regressive Data from Precision Matricies - used to simulate data that are auto-regressive.
- cpsurvsim: Simulating Survival Data from Change-Point Hazard Distributions - simulates time-to-event data with type I right censoring using the inverse CDF and memoryless methods.
- dgmb: Simulating Data for PLS Mode B Structural Models - performs Monte Carlo simulations on structural models with formative constructs and interaction and nonlinear effects.
- fabricatr - hierarchical data structures and correlated data can be easily simulated, either from random number generators or by resampling from existing data sources.
- fakeR: Simulates Data from a Data Frame of Different Variable Types - simulates time-independent and time-dependent data, from a dataset including different variable types.
- glmdm: R Code for Simulation of GLMDM - uses a simulation of the posterior to perform generalized linear mixed Dirichlet models.
- GlmSimulatoR: Creates Ideal Data for Generalized Linear Models - creates ideal data for conventional and novel generalized linear models.
- graphsim: Simulate Expression Data from 'igraph' Networks - functions to develop simulated continuous data (e.g., gene expression) from a sigma covariance matrix.
- hapsim: Haplotype Data Simulation - used to haplotype-based genotype simulations.
- holodeck: A Tidy Interface for Simulating Multivariate Data - used to create simulated multivariate data sets with groups of variables with different degrees of variance, covariance, and effect size.
- InterSIM: Simulation of Inter-Related Genomic Datasets - used to generate inter-related genomic datasets of methylation, gene expression, and protein expression.
- LOST: Missing Morphometric Data Simulation and Estimation -
- Mediana: Clinical Trial Simulations - has a general framework for clinical trial simulations based on the Clinical Scenario Evaluation (CSE) approach.
- missDeaths: Simulating and Analyzing Time to Event Data in the Presence of Population Mortality - used to a nonparametric risk adjustment, a data imputation method, and object-oriented survival data simulation functions.
- MixSim: Simulating Data to Study Performance of Clustering Algorithms - used to simulate mixtures of Gaussian distributions with different levels of overlap between mixture components.
- mlxR: Simulation of Longitudinal Data - used for the simulation and visualization of complex models for longitudinal data.
- pensim: Simulation of High-Dimensional Data and Parallelized Repeated Penalized Regression - used to the simulation of continuous data, correlated with time to an event.
- PermAlgo: Permutational Algorithm to Simulate Survival Data - used to obtain a dataset in which event and censoring times are conditional on a user-specified list of covariates.
- sdglinkage: Synthetic Data Generation for Linkage Methods Development - used for linkage method development.
- simcdm: Simulate Cognitive Diagnostic Model ('CDM') Data - used to simulate cognitive diagnostic model data for Deterministic Input.
- SimCorrMix: Simulation of Correlated Data with Multiple Variable Types Including Continuous and Count Mixture Distributions - used to simulate data sets that mimic real-world clinical or genetic data sets.
- simExam: Generate Simulated Data for IRT-Enabled Exams - used to generate binary test data based on Item Response Theory using the two-parameter logistic model.
- simITS: Analysis via Simulation of Interrupted Time Series (ITS) Data - used to create prediction intervals for post-policy outcomes in interrupted time series (ITS) designs.
- SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types - used to simulate datasets that mimic real-world situations (i.e., clinical or genetic).
- simPop: Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information - used to simulate populations for surveys based on auxiliary data.
- simrel: Simulation of Multivariate Linear Model Data - used for model comparison (testing and many other purposes) and simulate linear model data of a wide range of properties with few tuning parameters.
- SimSCRPiecewise: 'Simulates Univariate and Semi-Competing Risks Data Given Covariates and Piecewise Exponential Baseline Hazards' - simulates survival data from piecewise exponential hazards with a proportional hazard adjustment for covariates.
- simstudy: Simulation of Study Data - simulates data sets to explore modeling techniques or better understand data generating processes.
- simsurv: Simulate Survival Data - simulates survival times from standard parametric survival distributions, 2-component mixture distributions, or a user-defined hazard, or log cumulative hazard function.
- simTargetCov: Data Transformation or Simulation with Empirical Covariance Matrix - simulates data with a target empirical covariance matrix supplied by the user.
- SimTimeVar: Simulate Longitudinal Dataset with Time-Varying Correlated Covariates - flexibly simulates a dataset with time-varying covariates (normal or binary or static) with user-specified equivalent correlation structures across and within clusters.
- skimr: A frictionless, pipeable approach to dealing with summary statistics - a way to create data dictionaries; presents a frictionless method to summary statistics that adheres to the concept of least surprise by showing summary statistics that the user may rapidly scan to understand their data.
- synthACS: Synthetic Microdata and Spatial MicroSimulation Modeling for ACS Data - used to build synthetic micro-datasets at any user-specified geographic level, conduct spatial microsimulation modeling (SMSM), and provide functionality for data-extensibility of micro-datasets.
- synthpop: R package for generating synthetic versions of sensitive microdata for statistical disclosure control - used to create synthetic versions of confidential individual-level data for use by researchers interested in making inferences about the population that the data represent.
- Synthea: Synthetic Patient Generation - for generation of simulated EHR data
- SynthNotes: a clinical note generation tool - for the generation of simulated free text medical notes
- TimeGAN: Time-series Generative Adversarial Networks - synthetic time-series data generation
- viridis color maps: R package to make pretty plots to facilitate readability - A user-friendly package that provides a series of color maps designed to improve graph readability for readers with color blindness and/or color vision deficiency.
- Wakefield - quickly generate random data sets - very user-friendly package to generate multiple types of datasets.
Data processing methods
De-identification (anonymization)
Reproducible research
- Introduction to renv
- Style guide
- JLV and Research Part 1: Getting Started
- JLV and Research Part 2: A Deeper Look
- Updated Guidance on the Reporting of Race and Ethnicity in Medical and Science Journals
- All of Us Research Program
- mturk