Reproducible research - sparklabnyc/resources GitHub Wiki
Intro to reproducible research
Overview
Scientific findings and evidence are strengthened if they can be replicated and confirmed by other researchers. Transparency in research, through documentation and open access, allows others to reproduce and add to your results. New researchers can reuse data or code from prior projects to verify old findings or develop new analyses.
Definitions
Reproducible: A study is reproducible when independent researchers can use the same data and code provided by the original authors to obtain the same results.
Replicable: A study is replicable when independent researchers collect new data and follow the same experimental or analytical procedures as the original study and still arrive at the same findings.
Why does reproducibility matter
Reproducibility isn't just about being able to rerun code. It's about being able to build on the foundation that the study establishes. Reproducibility promotes scientific integrity, collaboration, and equity.
Benefits and risks
There are many benefits to having your study be reproducible:
- Verification: others can independently check results and methods
- Reuse: code and data can be used to support new studies
- Peer Access: leads to faster peer reviews and more citations
- Educational utility: trains students and emerging researchers
But it also comes with some drawbacks:
- Time and effort: writing reproducibly will take more time and planning. Documentation will take longer and needs more guidance
- Data sensitivity: not all data can be shared due to restrictions and guidelines.
- Fear of error exposure: Some researchers may hesitate to share code that might reveal errors or mistakes
Open research
Open research aims to transform research by making it more reproducible, transparent, and collaborative. Open research is the practice of making research data and code publicly accessible.
To achieve this, each element of the research process should:
- Be publicly available: it is difficult to use and benefit from knowledge hidden behind barriers, so make the code and data accessible.
- Be reusable: research methods need to be licensed appropriately, so that prospective users know any limitations on reuse.
- Be transparent: provide clear statements of how the research findings were produced and what they contain.
Open research aims to encourage collaboration and continuation. It is also a nice habit to document your work.
- Open data: documenting and sharing research data openly.
- Open source Software: documenting research code and methods.
- Open hardware: documenting designs, materials, and other relevant information related to hardware.
- Open access: making all published outputs accessible for use and impact.
- Open notebooks: an emerging practice where emerging researchers make their entire research process publicly available.
Version control
The work of many contributors in the group needs to be managed into a single set of shared working documents. The management of changes or additions made in a file or project is called versioning. Reproducibility requires providing the code and data used to create a figure. In practice, the data and code are changed regularly, but you must record when these changes are made and why. Version control is a method to record changes made in a file so you and your collaborators can track its history and review any modifications. GitHub is a powerful tool for version control and is recommended because it allows contributors to see the versions and changes, as well as make additions.
Version control is essential for collaborative projects where many people work on the same data or code at the same time and build on each otherβs work. With a version control system, the changes made by different people can be tracked and often merged automatically, saving a significant amount of manual effort. Your findings will be easier to reproduce and build upon. Additionally, version control hosting services like GitHub, GitLab, and others offer structured ways to communicate and collaborate, such as through pull requests, code reviews, and issues.
Code reproducibility
Introduction
Code reproducibility exemplars
This guide provides an in-depth look at how to use code reproducibility in research projects using GitHub. We will provide a real-world exemplar drawn from the USA floods mortality GitHub repository. This repository exemplifies best practices in reproducible research from environment setup and code organization to documentation.
Reproducible environments
Ensuring that others can run your code in your same environment is key to reproducibility.
A reproducible environment ensures that your code runs the same way on different machines by keeping the versions of packages and system dependencies consistent. This prevents errors from version changes.
#1.Load packages on CRAN
#1a.Add new packages here, as necessary
list.of.packages = c('acs','BiocManager','dlnm','dplyr','ecm','Epi','fiftystater','foreign', 'fst','ggpubr','ggplot2', 'graph','graticule',
'haven','here', 'janitor','lubridate', 'mapproj','maptools','mapview','MetBrewer','pipeR','raster',
'RColorBrewer','readxl', 'rgdal', 'rgeos','rnaturalearth','rnaturalearthdata','scales', 'sf','sp','sqldf', 'survival','splines',
'table1', 'tidycensus', 'tidyverse', 'totalcensus', 'usmap','zipcodeR','zoo', 'INLA', 'Rgraphviz','fmesher')
#1b.Check if list of packages is installed. If not, it will install ones not yet installed
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) invisible(install.packages(new.packages,repos = "https://cloud.r-project.org"))
#1c.load packages
invisible(lapply(list.of.packages, require, character.only = TRUE, quietly=TRUE))
#devtools::install_github("wmurphyrd/fiftystater")
This is a decent example of a reproducible environment, which automatically checks for and installs the required R packages used for the program. This allows anyone running the code to have the same execution environment (In this case, only if the libraries are available)
There was no guide on how to run the project, so I will provide a simple environment setup guide for newer researchers unfamiliar with RStudio
- Open RStudio
- Go to file, new project, version control, git, and then paste the URL to clone the project
- Save the repo to your local files and run packages_to_load to set up the environment and dependencies
- Run the program and rerun the analysis procedure to get the outputs
Code
For code reproducibility, the code should be clearly presented and combined with comments that could help users understand the role of that chunk in the program to get the output
Take an example from usa_floods_mortality_2024/02_code/2a_data_prep /a_00_process_nchs_mortality.R. The code below has each chunk clearly labeled so that users can understand what exactly the code does.
rm(list = ls())
#0a.Declare root directory, folder location and load essential stuff
project.folder = paste0(print(here::here()),'/')
source(paste0(project.folder,'create_folder_structure.R'))
source(paste0(functions.folder,'script_initiate.R'))
#1a.Load flood and temperature data
flood_data <- read_csv(paste0(exposure.data.folder, "flood_pop_data_by_thresh_type.csv")) |> #flood_pop_data_by_thresh_overall
dplyr::select(c(year,month,geoid,dfo_began,dfo_id,state,flood_cat,n,expo_type,expo_threshold,flood_cat,flood_occur)) |>
mutate(fips = geoid)
temp_data <- read_csv(paste0(exposure.data.folder, "median_max_temp_fips_month_year.csv")) %>%
dplyr::select(-c(meteo_var, med_temp)) %>%
mutate_at(c("month"), as.numeric)
#1b.Set categories to iterate over
expo_types <- c("pop_expo")
expo_thresholds <- c("any", "1_pert","25_pert", "50_pert", "75_pert")
#1c.Generate dataset of all possible fips, year, month combinations
year <- c(2001:2018)
month <- c(1:12)
fips <- c(unique(flood_data$fips))
flood_lag_grid <- expand_grid(year, month, fips)
#2.Run function to add lags
flood_with_lags <- create_flood_lags(expo_types, expo_thresholds)
colMeans(is.na(flood_with_lags))
n_distinct(flood_with_lags$dfo_id)
flood_data_with_lags_temp <- flood_with_lags %>%
mutate_at("fips", as.character) %>%
mutate_at("month", as.numeric) %>%
left_join(temp_data) %>%
filter(!is.na(max_mean_temp))
colMeans(is.na(flood_data_with_lags_temp))
n_distinct(flood_data_with_lags_temp$dfo_id)
flood_data_with_lags_temp %>%
write_csv(paste0(exposure.data.folder, "flood_pop_data_with_lags_.csv"))
The code should also include notes that explain what the code does and why. It should be able to explain the thought process for that code. The code could also provide details such as when to run or when not to run, as seen below.
#1.Set groups for model comparison; can make this iterative, but seems unnecessary
#as we aren't trying to find the marginally best fitting model for each specific model. We want to identify the best
#fitting model that works for all subgroups and subcauses - for this reason, stick with 'overall' and 'any'.
subcauses <- causes[1]
subgroups <- "overall"
types <- 'pop_expo'
thresholds <- 'any'
#1.DO NOT RUN in this project: Load and join year-specific monthly population data from Robbie Parks' CDC
#Monthly Population Inference project. File structure and data are from his project.
dat_all = data.frame()
for(year_selected in years_analysis){
dat_year <- read_csv(paste0(population.5year.processed.folder, 'vintage_2020/pop_monthly_5_year_age_groups_',year_selected,'.csv'))
dat_all <- data.table::rbindlist(list(dat_all, dat_year))
rm(dat_year)
}
Moreover, we can add a short code snippet at the end of the code to direct outputs to a designated folder, helping maintain better organization
model_results <- run_model_function_all(groups,expo_types,expo_thresholds,flood_types) |>
rownames_to_column()
#model_results %>% write_csv(paste0(model.output.folder, "model_results_all_mort_causes.csv"))
This output code shows a really good example of what well-documented code should be to ensure reproducibility. Every part of the code should have comments telling the user what that part does for the larger project.
rm(list = ls())
#0a.Declare root directory, folder location and load essential stuff
project.folder = paste0(print(here::here()),'/')
source(paste0(project.folder,'create_folder_structure.R'))
source(paste0(functions.folder,'script_initiate.R'))
#0a.Load datasets
model_results_flood_spf <- read_csv(paste0(model.output.folder, "model_results_flood_spf.csv"))
model_results_non_spf <- read_csv(paste0(model.output.folder, "model_results_non_spf.csv")) %>%
mutate(group = "over_non_spf")
#1.Join datasets
overall_results <- bind_rows(model_results_flood_spf, model_results_non_spf)
#2.Load and prepare model output data
tidy_plot_data <- overall_results %>%
filter(str_detect(rowname, "lag_")) %>%
mutate(across(all_of("rowname"), str_remove,pattern = "\\...*")) %>%
mutate(plot_group = case_when(
group == 'overall' ~ 'overall',
group == 'over_non_spf' ~ 'overall_non_spf',
group %in% c("1","2") ~ 'sex',
group %in% c("64", "66") ~ 'age'
)) %>%
#mutate(across(where(is.numeric), round,3)) %>%
mutate(rowname = case_when(rowname == 'lag_0' ~ '0',
rowname == 'lag_1' ~ '1',
rowname == 'lag_2' ~ '2',
rowname == 'lag_3' ~ '3')) %>%
mutate(flood_cat = case_when(flood_cat == "Snowmelt" ~ "Snowmelt",
flood_cat == "Heavy rain" ~ "Heavy rain",
flood_cat == "Tropical cyclones" ~ "Tropical cyclone",
flood_cat == "Ice jams and dam breaks" ~ "Ice jam or dam break",
flood_cat == "all_floods" ~ "All floods"))
#2.Set labels, colors, etc. for figures
sex.labs <- c("Male", "Female")
causes.labs <- c("Injuries", "Cardiovascular diseases","Respiratory diseases","Cancers", "Infectious and\nparasitic diseases","Neuropsychiatric\nconditions" )
floodtypes.labs <- c("Heavy rain", "Snowmelt", "Tropical cyclone", "Ice jam or dam break")
age.labs <- c("Age 0-64", "Age 65+")
names(sex.labs) <- c("1", "2")
names(causes.labs) <- c("Injuries", "Cardiovascular diseases","Respiratory diseases","Cancers", "Infectious and parasitic diseases", "Neuropsychiatric conditions")
names(floodtypes.labs) <- c("Heavy rain", "Snowmelt", "Tropical cyclone", "Ice jam or dam break")
names(age.labs) <- c("64", "66")
#3a.Set plots to run
plot_groups <- c("overall_non_spf","overall","sex","age")
expo_types <- c("pop_expo")
expo_thresholds <- c("1_pert", "25_pert", "50_pert", "75_pert")
#3b.Make all severity plots
plot_all_flood_severity(plot_groups, expo_types)
Data
Some of the datasets used in this study are sensitive and cannot be shared directly. However, wherever possible, we provide links and references for users to obtain the original data from the appropriate sources. Each dataset should include a short description of its origin and how it is used in the study.
## 1. Data
1a_exposure_data: flood data from Global Flood Database
1b_outcome_data: mortality data - only on local computer
1c_supportive_datasets: files used to help analysis (e.g. population weights, fips-to-state etc.)
Both raw and processed data should be well-documented. CSV files should contain clearly labeled headers and be logically structured
Flood cause | 25th percentile | 50th percentile | 75th percentile |
---|---|---|---|
All floods | 0.0033 | 0.021 | 0.1142 |
Heavy rain | 0.0029 | 0.0183 | 0.0984 |
Tropical cyclone | 0.0043 | 0.0271 | 0.1287 |
Snowmelt | 0.0066 | 0.0446 | 0.2268 |
Ice jam or dam break | 0.0028 | 0.0161 | 0.1005 |
Outputs
Outputs generated by the code should include clear labels and accompanying notes that explain what each figure represents.
README
A clear README file is essential to code reproducibility. It allows others to understand and explains its purpose within the context of the project. In a README, it should include a brief description of the project and an overview of the program file's function. Additionally, it can provide further information such as data availability and links to related files. It should also offer clear instructions or tips on environment setup and how to reproduce the analysis.
# Large floods drive changes in cause-specific mortality in the United States
Victoria D Lynch, Johnathan Sullivan, Aaron Flores, Sarika Aggarwal, Rachel C Nethery, Marianthi-Anna Kioumourtzoglou, Anne E Nigra, Robbie M Parks. Nature Medicine. 2025
## Project description
This dataset and code is used for the paper
https://www.nature.com/articles/s41591-024-03358-z
This section provides the basic information about the project and its related publications.
## 2.Code
### 2a.Data prep code
a_00_get_county_pop: code to read-in county- and month-specific population estimates from Robbie Parks' CDC Monthly Population Inference project; also includes alternative approach for annual population estimates from US Census Bureau and SEER data. Output saved in 1c. supportive datasets folder; no need to run
a_00_process_nchs_mortality: do not run locally; code to process mortality data
a_01_prep_exposure_data: add flood type to Global Flood Database data; identify floods missing from GFD
a_02_create_exposure_variables: create flood exposure variables by population thresholds
a_03_add_lags_to_floods: create lagged flood exposure variables
### 2b.Data exploration code
b_00_compare_flood_datasets: identify flood events in DFO and NCEI that are not currently in GFD; no need to run unless specifically comparing flood datasets
b_01_flood_event_eda: barplots and histograms to assess the number and duration of flood events by flood cause
b_02_flood_eda_maps: maps of flood count by county and by flood cause (cyclonic storms, heavy rain, rain and snow, ice jams and dam breaks)
b_03_mortality_eda: code for mortality plots
b_04_exposure_histogram: code to make exposure threshold histogram
b_05_manuscript_values: code to show how specific values in manuscript are calculated
This part of the README shows the descriptions of files and their purpose, and how they fit into the project workflow.
## 5. Figures
All figures for manuscript and supplement
note: please run create_folder_structure.R first to create folders which may not be there when first loaded.
The README should also include notes on how to set up the environment and run specific scripts to ensure that the code executes properly
## Data Availability
Flood data used in this analysis are available via https://github.com/vdl2103/usa_floods_mortality/tree/main/01_data/1a_exposure_data
Mortality data is available from https://www.cdc.gov/nchs/nvss/bridged_race.htm
It can even be used to inform users where the data was sourced from, which further enhances the projectβs reproducibility and transparency.
Directory
Organizing files is a clear cornerstone of reproducible research. A well-structured directory makes it easier for collaborators to understand, navigate, and run the code.
.
βββ 01_data
βΒ Β βββ 1a_exposure_data
βΒ Β βΒ Β βββ FloodArchive.csv
βΒ Β βΒ Β βββ dfo_usa_county_panel_20220629.csv
βΒ Β βΒ Β βββ floods_not_in_gfd.csv
βΒ Β βΒ Β βββ gfd_county_panel.csv
βΒ Β βΒ Β βββ gfd_usa_county_panel_20230224.csv
βΒ Β βΒ Β βββ gfd_with_flood_type.csv
βΒ Β βΒ Β βββ ncei_usa_county_panel_20230222.csv
βΒ Β βββ 1b_outcome_data
βΒ Β βΒ Β βββ mortality_cs_fips_sex_age_2001_2018.csv
βΒ Β βββ 1c_supportive_datasets
βΒ Β βΒ Β βββ fips_to_state.csv
βΒ Β βββ map_objects.R
βΒ Β βββ objects.R
βββ 02_code
βΒ Β βββ 20_functions
βΒ Β βΒ Β βββ 01_data_processing_functions.R
βΒ Β βΒ Β βββ 02_eda_functions.R
βΒ Β βΒ Β βββ 03_model_development_functions.R
βΒ Β βΒ Β βββ 04_model_functions.R
βΒ Β βΒ Β βββ 05_model_plotting_functions.R
βΒ Β βΒ Β βββ script_initiate.R
βΒ Β βββ 2a_data_prep
βΒ Β βΒ Β βββ a_00_get_county_pop.R
βΒ Β βΒ Β βββ a_00_process_nchs_mortality.R
βΒ Β βΒ Β βββ a_01_prep_exposure_data.R
βΒ Β βΒ Β βββ a_02_create_exposure_variables.R
βΒ Β βΒ Β βββ a_03_add_lags_to_floods.R
βΒ Β βΒ Β βββ load_data.R
βΒ Β βββ 2b_data_exploration
βΒ Β βΒ Β βββ b_00_compare_flood_datasets.R
βΒ Β βΒ Β βββ b_01_floods_eda.R
βΒ Β βΒ Β βββ b_02_flood_eda_maps.R
βΒ Β βΒ Β βββ b_03_mortality_eda.R
βΒ Β βΒ Β βββ b_04_exposure_histogram.R
βΒ Β βΒ Β βββ b_05_manuscript_values.R
βΒ Β βββ 2c_models
βΒ Β βΒ Β βββ c_00_create_subcause_group_datasets.R
βΒ Β βΒ Β βββ c_01_model_comparison.R
βΒ Β βΒ Β βββ c_02_run_model.R
βΒ Β βββ 2d_model_plotting
βΒ Β βΒ Β βββ d_01_plot_model_output.R
βΒ Β βββ packages_to_load.R
βββ 03_output
βΒ Β βββ 3a_eda_output
βΒ Β βΒ Β βββ maps_flood_exposure_area.jpeg
βΒ Β βΒ Β βββ maps_flood_exposure_pop.jpeg
βΒ Β βΒ Β βββ maps_flood_type_any_pop_expo.jpeg
βΒ Β βΒ Β βββ seasonal_mortality.jpeg
βΒ Β βΒ Β βββ ts_data.jpeg
βΒ Β βββ 3b_model_output
βΒ Β βββ model_comparison_table.csv
βββ 04_tables
βββ 05_figures
βββ 06_literature
βββ 07_drafts
Here is an example of a well-organized project directory. It includes a clearly labeled folder and file names to give users an understanding of what each component contains and its purpose with just a glance. For instance, all files listed under data contain data used, and figures likewise contain the figures. Each file is also labeled with an appropriate extension, such as .csv, .r, or .csv, which makes it easier to identify the file type. Descriptive filenames such as mortality_cs_fips_sex_age_2001_2018.csv and b_00_compare_flood_datasets.R help users understand the context and realize the usage of the file without needing to open the files. This type of directory layout improves transparency and accessibility, allowing collaborators to quickly locate files. In summary, a reproducible research directory should be logically structured and labeled clearly, making it easier for anyone to navigate an unfamiliar project.
Licensing
Licensing is a critical component of reproducible and open research. It determines how others may use, adapt, and share your work. A good licensing choice protects your rights while enabling collaborative reuse. Without a license, others cannot legally use your work.
What is licensing
Licensing governs the legal use of intellectual property, including:
- Copyright: granted to authors of original works (code, papers, images, datasets).
- Patents: protect inventions or processes, often not applicable to most open research outputs.
- Trademarks: protect logos and names used for branding a tool, software, or service.
Types of licenses
Type | Description |
---|---|
Proprietary | Grants users the right to use software while the developer retains ownership and control over the software's intellectual property (default if no license is provided) |
Permissive | Allows users to freely use, modify, and distribute the software with minimal restrictions (e.g., MIT, BSD, Apache 2.0). |
Copyleft | Allows users to freely use, modify, and distribute the software, as well as any derivative works, but with the condition that these freedoms are preserved in any redistributed versions |
Restrictive | Limits reuse for specific purposes only (e.g., non-commercial, research only). |
For stored data and dode, you can place a LICENSE file in your GitHub repo.
Citations and attribution
Citations and attribution are crucial in academic and research settings for acknowledging the intellectual contributions of others.
In reproducible research, citations and attribution are very critical. Providing proper attribution allows all contributors to receive appropriate credit and allows for greater transparency. It allows others to build on your work and verify the findings.
To do so, use licenses that require attribution and acknowledge the work of contributors, use automatic citation managers such as Zotero to cite, and provide all research objects.