Reproducible research - sparklabnyc/resources GitHub Wiki

Intro to reproducible research

Overview

Scientific findings and evidence are strengthened if they can be replicated and confirmed by other researchers. Transparency in research, through documentation and open access, allows others to reproduce and add to your results. New researchers can reuse data or code from prior projects to verify old findings or develop new analyses.

Definitions

Reproducible: A study is reproducible when independent researchers can use the same data and code provided by the original authors to obtain the same results.

Replicable: A study is replicable when independent researchers collect new data and follow the same experimental or analytical procedures as the original study and still arrive at the same findings.

Why does reproducibility matter

Reproducibility isn't just about being able to rerun code. It's about being able to build on the foundation that the study establishes. Reproducibility promotes scientific integrity, collaboration, and equity.

Benefits and risks

There are many benefits to having your study be reproducible:

  • Verification: others can independently check results and methods
  • Reuse: code and data can be used to support new studies
  • Peer Access: leads to faster peer reviews and more citations
  • Educational utility: trains students and emerging researchers

But it also comes with some drawbacks:

  • Time and effort: writing reproducibly will take more time and planning. Documentation will take longer and needs more guidance
  • Data sensitivity: not all data can be shared due to restrictions and guidelines.
  • Fear of error exposure: Some researchers may hesitate to share code that might reveal errors or mistakes

Open research

Open research aims to transform research by making it more reproducible, transparent, and collaborative. Open research is the practice of making research data and code publicly accessible.

To achieve this, each element of the research process should:

  • Be publicly available: it is difficult to use and benefit from knowledge hidden behind barriers, so make the code and data accessible.
  • Be reusable: research methods need to be licensed appropriately, so that prospective users know any limitations on reuse.
  • Be transparent: provide clear statements of how the research findings were produced and what they contain.

Open research aims to encourage collaboration and continuation. It is also a nice habit to document your work.

  • Open data: documenting and sharing research data openly.
  • Open source Software: documenting research code and methods.
  • Open hardware: documenting designs, materials, and other relevant information related to hardware.
  • Open access: making all published outputs accessible for use and impact.
  • Open notebooks: an emerging practice where emerging researchers make their entire research process publicly available.

Version control

The work of many contributors in the group needs to be managed into a single set of shared working documents. The management of changes or additions made in a file or project is called versioning. Reproducibility requires providing the code and data used to create a figure. In practice, the data and code are changed regularly, but you must record when these changes are made and why. Version control is a method to record changes made in a file so you and your collaborators can track its history and review any modifications. GitHub is a powerful tool for version control and is recommended because it allows contributors to see the versions and changes, as well as make additions.

Version control is essential for collaborative projects where many people work on the same data or code at the same time and build on each other’s work. With a version control system, the changes made by different people can be tracked and often merged automatically, saving a significant amount of manual effort. Your findings will be easier to reproduce and build upon. Additionally, version control hosting services like GitHub, GitLab, and others offer structured ways to communicate and collaborate, such as through pull requests, code reviews, and issues.

Code reproducibility

Introduction

Code reproducibility exemplars

This guide provides an in-depth look at how to use code reproducibility in research projects using GitHub. We will provide a real-world exemplar drawn from the USA floods mortality GitHub repository. This repository exemplifies best practices in reproducible research from environment setup and code organization to documentation.

Reproducible environments

Ensuring that others can run your code in your same environment is key to reproducibility.

A reproducible environment ensures that your code runs the same way on different machines by keeping the versions of packages and system dependencies consistent. This prevents errors from version changes.

#1.Load packages on CRAN

#1a.Add new packages here, as necessary 
list.of.packages = c('acs','BiocManager','dlnm','dplyr','ecm','Epi','fiftystater','foreign', 'fst','ggpubr','ggplot2', 'graph','graticule',
                     'haven','here', 'janitor','lubridate', 'mapproj','maptools','mapview','MetBrewer','pipeR','raster',
                     'RColorBrewer','readxl', 'rgdal', 'rgeos','rnaturalearth','rnaturalearthdata','scales', 'sf','sp','sqldf', 'survival','splines',
                     'table1', 'tidycensus', 'tidyverse', 'totalcensus', 'usmap','zipcodeR','zoo', 'INLA', 'Rgraphviz','fmesher')

#1b.Check if list of packages is installed. If not, it will install ones not yet installed
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) invisible(install.packages(new.packages,repos = "https://cloud.r-project.org"))

#1c.load packages
invisible(lapply(list.of.packages, require, character.only = TRUE, quietly=TRUE))

#devtools::install_github("wmurphyrd/fiftystater")

This is a decent example of a reproducible environment, which automatically checks for and installs the required R packages used for the program. This allows anyone running the code to have the same execution environment (In this case, only if the libraries are available)

There was no guide on how to run the project, so I will provide a simple environment setup guide for newer researchers unfamiliar with RStudio

  1. Open RStudio
  2. Go to file, new project, version control, git, and then paste the URL to clone the project
  3. Save the repo to your local files and run packages_to_load to set up the environment and dependencies
  4. Run the program and rerun the analysis procedure to get the outputs

Code

For code reproducibility, the code should be clearly presented and combined with comments that could help users understand the role of that chunk in the program to get the output

Take an example from usa_floods_mortality_2024/02_code/2a_data_prep /a_00_process_nchs_mortality.R. The code below has each chunk clearly labeled so that users can understand what exactly the code does.

rm(list = ls())
#0a.Declare root directory, folder location and load essential stuff
project.folder = paste0(print(here::here()),'/')
source(paste0(project.folder,'create_folder_structure.R'))
source(paste0(functions.folder,'script_initiate.R'))

#1a.Load flood and temperature data 
flood_data <- read_csv(paste0(exposure.data.folder, "flood_pop_data_by_thresh_type.csv")) |> #flood_pop_data_by_thresh_overall
  dplyr::select(c(year,month,geoid,dfo_began,dfo_id,state,flood_cat,n,expo_type,expo_threshold,flood_cat,flood_occur)) |> 
  mutate(fips = geoid)
temp_data <- read_csv(paste0(exposure.data.folder, "median_max_temp_fips_month_year.csv")) %>% 
  dplyr::select(-c(meteo_var, med_temp)) %>% 
  mutate_at(c("month"), as.numeric)

#1b.Set categories to iterate over
expo_types <- c("pop_expo")
expo_thresholds <- c("any", "1_pert","25_pert", "50_pert", "75_pert")

#1c.Generate dataset of all possible fips, year, month combinations 
year <- c(2001:2018)
month <- c(1:12)
fips <- c(unique(flood_data$fips))
flood_lag_grid <- expand_grid(year, month, fips)

#2.Run function to add lags
flood_with_lags <- create_flood_lags(expo_types, expo_thresholds)
colMeans(is.na(flood_with_lags))
n_distinct(flood_with_lags$dfo_id)

flood_data_with_lags_temp <- flood_with_lags %>% 
  mutate_at("fips", as.character) %>% 
  mutate_at("month", as.numeric) %>% 
  left_join(temp_data) %>% 
  filter(!is.na(max_mean_temp))
colMeans(is.na(flood_data_with_lags_temp))
n_distinct(flood_data_with_lags_temp$dfo_id)

flood_data_with_lags_temp %>% 
  write_csv(paste0(exposure.data.folder, "flood_pop_data_with_lags_.csv"))

The code should also include notes that explain what the code does and why. It should be able to explain the thought process for that code. The code could also provide details such as when to run or when not to run, as seen below.


#1.Set groups for model comparison; can make this iterative, but seems unnecessary 
#as we aren't trying to find the marginally best fitting model for each specific model. We want to identify the best
#fitting model that works for all subgroups and subcauses - for this reason, stick with 'overall' and 'any'. 
subcauses <- causes[1]
subgroups <- "overall"
types <- 'pop_expo'
thresholds <- 'any'

#1.DO NOT RUN in this project: Load and join year-specific monthly population data from Robbie Parks' CDC
#Monthly Population Inference project. File structure and data are from his project. 
dat_all = data.frame()
for(year_selected in years_analysis){
  dat_year <- read_csv(paste0(population.5year.processed.folder, 'vintage_2020/pop_monthly_5_year_age_groups_',year_selected,'.csv'))
  dat_all <- data.table::rbindlist(list(dat_all, dat_year))
  rm(dat_year)
}

Moreover, we can add a short code snippet at the end of the code to direct outputs to a designated folder, helping maintain better organization


model_results <- run_model_function_all(groups,expo_types,expo_thresholds,flood_types) |> 
  rownames_to_column()
#model_results %>% write_csv(paste0(model.output.folder, "model_results_all_mort_causes.csv"))

This output code shows a really good example of what well-documented code should be to ensure reproducibility. Every part of the code should have comments telling the user what that part does for the larger project.


rm(list = ls())
#0a.Declare root directory, folder location and load essential stuff
project.folder = paste0(print(here::here()),'/')
source(paste0(project.folder,'create_folder_structure.R'))
source(paste0(functions.folder,'script_initiate.R'))

#0a.Load datasets
model_results_flood_spf <- read_csv(paste0(model.output.folder, "model_results_flood_spf.csv"))
model_results_non_spf <- read_csv(paste0(model.output.folder, "model_results_non_spf.csv")) %>% 
  mutate(group = "over_non_spf") 

#1.Join datasets
overall_results <- bind_rows(model_results_flood_spf, model_results_non_spf)

#2.Load and prepare model output data
tidy_plot_data <- overall_results %>% 
  filter(str_detect(rowname, "lag_")) %>% 
  mutate(across(all_of("rowname"), str_remove,pattern = "\\...*")) %>% 
  mutate(plot_group = case_when(
    group == 'overall' ~ 'overall',
    group == 'over_non_spf' ~ 'overall_non_spf',
    group %in% c("1","2") ~ 'sex',
    group %in% c("64", "66") ~ 'age'
  )) %>% 
  #mutate(across(where(is.numeric), round,3)) %>%
  mutate(rowname = case_when(rowname == 'lag_0' ~ '0', 
                             rowname == 'lag_1' ~ '1', 
                             rowname == 'lag_2' ~ '2', 
                             rowname == 'lag_3' ~ '3')) %>% 
  mutate(flood_cat = case_when(flood_cat == "Snowmelt" ~ "Snowmelt",
                               flood_cat == "Heavy rain" ~ "Heavy rain",
                               flood_cat == "Tropical cyclones" ~ "Tropical cyclone",
                               flood_cat == "Ice jams and dam breaks" ~ "Ice jam or dam break",
                               flood_cat == "all_floods" ~ "All floods")) 

#2.Set labels, colors, etc. for figures
sex.labs <- c("Male", "Female")
causes.labs <- c("Injuries", "Cardiovascular diseases","Respiratory diseases","Cancers", "Infectious and\nparasitic diseases","Neuropsychiatric\nconditions" )
floodtypes.labs <- c("Heavy rain", "Snowmelt", "Tropical cyclone", "Ice jam or dam break")
age.labs <- c("Age 0-64", "Age 65+")
names(sex.labs) <- c("1", "2")
names(causes.labs) <- c("Injuries", "Cardiovascular diseases","Respiratory diseases","Cancers", "Infectious and parasitic diseases", "Neuropsychiatric conditions")
names(floodtypes.labs) <- c("Heavy rain", "Snowmelt", "Tropical cyclone", "Ice jam or dam break")
names(age.labs) <- c("64", "66")

#3a.Set plots to run
plot_groups <- c("overall_non_spf","overall","sex","age")
expo_types <- c("pop_expo")
expo_thresholds <- c("1_pert", "25_pert", "50_pert", "75_pert")

#3b.Make all severity plots
plot_all_flood_severity(plot_groups, expo_types)

Data

Some of the datasets used in this study are sensitive and cannot be shared directly. However, wherever possible, we provide links and references for users to obtain the original data from the appropriate sources. Each dataset should include a short description of its origin and how it is used in the study.


## 1. Data
1a_exposure_data: flood data from Global Flood Database 

1b_outcome_data: mortality data - only on local computer 

1c_supportive_datasets: files used to help analysis (e.g. population weights, fips-to-state etc.)

Both raw and processed data should be well-documented. CSV files should contain clearly labeled headers and be logically structured

Flood cause 25th percentile 50th percentile 75th percentile
All floods 0.0033 0.021 0.1142
Heavy rain 0.0029 0.0183 0.0984
Tropical cyclone 0.0043 0.0271 0.1287
Snowmelt 0.0066 0.0446 0.2268
Ice jam or dam break 0.0028 0.0161 0.1005

Outputs

Outputs generated by the code should include clear labels and accompanying notes that explain what each figure represents.

Figure S2

Figure S5

README

A clear README file is essential to code reproducibility. It allows others to understand and explains its purpose within the context of the project. In a README, it should include a brief description of the project and an overview of the program file's function. Additionally, it can provide further information such as data availability and links to related files. It should also offer clear instructions or tips on environment setup and how to reproduce the analysis.

# Large floods drive changes in cause-specific mortality in the United States
Victoria D Lynch, Johnathan Sullivan, Aaron Flores, Sarika Aggarwal, Rachel C Nethery, Marianthi-Anna Kioumourtzoglou, Anne E Nigra, Robbie M Parks. Nature Medicine. 2025

## Project description

This dataset and code is used for the paper

https://www.nature.com/articles/s41591-024-03358-z

This section provides the basic information about the project and its related publications.


## 2.Code 

### 2a.Data prep code
a_00_get_county_pop: code to read-in county- and month-specific population estimates from Robbie Parks' CDC Monthly Population Inference project; also includes alternative approach for annual population estimates from US Census Bureau and SEER data. Output saved in 1c. supportive datasets folder; no need to run 

a_00_process_nchs_mortality: do not run locally; code to process mortality data 

a_01_prep_exposure_data: add flood type to Global Flood Database data; identify floods missing from GFD

a_02_create_exposure_variables: create flood exposure variables by population thresholds 

a_03_add_lags_to_floods: create lagged flood exposure variables 

### 2b.Data exploration code
b_00_compare_flood_datasets: identify flood events in DFO and NCEI that are not currently in GFD; no need to run unless specifically comparing flood datasets

b_01_flood_event_eda: barplots and histograms to assess the number and duration of flood events by flood cause 

b_02_flood_eda_maps: maps of flood count by county and by flood cause (cyclonic storms, heavy rain, rain and snow, ice jams and dam breaks)

b_03_mortality_eda: code for mortality plots

b_04_exposure_histogram: code to make exposure threshold histogram

b_05_manuscript_values: code to show how specific values in manuscript are calculated

This part of the README shows the descriptions of files and their purpose, and how they fit into the project workflow.


## 5. Figures
All figures for manuscript and supplement 

note: please run create_folder_structure.R first to create folders which may not be there when first loaded.

The README should also include notes on how to set up the environment and run specific scripts to ensure that the code executes properly


## Data Availability 
Flood data used in this analysis are available via https://github.com/vdl2103/usa_floods_mortality/tree/main/01_data/1a_exposure_data

Mortality data is available from https://www.cdc.gov/nchs/nvss/bridged_race.htm

It can even be used to inform users where the data was sourced from, which further enhances the project’s reproducibility and transparency.

Directory

Organizing files is a clear cornerstone of reproducible research. A well-structured directory makes it easier for collaborators to understand, navigate, and run the code.

.
β”œβ”€β”€ 01_data
β”‚Β Β  β”œβ”€β”€ 1a_exposure_data
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ FloodArchive.csv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ dfo_usa_county_panel_20220629.csv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ floods_not_in_gfd.csv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ gfd_county_panel.csv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ gfd_usa_county_panel_20230224.csv
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ gfd_with_flood_type.csv
β”‚Β Β  β”‚Β Β  └── ncei_usa_county_panel_20230222.csv
β”‚Β Β  β”œβ”€β”€ 1b_outcome_data
β”‚Β Β  β”‚Β Β  └── mortality_cs_fips_sex_age_2001_2018.csv
β”‚Β Β  β”œβ”€β”€ 1c_supportive_datasets
β”‚Β Β  β”‚Β Β  └── fips_to_state.csv
β”‚Β Β  β”œβ”€β”€ map_objects.R
β”‚Β Β  └── objects.R
β”œβ”€β”€ 02_code
β”‚Β Β  β”œβ”€β”€ 20_functions
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 01_data_processing_functions.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 02_eda_functions.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 03_model_development_functions.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 04_model_functions.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ 05_model_plotting_functions.R
β”‚Β Β  β”‚Β Β  └── script_initiate.R
β”‚Β Β  β”œβ”€β”€ 2a_data_prep
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ a_00_get_county_pop.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ a_00_process_nchs_mortality.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ a_01_prep_exposure_data.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ a_02_create_exposure_variables.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ a_03_add_lags_to_floods.R
β”‚Β Β  β”‚Β Β  └── load_data.R
β”‚Β Β  β”œβ”€β”€ 2b_data_exploration
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ b_00_compare_flood_datasets.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ b_01_floods_eda.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ b_02_flood_eda_maps.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ b_03_mortality_eda.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ b_04_exposure_histogram.R
β”‚Β Β  β”‚Β Β  └── b_05_manuscript_values.R
β”‚Β Β  β”œβ”€β”€ 2c_models
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ c_00_create_subcause_group_datasets.R
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ c_01_model_comparison.R
β”‚Β Β  β”‚Β Β  └── c_02_run_model.R
β”‚Β Β  β”œβ”€β”€ 2d_model_plotting
β”‚Β Β  β”‚Β Β  └── d_01_plot_model_output.R
β”‚Β Β  └── packages_to_load.R
β”œβ”€β”€ 03_output
β”‚Β Β  β”œβ”€β”€ 3a_eda_output
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ maps_flood_exposure_area.jpeg
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ maps_flood_exposure_pop.jpeg
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ maps_flood_type_any_pop_expo.jpeg
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ seasonal_mortality.jpeg
β”‚Β Β  β”‚Β Β  └── ts_data.jpeg
β”‚Β Β  β”œβ”€β”€ 3b_model_output
β”‚Β Β  └── model_comparison_table.csv
β”œβ”€β”€ 04_tables
β”œβ”€β”€ 05_figures
β”œβ”€β”€ 06_literature
β”œβ”€β”€ 07_drafts

Here is an example of a well-organized project directory. It includes a clearly labeled folder and file names to give users an understanding of what each component contains and its purpose with just a glance. For instance, all files listed under data contain data used, and figures likewise contain the figures. Each file is also labeled with an appropriate extension, such as .csv, .r, or .csv, which makes it easier to identify the file type. Descriptive filenames such as mortality_cs_fips_sex_age_2001_2018.csv and b_00_compare_flood_datasets.R help users understand the context and realize the usage of the file without needing to open the files. This type of directory layout improves transparency and accessibility, allowing collaborators to quickly locate files. In summary, a reproducible research directory should be logically structured and labeled clearly, making it easier for anyone to navigate an unfamiliar project.

Licensing

Licensing is a critical component of reproducible and open research. It determines how others may use, adapt, and share your work. A good licensing choice protects your rights while enabling collaborative reuse. Without a license, others cannot legally use your work.

What is licensing

Licensing governs the legal use of intellectual property, including:

  • Copyright: granted to authors of original works (code, papers, images, datasets).
  • Patents: protect inventions or processes, often not applicable to most open research outputs.
  • Trademarks: protect logos and names used for branding a tool, software, or service.

Types of licenses

Type Description
Proprietary Grants users the right to use software while the developer retains ownership and control over the software's intellectual property (default if no license is provided)
Permissive Allows users to freely use, modify, and distribute the software with minimal restrictions (e.g., MIT, BSD, Apache 2.0).
Copyleft Allows users to freely use, modify, and distribute the software, as well as any derivative works, but with the condition that these freedoms are preserved in any redistributed versions
Restrictive Limits reuse for specific purposes only (e.g., non-commercial, research only).

For stored data and dode, you can place a LICENSE file in your GitHub repo.

Citations and attribution

Citations and attribution are crucial in academic and research settings for acknowledging the intellectual contributions of others.

In reproducible research, citations and attribution are very critical. Providing proper attribution allows all contributors to receive appropriate credit and allows for greater transparency. It allows others to build on your work and verify the findings.

To do so, use licenses that require attribution and acknowledge the work of contributors, use automatic citation managers such as Zotero to cite, and provide all research objects.