Data processing details - borenstein-lab/microbiome-metabolome-curated-data GitHub Wiki

Metagenomics processing notes

This dataset collection includes microbiome data from both whole genome shotgun sequencing and 16S rRNA amplicon sequencing. We re-processed raw data using the appropriate computational tools (detailed below), and using the Genome Taxonomy Database (GTDB v207) 1(#references) as our reference database, as it is specifically designed to provide consistent and comprehensive taxonomy for bacterial genomes. Further details about the processing of each data type are provided below.

Shotgun data

For studies with shotgun metagenomic data, we obtained raw fastq files and applied quality filtering, adapter trimming and deduplication using fastp v0.23.2 2(#references):

fastp --in1 $FASTQ_FWD --in2 $FASTQ_REV --length_required 60 --dedup --thread $N_THREADS --out1 $FASTQ_FWD_CLEAN --out2 $FASTQ_REV_CLEAN

If paired-end reads could not be merged, we concatenated forward and reverse reads into a single fastq file. We then filtered out host DNA using bowtie v2.3.5 3(#references), aligning reads to the human reference genome named Genome Reference Consortium Human Build 38:

bowtie2 -U $FASTQ_CONCAT -x $HOST_REF --sensitive -U - | samtools fastq -f 4 -c 9 - | gzip > $output

Next, we ran Kraken v2.1.1 and Bracken v2.8 for taxonomic classification and species abundances estimation 4-5(#references).

# Run kraken2
kraken2 --db $kraken2_GTDB --output "$output".krkn --use-names --report "$output".rep --gzip-compressed --paired --memory-mapping "$input"_1.filtered.fastq.gz "$input"_2.filtered.fastq.gz 
# Run bracken - species abundances
bracken -i "$output".rep -o "$output".brkn.sp -d $bracken_GTDB -l S
# Run bracken - genera abundances
bracken -i "$output".rep -o "$output".brkn.ge -d $bracken_GTDB -l G

Samples with less than 50,000 reads were discarded. Instead of the default reference database, we used a recently published version of GTDB (v207) suitable for kraken 6(#references). Species-level abundance profiles were saved in the species tables, and genus-level abundances were saved in the genera tables;

Statistics about the number of reads at each processing step can be found in supplementary table S2.

Note: We used Kraken2 as opposed to the popular MetaPhlAn3 due to the availability of GTDB reference databases suitable for Kraken2 6(#references).

16S rRNA data

Raw 16S rRNA gene sequencing data was processed using QIIME2 (version 2019-1) 7(#references) as follows:

When raw data was multiplexed, we demultiplexed the data using QIIME2’s demux plugin.
We used DADA2 8(#references) for denoising the data and extracting ASV's. DADA2 was also used for merging paired end reads when applicable.
To assign ASV's to the GTDB taxonomy (genus-level), we used the assignTaxonomy function from DADA2 R package with the 16S GTDB database (version 5, database file name: gtdb-sbdi-sativa.r07rs207.1genome.assignTaxonomy.fna.gz) 9(#references).

Further study-specific parameters and details about the 16S data processing of each dataset are detailed in supplementary tables S1 and S2.

Additional processing of taxonomic profiles

In all datasets, we removed non-bacteria taxa and re-calculated relative abundances. Unclassified bacteria were labeled as "Unclassified".

Metabolomics processing notes

The metabolomics datasets included in this collection were obtained from different studies and generated via diverse technologies. Specifically, different datasets may have been generated by different metabolomics platforms (e.g. NMR, LC-MS, GC-MS, etc.), in either a targeted or untargeted approach. Different studies may have also used different control and normalization procedures, and were provided using different formats and compound identifier schemes.

Attempting to consolidate the metabolite identifiers in this collection (but at the same time also provide the entire original metabolome data), we performed the following processing:

Compound identifiers in each mtb table are listed as provided by the authors (i.e. not unified or modified in any way). Occasionally, a few different fields from the original data were concatenated in order to assure unique compound identifiers. For example, if a dataset contained both an NMR-based metabolic profile and an LC-MS untargeted profile, then the unique compound names are a concatenation of the metabolomics method name and the metabolite identifier within that method (e.g. "NMR_Lactate" or "LC-MS_Glycocholic acid"). Further details can be found in supplementary table S3 or in the dataset-specific scripts found here;
We created a metabolite-metadata table per dataset (namely mtb.map) where additional details are provided for each metabolite in the mtb table. The mtb.map table includes:
- Any original information per metabolite as provided by authors;
- Mappings to KEGG and HMDB identifiers wherever possible. These were either provided by authors, or obtained using the conversion utility from MetaboAnalystR (version 3.2) 10(#references). Additional mappings were added manually when possible;
- We added a High.Confidence.Annotation boolean field to mark cases where the identification of the metabolite in the original publication, or it's mapping to HMDB/KEGG ID's, was made with a lower confidence. In particular, this field is set to FALSE in any of the following cases:
  - Metabolite had an ambiguous identification in the original table (e.g. "fructose/glucose");
  - Metabolite was identified with low-confidence by the authors of the original publication;
  - Metabolite name had a typo (which resulted in a manual mapping to HMDB/KEGG based on the supposedly correct metabolite name and additional metabolite information if provided);
  - In cases of conflicts between the KEGG/HMDB ID provided by author and the ID returned by MetaboAnalyst based on metabolite name;
  - In cases where more than one metabolite was mapped to the same KEGG or HMDB ID;
Metabolite values were kept as is (including missing values where present);

:pushpin: Note: searching for metabolite identifiers by metabolite names may lead to inaccurate/partial mappings 11(#references).

Sample metadata files

Sample metadata files (metadata) include information about each sample/subject as provided in the original publication (typically in the supplementary information. See supplementary table S1 for metadata source per dataset). Note! Additional details per sample/subject may be available in the original publications (e.g., iHMP study further provides family history of IBD, answers to dietary questionnaires, medication logs, etc., all of which available in the originally published metadata table).

We specifically unified the names of the following fields:

Field name	Description
Sample	Sample identifier. Corresponds to sample names in feature tables
Subject	Subject identifier. Some studies have multiple samples per subject
Study.Group	Study group as named in original study (typically one of the groups would be named 'control' or 'healthy' and the other will be named after the studied disease/condition)
Age	The subject's age, if available
Age.Units	One of: `Years`,`Months`,`Days`
Gender	One of: `Male`,`Female`,`Other`
BMI	The subject's BMI, if available

In addition, in each metadata file we added the 3 following study-related fields:

Field name	Description
Dataset	The dataset's name, formatted as following: `<First author>_<Short cohort description>_<Year of publication>`
DOI	Publication DOI
Publication.Name	Publication name

References

Parks, Donovan H., et al. "GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy." Nucleic acids research 50.D1 (2022): D785-D794.
Chen, Shifu, et al. "fastp: an ultra-fast all-in-one FASTQ preprocessor." Bioinformatics 34.17 (2018): i884-i890.
Langmead, Ben, and Steven L. Salzberg. "Fast gapped-read alignment with Bowtie 2." Nature methods 9.4 (2012): 357-359.
Wood, Derrick E., Jennifer Lu, and Ben Langmead. "Improved metagenomic analysis with Kraken 2." Genome biology 20.1 (2019): 1-13.
Lu, Jennifer, et al. "Bracken: estimating species abundance in metagenomics data." PeerJ Computer Science 3 (2017): e104.
Youngblut, Nicholas D., and Ruth E. Ley. "Struo2: efficient metagenome profiling database construction for ever-expanding microbial genome datasets." PeerJ 9 (2021): e12198.
Bolyen, Evan, et al. "Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2." Nature biotechnology 37.8 (2019): 852-857.
Callahan, Benjamin J., et al. "DADA2: high-resolution sample inference from Illumina amplicon data." Nature methods 13.7 (2016): 581-583.
Swedish Biodiversity Infrastructure (SBDI; 2021). SBDI Sativa curated 16S GTDB database. https://doi.org/10.17044/scilifelab.14869077
Pang, Zhiqiang, et al. "MetaboAnalystR 3.0: toward an optimized workflow for global metabolomics." Metabolites 10.5 (2020): 186.
Pham, Nhung, et al. "Consistency, inconsistency, and ambiguity of metabolite names in biochemical databases used for genome-scale metabolic modelling." Metabolites 9.2 (2019): 28.