wms fun 2022 - quadram-institute-bioscience/gmh-sops GitHub Wiki

Functional profiling of WMS data

Methods

Raw sequences have been quality checked with SeqFu 1.8.5 (Telatin 2021) and filtered using Fastp (Chen 2018) to remove sequences with any bases below Phred 15. Human reads have been filtered using Kraken 2 (Wood 2019) against the Human genome (release hg19). The integrated pre-processing pipeline is available at https://github.com/telatin/cleanup.

Functional profiling of the samples has been performed using the Humann 3.0 package from the Biobackery suite (McIver 2018; Beghini 2021), which was run storing the intermediate taxonomic annotation provided by Metaphlan 3.0.

The functional profile of each sample i.e. Genefamilies and Pathways were joined using the utility script human_join_tables into the respective summary files (i.e. Genefamilies.tsv and pathabundance.tsv).

In-house script (./scripts/extract_genefamilies.py and ./scripts/extract_pathabundance.py) was used to estimate the top N Genefamilies/pathways by filtering the taxa level contribution and UNMAPPED/UNINTEGRATED from the summary files. The scripts and the input files can be found here Functional_profiling.

A QC plot and the top N genefamilies/pathways abundance across the samples was visualized. The Rscript to generate the plots can be found here Functional_profiling.

The summary file were used for the estimation of top 10 genefamilies/pathways followed by visualization. The workflow for this as follows:

QC plot:
- First the UNMAPPED reads were extracted for each sample were obtained from GeneFamilies.tsv.
- MAPPED reads per sample = Total number of reads per sample - UNMAPPED reads per sample.
- Final data can be found in inputs/QC_Abundance/qc_plot.csv.
- The qc_plot.csv were used as input to plot the QC plot using scripts/QC_abdunace.Rmd.
- Note: The same plot also indicates the reads mapped to different pathways. Since, the genes found were used to detect the pathways.
Estimation of top 10 Genefamilies/pathways:
- In whole experiments
  - GeneFamilies.tsv or PathAbundance.tsv was used as input for the python scripts.
  - The top 10 Genefamilies/Pathways were estimated at community level (that means species-level abundance were discarded).
  - The 'UNMAPPED' and 'UNINTEGRATED' rows were filtered from the analysis.
  - First the total abundance of a genefamily/pathway was estimated by summing up their abundance across the samples.
  - Then total abundance were sorted in decreasing order and the top 10 genefamilies/pathways were extracted.
  - The abundance of top 10 genefamilies/pathways per sample were extracted and all other genefamilies/pathways are grouped into 'others'.
  - Relative abundance of top 10 genefamilies/pathways within a sample were estimated by: For e.g., Relative abundance of genefamily or pathway A in sample 1 = (abundance of genefamily or pathway A in sample 1)/(total abundance of genefamilies or pathways in sample 1)
  - The same principles were implemented in the scripts /scripts/extract_genefamilies.py and /scripts/extact_pathabundance.py
  - Top 10 Genefamilies:
    - cmd: python3 /scripts/extract_genefamilies.py -i GeneFamilies.tsv -o top10gf -t 10
    - output: top10gf-rel.csv can be found in inputs/QC_Abundance/
  - Top 10 Pathways:
    - cmd: python3 /scripts/extract_pathabundance.py -i PathAbundance.tsv -o top10pa -t 10
    - output: top10pa-rel.csv can be found in inputs/QC_Abundance/
- In different comparisons:
- A) “all” microbes from ME/CFS patients compared to “all” microbes from matched controls:
  - The samples "all" microbes from ME/CFS patients and matched controls were selected and rest samples are filltered off from the GeneFamilies.tsv/PathAbundance.tsv.
  - The relative abundance of top 10 genefamilies/pathways are re-estimated using the same python scripts as described above.
  - Top 10 Genefamilies:
    - cmd : python3 /scripts/extract_genefamilies.py -i GeneFamilies.tsv -o top10gf_A -t 10 -s GN101 GP101 GN104 GN105 GN107 GP107 GN108 GP108 GN109 GP109 GN110 GP110 GN102 GP102 GN103 GP103 GP104 GP105 GN106 GP106
    - Output: top10gf_A-rel.csv found in ./inputs/genefamilies_analysis/
  - Top 10 Pathways:
    - cmd : python3 /scripts/extract_pathabundance.py -i PathAbundance.tsv -o paA_patient_control -t 10 -s GN101 GP101 GN104 GN105 GN107 GP107 GN108 GP108 GN109 GP109 GN110 GP110 GN102 GP102 GN103 GP103 GP104 GP105 GN106 GP106
    - Output: top10gf_A-rel.csv found in ./inputs/pathways_analysis/
- B) “IgG+” microbes vs “IgG-“ microbes in ME/CFS patients:
  - The samples of IgG+/IgG- mircobes from patients were selected and rest of the samples were discarded.
  - The relative abundance of top 10 genefamilies/pathways are re-estimated using the same python scripts as described above. \
  - Top 10 Genefamilies:
    - cmd : python3 /scripts/extract_genefamilies.py -i GeneFamilies.tsv -o top10gf_PN_patient -t 10 -s GA104 GN104 GA106 GA108 GN108 GP108 GA110 GN110 GP110 GA102 GN102 GP102 GP104 GN106 GP106 GA101 GA103 GA105 GA107 GA109
    - Output: top10gf_PN_patient-rel.csv found in ./inputs/genefamilies_analysis/
  - Top 10 Pathways:
    - cmd : python3 /scripts/extract_pathabundance.py -i PathAbundance.tsv -o top10paPN_patient -t 10 -s GA104 GN104 GA106 GA108 GN108 GP108 GA110 GN110 GP110 GA102 GN102 GP102 GP104 GN106 GP106 GA101 GA103 GA105 GA107 GA109
    - Output: top10paPN_patient-rel.csv found in ./inputs/pathways_analysis/
- C) “IgG+” microbes vs “IgG-“ microbes in Controls:
  - The relative abundance of top 10 genefamilies/pathways are re-estimated using the same python scripts as described above.
  - Top 10 Genefamilies:
    - cmd : python3 /scripts/extract_genefamilies.py -i GeneFamilies.tsv -o top10gf_PN_control -t 10 -s GA101 GN101 GP101 GA103 GA105 GN105 GN107 GP107 GN109 GP109 GN103 GP103 GP105 GA107 GA109 GA104 GA106 GA108 GA110 GA102
    - Output: top10gf_PN_control-rel.csv found in ./inputs/genefamilies_analysis/
  - Top 10 Pathways:
    - cmd : python3 /scripts/extract_pathabundance.py -i PathAbundance.tsv -o top10gf_PN_control -t 10 -s GA101 GN101 GP101 GA103 GA105 GN105 GN107 GP107 GN109 GP109 GN103 GP103 GP105 GA107 GA109 GA104 GA106 GA108 GA110 GA102
    - Output: top10gf_PN_control-rel.csv found in ./inputs/pathways_analysis/
- D) Patient “IgG+” microbes to control “IgG+” microbes:
  - The relative abundance of top 10 genefamilies/pathways are re-estimated as descrbed in comparison A.
  - Top 10 Genefamilies:
    - cmd : python3 /scripts/extract_genefamilies.py -i GeneFamilies.tsv -o top10gf_P_patient_control -t 10 -s GA101 GN101 GA103 GA104 GN104 GA105 GN105 GA106 GN107 GA108 GN108 GN109 GA110 GN110 GA102 GN102 GN103 GN106 GA107 GA109
    - Output: top10gf_P_patient_control.csv found in ./inputs/genefamilies_analysis/
  - Top 10 Pathways:
    - cmd : python3 /scripts/extract_pathabundance.py -i PathAbundance.tsv -o top10pa_P_patient_control -t 10 -s GA101 GN101 GP101 GA103 GA105 GN105 GN107 GP107 GN109 GP109 GN103 GP103 GP105 GA107 GA109 GA104 GA106 GA108 GA110 GA102
    - Output: top10pa_P_patient_control-rel.csv found in ./inputs/pathways_analysis/
Visualization of results:
- All the plots for the analaysis were made in R. Please refer to the R markdown scripts for further reference.
- There were 3 R markdown scritpts and are stored in /scripts/ dir:
  - 1. QC_Abundance.Rmd: Consist of stacked bar plot representing the QC plots and two bubble plots showing the relative abundance of top 10 genefamilies/pathways abundance in whole experiment.
  - 1. GeneFamililes.Rmd: Consist of bubble plots showing the relative abundance of top 10 genefamilies in different comparison.
  - 1. Pathways.Rmd: Consist of bubble plots showing the relative abundance of top 10 pathways in different comparison.

Reports of the results have been generated using MultiQC (Ewels 2016). All the tools have been retrieved from the BioConda (Grüning 2018) repository.

References

Telatin A, Fariselli P, Birolo G. SeqFu: A Suite of Utilities for the Robust and Reproducible Manipulation of Sequence Files. Bioengineering (Basel, Switzerland). 2021 May;8(5). DOI: 10.3390/bioengineering8050059. PMID: 34066939
Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics (Oxford, England). 2018 Sep;34(17):i884-i890. DOI: 10.1093/bioinformatics/bty560. PMID: 30423086; PMCID: PMC6129281.
Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biology. 2019 Nov;20(1):257. DOI: 10.1186/s13059-019-1891-0. PMID: 31779668; PMCID: PMC6883579.
Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics (Oxford, England). 2016 Oct;32(19):3047-3048. DOI: 10.1093/bioinformatics/btw354. PMID: 27312411; PMCID: PMC5039924.
Grüning B, Dale R, Sjödin A, et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature Methods. 2018 Jul;15(7):475-476. DOI: 10.1038/s41592-018-0046-7. PMID: 29967506.
McIver LJ, Abu-Ali G, Franzosa EA, et al. bioBakery: a meta'omic analysis environment. Bioinformatics (Oxford, England). 2018 Apr;34(7):1235-1237. DOI: 10.1093/bioinformatics/btx754. PMID: 29194469; PMCID: PMC6030947.
Beghini F, McIver LJ, Blanco-Míguez A, et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. Elife. 2021 May;10. DOI: 10.7554/elife.65088. PMID: 33944776; PMCID: PMC8096432.