pb‐16S‐nf pipeline with internal Sequel IIe data - Nucleomics-VIB/16S_analysis_pipeline GitHub Wiki

(last edit: 2022-12-21)

Table of Contents


Aim

This page reports the analysis of internal read-sets for the Zymo community obtained from several projects and looks for outlier samples and therefore suspicious runs.

Note: The present analysis is purely explorative and does not provide guidelines to classify a run as 'good' or 'failed'.

Mock community

The amplicon sequencing data used in this project was extratcted from several Nucleomics Core projects where the same Zymo control DNA (aka Zymo mock community (D6305)) was amplified and sequenced using the Pacbio protocol and various combinations of barcodes primers (depending on each project).

In order to validate the workflow, we collected barcoded positive control samples from several Nucleomics Core 16S Sequel-IIe experiments and used them to compare pb-16S-nf results to the theoretical distribution present in the Zymo mock community

Note: OTU vs ASV

  • the term OTU (operating taxonomic unit) is often associated with a single organism in former metagenomic studies (based on short read data).

  • the term ASV (amplicon specific variant) was introduced more recently with full length high quality long-read sequencing, and defines a unique 16S sequence found in one or more samples.

A given ASV often correspond to a single 16S copy present in a given cell but since most bacteria (and likely other organisms too) have multiple distinct copies of the rRNA genes, each harboring sequence variations (hence the name ASV), several ASV's can in fact originate from the same unicellular organism. For this reason, summing up the ASV copies will not give the count of individual organism but instead produce counts larger than the true cellular abundance (further discussed here).

Note: Adding to the complexity of 16S analysis, the copy number of rDNA genes in bacteria can vary a lot and contribute to 16S counts deferentially affected depending on the bacterial host. This is well documented in the Zymo protocol document for the mock community used in this experiment.

Note: the two yeast samples present in the Zymo DNA are not amplified by the 16S-specific primers (they do no have a 16S rDNA gene but a 18S instead), and hence not expected in the results data.

[back-to-top]

Sequencing DATA

The FastQ read data for the Zymo community (D6305) sample included in several project was collected and labelled by the project internal code.

You can find Nucleomics Core Zymo community (D6305) sequencing data on the L: drive under: /ResearchDev/_NC_Long-Read_DATA/PacBio/pb-16S-nf_data/reads. For each new projects including the Zymo positive control sample, the corresponding FastQ will be added in that folder (currently containing 6+ samples).

# copy the data locally to repeat this study
cd <path-to-your-working-folder>
mkdir -p pb-16S-nf_data && cd pb-16S-nf_data
rsync -avr /mnt/nuc-data/ResearchDev/_NC_Long-Read_DATA/PacBio/pb-16S-nf_data/reads .
rsync -avr /mnt/nuc-data/ResearchDev/_NC_Long-Read_DATA/PacBio/pb-16S-nf_data/run*.tsv .

Two text files were produced to describe and annotate the data and are reproduced below.

Note: The columns shown here are required (and their names should be set as shown)

  • run_samples.tsv
sample-id absolute-file-path
4170_bc1005--bc1096 /mnt/nuc-data/ResearchDev/_NC_Long-Read_DATA/PacBio/pb-16S-nf_data/reads/4170_bc1005--bc1096.fastq.gz
4285_bc1022--bc1107 /mnt/nuc-data/ResearchDev/_NC_Long-Read_DATA/PacBio/pb-16S-nf_data/reads/4285_bc1022--bc1107.fastq.gz
4296_bc1022--bc1060 /mnt/nuc-data/ResearchDev/_NC_Long-Read_DATA/PacBio/pb-16S-nf_data/reads/4296_bc1022--bc1060.fastq.gz
4112_bc1008--bc1075 /mnt/nuc-data/ResearchDev/_NC_Long-Read_DATA/PacBio/pb-16S-nf_data/reads/4112_bc1008--bc1075.fastq.gz
4128_bc1005--bc1107 /mnt/nuc-data/ResearchDev/_NC_Long-Read_DATA/PacBio/pb-16S-nf_data/reads/4128_bc1005--bc1107.fastq.gz
4356_bc1005--bc1112 /mnt/nuc-data/ResearchDev/_NC_Long-Read_DATA/PacBio/pb-16S-nf_data/reads/4356_bc1005--bc1112.fastq.gz

Note: The full path to each FastQ file in run_samples.tsv should be adapted to your own local copy for the code to work.

# adapth the path to your local path in run_samples.tsv
curpath=$PWD
sed -i 's#/mnt/nuc-data/ResearchDev/_NC_Long-Read_DATA/PacBio/pb-16S-nf_data#'${curpath}'#' run_samples.tsv
  • run_metadata.tsv
sample_name condition
4170_bc1005--bc1096 control
4285_bc1022--bc1107 control
4296_bc1022--bc1060 control
4112_bc1008--bc1075 control
4128_bc1005--bc1107 control
4356_bc1005--bc1112 control

Note: additional metadata columns can be added to the right of the metadata file and will be fowarded in the QIIME objects for later use in grouping and labelling of the results.

[back-to-top]

Method

We describe here a new nextflow pipeline **pb-16S-nf developed by Khi Pin, Chua (@proteinosome) as part of the Pacbio open code hosted on github and which can be used to analyze data obtained with the **Pacbio 16S method. Khi Pin is actively developing this package further and was very helpful in developing this code and correcting a few issues.

Note: This Nextflow pipeline is designed to process PacBio HiFi full-length 16S data into high quality amplicon sequence variants (ASVs) using QIIME2 and DADA2. It provides a set of visualization through the QIIME 2 framework for interactive plotting. The pipeline generates a HTML report for the important statistics and top taxonomies>> (taken from the github page).

Several published tools were used with PacBio 16S long read amplicons. These tools often use DADA2 for the processing of the amplicon data and SILVA as database for the classification of the obtained OTU (Operational taxonomic unit) or ASV (amplicon sequence variant) entities. The latest version of SILVA (v138.1) 1 is currently chosen for classification as it represents the mosty up-to-date normalized set of bacterial references.

[back-to-top]

The pb-16S-nf nextflow pipeline depends on two text files and a matching folder of demultiplexed HiFi fastq files produced by the SMRTLink platform.

The pipeline performs a number pre-processing steps followed by DADA2 and Qiime2 commands. All of it integrated and standardized for ease of use.

The general workflow is shown in the next figure

[back-to-top]

Nextflow install and setup

The nextflow pipeline is available from github and should be cloned locally on the analysis server.

git clone https://github.com/PacificBiosciences/pb-16S-nf.git
cd pb-16S-nf

# run With docker
nextflow run main.nf --download_db -profile docker

During the first run (adding resources to the local server), the docker components (N=3) will be downloaded and installed in the default Docker cache on the server

  • kpinpb/pb-16s-nf-tools
  • kpinpb/pb-16s-nf-qiime
  • kpinpb/pb-16s-vis

In addition, the classification databases are also copied from the web and stored locally.

[back-to-top]

Results

Running the nextflow pipeline is very simple and only requires a few arguments to define inputs and outputs and tune the run parameters.

Note: By default, the rarefaction depth is computed automatically to include 80% of the samples. In circumstances where all samples need to be included, an arbitrary depth based on the smaller sample(s) can be given as argument in the run command --rarefaction_depth <observed smaller depth> using the read QC to define the cutoff value.

The reference databases used in this pipeline for the Naive Bayes classifier are:

  • SILVA: nr99_v138.1_wSpecies_train_set

  • GTDB: GTDB_bac120_arc53_ssu_r207_fullTaxo

  • RefSeq_16S_6-11-20_RDPv16_fullTaxo

For VSEARCH, the dabase used is GTDB_ssu_all_r207

show command arguments

nextflow run main.nf --help
N E X T F L O W  ~  version 22.10.1
Launching `main.nf` [fabulous_brahmagupta] DSL2 - revision: 6990708c9f

  Usage:
  This pipeline takes in the standard sample manifest and metadata file used in
  QIIME 2 and produces QC summary, taxonomy classification results and visualization.

  For samples TSV, two columns named "sample-id" and "absolute-filepath" are
  required. For metadata TSV file, at least two columns named "sample_name" and
  "condition" to separate samples into different groups.

  nextflow run main.nf --input samples.tsv --metadata metadata.tsv \
    --dada2_cpu 8 --vsearch_cpu 8

  By default, sequences are first trimmed with cutadapt. If adapters are already trimmed, you can skip 
  cutadapt by specifying "--skip_primer_trim".

  Other important options:
  --front_p    Forward primer sequence. Default to F27. (default: AGRGTTYGATYMTGGCTCAG)
  --adapter_p    Reverse primer sequence. Default to R1492. (default: AAGTCGTAACAAGGTARCY)
  --filterQ    Filter input reads above this Q value (default: 20).
  --max_ee    DADA2 max_EE parameter. Reads with number of expected errors higher than
              this value will be discarded (default: 2)
  --minQ    DADA2 minQ parameter. Reads with any base lower than this score 
            will be removed (default: 0)
  --min_len    Minimum length of sequences to keep (default: 1000)
  --max_len    Maximum length of sequences to keep (default: 1600)
  --pooling_method    QIIME 2 pooling method for DADA2 denoise see QIIME 2 
                      documentation for more details (default: "pseudo", alternative: "independent") 
  --maxreject    max-reject parameter for VSEARCH taxonomy classification method in QIIME 2
                 (default: 100)
  --maxaccept    max-accept parameter for VSEARCH taxonomy classification method in QIIME 2
                 (default: 100)
  --min_asv_totalfreq    Total frequency of any ASV must be above this threshold
                         across all samples to be retained. Set this to 0 to disable filtering
                         (default 5)
  --min_asv_sample    ASV must exist in at least min_asv_sample to be retained. 
                      Set this to 0 to disable. (default 1)
  --vsearch_identity    Minimum identity to be considered as hit (default 0.97)
  --rarefaction_depth    Rarefaction curve "max-depth" parameter. By default the pipeline
                         automatically select a cut-off above the minimum of the denoised 
                         reads for >80% of the samples. This cut-off is stored in a file called
                         "rarefaction_depth_suggested.txt" file in the results folder
                         (default: null)
  --dada2_cpu    Number of threads for DADA2 denoising (default: 8)
  --vsearch_cpu    Number of threads for VSEARCH taxonomy classification (default: 8)
  --cutadapt_cpu    Number of threads for primer removal using cutadapt (default: 16)
  --outdir    Output directory name (default: "results")
  --vsearch_db	Location of VSEARCH database (e.g. silva-138-99-seqs.qza can be
                downloaded from QIIME database)
  --vsearch_tax    Location of VSEARCH database taxonomy (e.g. silva-138-99-tax.qza can be
                   downloaded from QIIME database)
  --silva_db   Location of Silva 138 database for taxonomy classification 
  --gtdb_db    Location of GTDB r202 for taxonomy classification
  --refseq_db    Location of RefSeq+RDP database for taxonomy classification
  --skip_primer_trim    Skip all primers trimming (switch off cutadapt and DADA2 primers
                        removal) (default: trim with cutadapt)
  --skip_nb    Skip Naive-Bayes classification (only uses VSEARCH) (default: false)
  --colorby    Columns in metadata TSV file to use for coloring the MDS plot
               in HTML report (default: condition)
  --run_picrust2    Run PICRUSt2 pipeline. Note that pathway inference with 16S using PICRUSt2
                    has not been tested systematically (default: false)
  --download_db    Download databases needed for taxonomy classification only. Will not
                   run the pipeline. Databases will be downloaded to a folder "databases"
                   in the Nextflow pipeline directory.
  --version    Output version

An example command is provided next with some custome arguments including an arbitrary rarefaction limit.

# full path to where your pulled the github repo
tooldir="/opt/biotools/pb-16S-nf"

# move to the pb-16s-nf repo folder, as we will work from there and read/write to the project working folder
cd ${tooldir}

# use >= 32 cpu for good performance
cpu=32

# edit the next variable to the the full path to your working folder
infolder=<full_path_to_your_working_folder>

# give a name for the pipeline result folder
outfolder="${infolder}/nextflow_results"

# define the full path to the two input files
sample_file=${infolder}/run_samples.tsv
metadata_file=${infolder}/run_metadata.tsv

# when necessary, define the rarefaction value based on the smallest sample
rarefaction=10000

# opt: add rarefaction value to the result folder name
outfolder=${outfolder}_${rarefaction}

# pipeline default (run with -help to discover more default valmues)
#  --min_asv_totalfreq    Total frequency of any ASV must be above this threshold
#                         across all samples to be retained. Set this to 0 to disable filtering
#                         (default 5)
#  --min_asv_sample    ASV must exist in at least min_asv_sample to be retained. 
#                      Set this to 0 to disable. (default 1)

minasvfreq=5
minasv=1 

nextflow run main.nf \
  --input ${sample_file} \
  --metadata ${metadata_file} \
  --outdir ${outfolder} \
  --min_asv_totalfreq ${minasvfreq} \
  --min_asv_sample ${minasv} \
  --dada2_cpu ${cpu} \
  --vsearch_cpu ${cpu} \
  --cutadapt_cpu ${cpu} \
  --rarefaction_depth ${rarefaction} \
  -profile docker

The run outputs progress information as shown next

Launching `main.nf` [confident_fermat] DSL2 - revision: 6990708c9f

  Parameters set for pb-16S-nf pipeline for PacBio HiFi 16S
  =========================================================
  Number of samples in samples TSV: 6
  Filter input reads above Q: 20
  Trim primers with cutadapt: Yes
  Forward primer: AGRGTTYGATYMTGGCTCAG
  Reverse primer: AAGTCGTAACAAGGTARCY
  Minimum amplicon length filtered in DADA2: 1000
  Maximum amplicon length filtered in DADA2: 1600
  maxEE parameter for DADA2 filterAndTrim: 2
  minQ parameter for DADA2 filterAndTrim: 0
  Pooling method for DADA2 denoise process: pseudo
  Minimum number of samples required to keep any ASV: 1
  Minimum number of reads required to keep any ASV: 5
  Taxonomy sequence database for VSEARCH: /opt/biotools/pb-16S-nf/databases/GTDB_ssu_all_r207.qza
  Taxonomy annotation database for VSEARCH: /opt/biotools/pb-16S-nf/databases/GTDB_ssu_all_r207.taxonomy.qza
  Skip Naive Bayes classification: false
  SILVA database for Naive Bayes classifier: /opt/biotools/pb-16S-nf/databases/silva_nr99_v138.1_wSpecies_train_set.fa.gz
  GTDB database for Naive Bayes classifier: /opt/biotools/pb-16S-nf/databases/GTDB_bac120_arc53_ssu_r207_fullTaxo.fa.gz
  RefSeq + RDP database for Naive Bayes classifier: /opt/biotools/pb-16S-nf/databases/RefSeq_16S_6-11-20_RDPv16_fullTaxo.fa.gz
  VSEARCH maxreject: 100
  VSEARCH maxaccept: 100
  VSEARCH perc-identity: 0.97
  QIIME 2 rarefaction curve sampling depth: 10000
  Number of threads specified for cutadapt: 80
  Number of threads specified for DADA2: 80
  Number of threads specified for VSEARCH: 80
  Script location for HTML report generation: /opt/biotools/pb-16S-nf/scripts/visualize_biom.Rmd
  Container enabled via docker/singularity: true
  Version of Nextflow pipeline: 0.4

executor >  Local (35)
[16/961acc] process > pb16S:write_log                      [100%] 1 of 1
[18/96f98f] process > pb16S:QC_fastq (3)                   [100%] 6 of 6
[fd/90ad41] process > pb16S:cutadapt (6)                   [100%] 6 of 6
[87/d217c6] process > pb16S:QC_fastq_post_trim (6)         [100%] 6 of 6
[3b/1d189f] process > pb16S:collect_QC                     [100%] 1 of 1
[77/b05b00] process > pb16S:prepare_qiime2_manifest        [100%] 1 of 1
[05/7b5f13] process > pb16S:import_qiime2                  [100%] 1 of 1
[ea/578c10] process > pb16S:demux_summarize                [100%] 1 of 1
[bb/3dc94d] process > pb16S:dada2_denoise                  [100%] 1 of 1
[f3/8da1e4] process > pb16S:filter_dada2                   [100%] 1 of 1
[1d/f0b337] process > pb16S:dada2_qc (1)                   [100%] 1 of 1
[76/9c0007] process > pb16S:qiime2_phylogeny_diversity (1) [100%] 1 of 1
[2c/44c38b] process > pb16S:dada2_rarefaction (1)          [100%] 1 of 1
[02/5a6f6a] process > pb16S:class_tax                      [100%] 1 of 1
[1d/f3a1b4] process > pb16S:dada2_assignTax                [100%] 1 of 1
[4c/051df4] process > pb16S:export_biom                    [100%] 1 of 1
[14/a4333e] process > pb16S:barplot_nb (1)                 [100%] 1 of 1
[0c/f2be75] process > pb16S:barplot (1)                    [100%] 1 of 1
[86/8cbd2e] process > pb16S:html_rep (1)                   [100%] 1 of 1
[ea/771e6b] process > pb16S:krona_plot                     [100%] 1 of 1
Completed at: 20-Dec-2022 14:08:30
Duration    : 13m 45s
CPU hours   : 17.3
Succeeded   : 35

The main output folder has the following standard structure:

cutadapt_summary
dada2
filtered_input_FASTQ
import_qiime
nb_tax
parameters.txt
results
summary_demux
trimmed_primers_FASTQ

Note: The results folder contains symbolic links to all final key files and 4 folders which can all be forwarded to the customer as-is (!follow symlinks during copy to include the subfolder phylogeny_diversity/core-metrics-diversity which is not copied by a regular ftp transfer!)

alpha-rarefaction-curves.qzv
best_tax_merged_freq_tax.tsv
best_taxonomy.tsv
best_taxonomy_withDB.tsv
best_tax.qza
dada2_qc.tsv
dada2_stats.qzv
dada2_table.qzv
feature-table-tax.biom
feature-table-tax_vsearch.biom
krona.qzv
merged_freq_tax.qzv
phylogeny_diversity/
phylogeny_diversity/core-metrics-diversity/
rarefaction_depth_suggested.txt
reads_QC/
samplefile.txt
stats.tsv
tax_export/
taxonomy_barplot_nb.qzv
taxonomy_barplot_vsearch.qzv
taxonomy.vsearch.qza
visualize_biom.html
vsearch_merged_freq_tax.tsv

The results of this run are shared next to this report in the Zymo-SequelIIe-Hifi_results_local folder to allow more exploration of this typical data.

All files ending with .qzv are QIIME2 visualization files that can be fed to the online QIIME2-Viewer (https://view.qiime2.org/) to create and customize plots or tables.

Files with extension .qza are QIIME2 objects that can be reloaded in QIIME2 to proceed in the analysis while files with extension .tsv are data files that can be used for further analysis (eg. in R).

Files with .nwk extension can be uploaded to iTol to visualize the corresponding trees.

The main file present in the results folder is the RMarkdown converted document visualize_biom.html reporting all QC metrics and key findings through the user browser. Tables in that file are live and can be filtered.

The pipeline also exports .biom objects which can be loaded using R package like Phyloseq to proceed with deeper analysis of the data. Please refer to Phyloseq vignettes for a starter.

[back-to-top]

Discussion

The effect of rarefaction on the number of retained samples can be seen by plotting the file alpha-rarefaction-curves.qzv.

With a rarefaction value of 10000, all samples are included in the analysis while when runnnig the pipeline in automatic mode, one of the samples would be excluded (80% inclusion aimed by default).

As other illustration we show below two of the plots produced by the pipeline and showing multi dimensional principal component analyses results

  • Bray Curtis (phylogeny_diversity/core-metrics-diversity/bray_curtis_emperor.qzv)

  • Weighted unifrac (phylogeny_diversity/core-metrics-diversity/weighted_unifrac_emperor.qzv)

The viewer can also produce pictures for other results present in the results folder among which the two classification outcomes produced by the workflow for our Zymo samples. The Naive Bayes Classifier is based on multiple search results and may be more complete but may sometimes include absent species (see doc on github) while the Vsearch classification is sometimes more accurate and is based only on Vsearch best hits.

[back-to-top]

  • Taxonomy plot made from the Naive Bayes results taxonomy_barplot_nb.qzv

  • Taxonomy plot made from the Vsearch results taxonomy_barplot_vsearch.qzv

When compared to the theoretical distribution shown next, experimental results are nicely concordant except for some species name differences.

[back-to-top]

Additional plots can be produced from the Krona classification results saved in krona_html as shown next for sample bc1008-bc1075

[back-to-top]

Note: Various tables (.qzv) can be converted to pretty tables in the viewer as well.

As example, the content of phylogeny_diversity/dada2_stats.qzv (this table is also present in the html report)

[back-to-top]

A phylogeny tree can be converted to a picture using TreeViewer \footnote{\url{https://github.com/arklumpus/TreeViewer}} or iTol \footnote{\url{https://itol.embl.de/}} and the file phylotree_mafft_rooted.nwk (note that, in the current version of the pipeline, the the tree shows the ASV labels rather than the stain names). The picture below has been edited to show the classification on the tips using a custom script.

REM: iTol outputs are shown next

[back-to-top]

Conclusion

The pb-16s-nf pipeline is fast and stable and allows analysis of multiple barcoded samples obtained from PacBio Sequel-IIe runs with minimal user setup (preparing both samples.tsv and metadata.tsv).

Note: Additional metadata columns can be passed to the tool and will be saved in the qiime2 objects, allowing advanced graphics production and follow-up analyses.

In this particular mini-project (synthetic community), multiple ASV of the same genus+species suggest different 16S copies with different sequences present in the same bacterium present in the mock community.

The height expected species were found back, although with slightly different names as compared to the Zymo documents. Some with a unique ASV (Pseudomonas.aeruginosa and Enterococcus.faecalis) while others were represented by multiple ASVs (eg. Escherichia.Coli and Bacilus.subtilis [intestinalis] both with 6 ASV detected).

Local alignmenent of the three Listeria ASV sequences show only a single nucleotide difference between any two of them, highlighting the sensitivity and precision of the Pacbio 16S sequencing approach.

Rem: Only the first three lines of the multiple alignment are shown here to highlight SNVs.

The pipeline produces a variety of outputs that can be fed to other Qiime2 methods or imported into R for advanced analysis.

For more advanced follow-up analyzes, the user can use the many tools present in the Qiime2 package and get support from the Qiime2 Forum as well as import the objects in R and create custom plots or downstream analyses there using the qiime2R R package (see tutorial on the qiime2 pages). Deeper analysis of the data can also be done using Phyloseq and other R packages.

[back-to-top]

References

1: SILVA database v138.1 (link)

[back-to-top]

⚠️ **GitHub.com Fallback** ⚠️