PacBio CCS Amplicon SOP v1 (qiime2) - LangilleLab/microbiome_helper GitHub Wiki

This standard operating procedure (SOP) is based on QIIME2 and is meant for users who want to quickly run their PacBio CCS amplicon data through the Microbiome Helper virtual box image and for internal use. Note that this workflow simply adapts our current Illumina amplicon workflows by altering the first steps to be compatible with the PacBio datatype, therefore you will need to use and familiarize yourself with the Illumina workflow of your choice on the right menu bar in order to complete processing.

If you use this workflow make sure to keep track of the commands you use locally as this page will be updated over time (see "revisions" above for earlier versions).

Requirements

This workflow assumes that you have QIIME2 installed in a conda environment (the appropriate version matching the Illumina workflow you will continue) and that you also have the seqtk toolset installed in this environment.

This workflow also assumes that the input is raw Circular Consensus Sequencing (CCS) PacBio data in demultiplexed FASTQ format located within a folder called raw_data. The filenames can be almost anything you wish (contrary to most QIIME2 importing) since you are going to use a "manifest file" to list each file.

1. First steps

Steps 1.1 to 1.4

As per the Illumina amplicon SOP.

1.5 Resolve orientation problems

As PacBio sequencing does not produce reads in a consistent F or R direction, unlike Illumina, the resulting raw FASTQ files have reads in both the 5'-3' plus 3'-5' directions...these "mixed orientation" files then break most pipelines in that many tools cannot read both orientations at the same time (ie: some tools can search the other orientation, but would assume all reads would be reversed if it found any). The easiest way to resolve this problem, without resorting to complex scripting, is to simply make a copy of the raw files, reverse complement them, concatenate them together (so now 2x the original size/content), then run this "doubled" file through cutadapt below.

gzip -d raw_data/*.gz
mkdir raw_data_rc/
mkdir raw_data_cat/
parallel -j 4 'seqtk seq -r {} > raw_data_rc/{/.}_rc.fastq' ::: raw_data/*.fastq
parallel -j 4 --link 'cat {1} {2} > raw_data_cat/{1/.}_cat.fastq' ::: raw_data/*.fastq ::: raw_data_rc/*_rc.fastq

1.6 Trim primers with cutadapt

We now do the standard screening out of reads that do not begin+end with the correct primer sequences and remove those sequences from reads using cutadapt. As we specify just the "proper" 5'-3' orientation here, all the extra reads of the above "doubled-up" files are now removed and we are left with only the same orientation for all reads. The below primers correspond to the full-length 16S (Bacteria-specific primer set).

mkdir trimmed_reads/

parallel -j 4 'cutadapt -g AGRGTTYGATYMTGGCTCAG...AAGTCGTAACAAGGTARCY \
               --discard-untrimmed --no-indels -j 1 -m 1200 -M 1800 \
               -o trimmed_reads/{/.}_trim.fastq {}' \
              ::: raw_data_cat/*_cat.fastq

Note we are using linked anchored adapters here to ensure only reads having the primers at the extremities are retained and a size-range of 1300-1800 nt to allow some indels (there should be relatively few in actuality), but prevent amplicon dimers (we've observed ~1% in PacBio CCS data) from passing through, since the size of the 16S amplicon here is ~1500 nt.

If using our full-length 16S Archaea-specific primers (currently in testing), use the following command:

parallel -j 4 'cutadapt -g TCCGGTTGATCCYGCCGG...TGCYCCTTGYACWCACYG \
               --discard-untrimmed --no-indels -j 1 -m 1200 -M 1800 \
               -o trimmed_reads/{/.}_trim.fastq {}' \
              ::: raw_data_cat/*_cat.fastq

If using our full-length 18S primers, use the following command (note a much wider size-range, around the ~1800 nt normal size, due to the fact some species of Eukaryotes have much larger indels in their rRNA than Bacteria do):

parallel -j 4 'cutadapt -g CTGGTTGATYCTGCCAGT...GTAGGTGAACCTGCAGAAGGATCA \
               --discard-untrimmed --no-indels -j 1 -m 1000 -M 3000 \
               -o trimmed_reads/{/.}_trim.fastq {}' \
              ::: raw_data_cat/*_cat.fastq

If using our full-length fungal ITS primers, use the following command (again, size-range is modified for the proper range to keep variation, but hopefully remove larger dimers; ~600 nt average size):

parallel -j 4 'cutadapt -g TAGAGGAAGTAAAAGTCGTAA...GCAWAWCAAWAAGCGGAGGA \
                --discard-untrimmed --no-indels -j 1 -m 300 -M 900 \
                -o trimmed_reads/{/.}_trim.fastq {}' \
               ::: raw_data_cat/*_cat.fastq

If you see substantial losses of read numbers after this step (ie: your file-sizes are now much smaller than the original "raw" CCS files), then make absolutely certain you are using the correct primer sequences (in the correct linked orientations) and have not filtered out most of your reads due to the size-range restrictions - this would indicate a problem in matching your fragment with the correct parameters (especially important if using custom primers and adjusting all the above values).

1.7 Import FASTQs as QIIME 2 artifact

The trimmed reads can now be imported into the QIIME 2 "artifact" file format (with the extension QZA). The slight difference here compared to standard Illumina file importing is that you need to use a "manifest" file - consult the QIIME2 documentation about preparing it, but essentially it is just a tab-delimited text file containing the sample names + absolute path to each file in the trimmed_reads/ folder we just made above.

mkdir reads_qza

qiime tools import \
    --type SampleData[SequencesWithQuality] \
    --input-path PacBioCCSmanifest.tsv \
    --output-path reads_qza/trimmed_reads.qza \
    --input-format SingleEndFastqManifestPhred33V2

1.8 Summarize trimmed FASTQs

You can run the demux summarize command after trimming the reads to get a report of the number of reads per sample and quality distribution across the reads. This generates a more basic output compared to FASTQC/MultiQC, but is sufficient for this step.

qiime demux summarize \
   --i-data reads_qza/trimmed_reads.qza \
   --o-visualization reads_qza/trimmed_reads_summary.qzv

2. Denoising the reads into amplicon sequence variants

At this stage, the main 2 pipelines you can use are based on either deblur or DADA2. Below we will describe the commands for running DADA2, which now supports PacBio reads, as Deblur may not work correctly since it does not currently have a PacBio mode.

2.1 Running DADA2

Run the DADA2 workflow to correct reads and get amplicon sequence variants (ASVs). Note that we are not doing any form of initial quality filtering (unlike our Illumina SOP) or truncation in the below command due to the fact the CCS reads are already of very high quality when produced (99.9% consensus accuracy). You will probably also want to increase the number of threads used below to the maximum your system has available. If your PacBio data is not CCS in nature or you have used a lower consensus threshold (ie: <99%), then we would suggest you add in a quality-filter step.

qiime dada2 denoise-single \
   --i-demultiplexed-seqs reads_qza/trimmed_reads.qza \
   --p-trunc-len 0 \
   --p-max-ee 3 \
   --p-n-threads 4 \
   --p-n-reads-learn 100000 \
   --output-dir dada2_output

2.2 Summarizing DADA2 output

Once a denoising pipeline has been run you can summarize the output table with the below command, which will create a visualization artifact for you to view.

qiime feature-table summarize \
   --i-table dada2_output/table.qza \
   --o-visualization dada2_output/dada2_table_summary.qzv

You should also take a look at the read count table to see how many reads were retained at each step of the DADA2 pipeline:

qiime tools export --input-path dada2_output/denoising_stats.qza --output-path dada2_output
mv dada2_output/stats.tsv dada2_output/dada2_stats.tsv

Steps 3 and Onward

Continue on with the standard Illumina amplicon SOP, keeping in mind the following notes:

Remember to modify the expected filenames and folders from "deblur..." to "dada2..." throughout.
There is no "bleed-through" phenomenon in PacBio sequencing (addressed in Step 4.1) and depth is much shallower than Illumina, therefore you may have to adjust the levels to which you filter out rare sequences (although some filtering to remove noise/singletons is probably still recommended).
Remember to use the full-length versions of the taxonomic reference files when identifying ASVs.