Tutorial_kraken2 - UPHL-BioNGS/Cecret GitHub Wiki

Tutorial : Using a Kraken2 Database with SARS-CoV-2

The original use-case for running Cecret was for SARS-CoV-2 consensus creation and lineage determination, and this remains the default purpose for this workflow.

A fun fact was that when SARS-CoV-2 became prominent in the early 2020s, there was not yet a SARS-CoV-2 reference in one of the standard kraken2 databases. This is why kraken2 analysis will always be optional is included only as a QC metric.

Steps in this tutorial:

Step 1. Downloading some sample fastq files

This step is going to create a directory called reads, enter that directory, and then download 6 test files for 3 test samples.

mkdir reads
cd reads
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test1_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test1_2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test2_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test2_2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test3_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test3_2.fastq.gz
cd ../

These reads or files are paired-end, meaning there are two files for every one sample. This is the typical case for most sequencing projects.

These reads were subsampled from samples SRR32571082, SRR32571051, and SRR32571045, which were created using Illumina COVIDSeq Artic v5.3.2 primers.

Step 2. Downloading the kraken2 database

There are a lot of kraken2 databases out there, and they get continually updated. The most recent kraken2 databases can be found at https://benlangmead.github.io/aws-indexes/k2.

Since we're using some SARS-CoV-2 test files, the standard viral database should be sufficient.

This step is going to create a directory called kraken2_db, enter it, download a standard kraken2 database, decompress it, and then exit the directory.

mkdir kraken2_db
cd kraken2_db
wget https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20241228.tar.gz
tar -zxvf k2_viral_20241228.tar.gz
cd ../

Step 3. Start the workflow

Now the workflow can be run using the following command

nextflow run UPHL-BioNGS/Cecret -profile docker --reads reads --kraken2_db kraken2_db

Step 4. Look through the results

The summary file

A summary file can be found at cecret/cecret_results.csv.

Basically it looks like this:

sample_id,sample,pangolin_lineage,nextclade_clade,vadr_p/f,fasta_line,fastqc_raw_reads_1,fastqc_raw_reads_2,num_N,num_total,seqyclean_PairsKept,seqyclean_Perc_Kept,top_organism,percent_reads_top_organism,%_human_reads,num_pos_100X,insert_size_after_trimming,bcftools_variants_identified,samtools_meandepth_after_trimming,samtools_per_1X_coverage_after_trimming,vadr_model,vadr_alerts,nextclade_clade_who,nextclade_qc.overallScore,nextclade_qc.overallStatus,pangolin_conflict,pangolin_ambiguity_score,pangolin_scorpio_call,pangolin_scorpio_support,pangolin_scorpio_conflict,pangolin_scorpio_notes,pangolin_version,pangolin_pangolin_version,pangolin_scorpio_version,pangolin_constellation_version,pangolin_is_designated,pangolin_qc_status,pangolin_qc_notes,pangolin_note,pango_aliasor_lineage,pango_aliasor_unaliased_lineage,freyja_summarized,Cecret version,seqyclean,bwa,ivar,ivar consensus
test1,test1,XEC.2.1,24F,PASS,test1,55820.0,55820.0,296,29694,53557.0,95.9459,Severe acute respiratory syndrome-related coronavirus,99.62,0,29398,141.8,149,386.763,99.3746,NC_045512,-,,0.694444,good,0.0,,Omicron (BA.2-like),0.87,0.02,scorpio call: Alt alleles 54; Ref alleles 1; Amb alleles 2; Oth alleles 5,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.03,Usher placements: XEC.2.1(1/1); scorpio lineage BA.2 conflicts with inference lineage XEC.2.1,XEC.2.1,XEC.2.1,[('XEC* (XEC.X)'  0.9999999988066752)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test2,test2,LP.8.1.1,25A,PASS,test2,54688.0,54688.0,76,29767,52729.0,96.4179,Severe acute respiratory syndrome-related coronavirus,99.61,0,29691,144.7,147,386.836,99.6121,NC_045512,-,Omicron,21.006944,good,0.0,,Omicron (BA.2-like),0.87,0.05,scorpio call: Alt alleles 54; Ref alleles 3; Amb alleles 0; Oth alleles 5,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.02,Usher placements: LP.8.1.1(1/1),LP.8.1.1,B.1.1.529.2.86.1.1.11.1.1.1.3.8.1.1,[('LP.8.1* (LP.8.1.X)'  0.9999999997448206)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test3,test3,LB.1.3.1,24A,PASS,test3,34523.0,34523.0,4310,29686,34130.0,98.8616,Severe acute respiratory syndrome-related coronavirus,99.87,0,25341,186.4,147,286.386,99.1773,NC_045512,-,Omicron,220.577503,bad,0.0,,Omicron (BA.2-like),0.87,0.03,scorpio call: Alt alleles 54; Ref alleles 2; Amb alleles 3; Oth alleles 3,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.16,Usher placements: LB.1.3.1(1/1),LB.1.3.1,B.1.1.529.2.86.1.1.9.2.1.3.1,[('LB.1* (LB.1.X)'  0.9999999998669523)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3

Although it might be easier to see a rendered version of the csv at this link.

Explanation of columns:

sample_id : the sample id for the sample which is often parsed from 'sample'
sample : this name of the sample
pangolin_lineage : lineage assigned by pangolin
nextclade_clade : clade assigned by nextclade
vadr_p/f : whether or not vadr "passes" or "fails" the consensus sequence
fasta_line : header of generated consensus fasta for sample
fastqc_raw_reads_1 : number of reads in R1 as determined by fastqc
fastqc_raw_reads_2 : number of reads in R2 as determined by fastqc
num_N : number of "N"s or ambiguous bases in generated consensus sequence (lower is better)
num_total : the total number of base predictions in consensus sequence (higher is better)
seqyclean_PairsKept : the number of pairs kept by seqyclean (higher is better)
seqyclean_Perc_Kept : the percentage of pairs kept by seqyclean (higher is better)
top_organism : the organism with the most representation
percent_reads_top_organism : the corresponding percentage of reads assigned to that organism
%_human_reads : the number of reads predicted to be human (only useful if the kraken2 database contains a human reference)
num_pos_100X : the number of positions with 100+ depth / with 100+ reads covering that base (higher is better)
insert_size_after_trimming : should make sense for the library prep kit used
bcftools_variants_identified : the number of differences identified via bcftools
samtools_meandepth_after_trimming : the mean depth across the reference
samtools_per_1X_coverage_after_trimming : percentage of bases in the reference with at least 1X depth
vadr_* : from vadr output
nextclade_* : from nextclade output
pangolin_* : from pangolin output
pango_aliasor_lineage : lineage assigned by pangolin
pango_aliasor_unaliased_lineage : full lineage path of pangolin lineage
freyja_summarized : output from freyja for sample simplified to one cell value (wastewater is recommended to use freyja files)
Cecret version : version of Cecret run on samples
seqyclean : version of seqylcean used in workflow
bwa : version of seqylcean used in workflow
ivar : version of seqylcean used in workflow
ivar consensus : version of seqylcean used in workflow

top_organism, percent_reads_top_organism, and %_human_reads are kraken2 specific and these columns will not populate if no kraken2 database is supplied.

The consensus fasta sequences

The consensus sequences generated via Cecret can be used for submission to GISAID, GENBANK, and other repositories.

They are located in cecret/consensus

$ ls cecret/consensus/
test1.consensus.fa  test2.consensus.fa  test3.consensus.fa

This tutorial produced three fasta files: test1.consensus.fa, test2.consensus.fa, and test3.consensus.fa

The inside of the files look something like this

$ head cecret/consensus/test1.consensus.fa 
>test1
AAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGT
TGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGC
ACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACA
CGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGT
CTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCA
ACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCT
GGTAGCAGAACTCGAAGGCATTCAGTACGGTTGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGA
AATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGGTACGGCGC
CGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACAC

These fasta files have unique headers, so files can be combined into a multifasta with something like

cat cecret/consensus/* > combined.fasta

There are more features to this workflow