Tutorial_kraken2 - UPHL-BioNGS/Cecret GitHub Wiki

Tutorial : Using a Kraken2 Database with SARS-CoV-2

The original use-case for running Cecret was for SARS-CoV-2 consensus creation and lineage determination, and this remains the default purpose for this workflow.

A fun fact was that when SARS-CoV-2 became prominent in the early 2020s, there was not yet a SARS-CoV-2 reference in one of the standard kraken2 databases. This is why kraken2 analysis will always be optional is included only as a QC metric.

Steps in this tutorial:

Step 1. Downloading some sample fastq files

This step is going to create a directory called reads, enter that directory, and then download 6 test files for 3 test samples.

mkdir reads
cd reads
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test1_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test1_2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test2_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test2_2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test3_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test3_2.fastq.gz
cd ../

These reads or files are paired-end, meaning there are two files for every one sample. This is the typical case for most sequencing projects.

These reads were subsampled from samples SRR32571082, SRR32571051, and SRR32571045, which were created using Illumina COVIDSeq Artic v5.3.2 primers.

Step 2. Downloading the kraken2 database

There are a lot of kraken2 databases out there, and they get continually updated. The most recent kraken2 databases can be found at https://benlangmead.github.io/aws-indexes/k2.

Since we're using some SARS-CoV-2 test files, the standard viral database should be sufficient.

This step is going to create a directory called kraken2_db, enter it, download a standard kraken2 database, decompress it, and then exit the directory.

mkdir kraken2_db
cd kraken2_db
wget https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20241228.tar.gz
tar -zxvf k2_viral_20241228.tar.gz
cd ../

Step 3. Start the workflow

Now the workflow can be run using the following command

nextflow run UPHL-BioNGS/Cecret -profile docker --reads reads --kraken2_db kraken2_db

Step 4. Look through the results

The summary file

A summary file can be found at cecret/cecret_results.csv.

Basically it looks like this:

sample_id,sample,pangolin_lineage,nextclade_clade,vadr_p/f,fasta_line,fastqc_raw_reads_1,fastqc_raw_reads_2,num_N,num_total,seqyclean_PairsKept,seqyclean_Perc_Kept,top_organism,percent_reads_top_organism,%_human_reads,num_pos_100X,insert_size_after_trimming,bcftools_variants_identified,samtools_meandepth_after_trimming,samtools_per_1X_coverage_after_trimming,vadr_model,vadr_alerts,nextclade_clade_who,nextclade_qc.overallScore,nextclade_qc.overallStatus,pangolin_conflict,pangolin_ambiguity_score,pangolin_scorpio_call,pangolin_scorpio_support,pangolin_scorpio_conflict,pangolin_scorpio_notes,pangolin_version,pangolin_pangolin_version,pangolin_scorpio_version,pangolin_constellation_version,pangolin_is_designated,pangolin_qc_status,pangolin_qc_notes,pangolin_note,pango_aliasor_lineage,pango_aliasor_unaliased_lineage,freyja_summarized,Cecret version,seqyclean,bwa,ivar,ivar consensus
test1,test1,XEC.2.1,24F,PASS,test1,55820.0,55820.0,296,29694,53557.0,95.9459,Severe acute respiratory syndrome-related coronavirus,99.62,0,29398,141.8,149,386.763,99.3746,NC_045512,-,,0.694444,good,0.0,,Omicron (BA.2-like),0.87,0.02,scorpio call: Alt alleles 54; Ref alleles 1; Amb alleles 2; Oth alleles 5,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.03,Usher placements: XEC.2.1(1/1); scorpio lineage BA.2 conflicts with inference lineage XEC.2.1,XEC.2.1,XEC.2.1,[('XEC* (XEC.X)'  0.9999999988066752)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test2,test2,LP.8.1.1,25A,PASS,test2,54688.0,54688.0,76,29767,52729.0,96.4179,Severe acute respiratory syndrome-related coronavirus,99.61,0,29691,144.7,147,386.836,99.6121,NC_045512,-,Omicron,21.006944,good,0.0,,Omicron (BA.2-like),0.87,0.05,scorpio call: Alt alleles 54; Ref alleles 3; Amb alleles 0; Oth alleles 5,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.02,Usher placements: LP.8.1.1(1/1),LP.8.1.1,B.1.1.529.2.86.1.1.11.1.1.1.3.8.1.1,[('LP.8.1* (LP.8.1.X)'  0.9999999997448206)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test3,test3,LB.1.3.1,24A,PASS,test3,34523.0,34523.0,4310,29686,34130.0,98.8616,Severe acute respiratory syndrome-related coronavirus,99.87,0,25341,186.4,147,286.386,99.1773,NC_045512,-,Omicron,220.577503,bad,0.0,,Omicron (BA.2-like),0.87,0.03,scorpio call: Alt alleles 54; Ref alleles 2; Amb alleles 3; Oth alleles 3,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.16,Usher placements: LB.1.3.1(1/1),LB.1.3.1,B.1.1.529.2.86.1.1.9.2.1.3.1,[('LB.1* (LB.1.X)'  0.9999999998669523)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3

Although it might be easier to see a rendered version of the csv at this link.

Explanation of columns:

  • sample_id : the sample id for the sample which is often parsed from 'sample'
  • sample : this name of the sample
  • pangolin_lineage : lineage assigned by pangolin
  • nextclade_clade : clade assigned by nextclade
  • vadr_p/f : whether or not vadr "passes" or "fails" the consensus sequence
  • fasta_line : header of generated consensus fasta for sample
  • fastqc_raw_reads_1 : number of reads in R1 as determined by fastqc
  • fastqc_raw_reads_2 : number of reads in R2 as determined by fastqc
  • num_N : number of "N"s or ambiguous bases in generated consensus sequence (lower is better)
  • num_total : the total number of base predictions in consensus sequence (higher is better)
  • seqyclean_PairsKept : the number of pairs kept by seqyclean (higher is better)
  • seqyclean_Perc_Kept : the percentage of pairs kept by seqyclean (higher is better)
  • top_organism : the organism with the most representation
  • percent_reads_top_organism : the corresponding percentage of reads assigned to that organism
  • %_human_reads : the number of reads predicted to be human (only useful if the kraken2 database contains a human reference)
  • num_pos_100X : the number of positions with 100+ depth / with 100+ reads covering that base (higher is better)
  • insert_size_after_trimming : should make sense for the library prep kit used
  • bcftools_variants_identified : the number of differences identified via bcftools
  • samtools_meandepth_after_trimming : the mean depth across the reference
  • samtools_per_1X_coverage_after_trimming : percentage of bases in the reference with at least 1X depth
  • vadr_* : from vadr output
  • nextclade_* : from nextclade output
  • pangolin_* : from pangolin output
  • pango_aliasor_lineage : lineage assigned by pangolin
  • pango_aliasor_unaliased_lineage : full lineage path of pangolin lineage
  • freyja_summarized : output from freyja for sample simplified to one cell value (wastewater is recommended to use freyja files)
  • Cecret version : version of Cecret run on samples
  • seqyclean : version of seqylcean used in workflow
  • bwa : version of seqylcean used in workflow
  • ivar : version of seqylcean used in workflow
  • ivar consensus : version of seqylcean used in workflow

top_organism, percent_reads_top_organism, and %_human_reads are kraken2 specific and these columns will not populate if no kraken2 database is supplied.

The consensus fasta sequences

The consensus sequences generated via Cecret can be used for submission to GISAID, GENBANK, and other repositories.

They are located in cecret/consensus

$ ls cecret/consensus/
test1.consensus.fa  test2.consensus.fa  test3.consensus.fa

This tutorial produced three fasta files: test1.consensus.fa, test2.consensus.fa, and test3.consensus.fa

The inside of the files look something like this

$ head cecret/consensus/test1.consensus.fa 
>test1
AAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGT
TGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGC
ACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACA
CGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGT
CTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCA
ACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCT
GGTAGCAGAACTCGAAGGCATTCAGTACGGTTGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGA
AATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGGTACGGCGC
CGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACAC

These fasta files have unique headers, so files can be combined into a multifasta with something like

cat cecret/consensus/* > combined.fasta

There are more features to this workflow