Tutorial_sample_sheet - UPHL-BioNGS/Cecret GitHub Wiki

Tutorial : Default settings with SARS-CoV-2 using a sample sheet

The original use-case for running Cecret was for SARS-CoV-2 consensus creation and lineage determination, and this remains the default purpose for this workflow.

Steps in this tutorial:

Step 1. Downloading some sample fastq files

This step is going to create a directory called reads, enter that directory, and then download 6 test files for 3 test samples.

mkdir reads
cd reads
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test1_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test1_2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test2_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test2_2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test3_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test3_2.fastq.gz
cd ../

These reads or files are paired-end, meaning there are two files for every one sample. This is the typical case for most sequencing projects.

These reads were subsampled from samples SRR32571082, SRR32571051, and SRR32571045, which were created using Illumina COVIDSeq Artic v5.3.2 primers.

Step 2. Creating the sample sheet

Sample sheets are required in several use cases including

When nextflow cannot correctly parse the names of fastq files resulting in mis-paired fastq files or mis-labeled result files.
When using a nextflow compatible cloud service

Sample sheets for Cecret are not hard to create, but sometimes the formatting can be particular.

The header for the sample sheet must be sample,fastq_1,fastq_2.

sample : the name of the sample (i.e. SRR32571082 or 98731-UT-19)
fastq_1 : pair 1 of a paired-end fastq file pair
fastq_2 : pair 2 of a paired-end fastq file pair

The path to fastq_1 and fastq_2 may need to be the full path for some systems. For example, if using an AWS-based service, the filename must start with the s3 location s3://<full path to fastq file>. For this tutorial, we only need to use the relative paths.

Example sample_sheet.csv

sample,fastq_1,fastq_2
test1,reads/test1_1.fastq.gz,reads/test1_2.fastq.gz
test2,reads/test2_1.fastq.gz,reads/test2_2.fastq.gz
test3,reads/test3_1.fastq.gz,reads/test3_2.fastq.gz

This file can be downloaded

wget https://raw.githubusercontent.com/erinyoung/cecret_test_data/refs/heads/main/data/sarscov2/sample_sheet.csv

Step 3. Start the workflow

nextflow run UPHL-BioNGS/Cecret -profile docker --sample_sheet sample_sheet.csv

Step 4. Look through the results

The summary file

A summary file can be found at cecret/cecret_results.csv.

Basically it looks like this:

sample_id,sample,pangolin_lineage,nextclade_clade,vadr_p/f,fasta_line,fastqc_raw_reads_1,fastqc_raw_reads_2,num_N,num_total,seqyclean_PairsKept,seqyclean_Perc_Kept,num_pos_100X,insert_size_after_trimming,bcftools_variants_identified,samtools_meandepth_after_trimming,samtools_per_1X_coverage_after_trimming,vadr_model,vadr_alerts,nextclade_clade_who,nextclade_qc.overallScore,nextclade_qc.overallStatus,pangolin_conflict,pangolin_ambiguity_score,pangolin_scorpio_call,pangolin_scorpio_support,pangolin_scorpio_conflict,pangolin_scorpio_notes,pangolin_version,pangolin_pangolin_version,pangolin_scorpio_version,pangolin_constellation_version,pangolin_is_designated,pangolin_qc_status,pangolin_qc_notes,pangolin_note,pango_aliasor_lineage,pango_aliasor_unaliased_lineage,freyja_summarized,Cecret version,seqyclean,bwa,ivar,ivar consensus
test1,test1,XEC.2.1,24F,PASS,test1,55820.0,55820.0,296,29694,53557.0,95.9459,29398,141.8,149,386.763,99.3746,NC_045512,-,,0.694444,good,0.0,,Omicron (BA.2-like),0.87,0.02,scorpio call: Alt alleles 54; Ref alleles 1; Amb alleles 2; Oth alleles 5,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.03,Usher placements: XEC.2.1(1/1); scorpio lineage BA.2 conflicts with inference lineage XEC.2.1,XEC.2.1,XEC.2.1,[('XEC* (XEC.X)'  0.9999999988066752)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test2,test2,LP.8.1.1,25A,PASS,test2,54688.0,54688.0,76,29767,52729.0,96.4179,29691,144.7,147,386.836,99.6121,NC_045512,-,Omicron,21.006944,good,0.0,,Omicron (BA.2-like),0.87,0.05,scorpio call: Alt alleles 54; Ref alleles 3; Amb alleles 0; Oth alleles 5,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.02,Usher placements: LP.8.1.1(1/1),LP.8.1.1,B.1.1.529.2.86.1.1.11.1.1.1.3.8.1.1,[('LP.8.1* (LP.8.1.X)'  0.9999999997448206)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test3,test3,LB.1.3.1,24A,PASS,test3,34523.0,34523.0,4310,29686,34130.0,98.8616,25341,186.4,147,286.386,99.1773,NC_045512,-,Omicron,220.577503,bad,0.0,,Omicron (BA.2-like),0.87,0.03,scorpio call: Alt alleles 54; Ref alleles 2; Amb alleles 3; Oth alleles 3,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.16,Usher placements: LB.1.3.1(1/1),LB.1.3.1,B.1.1.529.2.86.1.1.9.2.1.3.1,[('LB.1* (LB.1.X)'  0.9999999998669523)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3

Although it might be easier to see a rendered version of the csv at this link.

Explanation of columns:

sample_id : the sample id for the sample which is often parsed from 'sample'
sample : this name of the sample
pangolin_lineage : lineage assigned by pangolin
nextclade_clade : clade assigned by nextclade
vadr_p/f : whether or not vadr "passes" or "fails" the consensus sequence
fasta_line : header of generated consensus fasta for sample
fastqc_raw_reads_1 : number of reads in R1 as determined by fastqc
fastqc_raw_reads_2 : number of reads in R2 as determined by fastqc
num_N : number of "N"s or ambiguous bases in generated consensus sequence (lower is better)
num_total : the total number of base predictions in consensus sequence (higher is better)
seqyclean_PairsKept : the number of pairs kept by seqyclean (higher is better)
seqyclean_Perc_Kept : the percentage of pairs kept by seqyclean (higher is better)
num_pos_100X : the number of positions with 100+ depth / with 100+ reads covering that base (higher is better)
insert_size_after_trimming : should make sense for the library prep kit used
bcftools_variants_identified : the number of differences identified via bcftools
samtools_meandepth_after_trimming : the mean depth across the reference
samtools_per_1X_coverage_after_trimming : percentage of bases in the reference with at least 1X depth
vadr_* : from vadr output
nextclade_* : from nextclade output
pangolin_* : from pangolin output
pango_aliasor_lineage : lineage assigned by pangolin
pango_aliasor_unaliased_lineage : full lineage path of pangolin lineage
freyja_summarized : output from freyja for sample simplified to one cell value (wastewater is recommended to use freyja files)
Cecret version : version of Cecret run on samples
seqyclean : version of seqylcean used in workflow
bwa : version of seqylcean used in workflow
ivar : version of seqylcean used in workflow
ivar consensus : version of seqylcean used in workflow

The consensus fasta sequences

The consensus sequences generated via Cecret can be used for submission to GISAID, GENBANK, and other repositories.

They are located in cecret/consensus

$ ls cecret/consensus/
test1.consensus.fa  test2.consensus.fa  test3.consensus.fa

This tutorial produced three fasta files: test1.consensus.fa, test2.consensus.fa, and test3.consensus.fa

The inside of the files look something like this

$ head cecret/consensus/test1.consensus.fa 
>test1
AAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGT
TGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGC
ACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACA
CGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGT
CTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCA
ACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCT
GGTAGCAGAACTCGAAGGCATTCAGTACGGTTGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGA
AATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGGTACGGCGC
CGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACAC

These fasta files have unique headers, so files can be combined into a multifasta with something like

cat cecret/consensus/* > combined.fasta

There are more features to this workflow