Tutorial_sample_sheet - UPHL-BioNGS/Cecret GitHub Wiki
Tutorial : Default settings with SARS-CoV-2 using a sample sheet
The original use-case for running Cecret
was for SARS-CoV-2 consensus creation and lineage determination, and this remains the default purpose for this workflow.
Steps in this tutorial:
- Downloading some sample fastq files
- Creating the sample sheet
- Start the workflow
- Look through the results
Step 1. Downloading some sample fastq files
This step is going to create a directory called reads
, enter that directory, and then download 6 test files for 3 test samples.
mkdir reads
cd reads
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test1_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test1_2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test2_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test2_2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test3_1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/sarscov2/test3_2.fastq.gz
cd ../
These reads or files are paired-end, meaning there are two files for every one sample. This is the typical case for most sequencing projects.
These reads were subsampled from samples SRR32571082, SRR32571051, and SRR32571045, which were created using Illumina COVIDSeq Artic v5.3.2 primers.
Step 2. Creating the sample sheet
Sample sheets are required in several use cases including
- When nextflow cannot correctly parse the names of fastq files resulting in mis-paired fastq files or mis-labeled result files.
- When using a nextflow compatible cloud service
Sample sheets for Cecret
are not hard to create, but sometimes the formatting can be particular.
The header for the sample sheet must be sample,fastq_1,fastq_2
.
- sample : the name of the sample (i.e.
SRR32571082
or98731-UT-19
) - fastq_1 : pair 1 of a paired-end fastq file pair
- fastq_2 : pair 2 of a paired-end fastq file pair
The path to fastq_1
and fastq_2
may need to be the full path for some systems. For example, if using an AWS-based service, the filename must start with the s3 location s3://<full path to fastq file>
. For this tutorial, we only need to use the relative paths.
Example sample_sheet.csv
sample,fastq_1,fastq_2
test1,reads/test1_1.fastq.gz,reads/test1_2.fastq.gz
test2,reads/test2_1.fastq.gz,reads/test2_2.fastq.gz
test3,reads/test3_1.fastq.gz,reads/test3_2.fastq.gz
This file can be downloaded
wget https://raw.githubusercontent.com/erinyoung/cecret_test_data/refs/heads/main/data/sarscov2/sample_sheet.csv
Step 3. Start the workflow
nextflow run UPHL-BioNGS/Cecret -profile docker --sample_sheet sample_sheet.csv
Step 4. Look through the results
The summary file
A summary file can be found at cecret/cecret_results.csv
.
Basically it looks like this:
sample_id,sample,pangolin_lineage,nextclade_clade,vadr_p/f,fasta_line,fastqc_raw_reads_1,fastqc_raw_reads_2,num_N,num_total,seqyclean_PairsKept,seqyclean_Perc_Kept,num_pos_100X,insert_size_after_trimming,bcftools_variants_identified,samtools_meandepth_after_trimming,samtools_per_1X_coverage_after_trimming,vadr_model,vadr_alerts,nextclade_clade_who,nextclade_qc.overallScore,nextclade_qc.overallStatus,pangolin_conflict,pangolin_ambiguity_score,pangolin_scorpio_call,pangolin_scorpio_support,pangolin_scorpio_conflict,pangolin_scorpio_notes,pangolin_version,pangolin_pangolin_version,pangolin_scorpio_version,pangolin_constellation_version,pangolin_is_designated,pangolin_qc_status,pangolin_qc_notes,pangolin_note,pango_aliasor_lineage,pango_aliasor_unaliased_lineage,freyja_summarized,Cecret version,seqyclean,bwa,ivar,ivar consensus
test1,test1,XEC.2.1,24F,PASS,test1,55820.0,55820.0,296,29694,53557.0,95.9459,29398,141.8,149,386.763,99.3746,NC_045512,-,,0.694444,good,0.0,,Omicron (BA.2-like),0.87,0.02,scorpio call: Alt alleles 54; Ref alleles 1; Amb alleles 2; Oth alleles 5,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.03,Usher placements: XEC.2.1(1/1); scorpio lineage BA.2 conflicts with inference lineage XEC.2.1,XEC.2.1,XEC.2.1,[('XEC* (XEC.X)' 0.9999999988066752)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test2,test2,LP.8.1.1,25A,PASS,test2,54688.0,54688.0,76,29767,52729.0,96.4179,29691,144.7,147,386.836,99.6121,NC_045512,-,Omicron,21.006944,good,0.0,,Omicron (BA.2-like),0.87,0.05,scorpio call: Alt alleles 54; Ref alleles 3; Amb alleles 0; Oth alleles 5,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.02,Usher placements: LP.8.1.1(1/1),LP.8.1.1,B.1.1.529.2.86.1.1.11.1.1.1.3.8.1.1,[('LP.8.1* (LP.8.1.X)' 0.9999999997448206)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test3,test3,LB.1.3.1,24A,PASS,test3,34523.0,34523.0,4310,29686,34130.0,98.8616,25341,186.4,147,286.386,99.1773,NC_045512,-,Omicron,220.577503,bad,0.0,,Omicron (BA.2-like),0.87,0.03,scorpio call: Alt alleles 54; Ref alleles 2; Amb alleles 3; Oth alleles 3,PUSHER-v1.32,4.3.1,0.3.19,v0.1.12,False,pass,Ambiguous_content:0.16,Usher placements: LB.1.3.1(1/1),LB.1.3.1,B.1.1.529.2.86.1.1.9.2.1.3.1,[('LB.1* (LB.1.X)' 0.9999999998669523)],v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
Although it might be easier to see a rendered version of the csv at this link.
Explanation of columns:
- sample_id : the sample id for the sample which is often parsed from 'sample'
- sample : this name of the sample
- pangolin_lineage : lineage assigned by pangolin
- nextclade_clade : clade assigned by nextclade
- vadr_p/f : whether or not vadr "passes" or "fails" the consensus sequence
- fasta_line : header of generated consensus fasta for sample
- fastqc_raw_reads_1 : number of reads in R1 as determined by fastqc
- fastqc_raw_reads_2 : number of reads in R2 as determined by fastqc
- num_N : number of "N"s or ambiguous bases in generated consensus sequence (lower is better)
- num_total : the total number of base predictions in consensus sequence (higher is better)
- seqyclean_PairsKept : the number of pairs kept by seqyclean (higher is better)
- seqyclean_Perc_Kept : the percentage of pairs kept by seqyclean (higher is better)
- num_pos_100X : the number of positions with 100+ depth / with 100+ reads covering that base (higher is better)
- insert_size_after_trimming : should make sense for the library prep kit used
- bcftools_variants_identified : the number of differences identified via bcftools
- samtools_meandepth_after_trimming : the mean depth across the reference
- samtools_per_1X_coverage_after_trimming : percentage of bases in the reference with at least 1X depth
- vadr_* : from vadr output
- nextclade_* : from nextclade output
- pangolin_* : from pangolin output
- pango_aliasor_lineage : lineage assigned by pangolin
- pango_aliasor_unaliased_lineage : full lineage path of pangolin lineage
- freyja_summarized : output from freyja for sample simplified to one cell value (wastewater is recommended to use freyja files)
- Cecret version : version of Cecret run on samples
- seqyclean : version of seqylcean used in workflow
- bwa : version of seqylcean used in workflow
- ivar : version of seqylcean used in workflow
- ivar consensus : version of seqylcean used in workflow
The consensus fasta sequences
The consensus sequences generated via Cecret can be used for submission to GISAID, GENBANK, and other repositories.
They are located in cecret/consensus
$ ls cecret/consensus/
test1.consensus.fa test2.consensus.fa test3.consensus.fa
This tutorial produced three fasta files: test1.consensus.fa, test2.consensus.fa, and test3.consensus.fa
The inside of the files look something like this
$ head cecret/consensus/test1.consensus.fa
>test1
AAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGT
TGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGC
ACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACA
CGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGT
CTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCA
ACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCT
GGTAGCAGAACTCGAAGGCATTCAGTACGGTTGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGA
AATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGGTACGGCGC
CGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACAC
These fasta files have unique headers, so files can be combined into a multifasta with something like
cat cecret/consensus/* > combined.fasta
There are more features to this workflow