Tutorial_new_virus - UPHL-BioNGS/Cecret GitHub Wiki
New virus tutorial
Cecret
is meant to be flexible. In early 2020, there needed to be a workflow where a primer bedfile existed and a reference existed, but everything else was still lagging behind.
For the purpose of this tutorial, assume that vampirism has suddenly emerged. A reference genome and primer scheme exist for the causative virus, but no additional containerized tools for lineage determination or other analyses are available. The following steps outline the process for analyzing viral samples.
Steps in this tutorial:
- Downloading some sample fastq files
- Download the reference files
- Start the workflow
- Look through the results
Step 1. Downloading some sample fastq files
This step is going to create a directory called reads
, enter that directory, and then download 6 test files for 3 test samples.
mkdir reads
cd reads
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test1_R1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test1_R2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test2_R1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test2_R2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test3_R1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test3_R2.fastq.gz
cd ../
These reads or files are paired-end, meaning there are two files for every one sample. This is the typical case for most sequencing projects.
Step 2. Download the reference files
# the reference sequence
wget https://raw.githubusercontent.com/erinyoung/cecret_test_data/refs/heads/main/data/vampirism/reference.fasta
# the primer bedfile
wget https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/primer.bed
The reference file
Cecret
only allows the use of one reference file. This is a fasta file for the entire genome that you are hoping to align to.
This is the head of that file
>NOCTIS_REF
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC
CCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTAC
GTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGG
CTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGAT
GCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC
GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCT
Please note that the header of this sequence is >NOCTIS_REF
, which will need to coincide with any bedfiles.
The primer schema
The primer bedfile looks like this
NOCTIS_REF 47 78 NOCTIS_SANGUIS_400_1_LEFT_1 1 +
NOCTIS_REF 419 447 NOCTIS_SANGUIS_400_1_RIGHT_1 1 -
NOCTIS_REF 344 366 NOCTIS_SANGUIS_400_2_LEFT_0 2 +
NOCTIS_REF 707 732 NOCTIS_SANGUIS_400_2_RIGHT_0 2 -
NOCTIS_REF 638 661 NOCTIS_SANGUIS_400_3_LEFT_1 1 +
NOCTIS_REF 1018 1047 NOCTIS_SANGUIS_400_3_RIGHT_0 1 -
NOCTIS_REF 970 995 NOCTIS_SANGUIS_400_4_LEFT_0 2 +
NOCTIS_REF 1340 1370 NOCTIS_SANGUIS_400_4_RIGHT_0 2 -
NOCTIS_REF 1292 1320 NOCTIS_SANGUIS_400_5_LEFT_0 1 +
NOCTIS_REF 1660 1692 NOCTIS_SANGUIS_400_5_RIGHT_0 1 -
NOCTIS_REF 1574 1596 NOCTIS_SANGUIS_400_6_LEFT_1 2 +
NOCTIS_REF 1945 1972 NOCTIS_SANGUIS_400_6_RIGHT_1 2 -
NOCTIS_REF 1882 1905 NOCTIS_SANGUIS_400_7_LEFT_2 1 +
NOCTIS_REF 2259 2284 NOCTIS_SANGUIS_400_7_RIGHT_2 1 -
NOCTIS_REF 2229 2252 NOCTIS_SANGUIS_400_8_LEFT_0 2 +
NOCTIS_REF 2603 2629 NOCTIS_SANGUIS_400_8_RIGHT_0 2 -
NOCTIS_REF 2533 2563 NOCTIS_SANGUIS_400_9_LEFT_0 1 +
NOCTIS_REF 2900 2933 NOCTIS_SANGUIS_400_9_RIGHT_0 1 -
NOCTIS_REF 2854 2880 NOCTIS_SANGUIS_400_10_LEFT_0 2 +
NOCTIS_REF 3233 3254 NOCTIS_SANGUIS_400_10_RIGHT_0 2 -
Note that the first column matches the name of the reference, which is NOCTIS_REF
. The second column indicates where that primer starts in the reference, and the third column is where it ends. The fourth column is the name of that primer. Some of the tools in this workflow require the LEFT
and RIGHT
text strings to be present, so I recommend adjusting any bedfile accordingly.
More information about bedfiles can be found here : ()[]
Step 4. Start the workflow
nextflow run UPHL-BioNGS/Cecret -profile docker --reads reads --reference_genome reference.fasta --primer_bed primer.bed --species other
Step 5. Look through the results
The summary file
A summary file can be found at cecret/cecret_results.csv
.
Basically it looks like this:
sample_id,sample,fasta_line,fastqc_raw_reads_1,fastqc_raw_reads_2,num_N,num_total,seqyclean_PairsKept,seqyclean_Perc_Kept,num_pos_100X,insert_size_after_trimming,bcftools_variants_identified,samtools_meandepth_after_trimming,samtools_per_1X_coverage_after_trimming,Cecret version,seqyclean,bwa,ivar,ivar consensus
test1,test1,test1,5764.0,5764.0,0,3222,5681.0,98.56,3222,138.2,6,373.412,97.6364,v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test2,test2,test2,5726.0,5726.0,76,3298,5637.0,98.4457,3222,140.3,8,372.162,99.9091,v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test3,test3,test3,2989.0,2989.0,1097,3213,2958.0,98.9629,2116,178.1,10,222.273,95.8485,v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
Although it might be easier to see a rendered version of the csv at this link.
Explanation of columns:
- sample_id : the sample id for the sample which is often parsed from 'sample'
- sample : this name of the sample
- fasta_line : header of generated consensus fasta for sample
- fastqc_raw_reads_1 : number of reads in R1 as determined by fastqc
- fastqc_raw_reads_2 : number of reads in R2 as determined by fastqc
- num_N : number of "N"s or ambiguous bases in generated consensus sequence (lower is better)
- num_total : the total number of base predictions in consensus sequence (higher is better)
- seqyclean_PairsKept : the number of pairs kept by seqyclean (higher is better)
- seqyclean_Perc_Kept : the percentage of pairs kept by seqyclean (higher is better)
- num_pos_100X : the number of positions with 100+ depth / with 100+ reads covering that base (higher is better)
- insert_size_after_trimming : should make sense for the library prep kit used
- bcftools_variants_identified : the number of differences identified via bcftools
- samtools_meandepth_after_trimming : the mean depth across the reference
- samtools_per_1X_coverage_after_trimming : percentage of bases in the reference with at least 1X depth
- Cecret version : version of Cecret run on samples
- seqyclean : version of seqylcean used in workflow
- bwa : version of bwa used in workflow
- ivar : version of ivar used in workflow
- ivar consensus : version of ivar used in workflow
The consensus fasta sequences
The consensus sequences generated via Cecret can be used for submission to GISAID, GENBANK, and other repositories.
They are located in cecret/consensus
$ ls cecret/consensus/
test1.consensus.fa test2.consensus.fa test3.consensus.fa
This tutorial produced three fasta files: test1.consensus.fa, test2.consensus.fa, and test3.consensus.fa
The inside of the files look something like this
$ head cecret/consensus/test1.consensus.fa
>test1
AAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGT
TGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGC
ACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACA
CGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGT
CTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCA
ACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCT
GGTAGCAGAACTCGAAGGCATTCAGTACGGTTGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGA
AATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGGTACGGCGC
CGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACAC
These fasta files have unique headers, so files can be combined into a multifasta with something like
cat cecret/consensus/* > combined.fasta
There are more features to this workflow
Side note: I kept trying to find real samples to do this with, but then I kept finding datasets in nextclade, so I ended up having to create my own by butchering some SARS-CoV-2 data.