Tutorial_new_virus - UPHL-BioNGS/Cecret GitHub Wiki

New virus tutorial

Cecret is meant to be flexible. In early 2020, there needed to be a workflow where a primer bedfile existed and a reference existed, but everything else was still lagging behind.

For the purpose of this tutorial, assume that vampirism has suddenly emerged. A reference genome and primer scheme exist for the causative virus, but no additional containerized tools for lineage determination or other analyses are available. The following steps outline the process for analyzing viral samples.

Steps in this tutorial:

Step 1. Downloading some sample fastq files

This step is going to create a directory called reads, enter that directory, and then download 6 test files for 3 test samples.

mkdir reads
cd reads
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test1_R1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test1_R2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test2_R1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test2_R2.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test3_R1.fastq.gz
wget -q https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/test3_R2.fastq.gz
cd ../

These reads or files are paired-end, meaning there are two files for every one sample. This is the typical case for most sequencing projects.

Step 2. Download the reference files

# the reference sequence
wget https://raw.githubusercontent.com/erinyoung/cecret_test_data/refs/heads/main/data/vampirism/reference.fasta
# the primer bedfile
wget https://github.com/erinyoung/cecret_test_data/raw/refs/heads/main/data/vampirism/primer.bed

The reference file

Cecret only allows the use of one reference file. This is a fasta file for the entire genome that you are hoping to align to.

This is the head of that file

>NOCTIS_REF
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
TTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTC
CCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTAC
GTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGG
CTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGAT
GCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC
GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCT

Please note that the header of this sequence is >NOCTIS_REF, which will need to coincide with any bedfiles.

The primer schema

The primer bedfile looks like this

NOCTIS_REF	47	78	NOCTIS_SANGUIS_400_1_LEFT_1	1	+
NOCTIS_REF	419	447	NOCTIS_SANGUIS_400_1_RIGHT_1	1	-
NOCTIS_REF	344	366	NOCTIS_SANGUIS_400_2_LEFT_0	2	+
NOCTIS_REF	707	732	NOCTIS_SANGUIS_400_2_RIGHT_0	2	-
NOCTIS_REF	638	661	NOCTIS_SANGUIS_400_3_LEFT_1	1	+
NOCTIS_REF	1018	1047	NOCTIS_SANGUIS_400_3_RIGHT_0	1	-
NOCTIS_REF	970	995	NOCTIS_SANGUIS_400_4_LEFT_0	2	+
NOCTIS_REF	1340	1370	NOCTIS_SANGUIS_400_4_RIGHT_0	2	-
NOCTIS_REF	1292	1320	NOCTIS_SANGUIS_400_5_LEFT_0	1	+
NOCTIS_REF	1660	1692	NOCTIS_SANGUIS_400_5_RIGHT_0	1	-
NOCTIS_REF	1574	1596	NOCTIS_SANGUIS_400_6_LEFT_1	2	+
NOCTIS_REF	1945	1972	NOCTIS_SANGUIS_400_6_RIGHT_1	2	-
NOCTIS_REF	1882	1905	NOCTIS_SANGUIS_400_7_LEFT_2	1	+
NOCTIS_REF	2259	2284	NOCTIS_SANGUIS_400_7_RIGHT_2	1	-
NOCTIS_REF	2229	2252	NOCTIS_SANGUIS_400_8_LEFT_0	2	+
NOCTIS_REF	2603	2629	NOCTIS_SANGUIS_400_8_RIGHT_0	2	-
NOCTIS_REF	2533	2563	NOCTIS_SANGUIS_400_9_LEFT_0	1	+
NOCTIS_REF	2900	2933	NOCTIS_SANGUIS_400_9_RIGHT_0	1	-
NOCTIS_REF	2854	2880	NOCTIS_SANGUIS_400_10_LEFT_0	2	+
NOCTIS_REF	3233	3254	NOCTIS_SANGUIS_400_10_RIGHT_0	2	-

Note that the first column matches the name of the reference, which is NOCTIS_REF. The second column indicates where that primer starts in the reference, and the third column is where it ends. The fourth column is the name of that primer. Some of the tools in this workflow require the LEFT and RIGHT text strings to be present, so I recommend adjusting any bedfile accordingly.

More information about bedfiles can be found here : ()[]

Step 4. Start the workflow

nextflow run UPHL-BioNGS/Cecret -profile docker --reads reads --reference_genome reference.fasta --primer_bed primer.bed --species other

Step 5. Look through the results

The summary file

A summary file can be found at cecret/cecret_results.csv.

Basically it looks like this:

sample_id,sample,fasta_line,fastqc_raw_reads_1,fastqc_raw_reads_2,num_N,num_total,seqyclean_PairsKept,seqyclean_Perc_Kept,num_pos_100X,insert_size_after_trimming,bcftools_variants_identified,samtools_meandepth_after_trimming,samtools_per_1X_coverage_after_trimming,Cecret version,seqyclean,bwa,ivar,ivar consensus
test1,test1,test1,5764.0,5764.0,0,3222,5681.0,98.56,3222,138.2,6,373.412,97.6364,v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test2,test2,test2,5726.0,5726.0,76,3298,5637.0,98.4457,3222,140.3,8,372.162,99.9091,v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3
test3,test3,test3,2989.0,2989.0,1097,3213,2958.0,98.9629,2116,178.1,10,222.273,95.8485,v3.26.25063,1.10.09_(2018-10-16),0.7.18-r1243-dirty,1.4.3,1.4.3

Although it might be easier to see a rendered version of the csv at this link.

Explanation of columns:

  • sample_id : the sample id for the sample which is often parsed from 'sample'
  • sample : this name of the sample
  • fasta_line : header of generated consensus fasta for sample
  • fastqc_raw_reads_1 : number of reads in R1 as determined by fastqc
  • fastqc_raw_reads_2 : number of reads in R2 as determined by fastqc
  • num_N : number of "N"s or ambiguous bases in generated consensus sequence (lower is better)
  • num_total : the total number of base predictions in consensus sequence (higher is better)
  • seqyclean_PairsKept : the number of pairs kept by seqyclean (higher is better)
  • seqyclean_Perc_Kept : the percentage of pairs kept by seqyclean (higher is better)
  • num_pos_100X : the number of positions with 100+ depth / with 100+ reads covering that base (higher is better)
  • insert_size_after_trimming : should make sense for the library prep kit used
  • bcftools_variants_identified : the number of differences identified via bcftools
  • samtools_meandepth_after_trimming : the mean depth across the reference
  • samtools_per_1X_coverage_after_trimming : percentage of bases in the reference with at least 1X depth
  • Cecret version : version of Cecret run on samples
  • seqyclean : version of seqylcean used in workflow
  • bwa : version of bwa used in workflow
  • ivar : version of ivar used in workflow
  • ivar consensus : version of ivar used in workflow

The consensus fasta sequences

The consensus sequences generated via Cecret can be used for submission to GISAID, GENBANK, and other repositories.

They are located in cecret/consensus

$ ls cecret/consensus/
test1.consensus.fa  test2.consensus.fa  test3.consensus.fa

This tutorial produced three fasta files: test1.consensus.fa, test2.consensus.fa, and test3.consensus.fa

The inside of the files look something like this

$ head cecret/consensus/test1.consensus.fa 
>test1
AAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGT
TGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGC
ACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACA
CGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGT
CTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCA
ACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCT
GGTAGCAGAACTCGAAGGCATTCAGTACGGTTGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGA
AATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGGTACGGCGC
CGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACAC

These fasta files have unique headers, so files can be combined into a multifasta with something like

cat cecret/consensus/* > combined.fasta

There are more features to this workflow

Side note: I kept trying to find real samples to do this with, but then I kept finding datasets in nextclade, so I ended up having to create my own by butchering some SARS-CoV-2 data.