Home - UPHL-BioNGS/Cecret GitHub Wiki

Welcome to the Cecret wiki!

Consensus Extraction and Contig Reconstruction using Enriched libraries against a Template (CECRET)

---
Cecret
---
flowchart LR
fastq --> cleaning
cleaning --> A[alignment to reference]
A --> B[primer trimming according to primer schema]
B --> consensus

Introduction

This workflow is for intended for amplicon-based NGS libraries and a single reference. There are options to skip primer removal, but there are no options to skip alignment to a reference.

There are several references and primer schemes supplied with this workflow which are listed in their corresponding subspecies workflow. More can be added if the reference is small. Please submit an issue to let us know what else we should include.

The primer scheme and reference fasta file may also be supplied by the end user.

It is possible to use this workflow to simply annotate fastas generated from any workflow or downloaded from GISAID or NCBI. There are also options for multiple sequence alignment (MSA) and phylogenetic tree creation from the fasta files.

Cecret is also part of the staphb-toolkit.

History

Cecret was originally developed by @erinyoung at the Utah Public Health Laborotory for SARS-COV-2 sequencing with the artic/Illumina hybrid library prep workflow for MiSeq data with protocols here and here. This nextflow workflow, however, is flexible for many additional organisms and primer schemes as long as the reference genome is "small" and "good enough." In 2022, @tives82 added in contributions for Monkeypox virus, including converting IDT's primer scheme to NC_063383.1 coordinates. We are grateful to everyone that has contributed to this repo.

Library preparation considerations

The library preparation method greatly impacts which bioinformatic tools are recommended for creating a consensus sequence. For example, amplicon-based library preparation methods will required primer trimming and an elevated minimum depth for base-calling. Some bait-derived library preparation methods have a PCR amplification step, and PCR duplicates will need to be removed. This has added complexity and several (admittedly confusing) options to this workflow. Please submit an issue if/when you run into issues.

Dependencies

Nextflow
Singularity or Docker - set the profile as singularity or docker during runtime

Quickstart

The default usage of Cecret is to run on fastq files for SARS-CoV-2 sequencing.

nextflow run UPHL-BioNGS/Cecret -profile singularity --reads reads

There are a lot of ways this workflow can be adjusted. Cecret includes 100+ parameters, which is a lot of text to read through. We've divided this wiki into sections of reading, but please create an issue if something is unclear.

A selection of wiki pages:

A complete list of all params with their default values can be found in (Cecret/nextflow_schema.json)[https://github.com/UPHL-BioNGS/Cecret/blob/master/nextflow_schema.json]

References

Cecret is a nextflow workflow that strings together a variety of tools, and would not be possible without them.

aci - for depth estimation over amplicons
artic network - for aligning and consensus creation of nanopore reads
bbnorm - for normalizing reads prior to alignment
bwa - for aligning reads to the reference
fastp - for cleaning reads ; optional, faster alternative to seqyclean
fastqc - for QC metrics
freyja - for multiple SARS-CoV-2 lineage classifications
heatcluster - for visualization of a SNP matrix
igv-reports - for creating igv-reports for each suspected variant
iqtree2 - for phylogenetic tree generation (optional, relatedness must be set to "true")
ivar - calling variants and creating a consensus fasta; optional primer trimmer
kraken2 - for read classification
mafft - for multiple sequence alignment (optional, relatedness must be set to "true")
minimap2 - an alternative to bwa
multiqc - summary of results
nextclade - for SARS-CoV-2 clade classification
pangolin - for SARS-CoV-2 lineage classification
pango_aliasor - to identify parent pangolin lineages
phytreeviv - for visualization of the phylogenetic tree
samtools - for QC metrics and sorting; optional primer trimmer; optional converting bam to fastq files; optional duplication marking
seqyclean - for cleaning reads
snp-dists - for relatedness determination (optional, relatedness must be set to "true")
vadr - for annotating fastas like NCBI