Home - UPHL-BioNGS/Cecret GitHub Wiki

Welcome to the Cecret wiki!

Consensus Extraction and Contig Reconstruction using Enriched libraries against a Template (CECRET)

---
Cecret
---
flowchart LR
fastq --> cleaning
cleaning --> A[alignment to reference]
A --> B[primer trimming according to primer schema]
B --> consensus

Introduction

This workflow is for intended for amplicon-based NGS libraries and a single reference. There are options to skip primer removal, but there are no options to skip alignment to a reference.

There are several references and primer schemes supplied with this workflow which are listed in their corresponding subspecies workflow. More can be added if the reference is small. Please submit an issue to let us know what else we should include.

The primer scheme and reference fasta file may also be supplied by the end user.

It is possible to use this workflow to simply annotate fastas generated from any workflow or downloaded from GISAID or NCBI. There are also options for multiple sequence alignment (MSA) and phylogenetic tree creation from the fasta files.

Cecret is also part of the staphb-toolkit.

History

Cecret was originally developed by @erinyoung at the Utah Public Health Laborotory for SARS-COV-2 sequencing with the artic/Illumina hybrid library prep workflow for MiSeq data with protocols here and here. This nextflow workflow, however, is flexible for many additional organisms and primer schemes as long as the reference genome is "small" and "good enough." In 2022, @tives82 added in contributions for Monkeypox virus, including converting IDT's primer scheme to NC_063383.1 coordinates. We are grateful to everyone that has contributed to this repo.

Library preparation considerations

The library preparation method greatly impacts which bioinformatic tools are recommended for creating a consensus sequence. For example, amplicon-based library preparation methods will required primer trimming and an elevated minimum depth for base-calling. Some bait-derived library preparation methods have a PCR amplification step, and PCR duplicates will need to be removed. This has added complexity and several (admittedly confusing) options to this workflow. Please submit an issue if/when you run into issues.

Dependencies

Quickstart

The default usage of Cecret is to run on fastq files for SARS-CoV-2 sequencing.

nextflow run UPHL-BioNGS/Cecret -profile singularity --reads reads 

There are a lot of ways this workflow can be adjusted. Cecret includes 100+ parameters, which is a lot of text to read through. We've divided this wiki into sections of reading, but please create an issue if something is unclear.

A selection of wiki pages:

A complete list of all params with their default values can be found in (Cecret/nextflow_schema.json)[https://github.com/UPHL-BioNGS/Cecret/blob/master/nextflow_schema.json]

References

Cecret is a nextflow workflow that strings together a variety of tools, and would not be possible without them.

  • aci - for depth estimation over amplicons
  • artic network - for aligning and consensus creation of nanopore reads
  • bbnorm - for normalizing reads prior to alignment
  • bwa - for aligning reads to the reference
  • fastp - for cleaning reads ; optional, faster alternative to seqyclean
  • fastqc - for QC metrics
  • freyja - for multiple SARS-CoV-2 lineage classifications
  • heatcluster - for visualization of a SNP matrix
  • igv-reports - for creating igv-reports for each suspected variant
  • iqtree2 - for phylogenetic tree generation (optional, relatedness must be set to "true")
  • ivar - calling variants and creating a consensus fasta; optional primer trimmer
  • kraken2 - for read classification
  • mafft - for multiple sequence alignment (optional, relatedness must be set to "true")
  • minimap2 - an alternative to bwa
  • multiqc - summary of results
  • nextclade - for SARS-CoV-2 clade classification
  • pangolin - for SARS-CoV-2 lineage classification
  • pango_aliasor - to identify parent pangolin lineages
  • phytreeviv - for visualization of the phylogenetic tree
  • samtools - for QC metrics and sorting; optional primer trimmer; optional converting bam to fastq files; optional duplication marking
  • seqyclean - for cleaning reads
  • snp-dists - for relatedness determination (optional, relatedness must be set to "true")
  • vadr - for annotating fastas like NCBI