FAQ - UPHL-BioNGS/Cecret GitHub Wiki

Frequently Asked Questions (FAQ)

What do I do if I encounter an error?

TELL US ABOUT IT!!!

Github issue
Send to someone from UPHL on slack

Be sure to include the command that was used, what config file was used, and what the nextflow error was.

What is the MultiQC report?

The multiqc report aggregates data across your samples into one file. Open the 'cecret/multiqc/multiqc_report.html' file with your favored browser. There tables and graphs are generated for 'General Statistics', 'Samtools stats', 'Samtools flagstats', 'FastQC', 'iVar', 'SeqyClean', 'Fastp', 'Pangolin', and 'Kraken2'.

Example fastqc graph

Example kraken2 graph

Example pangolin graph

What if I want to test the workflow?

In the history of this repository, there actually was an attempt to store fastq files here that the End User could use to test out this workflow. This made the repository very large and difficult to download.

There are several test profiles. These download fastq files from the ENA to use in the workflow. This does not always work due to local internet connectivity issues, but may work fine for everyone else.

nextflow run UPHL-BioNGS/Cecret -profile {docker or singularity},test

Another great resources is SARS-CoV-2 datasets, an effort of the CDC to provide a benchmark dataset for validating bioinformatic workflows. Fastq files from the nonviovoc, voivoc, and failed projects were downloaded from the SRA and put through this workflow and tested locally before releasing a new version.

The expected amount of time to run this workflow with 250 G RAM and 48 CPUs, 'params.maxcpus = 8', and 'params.medcpus = 4' is ~42 minutes. This corresponded with 25.8 CPU hours.

What if I just want to annotate some SARS-CoV-2 fastas with pangolin, freyja, nextclade and vadr?

# for a collection of fastas
nextflow run UPHL-BioNGS/Cecret -profile singularity --fastas <directory with fastas>

# for a collection of fastas and multifastas
nextflow run UPHL-BioNGS/Cecret -profile singularity --fastas <directory with fastas> --multifastas <directory with multifastas>

How do I compare a bunch of sequences? How do I create a phylogenetic tree?

The End User can run mafft, snpdists, and iqtree on a collection of fastas as well with

nextflow run UPHL-BioNGS/Cecret -profile singularity --relatedness true --fastas <directory with fastas> --multifastas <directory with multifastas>

The End User can have paired-end, singled-end, and fastas that can all be put together into one analysis.

nextflow run UPHL-BioNGS/Cecret -profile singularity --relatedness true --fastas <directory with fastas> --multifastas <directory with multifastas> --reads <directory with paire-end reads> --single_reads <directory with single-end reads>

Is there a way to determine if certain amplicons are failing?

There are two ways to do this.

With ACI :

ACI is disabled by default because of how long it takes to run, it can be enabled by setting params.aci to true.

nextflow run UPHL-BioNGS/Cecret <everything else that would normally be run> --aci

cecret/aci has two files : amplicon_depth.csv and amplicon_depth.png. There is a row for each sample in 'amplicon_depth.csv', and a column for each primer in the amplicon bedfile. The values contained within are reads that only map to the region specified in the amplicon bedfile and excludes reads that do not. A boxplot of these values is visualized in amplicon_depth.png.

alt text

With samtools ampliconstats :

cecret/samtools_ampliconstats has a file for each sample.

Row number 126 (FDEPTH) has a column for each amplicon (also without a header). To get this row for all of the samples, grep the keyword "FDEPTH" from each sample.

grep "^FDEPTH" cecret/samtools_ampliconstats/* > samtools_ampliconstats_all.tsv

There are corresponding images in cecret/samtools_plot_ampliconstats for each sample.

Sample samtools plot ampliconstats depth graph

alt text

What is the difference between `params.amplicon_bed` and `params.primer_bed`?

The primer bedfile is the file with the start and stop of each primer sequence.

$ head configs/artic_V3_nCoV-2019.primer.bed
MN908947.3	30	54	nCoV-2019_1_LEFT	nCoV-2019_1	+
MN908947.3	385	410	nCoV-2019_1_RIGHT	nCoV-2019_1	-
MN908947.3	320	342	nCoV-2019_2_LEFT	nCoV-2019_2	+
MN908947.3	704	726	nCoV-2019_2_RIGHT	nCoV-2019_2	-
MN908947.3	642	664	nCoV-2019_3_LEFT	nCoV-2019_1	+
MN908947.3	1004	1028	nCoV-2019_3_RIGHT	nCoV-2019_1	-
MN908947.3	943	965	nCoV-2019_4_LEFT	nCoV-2019_2	+
MN908947.3	1312	1337	nCoV-2019_4_RIGHT	nCoV-2019_2	-
MN908947.3	1242	1264	nCoV-2019_5_LEFT	nCoV-2019_1	+
MN908947.3	1623	1651	nCoV-2019_5_RIGHT	nCoV-2019_1	-

The amplicon bedfile is the file with the start and stop of each intended amplicon.

$ head configs/artic_V3_nCoV-2019.insert.bed <==
MN908947.3	54	385	1	1	+
MN908947.3	342	704	2	2	+
MN908947.3	664	1004	3	1	+
MN908947.3	965	1312	4	2	+
MN908947.3	1264	1623	5	1	+
MN908947.3	1595	1942	6	2	+
MN908947.3	1897	2242	7	1	+
MN908947.3	2205	2568	8	2	+
MN908947.3	2529	2880	9	1	+
MN908947.3	2850	3183	10	2	+

Due to the many varieties of primer bedfiles, it is best if the End User supplied this file for custom primer sequences.

What if I am using an amplicon-based library that is not SARS-CoV-2?

First of all, this is a great thing! Let us know if tools specific for your organism should be added to this workflow. There are already options for 'mpx' and 'other' species.

In a config file, change the following relevant parameters:

params.reference_genome
params.primer_bed
params.species = 'other'

How do I fix the "paired reads have different names" error?

This error is from bwa. It looks something like this.

ERROR ~ Error executing process > 'CECRET:CONSENSUS:BWA (ERR5743893)'

Caused by:
  Process `CECRET:CONSENSUS:BWA (ERR5743893)` terminated with an error exit status (1)


Command executed:

  mkdir -p bwa logs/CECRET:CONSENSUS:BWA
  log=logs/CECRET:CONSENSUS:BWA/ERR5743893.101b24c2-b3f5-4bae-bb0e-fe153c0aa747.log
  
  # time stamp + capturing tool versions
  date > $log
  echo "bwa $(bwa 2>&1 | grep Version )" >> $log
  bwa_version=$(bwa 2>&1 | grep Version | awk '{print $NF}')
  
  # index the reference fasta file
  bwa index MN908947.3.fasta
  
  # bwa mem command
  bwa mem        -t 12       MN908947.3.fasta       ERR5743893_clean_PE1.fastq.gz ERR5743893_clean_PE2.fastq.gz       > bwa/ERR5743893.sam
  
  cat <<-END_VERSIONS > versions.yml
  "CECRET:CONSENSUS:BWA":
    bwa: $(bwa 2>&1 | grep Version | awk '{print $NF}')
    container: staphb/bwa:0.7.19
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [bwa_index] Pack FASTA... 0.00 sec
  [bwa_index] Construct BWT for the packed sequence...
  [bwa_index] 0.01 seconds elapse.
  [bwa_index] Update BWT... 0.00 sec
  [bwa_index] Pack forward-only FASTA... 0.00 sec
  [bwa_index] Construct SA from BWT and Occ... 0.00 sec
  [main] Version: 0.7.19-r1273
  [main] CMD: bwa index MN908947.3.fasta
  [main] Real time: 0.018 sec; CPU: 0.011 sec
  [M::bwa_idx_load_from_disk] read 0 ALT contigs
  [M::process] read 333450 sequences (65032307 bp)...
  [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 161952, 81, 9)
  [M::mem_pestat] skip orientation FF as there are not enough pairs
  [M::mem_pestat] analyzing insert size distribution for orientation FR...
  [M::mem_pestat] (25, 50, 75) percentile: (381, 387, 393)
  [M::mem_pestat] low and high boundaries for computing mean and std.dev: (357, 417)
  [M::mem_pestat] mean and std.dev: (389.66, 10.23)
  [M::mem_pestat] low and high boundaries for proper pairs: (345, 431)
  [M::mem_pestat] analyzing insert size distribution for orientation RF...
  [M::mem_pestat] (25, 50, 75) percentile: (3275, 7351, 8711)
  [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 19583)
  [M::mem_pestat] mean and std.dev: (5738.27, 3096.21)
  [M::mem_pestat] low and high boundaries for proper pairs: (1, 25019)
  [M::mem_pestat] skip orientation RR as there are not enough pairs
  [M::mem_pestat] skip orientation RF
  [mem_sam_pe] paired reads have different names: "ERR5743893.198327", "ERR5743893.198566"

Work dir:
  /Volumes/NGS_2/Bioinformatics/eriny/testing_cecret/2025-06-24/work/75/d0edc6e410e2b2d18ab5dc00f2a12e

Container:
  staphb/bwa:0.7.19

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

There is not an option to fix this in the workflow, but reads can be pre-processed with something like BBTools' repair.sh.

repair.sh in1=broken1.fq in2=broken2.fq out1=fixed1.fq out2=fixed2.fq outs=singletons.fq repair

Then the fixed files can be used as input.

This is generally due to how the fastq files were created. We have found it to be an uncommon occurrence, but if this if this is something normal, please submit an issue and we'll add this step by defaul.

Alternatively, switching to using minimap2 instead of bwa will bypass this.

nextflow run UPHL-BioNGS/Cecret <normal flags and params> --aligner minimap2

What if I need to filter out human reads or I only want reads that map to my reference?

Although not perfect, if 'params.filter = true', then only the reads that were mapped to the reference are returned. This should eliminate all human contamination (as long as human is not part of the supplied reference) and all "problematic" incidental findings.

This workflow has too many bells and whistles. I really only care about generating a consensus fasta. How do I get rid of all the extras?

Change the parameters in a config file and set most of them to false.

params.species = 'none'
params.fastqc = false
params.ivar_variants = false
params.samtools_stats = false
params.samtools_coverage = false
params.samtools_depth = false
params.samtools_flagstat = false
params.samtools_ampliconstats = false
params.samtools_plot_ampliconstats = false
params.aci = false
params.pangolin = false
params.freyja = false
params.nextclade = false
params.vadr = false
params.multiqc = false

And, yes, this means I added some bells and whistles so the End User could turn off the bells and whistles. /irony

Can I get images of my SNPs and indels?

Yes! by setting igv-reports to true.

Where did the SAM files go?

Never fear, they are still in nextflow's work directory if the End User really needs them. They are no longer included in publishDir because of size issues. The BAM files are still included in publishDir, and most analyses for SAM files can be done with BAM files.

Where did the *err files go?

Personally, we liked having stderr saved to a file because some of the tools using in this workflow print to stderr instead of stdout. We have found, however, that this puts all the error text into a file, which a lot of new-to-nextflow users had a hard time finding. It was easier to assist and troubleshoot with End Users when stderr was printed normally.

What is in the works to get added to 'Cecret'?

Currently Cecret is not under active development. It is still maintained, but no new features are currently in the works.