3. Results - DKFZ-ODCF/AlignmentAndQCWorkflows GitHub Wiki

Output is written to the output directory with the following general structure:

Directory File types
./ _commonBam_wroteQcSummary.txt, _qualitycontrol.json
./fastqc/ if if runFastQC=true: .fastqc.zip, fastq_qcpass_status.txt
./alignment/ .bam, .bai, .md5, .dupmark_metrics.txt
./coverage/ .DepthOfCoverage_Genome.txt, _readCoverage_1kb_windows.txt, _readCoverage_1kb_windows_coveragePlot_${chr}.png
./fingerprinting/ if runFingerprinting=true: .fp
./flagstats/ _flagstats.txt, _extendedFlagstats.txt
./insertsize_distribution/ _insertsize_plot.png, _insertsize_plot.png_qcValues.txt, _insertsizes.txt
./structural_variation/ _DiffChroms.txt, _DiffChroms.png, _DiffChroms.png_qcValues.txt
./correctedGcBias/ ACEseq QC files if runACEseqQc=true: _windows.anno.txt, _sex.txt, _windows.filtered.txt.gz, _windows.filtered.corrected.txt.gz, _qc_gc_corrected.json, _qc_gc_corrected.slim.tsv, _gc_corrected.png
./roddyExecutionStore/ Execution metadata, logs, configurations. Please refer to the Roddy documentation for details
./methylationcalling/ Only produced for the WGBS workflow

File contents

Many of the following files are produced by the workflow as one for each lane, optionally for each library -- e.g. with tagmentation WGBS data --, and for the final merged BAM. Some files are also per chromosome.

fastqc.zip, fastq_qcpass_status.txt

FastQC output file and a QC-status derived from the FastQC output by fastqcClassify. The status categories are

Category Meaning Formula
PASS high quality reads mean(q1) > 28
WARN reads quality compromised (20 < mean(q1) < = 28) or (PASS and one median below 20)
FAIL low quality reads mean(q1) < = 20

The Phred score is related to the probability of producing a sequencing error, and each sequenced nucleotide is associated to such a value. As a reference, a Phred score of 20 is a probability of 0.01 of having an error and a score of 30 a probability of 0.001. The reported value corresponds to the average of the first quantile of the Phred score in each position. A value above 28 is OK and below or equal to 20 is of bad quality. Anything in between is dubious. Also notice, in case there is one single position where the median score goes below 20 the sample will be flagged as dubious in case it was considered OK.

.bam, .bai, .md5

Position-sorted BAM file with associated BAI index and MD5 sum of the BAM. The workflow does not remove duplicates but only marks them as duplicates. Therefore, unless you do adapter trimming, the BAM contains all reads also contained in the input FASTQs and in principle the full set of reads is recovered in the files, except for the FASTQ comments, which are dropped. See the BamToFastqPlugin for an performant workflow to reconstitute (almost) original FASTQs (without FASTQ comments and of course not in the original read order).

.dupmark_metrics.txt

Standard output of the duplication marking program. Please refer to the documentation of the biobamba, picard or sambamba tool.

.DepthOfCoverage_Genome.txt

A TSV produced by coverageQc with the following header. An important number is that of the "QC bases":

QC Bases

Bases are filtered for quality based on the following per-base and per-read criteria:

  1. Read must be mapped (obvious)
  2. Read must not be marked as duplicate. Note that the duplication marker leaves one of the duplicates unmarked.
  3. Mapping quality > 0 (pretty strong filtering criterion with BWA)
  4. Count only alignment match (M), insert to the reference (I), sequence match (=), and sequence mismatch (X) CIGAR entries. Consequenctly, the following read-bases are not considered based on the CIGAR string: soft-clipped (S), skipped in reference (N), deleted in reference (D), hard-clipped (H), padded (P).
  5. Length of remaining bases >= min length 36 bp. If this is false, the whole read won't be counted!
  6. Average quality score of remaining bases >= 0 (Phred score) (Note: For WES the default cutoff is 25). If this false, the whole read won't be counted!

Note that overall, only reads on chromosomes listed in the CHROM_SIZES_FILE are considered by the workflow.

The "QC bases" are used in many of the statistics reported by coverageQc-tool.

Overview

Some of the following statistical measures are parametrized by one of the following parameters:

  • base quality cutoff: Defined in the variable BASE_QUALITY_CUTOFF. For WGS and WGBS data the current default is 0 (defined in the COWorkflowsBasePlugin). For WES this default is overwritten to be 25 (see analysisExome.xml).
  • minimum read length: Only the default from coverageQc is used, which is 36.

Furthermore, note that in all these values, except the duplicate-count, reads marked as PCR duplicates are ignored and not counted. Also usually bases that are softclipped (unaligning read-ends) are not counted.

Full details (and the only true reference) can be found in coverageQC.d function addRead().

Column Description Format
interval Contig or chromosome name. "all" is the overall value across all considered contigs.
coverage QC bases Coverage based on QC bases and total bases (including all non-{A,T,C,G}-bases). \d+.\d+x
#QC bases/#total bases Number of QC bases "/" number of total bases (including all non-{A,T,C,G}-bases). \d+/\d+
mapq=0 read{1,2} Number of reads with BWA mapping quality 0 for read 1 and 2. \d+
mapq>0,readlength<minlength read{1,2} Number of reads with mapping quality >0 but shorter than the minimum read length \d+
mapq>0,BaseQualityMedian<basequalCutoff read{1,2} Number of reads with mapping quality >0 and base-quality mean < base-quality cutoff. Note this is the mean base quality, not the median! It is calculated using non-softclipped bases only. \d+
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2} Number of reads with mapping quality >0 and base-quality mean >= base-quality cutoff. Note this is the mean, not the median! It is calculated using non-softclipped bases only. \d+
%incorrect PE orientation Number of read pairs with unexpected/incorrect alignment orientation. \d+
#incorrect proper pair Non-duplicate read pairs marked as properly paired and both aligned in reverse orientation. \d+
#duplicates read{1,2} Reads marked as PCR duplicates by e.g. sambamba. \d+
genome_w/o_N coverage QC bases Coverage based on QC bases and {A,T,C,G}-bases. \d+
#QC bases/#total not_N bases Number QC bases "/" number of {A,T,C,G} bases (i.e. excluding 'N's). \d+/\d+

_readCoverage_1kb_windows.txt, _readCoverage_1kb_windows_coveragePlot_${chr}.png.

Three-column TSV produced by genomeCoverage and a streamed through filter_readbins.pl, such that only chromosomes of interest are kept. The columns are the chromosome, the 0-based start index of the window on the chromosome, and the number of reads covering the window. Window size is 1 kb.

The genomeCoverage tool only counts reads not marked as duplicate reads and -- in "countReads" mode -- having a mapping quality of at least 1.

.fp

The file is generated by the bsnp.py script and produces base counts for reference and alternative alleles as well a betavalue-like allele frequency for in silico genotyping (fingerprinting). The columns are

  1. input file basename
  2. SNP chromosome
  3. SNP position on chromosome
  4. always '*'
  5. SNP name (dbSNP)
  6. reference allele count (c_ref); the 'ref' base in the input fingerprints file
  7. alternative allele count (c_alt); the 'alt' base in the input fingerprints file
  8. relative frequency of allele 'ref' allele [c_ref / (c_ref + c_alt)]

The input fingerprinting file should be a TSV with the following columns:

  1. chrom
  2. chromStart
  3. chromEnd
  4. name (dbSNP)
  5. avHet
  6. strand: '+' or '-'
  7. ref: reference base
  8. alt: alternative base

Note that the columns' base-assignment to the ref or alt columns may not reflecting annotations or population frequencies, but may be for instance alphabetical order.

If you want to do some downstream analysis with the fingerprints, e.g. cluster by fingerprint profiles or sample swap detection, please see the methylCtools-fingerprint repository.

_flagstats.txt

Please refer to the documentation of samtools flagstats for details.

_extendedFlagstats.txt

This file is produced by flags_isizes_PEaberrations.pl and produces the following values

  1. total alignments, i.e. all reads on the chromosomes of interest (the "real" chromosomes, usually chr1-22+X+Y).
  2. (such) non-duplicate, non-secondary, non-supplementary reads
  3. such with mapping quality >=1
  4. such on regarded chromosomes
  5. such with both reads on regarded chromosomes
  6. such mapping on different chromosomes
  7. (such) proper pairs read 1

Note that the "such" always indicates that the set of reads is incrementally reduced based on the previous selection. This means these are successively applied filters, and the numbers represent always a subset of the previous filtering step.

_insertsizes.txt

The files suffixed by _insertsizes.txt are produced by flags_isizes_PEaberrations.pl. It is a TSV file with insert size (column 1) and count (column 2).

Using this as input the script insertsizePlot.R plots the _insertsize_plot.png and writes the estimated distribution parameters for convenience into the _insertsize_plot.png_qcValues.txt file in three rows:

  1. median
  2. standard deviation / median
  3. standard deviation

DiffChroms

The _DiffChroms.txt file and the _DiffChroms.png_qcValues.txt are produced by flags_isizes_PEaberrations.pl and are used for producing the , _DiffChroms.png file by chrom_diff.r. Compare extendedFlagstats.txt.

The percentage of reads mapping to different chromosomes in _DiffChroms.png_qcValues.txt is determined from the number of mapped, non-duplication-marked reads with non-supplementary, non-secondary alignments that have a mapping quality larger than the cutoff and where both reads are mapped to different chromosomes. This number is then divided by the number of paired reads from the same class (mapped, non-duplication-marked, non-supplementary, non-secondary) but with both reads mapped to the same or different chromosome.

Like in extendedFlagstats.txt the statistic again is only for the "real" chromosomes, as defined by the CHROM_SIZES_FILE.

ACEseq QC

The files in the ./correctGcBias/ directory (_windows.anno.txt, _sex.txt, _windows.filtered.txt.gz, _windows.filtered.corrected.txt.gz, _qc_gc_corrected.json, _qc_gc_corrected.slim.tsv, _gc_corrected.png) are only produced if runACEseqQC was set to true and should be used only with WGS data.

Please refer to the documentation of the ACEseq workflow for extensive information about the contents of these files.

FWHM

The full width half maximum (FWHM) parameter indicates whether there are problems with the coverage over the genome. It is based on the coverage distribution after GC-bias correction. It most likely also affects the quality of the SNV and Indel calls. Severe cases should be considered for exclusion from the study. _gcBias_corrected.png

GC-bias mean slope

Slope of curve fitted through coverage vs. GC-content points. Large values could indicate less sensitivity and specificity SNV and INDEL in some regions of the genome due to lower coverage. This may be caused by library preparation (e.g. NEB bead selection, leading to enrichment of GC-rich regions).

GC-bias absolute curvature

Absolute curvature of curve fitted through coverage vs. GC-content points. Large values could indicate less sensitivity and specificity for SNV and INDEL in some regions of the genome due to lower coverage.

_commonBam_wroteQcSummary.txt

Summary TSV file with contents collected from the other files produced by writeQCsummary.pl. The file only contains a single line for the overall (all chromosomes) statistics.

Column Description
PID patient ID; could also be cell-line, or what other information you encode here
SAMPLE_TYPE usually tumor01, blood2, etc.
RUN_ID Name of the $run/sequence/ directory containing the FASTQs.
LANE Lane identifier.
TOTAL_READ_COUNT (flagstat) from flagstat
%TOTAL_READ_MAPPED_BWA (flagstat) from flagstat
%properly_paired (flagstat) from flagstat
%singletons (flagstat) from flagstat
ALIGNED_READ_COUNT TBD
%DUPLICATES (Picard metrics file) from .dupmark_metrics.txt file
ESTIMATED_LIBRARY_SIZE (Picard metrics file) from .dupmark_metrics.txt file
%PE_reads_on_diff_chromosomes (mapq>0) from DiffChroms.txt
%sd_PE_insertsize (mapq>0) TBD
PE_insertsize (mapq>0) TBD
coverage QC bases w/o N from _DepthOfCoverage_Genome.txt
QC bases/ total bases w/o N from _DepthOfCoverage_Genome.txt
coverage QC bases from _DepthOfCoverage_Genome.txt
QC bases/ total bases from _DepthOfCoverage_Genome.txt
mapq=0 read{1,2} from _DepthOfCoverage_Genome.txt
mapq>0,readlength<minlength read{1,2} from _DepthOfCoverage_Genome.txt
mapq>0,BaseQualityMedian<basequalCutoff read{1,2} from _DepthOfCoverage_Genome.txt
mapq>0,BaseQualityMedian>=basequalCutoff read{1,2} from _DepthOfCoverage_Genome.txt
%incorrect PE orientation from _DepthOfCoverage_Genome.txt
#incorrect proper pair from _DepthOfCoverage_Genome.txt
#duplicates read{1,2} (excluded from coverage analysis from _DepthOfCoverage_Genome.txt
ChrX coverage QC bases from _DepthOfCoverage_Genome.txt
ChrY coverage QC bases from _DepthOfCoverage_Genome.txt

_qualitycontrol.json

Summary JSON file with contents collected from the other files. Produced by qcJson.pl. The JSON contains short entries per chromosome and a more extensive entry for the "all" chromosome summing all results per chromosome.

Variable Description
chromosome chromosome identifier
referenceLength length of chromosomes
qcBasesMapped See QC Bases from _DepthOfCoverage_Genome.txt.
coverageQcBases qcBasesMapped / referenceLength
genomeWithoutNReferenceLength the "#total not_N bases" in "#QC bases/#total not_N bases"; same as length not counting 'N's from CHROM_SIZES_FILE
genomeWithoutNQcBasesMapped the "#QC bases" in "#QC bases/#total not_N bases"
genomeWithoutNCoverageQcBases genomeWithoutNQcBasesMapped / genomeWithoutNReferenceLength
insertSizeMedian from _insertsize_plot.png_qcValues.txt-file; line 1.
insertSizeSD Standard deviation of the insert size distribution; from _insertsize_plot.png_qcValues.txt-file; line 3.
insertSizeCV insertSizeSD / insertSizeMedian. From from _insertsize_plot.png_qcValues.txt-file; line 2.
singletons from flagstats
withItselfAndMateMapped from flagstats
withMateMappedToDifferentChr from flagstats
properlyPaired from flagstats
pairedRead{1,2} flagstats read 1, 2
qcFailedReads flagstats qc-failed
pairedInSequencing from flagstats
totalReadCounter from flagstats total
duplicates from flagstats duplicates
totalMappedReadCounter from flagstats mapped
percentageMatesOnDifferentChr from flagstats
withMateMappedToDifferentChrMaq from flagstats mapq >= 5

Whole-Exome Sequencing (WES)

For WES data, a target region file (TARGET_REGION_FILE parameter) is provided to the workflow and reads are kept if at least one base overlap exists to at least one of these regions. The QC statistics are than calculated like before, e.g. with QC bases mapped produced by the coverageQc.d programm on a base level, but only for the pre-selected reads.

Note that the default for BASE_QUALITY_CUTOFF is 25 for the WES workflow, but 0 for the WGS workflow (as defined in the COWorkflowsBasePlugin). This default has historical reasons and you may want to change it.

The WES-QC files are suffixed by target or variants of it (unfortunately no consistent naming scheme was applied, but changing this would required downstream workflows of researchers to get changed, so this won't happen with version 1 of the plugin anymore).

WARNING: The file ${sampleName}_${pid}_targetExtract.rmdup.bam.DepthOfCoverage_Target.txt is produced by the coverageQc D program and contains the mentioned statistics on coverage for QC-bases. You may wonder, why the "all" line, which contains a summary statistic of the whole genome, may have widely different values for "genome_w/o_N coverage QC bases" and "#QC bases/#total not_N bases". The reason is that while the per-chromosome lines contain the statistic using effective chromosome size as reference size (denominators in "#QC bases/#total not_N bases"), the "all" line gives the (average) coverage using the target region size as reference. See here.

There is one additional file produced.

_TargetsWithCov.txt

The file is produced by resources/analysisTools/exomePipeline/targetcov.pl that processes the output of coverageBed -d -abam $targetRegionRestrictedBam -b $targetRegionsBed (bedtools), which reports the reports the depth at each position in each target feature. The targetcov.pl output is in TSV format with one line per target region each with the following columns:

  1. chromosome
  2. target region start; 0-based, right-open (i.e. [0, $intervalLength) coordinates like in BED format)
  3. target region end; 0-based, right-open
  4. reads: estimated number of reads covering the region. This is not an exact count, but attempts to count the read-starts in the region. However, dependent on split-reads (one read with multiple subalignments) and end-to-start arrangements of adjacent alignment positions of different reads the number may deviate from the real number of reads.
  5. covered bases: number of positions with non-zero coverage
  6. length: the length of the interval
  7. ratio: proportion of covered bases (covered bases / length)
  8. average: average coverage

Whole-Genome Bisulphite Sequencing (WGBS)

The method uses methylCtools and was used for Hovestadt et al. (2014).

methylation/merged/methylationCalling/*.bed

These files are produced by the bcall_2012Nov27.py or bcall_tagmentation_2012Nov25.py scripts (the version of methylCtools is the date tag in the scripts), for non-tagmentation or tagmentation-calling, respectively. The columns of these BED files are:

  • chromosome name
  • coordinate
  • strand
  • context/genomic sequence
  • snp rate
  • unconverted cytosines (=methylated Cs)
  • converted cytosines (=non-methylated Cs)

The "snp rate" column here refers to the rate of SNPs, because SNPs interfere with the bisulfite calling. The bcall scripts give slightly more information

methylCtools examines the available sequencing information to evaluate if a cytosine position differs from the reference (argument -x). for example, if in a sample the true genotype of a reference CpG would be TpG, it would likely appear as an unmethylated CpG since they cannot be distinguished from bisulfite sequencing data. however, on the opposite strand there would be clear evidence for this SNV. the algorithm examines all positions within the context (in this case CG, or similarly C/CH/CHG/CHH) on both strands and reports on the maximum rate of evidence from any of these positions.

positive evidence for a SNV would be:

  • for C: sense A/G, antisense A/T/G
  • for G: sense A/C/T, antisense C/T
  • for H: sense G, antisense G