Gemini 5.2.10 Design Document - Illumina/Pisces GitHub Wiki

Overview

Essentially, Gemini takes in a messy bam, and returns an error-corrected variant-calling-ready bam, simultaneously doing the jobs of read-stitching and indel realigning. When reads are trivially stitchable (compatible cigar string), it stitches. When reads disagree, it attempts indel realignment with the help of the mate. Gemini has numerous tricks up its sleeve to reconstruct alignments, which we describe in more detail below. Note that GeminiMulti is the multiprocess wrapper for Gemini. Customers (internal and external) should use GeminiMulti.dll instead of calling Gemini.dll directly. (Gemini.dll as a single process is for debugging only). Gemini wraps Samtools to do sorting and indexing of the output bam.

Maximizing stitching benefits using Pisces: Pisces uses the XD tags outputted by read stitching in Gemini (and formerly Stitcher) to assess per-base directional support. Currently, when using Gemini to do stitching, please make sure to set -UseStitchedXD to true for Pisces, so Pisces knows to expect stitched directions in the reads. (Defaults may change, we are currently testing to see if this improves performance)

Configuration

Gemini shall accept command line arguments as a whitespace-separated list of name and value pairs. For example:

Format: GeminiMulti.dll [-options]

Example:

dotnet GeminiMulti.dll -bam <bam path> -genome <genome path> -samtools <samtools path> --outFolder <output path> -numProcesses 10

Required Arguments

Argument Name	Type	Default Value	Description
-bam	string	none	path to bam. Gemini only takes in one bam at a time.
-genome	string	none	path to genome folder
-exePath	string	none	path to Gemin subprocess executable
-samtools	string	none	path to Samtools binary
-numProcesses	int	1	number of Gemini subprocesses to run simultaneously
-outfolder	string	none	directory in which to create the new bam and logging outputs

Optional Arguments

System Configuration

Argument Name	Type	Default Value	Description
samtoolsoldstyle	bool	false	Whether the provided samtools executable is the old version that uses an output prefix rather than an explicit '-o' output option for the output file name (http://www.htslib.org/doc/samtools-1.1.html).

Classification-based Handling

Argument Name	Type	Default Value	Description
categoriestorealign	comma-separated list	ImperfectStitched,FailStitch,UnstitchIndel,Split,Disagree	Category names that should be attempted to realign
categoriestosnowball	comma-separated list	None	Category names that should be attempted to snowball
pairawareeverything	bool	false	Whether to pass everything through pair aware realignment, or just the expected categories.
deferIndelStitch	bool	false *Recommend using true	Whether to stitch non-disagreeing indel reads to start, or wait til realignment. Waiting for realignment allows the reads to be realigned first, allowing for more successful and accurate stitching (for example, if mate 1 has an indel and mate 2 has a softclip which is actually hiding the same indel, it is preferable to realign mate 2 and confidently stitch rather than handle it as an overlapping softclip).
filterduplicates	bool	true	whether reads marked as duplicates shall be filtered
filterpairunmapped	bool	false	whether read pairs should be filtered when one or both reads are not mapped
filterforproperpairs	bool	false	whether reads makred as not proper pairs shall be filtered

Stitching Options

Argument Name	Type	Default Value	Description
minbasecallquality	int	0	cutoff for which, in case of a stitching conflict, bases with qscore less than this value
will automatically be disregarded in favor of the mate's bases.
nifydisagreement	bool	false	whether to turn high-quality disagreeing overlap bases to Ns
dontstitchrepeatoverlap	bool	true	whether to not stitch read pairs whose overlap is a repeating sequence. The risk here when true is that true-positive insertions / deletions might have a perfect stitch with the reference. The risk when false is that true-positive reference reads will not stitch, thus count twice towards coverage, and inflate reference count

Realignment Options

Argument Name	Type	Default Value	Description
allowRescoringOrigZero	bool	true	Option to allow setting mapq of perfectly realigned reads (0 mismatch) to 40 even if original mapq was 0. If false, perfectly realigned reads with original mapq between 1-20 are still assigned mapq of 40, but those with 0 are left at 0.
maskpartialinsertion	bool	false	Option to softclip a partial insertion at the end of a realigned read (a complete but un-anchored insertion is allowed).
minimumunanchoredinsertionlength	int	0	inimum length of an unanchored insertion (-i.e. no flanking reference base on one side) allowed in a realigned read. Insertions shorter than the specified length will be softclipped. Default value is 0, i.e. allowing unanchored insertions of any length.
checksoftclipsformismatches	bool	false	Whether to count mismatches in softclips toward total mismatches.
trackMismatches	bool	false	Whether to track and compare mismatches when realigning
softclipUnknownIndels	bool	false	Whether to softclip out "unknown" indels. If the original alignment contained an indel, and realignment was unsuccessful (couldn't confirm the existing indel, or confidently rearrange around a different/additional indel), softclip the read starting at the point of the original indel.

Indel Filtering Options

Argument Name	Type	Default Value	Description
minpreferredsupport	int	3	Instances of a found variant before it can be considered to realign around.
minpreferredanchor	int	1	Minimum anchor around indel to count an observation toward good evidence
minrequiredindelsupport	int	0	Strict cutoff on indel support - don't allow otherwise strong indels that we attempt to rescue in if they have num observations below this
minrequiredanchor	int	0	Strict cutoff on anchor length - don't allow otherwise strong indels that we attempt to rescue in if they have min anchor below this.
maxmessthreshold	int	20	Don't allow indels with average "mess" (i.e. average NM value of reads containing the indel, less the length of the indel(s)) above this value. Lower number prevents potentially low-quality indels from being realigned around.
binsize	int	0	Size of bin within which to compare indels to determine if a region is too messy to realign around. Default: 0 (do not clean up).

Softclip Treatment

Argument Name	Type	Default Value	Description
keepBothSideSoftclips	bool	false	Whether to trust that both-side softclips are probe and should stay softclipped
trustSoftclips	bool	false	Whether to trust softclips. If true, having softclips doesn't automatically trigger indel realignment. Also, won't try to stitch the softclips.
keepProbe	bool	false	Whether to trust that probe-side softclips are probe and should stay softclipped

Performance

Argument Name	Type	Default Value	Description
numthreads	int	1	number of threads to allow for each process
maxreadlength	int	1024	maximum expected length of individual reads, used to determine the maximum expected stitched read length (2len - 1). For optimal performance, set as low as appropriate (i.e. the actual single-read length) for your data. Default: 1024. This default 1024 is an optimization choice. The bigger this gets, the slower the app will get. The maximum expected length of individual reads is used to determine the maximum expected stitched read length (2len - 1). For targeted applications, the default here is usually fine. However, the stitcher will STOP processing and throw an exception once it finds a read that exceeds it max length. This is so that big variants do not go un-noticed by the user. (If you need to turn off this exception-throwing behavior, you can use ignorereadsabovemaxlength at your own risk.) The exception it would throw is: System.Exception: Unable to process: Stitching failed for read [read name]: Combined length of reads is greater than the expected maximum stitched read length of 2047)
ignorereadsabovemaxlength	bool	false	whether to passively ignore read pairs that would be above the max stitched length (e.g. extremely long deletions). This is an optimization choice. Ignoring long reads is risky (risk=lost variants and very annoyed customers) but breaking on too-long reads stops the program (annoying, and the user has to then change their settings and raise the max read length). Whether to passively ignore read pairs that would be above the max stitched length (e.g. extremely long deletions). For targeted applications, the default here (false) is usually fine. However, for whole genome sequencing or when using an aligner likely to open up very big deletions, these "long" (counting deleted bases) reads become more common. Finding a read that (when stitched) is greater than the max read length will cause the stitcher to stop with the following error message: "System.Exception: Unable to process: Stitching failed for read [read name]: Combined length of reads is greater than the expected maximum stitched read length of 2047 ". If only reporting small variants is desired: Turning this setting "on" will skip these reads, and they will be lost to downstream variant calling. If reporting big deletions is desired: Leave this setting off, and increase the MaxReadLength as needed.
maxindelsize	int	100	Maximum allowed indel size for realignment

Debug / Other

Argument Name	Type	Default Value	Description
-help,-h	none	none	Displays the help menu
-version, -v	none	none	Displays the version
-debug	bool	false	whether to run in debug (verbose) mode
-debugSummary	bool	false	whether to track various category and event counts to output in a summary at the end of execution

Input

Gemini requires as input one BAM file and the path to a genome build. The BAM file is assumed to be sorted, i.e. alignment reads should be sorted by mapped reference position. This assumption allows Gemini to process BAMs in one pass without a large memory footprint. Most standard aligners produce sorted BAM files. Positions in BAM files are expected to use 0-based coordinate system.

For performance, Gemini requires each BAM file to have an accompanying BAI index file. The index file allows Gemini to efficiently jump to chromosomes it is configured to call.

Output

Gemini outputs a new sorted, stitched, indel realigned bam and the bai file.

Limitations

Gemini puts the read direction for stitched reads in the XD tag of the read. If the BAM has some pre-existing XD tag info, this may have unexpected or undesirable consequences for Gemini and Pisces.

Detailed Design

Gemini follows the following high-level process flow:

Assess Input Reads
Collect evidence on indels observed in the reads
Categorize read pairs into buckets for further processing
Stitch any read pairs that are stitchable and do not need further processing
Assess & Filter Candidate Indels
Build Evidence For Candidate Indels
Realign Eligible Reads around strong candidate indels
Combine bucketed alignments
Finalize and output sorted, indexed bam file

This is performed on a per-chromosome level. For optimal performance, each chromosome is analyzed in a separate process, and the multi-process coordinator combines the per-chromosome results in a final merging step.

Key Features

Bucketing of Reads

Gemini buckets reads from the input BAM into various categories that impact how the read is assessed downstream. This has positive impacts for computational performance (reduces re-reading of alignments that are already known not to be candidates for realignment) and for analytical performance (allows for further judgment of quality of evidence collected within the read). For example: (image)

Candidate Indel Scoring, Filtering and Prioritization

Individual Scoring and Filtering

To reduce the introduction of false positive indels and reduce time spent on unnecessary realignment around poor-quality indels, the observed indels are judged and prioritized based on the evidence collected in the prior "Assess Input Reads" stage. The following information is gathered during read assessment and ultimately used to filter and prioritize candidate indels for realignment:

Number of Observations
Average Left Anchor
Average Right Anchor
Average surrounding "mess" (NM - indel length)
Average Base Quality for the indel (for insertions, in the inserted bases, for deletions, the bases on either side)
Number of Observations on Forward Strand
Number of Observations on Reverse Strand
Number of Observations in Stitched Reads
Number of Observations from "reputable" reads - i.e. stitched, or high-quality non-disagreeing paired reads (UnstitchIndel)

The evidence collected is used to score and filter the indels based on the following:

Filtering:

Remove indels with number of observations below configured threshold
Remove indels with minimum average anchor size below configured threshold
Remove indels with average mess above configured threshold

Scoring and Boosting:

Indels with high observation count, high average anchoring, and low average mess are treated more favorably
Indels with high directional balance (i.e. read support is similar from forward and reverse directions, and/or with stitched support) are treated more favorably
Indels with high anchoring balance (i.e. average anchor on left and right of indel are similar) are treated more favorably
Indels with high reputable support fraction (i.e. number of observations from reputable reads / number of total observations) are treated more favorably

Neighborhood Filtering

Indel Pruning

Some genomic regions are prone to a high concentration of variable indels. In some of these cases, there is a clear indel that is stronger than the rest and the others are simply artifacts (ex: a well-anchored indel may have smaller indels nearby as a result of reads being too short to cover the entire event, or mismatches within the indel causing it to look like multiple events with an intervening reference base, etc). While prioritization alone will help reduce false positives generated from amplifying these spurious indels, in some cases reads will realign better around the bad ones than they were able to around the true one. To prevent that, if there are weak indels near to a strong indel (within a configured bin-size), the weak indels can be removed from further consideration as a candidate indel to realign against.

Collapsing Similar Insertions

It has been observed that for some long insertions, there is a tendency for one or more extremely similar, weaker insertions to be observed at the same position with the same length. Often, they are the same as the true insertion, but have a one base mismatch in the inserted sequence. To prevent amplification of such insertions, and to support the true insertion as much as possible, these additional insertions are essentially collapsed down to the true insertion (the weak insertions are removed from consideration and their support is partially added to the true insertion).

Indel Support Snowballing

To help amplify initial support for good candidate indels, an optional "snowballing" step may be performed on a subset of the reads slated for realignment. A portion of reads from read categories that can be expected to have a high success rate in indel realignment (for example, UnstitchIndel reads) can be analyzed to recalibrate the candidate indel prioritization and filtering before the rest of the alignments go through realignment. This can have the beneficial effect of boosting initially hard-to-call indels so that they can pass filtering requirements, while still allowing strict filtering and prioritization to be used with the rest of the processing (saving time and preventing introduction of low-quality indels into less confident situations).

Pair-Aware Indel Realignment

In many cases, an indel is seen with high confidence in one mate of a pair, and is missed in its mate due to, for example, the mate not being long enough to nicely anchor the indel or not covering the entire length of an insertion. In these cases, it is beneficial to have the mate information, and stitching ahead of time would not have helped as they could not have been successfully stitched. To improve accuracy in paired read realignment, we search for confident indels in each of the mates, and attempt to selectively realign around those. We are also able to be slightly more forgiving to realignment than we would in the individual read case because we have the extra boost of confidence in the indel.

Features still to add:

Special handling of pair-aware partial insertions in stitching

Pair-awareness when no initial indels were seen, but indel realignment creates a very confident indel in one of the reads (Maybe) Pair-awareness to recognize that the non-indel-containing mate is actually the right one, and softclip out a junky indel from its mate (is this necessary? seems like it would be much less common than the opposite) Handling Multiple Coexisting Indels The previous indel realignment algorithm already supported introducing multiple indels into a single read if they could coexist and if they were previously seen together (the latter is an update in 5.2.7 to improve precision). In pair-aware indel realignment, we can further improve this feature by handling cases where one indel was seen on Mate 1 and a second was seen on Mate 2 (and in particular, when one indel was seen and the aligner refused to open a second one nearby because of penalty of opening a second gap, leaving it as mismatches), and when successfully stitched together they become a multi-indel read. If this is encountered during snowballing, this multi-indel evidence can be used to boost detection of such situations in the rest of the read population.

Experimental Features

Basecall quality consideration in indel realignment scoring

When comparing the original read against the realignment, one of the key metrics we look at is the number of mismatches in the two reads. However, sometimes a mismatch is just a sequencing artifact and should not count so strongly against the alignment. To try to suss this out, we look at the basecall qualities of the mismatching bases.

Softclipping of unrealignable reads with weak indels

If reads contain indels but these indels were not included in our scored and filtered candidate indels, and the read was not able to realign around any others, optionally softclip the read starting at the position of the weak indel. This should help remove FPs that don't correspond to a true indel. While it does have a risk of producing indel FNs, as long as the initial indel filtering is not too aggressive it should not do so.

Allowing imperfect matches in realignment around long insertions

During read realignment, we allow reads that look to realign very nicely with a large insertion but have a small number of mismatches in the proposed inserted sequence (as compared to the true insertion's inserted sequence) to attempt to be realigned and evaluated as if they were the exact insertion (and insert an "N" at the disagreeing bases). This is related to the problem discussed in Collapsing Similar Insertions above. As it stands, this should already increase accuracy somewhat (reducing potential SNP or neighboring indel FPs by allowing the realignment/insertion to be in the right place), but still is potentially problematic for variant calling as the N-adjusted inserted sequence still won't collapse to the true insertion to boost its support (and at high numbers could show also show up as its own FP). This requires further work and discussion to maximize the benefits.