To align nearly identical or short query & target - pb-cdunn/blasr GitHub Wiki
When aligning short reads or a query sequence to a nearly identical target sequence, blasr may sometimes fail to return alignments using the default parameters, and in addition, blasr may return non-symmetric results if query and target are switched (e.g., blasr may return more hits by switching near identical query and target)
In order to solve this question, you may want to:
- specify maxMatch (e.g. -maxMatch 15, or -maxLCPLength 15, since it is an alias of maxMatch), or
- use a reduced minMatch (e.g. -minMatch 8)
For example, the following command usually should work:
$ blasr query.fa ref.fa -minMatch 8 -maxMatch 15
More readings if you are interested in reasons behind:
Blasr expects to find optimal hits for pacbio reads (which is long and ~ 85% accuracy) to references in a reasonable time (so short and nearly identical query & target are not what blasr usually deals with).
In the first step, blasr finds perfectly matching seeds at each position between query and target, then merges overlapping seeds, and then uses non-overlapping seeds as anchors to bound possible alignment candidates. Next blasr pre-filters theses alignment candidates based on number of seeds, minimum seed size (minMatch), length of reference and some tuple counting statistics, and finally aligns 'promising' alignment candidates using expensive dynamic programming.
The default value of minMatch is 12, which is the minimum length of seeds to search for. The default value for maxMatch is infinite, which is the max length for blasr to stop walking along a seed.
When aligning 100% identical short query to target, blasr finds perfect matching seeds starting from every single position and walks along every single seed till the end (since query & target match perfectly). Then in the overlapping seed merging step, blasr merges all completely-overlapping seeds to one single long seed which covers the whole alignment (e.g., seeds [0:N) and [1:N), [2:N) are merged into one seed [0:N]). Next, in the pre-filter phase, blasr evaluates whether or not an alignment candidate with one single seed is promising enough or not.
The evaluated goodness of alignment candidates can be close to the border line (because blasr expects aligning a pacbio read to reference, and seeing only a single anchor in an alignment candidate does not sound very promising.)
Setting a non-infinite maxMatch value can be very helpful. Since in the seed searching phase, blasr stops walking along a seed when the seed length is greater than maxMatch. These seeds don't complete overlap with each other and hence cannot be merged in to a single long seed (e.g., seeds [0:15), [1: 16) ..., [N-15:N) won't be merged by blasr). Thus, all these seeds will be considered as anchors in an alignment candidate and counted in the pre-filter evaluation. Apparently, this alignment candidate looks 'good' and will be reported.
In addition, reducing minMatch can also help by increasing the chance that this alignment be considered as 'promising'. That's why with reduced minMatch, blasr sometimes returns more alignments between near identical query & target.