Sources of Error - ababaian/serratus GitHub Wiki
Leaky Alignments
Source of False Positives. If a known virus A
is present at a high read-count, then things like sequencing error, biological artifacts and mis-mapping will result in a small fraction of reads being assigned to a related, but not the ideal sequence (B
and C
). The distance (in nt- or aa-substitutions) from the virus in the sequencing library may be in the "known range" to virus A
, and in the unknown range to virus B
and C
.
Often this falls well below the level of "noise", but in libraries with high viral read-counts (10,000s), this may lead to an appreciable signal in neighboring viruses.
The best way to mitigate this issue is to consider a higher level of the hierarchy for locating novel viruses. For instance instead of asking "Find me a novel PCV2-related sequence". You first ask "Find a novel Circovirus sequence." and then sub-set those results to "Which of those libraries is the best-available match PCV2."
Scattered Alignments
Source of False Negatives. Alignment scatter occurs when a library-sequence is "between" the sequences from two operational taxonomic units (OTU). When providing summary statistics at the level of OTU/Family, this in effect "dilutes" divergent reads across categories. A virus may be sufficiently abundant to warrant further investigation yet be reported as rare/incomplete. An interesting but probably hard to detect case would be chimeric sequences.