Sources of Error - ababaian/serratus GitHub Wiki

Leaky Alignments

Alignment Leakage

Source of False Positives. If a known virus A is present at a high read-count, then things like sequencing error, biological artifacts and mis-mapping will result in a small fraction of reads being assigned to a related, but not the ideal sequence (B and C). The distance (in nt- or aa-substitutions) from the virus in the sequencing library may be in the "known range" to virus A, and in the unknown range to virus B and C.

Often this falls well below the level of "noise", but in libraries with high viral read-counts (10,000s), this may lead to an appreciable signal in neighboring viruses.

The best way to mitigate this issue is to consider a higher level of the hierarchy for locating novel viruses. For instance instead of asking "Find me a novel PCV2-related sequence". You first ask "Find a novel Circovirus sequence." and then sub-set those results to "Which of those libraries is the best-available match PCV2."

PCV1 and PCV2 Leak

Scattered Alignments

Alignment Scatter

Source of False Negatives. Alignment scatter occurs when a library-sequence is "between" the sequences from two operational taxonomic units (OTU). When providing summary statistics at the level of OTU/Family, this in effect "dilutes" divergent reads across categories. A virus may be sufficiently abundant to warrant further investigation yet be reported as rare/incomplete. An interesting but probably hard to detect case would be chimeric sequences.