Design of nucleotide summarizer scoring and depth estimates - ababaian/serratus GitHub Wiki

Score

There is a score for each virus family and each reference sequence (score=xxx field). Collectively, families and reference sequences are "targets" of interest. The design goals for the score are as follows.

Range 0 to 100 for intuitive interpretation.
Can be interpreted as classifier confidence score that the virus family (or a closely related reference sequence) is present in the reads.
Zero if no alignments, greater than zero if at least one target alignment is found.
Conceptually, the score design should maximize area under a ROC curve (AUC), though this cannot be measured in practice. This leaves some flexibility, because any monotonic re-scaling of the score has the same AUC. This flexibility can be used to up-weight the most interesting / assemblable candidates and down-weight what is most likely junk.

Heuristics for scoring

Coverage across the pan-genome or reference sequence is the best signal for presence / absence.
Very low coverage and/or very low depth should be down-weighted because presence has less value if assembly is not possible.
Low identity should be up-weighted because coverage and depth may be under-estimated due to reduced sensitivity of alignments.

Raw score

Divide the reference sequence or pan-genome into 25 equal-sized bins. Count the number of alignments in each bin.

Cvg8 = number of bins with at least 8 alignments.

Cvg1 = number of bins with 1 to 7 alignments.

raw_score = Cvg8*4 + Cvg1.

The raw_score value is in range 0 to 100. It is always > 0 if there is an alignment. Maximum value is 100 when all bins have at least 8 alignments.

Identity weighting

identity_weight = (pctid/100)^3.

The identity_weight value is <= 1.0, decreasing with lower pctid. The decrease is slow at identities close to 100%, fast below 80%id. This is a simple model of the sensitivity of bowtie2 as a function of identity. For example, at 90%id identity_weight = 0.7 and at 80%id identity_weight = 0.5.

The inverse of identity_weight is used to up-weight raw_score at lower identities. For example, at 90%id 1/identity_weight = 1.4 and at 80%id 1/identity_weight = 2.0.

Score calculation

score = raw_score / identity_weight, pegged at 100 (i.e., if the value is > 100 set to 100).

Depth estimate

The goal is to estimate typical depth over a recoverable contig that is long enough to be useful. We don't want to require that the entire target can be assembled because a long contig often has value. We don't want mean or median depth over the complete target sequence because of outlier bins. Commonly observed outliers include (a) empty bins due to low coverage or missing gene, and (b) highly abundant bins due to false-positive alignments and expression artifacts (e.g., leaders).

typical_depth is calculated by discarding the two most abundant and two least abundant bins, and taking the mean number of reads from the remaining 21 bins. Similarly to coverage, depth will be under-estimated at low identities.

The final estimate is depth = typical_depth * identity_weight.