Consensus Construction - stuckyb/seqtrace GitHub Wiki

How SeqTrace constructs consensus sequences

Introduction

Sanger sequencing projects often produce both "forward" and "reverse" sequencing traces, representing the two complementary strands of the target DNA. These traces provide two sources of information about the target DNA, and when considered together, they should provide a more accurate assessment of the true DNA sequence than either would individually. Thus, an important feature of SeqTrace is its ability to construct a single consensus sequence from matching forward and reverse sequencing traces. This page describes how SeqTrace builds these consensus sequences. If you are looking for a short explanation, here it is: since version 0.9.0, SeqTrace uses Bayesian inference to combine the information from aligned sequencing traces in order to produce a single, high-quality consensus sequence that meets minimum base quality score requirements. That might be all you need to know, but if not, read on.

In general, there are many approaches one could take to constructing a single consensus sequence from two or more aligned sequencing traces, but the most robust methods rely on numerical estimates of the accuracy of each base call on the trace data. High-quality accuracy estimates first became available with the introduction of the software program Phred (Ewing and Green 1998), which uses characteristics of the sequencing trace to estimate the probability that each base is called correctly. These probability estimates are log-scaled, resulting in a final quality score for each base call. Experimental evidence has shown that Phred-type accuracy probabilities are highly reliable, and they have since become the standard for DNA sequencing. Because these quality scores are essential to producing the most accurate consensus sequence possible, SeqTrace requires that all trace data include base calls with Phred-type quality scores. All modern Sanger sequencing platforms of which I am aware meet this requirement.

Those who are interested in learning more about alternative approaches to constructing consensus sequences, as well as the historical development of this problem, should check out the papers by Churchill and Waterman (1992), Lawrence and Solovyev (1994), and Bonfield and Staden (1995). The documentation of the sequence-assembly program Gap4, which is part of the open-source Staden Package, also provides helpful background reading.

Consensus building with Bayes

The key idea that SeqTrace uses to construct a consensus sequence using base call quality scores is straightforward: Because the quality score for a called base gives us the probability that the base was called correctly, if we have two calls for a single base from matching forward and reverse traces, we can use statistical methods to combine these probabilities in order to infer the most likely "true" base. In this way, the information from both sequencing traces and their associated quality scores allows us to derive a final consensus sequence.

Bayesian statistical inference provides an elegant means to achieve this objective. In the rest of this section, I will try to provide a conceptual, high-level explanation of how SeqTrace implements this approach, but I will not take time to discuss much of the mathematics behind it or the details of the actual implementation. If some of the terminology and ideas used here are unfamiliar to you, I suggest reading some of the vast information available on the Web about Bayes' theorem and Bayesian inference. Curious readers will also want to take a look at SeqTrace's source code to see exactly what is going on.

The algorithm begins by assuming that we initially know nothing about the true underlying DNA sequence. Because we have no information about the true sequence, the best we can do is decide that the four nucleotides (A, T, G, and C) have an equal chance of occurring at any position in the strand. This defines a discrete prior probability distribution where each of the four bases occurs with probability 0.25 (in the language of Bayesian statistics, this is known as an "uninformative prior").

Once we acquire a forward sequencing trace with base calls and confidence scores, we have new information about the real DNA sequence. The confidence scores give us the probability that each base was called correctly, and this allows us to calculate the conditional probabilities of observing the base call data given a particular "true" base. Using Bayes' theorem, we can use these conditional probabilities along with the prior (uninformative) probability distribution of nucleotides to compute a new, posterior probability distribution of nucleotides for each position in the strand. These posterior distributions reflect our new knowledge of the true DNA sequence and tell us which sequence of bases is most likely, given the observed sequencing data.

Suppose we then acquire a second, reverse sequencing trace that covers the same DNA strand as the forward trace discussed in the preceding paragraph. We now have even more information about the true, underlying DNA sequence. The posterior probability distributions from our initial analysis now become the prior distributions, and we again use Bayes' theorem to calculate posterior distributions that reflect the additional information obtained from the new sequencing data. These posterior distributions summarize the information from both the forward and reverse traces and allow us to choose a consensus sequence that best represents both sets of trace data.

As an example, consider a single position in a DNA strand for which the true nucleotide is unknown. We want to use sequencing data to infer the true nucleotide at this position. Because we know nothing about the true sequence, we begin by assuming that all bases are equally likely at this position; that is, P(A) = P(T) = P(G) = P(C) = 0.25.

Now, imagine that we have obtained forward sequencing trace data for this position, with a base call of "A" and a Phred-type quality score of 5. This means that, given the trace data, the probability of a correct base call is approximately 0.7. We can now compute the conditional probabilities of observing a base call of "A" given the different possibilities for the true underlying base. For example, if the underlying base is actually an "A", then the probability of observing a correct base call of "A" is 0.7. If the underlying base is a "T", then the probability of observing an incorrect call of "A" is 0.1 ([1-0.7] / 3 to account for the three incorrect possibilities). After running these conditional probabilities and our prior distribution through Bayes' theorem, we arrive at the following posterior probability distribution of nucleotides at this position: P(A) = 0.7, P(T) = P(G) = P(C) = 0.1. At this point, we might conclude that the true base is likely to be an A, but that the evidence for this is not very strong.

Next, suppose that we acquire reverse sequencing trace data for this position, with a base call of "T" and a Phred-type quality score of 15. This quality score equates to a probability of a correct base call of approximately 0.97. We now calculate conditional probabilities exactly as before, take the posterior distribution from the preceding paragraph as our new prior probability distribution, and use Bayes' theorem again to derive a new posterior probability distribution of nucleotides for this position. This time, we get P(A) = 0.066, P(T) = 0.915, and P(G) = P(C) = 0.009. The reverse trace data provided much stronger evidence than the forward trace data, as indicated by the posterior probabilities. Considering all available information, we would conclude that the true base is probably a T (although the evidence is still rather weak!).

Even if you do not fully understand the details of the methodology, you can easily see (I hope!) that this result makes sense. If you were to simply look at these data and mentally choose a consensus nucleotide, you would almost certainly arrive at the same conclusion: that the true underlying base is probably a T. The Bayesian framework provides a formal, rigorous setting for this intuition and allows for the precise calculation of the relevant probabilities.

An additional benefit of this approach is that each base of the consensus sequence has an associated probability. These probabilities allow SeqTrace to generate confidence scores for the entire consensus sequence. SeqTrace then uses a user-defined minimum quality score (30 by default) to decide which consensus base calls to accept for the finished sequence.

As a final comment on SeqTrace's consensus algorithm, I should mention that in the case where two or more bases end up sharing the maximum probability (which is generally very unlikely) , SeqTrace will simply choose one as the consensus base. Note, though, that this can only happen if the maximum probability is 0.5 or less, meaning it can only affect the consensus sequence if the minimum quality score cutoff is less than 4! Consequently, it should be of no practical concern.

Working with ambiguous bases

As of version 0.9.0, SeqTrace supports all 11 IUPAC nucleotide ambiguity codes (e.g., W, S, M, etc.). When processing single trace files (or with the legacy consensus algorithm), these codes are treated in exactly the same way as the standard nucleotide codes. If the quality score meets the minimum quality threshold, then the code is retained in the final sequence.

The Bayesian consensus algorithm calculates base probabilities for ambiguous nucleotide codes by properly distributing the probability across all nucleotides that the code represents. For example, if a base call of "W" is given, which means either A or T, then the probability of a correct call is divided evenly between A and T. The algorithm then proceeds exactly as described in the preceding section.

Please note that even though the Bayesian consensus algorithm properly interprets ambiguous base calls, ambiguous bases are never assigned to the consensus sequence. That is, the algorithm always attempts to pick a single nucleotide that best represents the data. The only exception is for regions of the alignment where there are only data from one trace file. In these cases, the single-sequence algorithm is followed, which defers to the decisions of the original base caller.

The primary reason for this behavior is that the vast majority of times when the probabilities for two or more bases are nearly equal, it indicates little more than low quality sequencing data. In my opinion, assigning partial ambiguity codes in such cases has little, if any, utility and risks obscuring the poor quality of the data. The clearest approach is to choose a single base if the data support it and assign an "N" otherwise. If you believe you have DNA that is truly polymorphic at some position, then you can visually inspect the traces (which you should always do anyway) and add the ambiguity code if you want to.

As a final comment in this section, it might be worth pointing out that base-calling software seems to be really bad at assigning partial ambiguity codes. I have made no systematic study of this, but most of the time when I've seen partial ambiguity codes in a trace file, they range from nearly useless to nonsensical. For instance, it is quite common to see a 2-base ambiguity code, such as "W", with a quality score of 1. What does this actually mean? Well, "W" represents either an A or a T, and if you work out the probabilities from the quality score of 1, you find that either G or C are actually more likely than either A or T! To me, this makes no sense at all. At the very least, partial ambiguities in trace files should be viewed with healthy skepticism unless you know your base calling software does a good job of assigning them.

What about alignment gaps?

Alignments of forward and reverse traces will almost always contain one or more gaps (in addition to the end gaps needed to achieve alignment). These gaps pose a problem for the algorithm outlined in the previous section because they are a result of the alignment and do not have any associated quality score information.

Considering the underlying "true" DNA, gaps can only represent one of two possibilities. First, a gap can be caused by a nucleotide that actually does exist in the true sequence but was not detected in one of the sequencing traces. The second possibility is that one of the sequencing traces resulted in the base caller identifying a spurious nucleotide that does not actually exist in the underlying DNA sequence. In the first case, the gap is "real" because a base was missing in one of the traces; in the second case, the gap is merely an artifact caused by a non-existent base and it should be eliminated from the final consensus sequence.

To adjudicate between these two possibilities, SeqTrace uses the quality score of the base call from the sequencing trace that created the gap in the alignment as well as the quality scores of the two bases that flank the gap in the other sequencing trace. If the quality score of the gap-introducing base is less than the user-specified minimum quality threshold, and the mean quality score of the flanking bases in the other trace exceeds the minimum, then SeqTrace considers the base call to represent a spurious nucleotide and deletes it from the finished consensus sequence. Otherwise, the gap-introducing base is retained in the finished consensus sequence. This is an intentionally conservative algorithm that keeps all positions in the alignment unless there is strong evidence to delete a base position from one of the traces.

In practice, this algorithm seems to work quite well, but it is not perfect. It is always a good idea to visually inspect any internal gaps in forward/reverse trace alignments and decide whether to make any additional edits by hand. However, with reasonably rigorous quality trimming of the sequence ends, manual editing should rarely be necessary.

Consensus construction prior to SeqTrace 0.9.0

The Bayesian approach to building consensus sequences was a new feature introduced in SeqTrace version 0.9.0. Prior to that, SeqTrace used a much less complicated consensus algorithm, which worked by examining the quality scores and called bases for each position in the forward/reverse trace alignment and then using simple rules to make a decision about the consensus sequence, as follows: If both quality scores at a given position exceed or equal the user-specified minimum and both base calls agree, then use the base call in the consensus sequence. If both quality scores exceed or equal the minimum but the base calls disagree, then place an "N" in the consensus sequence. If one quality score exceeds or equals the minimum but the other does not, use the high-quality base call in the consensus sequence. If both quality scores are less than the minimum, then place an "N" in the consensus sequence.

This is a perfectly serviceable consensus algorithm, but it does not use the quality score information as effectively as the more sophisticated algorithm based on Bayesian inference. Consequently, I recommend that all users of SeqTrace process their sequencing data using the new algorithm. However, the old algorithm is still available as an option in case users need it to exactly reproduce previous analyses.