MSA Cleaner - TUM-CBR/pymol-plugins GitHub Wiki

MSA cleaner is an interactive tool which helps with the task of eliminating problematic sequences from a multiple sequence alignment (MSA). Unlike other tools which tackle this task, this tool intends to be interactive, so the researcher can focus on building an MSA with the attributes that interest her.

Getting Started

To use this tool, you must first create a multiple sequence alignment. There are many tools, such as Clustal, for this purpose. For this tutorial, we will use a reference MSA file which can be found at: https://raw.githubusercontent.com/TUM-CBR/pymol-plugins/feature/msacleaner-vectorized/ui_demo/MsaCleaner/1/FormylTransferase.aln.fa.

Once you have the file available in your computer, open the "MSA Cleaner" application. You should be greeting with the following screen:

Proceed by using the "Load MSA" button and select the MSA file you wish to work with. After selecting the sample file, the user interface will update to:

The application contains the following sections:

Sequence Names Table: A table of sequences and penalties (more about that).
Metrics Table: A table of performance metrics of the current prunning.
Alignment Overview: Displays the multiple sequence alignment from a toplevel perspective.
Cleaner Widgets: The widgets used to run cleaning algorithms.
General Controls: Controls to manipulate and load MSA files.

Green/Red Lists

It is often the case that you are building an alignment around a specific sequence(s). For this purpose, the app offers the ability to "green-list". If a sequence is green-listed, it will not be removed from the alignment regardless of what you do. In a similar fashion, you can red-list a sequence which will guarantee the removal.

For this example, the sequence named "Formyl_Transferase" is our sequence of interest, so we proceed to green-list it by ticking the checkbox under the "Greenlist" column next to the name in the "Sequences Names Table":

Your first Prunning

The flow of this program works by iteratively pruning out sequences that contribute to gaps in the alignment. The most obvious way to reduce gaps is to eliminate long sequences. Lets start there. Click on the "Sequence Length" tab which can be found in the "Cleaner Widgets" are of the application to see the respective tool:

The tool displays two sliders to control the minimum/maximum length of sequences that will be preserved. Above the sliders, you can see a graph showing the distribution of sequence lengths in your alignment. It is clear that the MSA has many sequences of inappropriate length, especially considering that our target sequence is only 204 residues long. Lets move the "Maximum length" slider to a more appropriate value like 450. After doing that, two important updates happen in the application:

The "Alignment Overview" will now contain dark spots. These correspond to the regions of the alignment that are targeted for elimination. Note that this tool (unlike other tools) only removes whole sequences, meaning that the dark columns result from the fact that the alignment is getting shorter by removing sequences (thus eliminating gaps):

You may need to scroll in order to see the different regions of the alignment.

The "Metrics Table" will update its values:

Most of the metrics are self explanatory, but there is an important metric that needs to be explained: Efficiency. This metric captures how good/bad the current proposal is. It is defined as Efficiency = (Number of gaps removed)/(Number of sequences removed), in other words, the number of gaps that get eliminated per sequence that is removed. The higher this number, the better. With the current setup, we achieve an efficiency of 179 as we delete 11 sequences reducing the length of the alignment by 1970.

At this point, the application is only making a proposal (no changes to the alignment have happened yet). You can move the sliders to different values and explore different proposals, which in turn will give you different efficiencies. To make the proposal final, click the "Prune" button in the "General Controls" section. Do this before proceeding to the next section. As you can see, after pruning, the length distribution has changed to something more reasonable:

The Second Prunning

Eliminating long sequences is useful, but is also very limited. We now have sequences of similar lengths with lots of huge gaps all over the place. You can see that the alignment is 577 positions long while the longest sequence is only 342 residues long. More advanced methods are needed. One such method is the "Gap Divergence" algorithm which you can open by selecting the respective tab in the "Cleaner Widgets" section:

Before we proceed, lets briefly explain how the cleaning routines of this application work. Every sequence is analyzed in the context of all sequences in order to compute a penalty at every position (including gaps) of every sequence. Then we can sum all the penalties of all the positions of a sequence to get a penalty per sequence. This is then used to rank the sequences, going from sequences with the lowest penalty to highest penalty. The idea is that the higher the penalty, the less desirable a sequence is. This brings us to the threshold slider, which controls the penalty cutoff, meaning that sequences that score above that threshold will get prunned. Since all sequences are ranked by their penalty score, it is always the case that if you set the threshold to 0.8, it means keeping 80% of the sequences with the lowest penalty according to one criteria (such as "Gap Divergence").

Try moving the "Gap Divergence" slider to 0.8, you can see that the application will update showing what was removed:

This yields an efficiency of 2.16, a very good value still! Lets proceed and prune again!

Scoping

As explained before, each position is assigned a penalty and the penalties are added up to compute the overall sequence penalty. However, in some cases we would like to focus our efforts on specific parts of the alignment, instead of the whole alignment. This is where scoping comes handy. It essentially restricts the positions that will be considered in the overall score.

The widgets in the "Cleaner Widgets" section offer two sliders to control the scope, but the easiest way to do this is to use the "Alignment Overview" for this purpose. Simply click and hold over the representation of the alignments to select the scope:

Try scoping different parts of the alignment and playing with the threshold control, you will see that the behavior changes as the penalization will be restricted to different parts of the alignment.

About Penalization

In the previous section we used one of the penalization approaches: "Gap Divergence". There are two more methods, namely "Eliminate Long Inserts" and "Eliminate Large Gaps". This section briefly describes who these methods compute penalties.

Gap Divergence

The idea of this method is to assign at every position, where the sequence has a residue, a value proportional to the number of gaps at that position in the alignment (zero if the position is a gap). The intuition is that the more gaps that occur where the sequence has a residue, the worse that position for the alignment.

Eliminate Large Gaps

This method penalizes sequences that have residues in regions considered as gaps. The penalization is proportional to the size of the gap and the number of residues the sequence has at that region. The first step is to define what is considered a gap. That's determined by the "continuity threshold". This defines what proportion of sequences must have gaps at a position for that position to be considered a gap. A value of 0.8 means that a position is considered a gap if at least 80% of the sequences have a gap in that position.

Eliminate Long Inserts

This methods works like the "Eliminate Large Gaps" method but differs in that the penalty given is proportional to the number of continuous residues that a sequence has in a gap.