Displaying BLAST scores - arklumpus/TreeViewer GitHub Wiki

TreeViewer can be used to display BLAST scores on a phylogenetic tree. This can be useful to contextualise the results of a BLAST search and assess whether the search results are true orthologs of the query sequence or not.

This example will show how to display BLAST scores on a tree containing the results of a BLAST search for sequence homologs of the hpnP gene. The hpnP gene encodes a C-2 hopanoid methylase that is involved in the production of 2-methylhopanoids in various groups of bacteria (in particular, in some cyanobacteria). Another interesting feature of this gene is that it has a closely related (but uncharacterised) homolog, which often turns up in BLAST searches; since matches for this homolog can have an e-value lower than 10-150 and a score higher than 400, it is sometimes difficult to distinguish between them and actual hpnP orthologs by just looking at the BLAST results. However, combining a phylogeny of the sequences with their BLAST scores can significantly help to determine whether a match is a “true ortholog” or just a false positive.

Note that as more data is added to the NCBI database, the results of the BLAST search may differ from the ones shown in this example. Nevertheless, the same principles should apply; if you wish to use exactly the same data as in this example, you can download the attached files rather than performing the BLAST search yourself.

Performing the BLAST search

The first step in this example is to run a BLAST search for the gene of interest. This can be done e.g. using the BLAST online interface to run a blastp analysis, using the accession number B3QHD1 as a query sequence (this is the accession number of the hpnP gene from Rhodopseudomonas palustris, an alphaproteobacterium). For simplicity, the search should be restricted to the Synechococcales group (taxid 1890424), by entering this name in the Organism box.

Once the BLAST search finishes, the usual interface showing the best 100 matching sequences should appear; from this interface, it is possible to download a tree of these sequences by clicking on Distance tree of results (which will open a tree view page) and then, in the tree view interface, on Tools, then Download, and finally ASN text file (or ASN binary file).

The resulting tree file (also available here) can be opened and inspected in TreeViewer; clicking on a tip label will open the selection panel and, if you click on the "tag" button, you will see the many attributes associated with that tip, such as the accession number, organism name and more. The tree (as drawn by default by TreeViewer) should look similar to the following image:

Adding the scores to the tree

Unfortunately, the tree file that we downloaded from the BLAST interface does not contain information about the scores. These can be obtained by going back to the BLAST results page, clicking on Download, and then on Hit Table (CSV). This will download a file (also available here) that contains a comma-separated table with the information from the BLAST search.

The contents of this file should look similar to the following:

B3QHD1.1,WP_190805177.1,58.801,517,212,1,7,522,25,541,0.0,679,77.95
B3QHD1.1,WP_099799097.1,58.621,522,214,2,7,526,26,547,0.0,668,76.25
B3QHD1.1,WP_011056615.1,58.915,516,210,2,7,520,26,541,0.0,668,77.13
B3QHD1.1,WP_126985133.1,58.915,516,210,2,7,520,26,541,0.0,668,77.13
B3QHD1.1,ATS19577.1,58.621,522,214,2,7,526,17,538,0.0,667,76.25
B3QHD1.1,WP_024125695.1,58.915,516,210,2,7,520,26,541,0.0,667,76.94
...

Each row contains:

  • The query accession number
  • The subject (i.e. matched sequence) accession number
  • The % identity
  • The alignment length
  • The number of mismatches
  • The number of gap opens
  • The position in the query where the alignment starts
  • The position in the query where the alignment ends
  • The position in the subject where the alignment starts
  • The position in the subject where the alignment end
  • The e-value
  • The bit score
  • The percentage of "positives" (i.e. amino acids that are identical between the query and the subject or have similar chemical properties)

This list of fields can be obtained by downloading the Hit Table (text) file. The CSV file can be added to the tree as an "Attachment" by clicking on the Add attachment button in the Attachments tab and selecting the file. To associate the information to the tips of the tree, the Parse node states module can be used.

To enable enable this module, click on the Further transformations button in the Modules tab (which opens the modules panel) and then on the Add module button, finally select the module. Open the options for the new module and set the value of the Data file parameter to the attachment containing the CSV file. Then, change the Separator to a comma , (which is what is used in the file). Since the table does not have a header row, the Column header(s) need to be specified manually, therefore you should enter the following in the text box:

query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score, % positives

Finally, the Match column should be set to 2 (i.e. the column in the data file that contains the accession number of the subject sequences) and, correspondingly, the Match attribute should be set to accession-nbr (this attribute was included in the ASN tree file that we downloaded). You can now click on the Preview button to check that everything is OK; then, click on Apply. If you click on a tip of the tree you should now see that the new attributes have been added; in particular, you should see that there is a bit score attribute.

At this stage, a bit score has been associated to the terminal branches of the tree, but the internal nodes do not have this attribute. The bit score can be propagated up to the root of the tree by using the Propagate attribute module. Enable this module by clicking on the Add module button in the Further transformations, then expand its options and set the Attribute to bit score, then click on the Apply button. This will ensure that all the branches in the tree have a bit score attribute and will be drawn in the tree.

It is now possible to highlight the bit scores on the tree; this can be done easily by clicking on the Branch scores button in the Actions tab. You should then set the Attribute to bit score (which should be the default) and change the Maximum and Minimum (as the program suggests, a minimum of 0 and a maximum of 600 should be appropriate).

When you click on OK, the program will:

  • Set up the Coordinates module so that the tree is drawn in an "unrooted" style.

  • Add 10 Branches modules. Each module is associated to a threshold, and draws only branches whose bit score is higher than that threshold. For example, the first Branches module draws all branches whose score is greater than 0 (i.e. all of them), the second module draws branches with a score higher than 60 (i.e. 600 * 0.1), the third module draws scores higher than 120 (i.e. 600 * 0.2) and so on, until the 10th module draws branches with a score higher than 540 (i.e. 600 * 0.9). Since the branches drawn by all these modules are stacked upon each other, this means that branches with a higher score are drawn in all the colours from the gradient, while branches with lower scores miss the top layers.

  • Add another Branches module; this module draws the whole tree in white, on top of the coloured branches drawn by the other modules.

  • Add a Legend module, which draws a legend below the tree (it will also add an attachment called ScoreLegend, which contains an SVG image with the gradient shown in the legend).

The tree should now look similar to the following figure:

This plot makes it easy to recognise the part of the tree corresponding to sequences with a high score (i.e., the yellow branches at the top), which are the "true orthologs", and the part of the tree corresponding to sequences with a lower score (the greenish-blue branches at the bottom), which are false positives. It would have been hard to distinguish between the two by only looking at the BLAST scores: since the lowest-scoring sequence still has a score of 306 and an e-value of about 10-98, we cannot use a "pre-baked" e-value or bit score threshold.

In particular, take note of the sequence corresponding to Leptolygbya sp. FACHB-36 (the blue-ish branch at the top of the tree, in the middle of the yellow branches): this sequence has a relatively low score (386), but this is due to the fact that only a partial sequence is available for this organism, and not to the fact that this is not a true ortholog of the gene we are looking for. Indeed, this is clear from the tree, because the sequence clusters together with the other high-scoring true orthologs; if we were to only look at the scores, we might conclude that Leptolyngbya sp. FACHB-36 does not possess this gene, while it actually does.

As an example of the opposite problem, Cornejo-Castillo & Zehr (2019) mistakenly marked many marine cyanobacteria as possessing the hpnP gene (see Fig. 1), while in reality they do not have it.

In conclusion, when assessing BLAST matches one always needs to look at both the scores/e-values and the tree of the matched sequences; with TreeViewer, both these factors can be assessed at a glance.

Highlighting the "true orthologs"

Now that we are able to identify the "true orthologs" on the tree, it might be a good idea to highlight them on the tree.

The first step is obtaining a list of these orthologs, which can be done by clicking on the Lasso selection button in the Actions tab. This will display a message on top of the tree plot, stating that the "lasso selection" is currently active. To select the taxa, you need to click on the tree plot in order to draw a shape around the "true ortholog" sequences (see the first screnshot below). Once you reach the final point, double-click in order to close the selection.

Now, a window will open, which lets you choose one of the attributes on the nodes that were within the selection shape. Select the accession-nbr attribute and click on OK. This will cause the branches within the selection to become highlighted in a light-blue colour (see the second screenshot below) and their accession numbers will be copied to the system clipboard.

You can now add the list of the selected taxa to the tree as an attachment. To do this, click on the bottom part of the Add attachment button in the Attachments tab and select Open text editor. This will open a text editor window in which you can paste the text from the clipboard (either right-click and choose Paste, or press CTRL+V on Windows and Linux or CMD+V on macOS). The text you paste should look similar to the following:

WP_099799097.1
ATS19577.1
WP_181496272.1
WP_011056615.1
WP_126985133.1
WP_024125695.1
...

You should now click on OK to add the text file as an attachment, and set the name of the attachment to something like orthologs.

After the attachment has been added, we need to create a new attribute telling TreeViewer that the sequences whose accession numbers are contained in the attachment are true orthologs. To do this, we can use the Add attribute module. You can enable this attribute by clicking on the Add module button in the Further transformations and then chosing the appropriate module (note that there are two modules called Add attribute - the one we are interested in is the one whose description says "Adds an attribute based on an attachment").

After adding the module, expand its options and set the Taxon list to the new attachment containing the true orthologs. Then, set the Match attribute to accession-nbr, the Attribute to Ortholog, the Attribute type to Number and the New value to 30. This module works by searching the tree for nodes whose accession-nbr is contained in the attachment that we selected; when it finds a matching node, it applies the specified attribute (i.e. Ortholog with value 30) to it. We will use this numeric attribute to draw a node shape only for the "true orthologs". Finally, click on the Apply button. Now, if you click on one of the tips of the tree representing the true orthologs, you will notice that the new attribute has been added to them.

To draw the shapes highlighting the true orthologs, we need to add an instance of the Node shapes Plot action module. This can be done by clicking on the Add module button in the Plot elements section and choosing the appropriate module. By default, this will draw a star at each of the tips of the tree, including the false positives at the bottom (the stars will be a bit small, so you may need to zoom in to actually see them). To make the stars only appear at the true orthologs, click on the wrench button next to the Size parameter: this will open a Number formatter window, in which you can set the Attribute name to Ortholog (i.e. the same attribute name that we used earlier) and the Default value to 0. This way, the size of the star will be 0 for all nodes except for the ones where this is overridden by the Ortholog attribute that we defined earlier. The tree should now look similar to the following figure:

Right now, each star is coloured in a random colour. To change this, uncheck the Auto fill colour by node check box and then change the Fill colour to an orange hue. If you wish to make the stars stand out even more, you can increase the Stroke thickness to 2 and set the Stroke colour to white. The tree should now look similar to the following:

Highlighting the query sequence

If you have a keen eye, you might notice that there is a branch close to the true orthologs that is only drawn in white and not in any of the colours corresponding to the various bit scores (see screenshot below).

This branch corresponds to the query sequence: this sequence (obviously) does not have a bit score, because it is not a result of the BLAST search. It might be relevant to highlight this sequence in the tree: for example, in this case it is nice that this sequence appears to be an outgroup to the other sequences, since the query sequence comes from an Alphaproteobacterium and the BLAST matches are all from a small group of Cyanobacteria.

The first step is to make sure that the branch corresponding to this sequence is actually visible in the tree. To do this, we need to assign an "artificial" BLAST score to the sequence. There are multiple ways in which this can be accomplished; a way that does not require us to click on the almost-invisible white branch is to use the Replace attribute module. You can add a new instance of this module by clicking on the Add module button under the Further transformations; then, you can expand the options for this module and set the search Attribute to node-info and the Value to query. Then, you can set the replacement Attribute to bit score (the Attribute type will automatically update itself to Number) and the Value to an arbitrarily high number like 1000. This will cause the module to search for a node in the tree that has a node-info attribute whose value is query (i.e. our query sequence) and to set the bit score attribute of that node to 1000. In this way, we can "pretend" that the query sequence had a bit score of 1000.

This should cause the branch corresponding to the query sequence to become visible on the tree; the tree should now look similar to the following figure:

To draw a shape at the node corresponding the query sequence, we will follow an approach similar to the one that we have used to highlight the "true orthologs": we will set an attribute on the branch corresponding to the query sequence, and then use that attribute to determine the size of the node shape.

Now that the branch corresponding to the query sequence is clearly visible, you can select it by clicking on it. This will open the Selection panel to the right; here, in the attributes section, you can click on the Add attribute button below the attributes to add a new attribute. A new window will open, in which you can set the Attribute name to Query, the Attribute type to Number and the Attribute value to 50. This will add an instance of the Add attribute module that adds the new attribute to the selected node. Note that you could have obtained the same result by using another instance of the Replace attribute module with similar settings to those that we have used above, but this way is quicker.

You can now add a new instance of the Node shapes module; click on the three dots next to the Size parameter to open the Number formatter window and set the Attribute name to Query and Default value to 0 (like before). Finally, uncheck the Auto fill colour by node, increase the Stroke thickness to 2 and set the Stroke colour to white.

The tree should now look similar to the following figure:

Updating the legend

The final step is to update the legend to explain what the differently coloured stars mean. To do this, open the options for the Legend module that was added by the Branch scores action and click on the button to edit the Markdown source of the legend. In the Markdown editor window, you can replace the source with the following:

# **Legend**

### Bit score ![](attachment://ScoreLegend)

### ![](star://11,11,#D55E00) Orthologs

### ![](star://11,11,#00A2E8) Query sequence

This will add a title to the legend and add an explanation for the orange stars and the blue star (make sure to use the same colour that you have chosen for the node shapes in your plot). See the readme for the Legend module for more information about the "special image" syntax used to draw the coloured stars.

Finally, there is a lot of unused space at the top-left of the plot, so there is no real reason to keep the legend at the bottom, where it occupies even more space. To re-position the legend, in the options for the Legend module you can change both the Anchor and Alignment to Top-left, and set the X and Y components of the Position to 0.

The final tree figure should be similar to the following:

You can now save the tree file or the plot as a PDF or SVG file using the items from the File menu. You can also download the BLAST.tbi tree file, which contains the tree along with all the modules. The finished tree file is also available in the Examples section of TreeViewer's welcome page in the File page.

Tips

  • The tree that you use as a starting point to create this kind of plot does not need to come from the online BLAST interface. You can use any kind of tree on whose tips you can assign a score.

  • Another option is to run a BLAST search on a genomic database on your local server and then extract both the sequences and the BLAST scores. You can then use the sequences to build a phylogenetic tree, on which you can add the scores and plot it following this tutorial.

  • Highlighting the branch scores on a tree in this way comes at a significant performance cost: since we are using 11 Branches modules, it is as if we were plotting the tree 11 times every time it is updated. If the tree contains thousands of taxa, this could take too long: in this case, the command-line interface of TreeViewer can be used to produce essentially the same plot, without overloading the computer.

  • The scores that are highlighted on the tree do not need to be BLAST scores - they can come from any kind of analysis, as long as they can be added as attributes on the tree.

References

Francisco M. Cornejo-Castillo, Jonathan P. Zehr, Hopanoid lipids may facilitate aerobic nitrogen fixation in the ocean, Proceedings of the National Academy of Sciences, Sep 2019, 116 (37) 18269-18271; DOI: 10.1073/pnas.1908165116

⚠️ **GitHub.com Fallback** ⚠️