08a. Visualizing and Understanding Assembly Graphs - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

NOW, WHERE WERE WE?

HPCC is offline right now. This is a modified version of the exercise I usually do. Fortunately, much of it doesn't rely on HPCC.

Given its importance to human health, the H. pylori genome has been sequenced many, many times. You can find many of those assemblies on the NCBI databases. The easiest way to find a reference Helicobacter pylori assembly is to just search for it in the search field at the top of the page. At the time I wrote this, there were 3,404 assemblies in the database. Click the link in the second sentence of this paragraph. That number may have gone up since mid-August, 2023. For this exercise, I want to use a particular assembly for reasons that will become apparent later. So, go directly to it using this URL: https://www.ncbi.nlm.nih.gov/nuccore/AE000511.1?report=genbank.

On the resulting page, you can find out tons of information on this genome assembly. The first bit of information I want you to notice is how long the genome is. You can find that on the very top line, 1,667,867 bp.

This will be under Assignment 8 - Assembly graphs.

How is the reference assembly different from yours with regard to length? Before answering, there are several things to consider.

First of all, which assembly should we compare to? We have three to choose from, 'hp_pe_only', 'hp_long_low', and 'hp_long_high'. As discussed in class, I hope, a good assembly is sometimes considered one that has high N50 and low contig/scaffold numbers. Based on those criteria and the table from the last exercise, we should go with 'hp_long_high'.

So, how do you figure out how long that assembly is? There are tons of ways but we're going to figure it out using only tools that you've already learned. And fortunately, Unicycler puts the length of each contig in the header. And there's only one contig making up this assembly. So, do the following:

[!WARNING]
Because HPCC is down, you can't do these steps.

cd hp_long_high

head assembly.fasta

You would have seen:

>1 length=1645852 depth=1.00x circular=true
ATGGATACCAACAATAATATTGAAAAAGAAATCTTGGCGCTAGTCAAACAAAATCCTAAAGTTAGCCTCA
TAGAGTATGAAAATTACTTGAGCCAACTCAAATACAACCCTAACGCAAGCAAGAGCGATATTGCCTTTTT
TTATGCCCCCAACAAATTCTTATGCACCACGATTACGGCTAAATACGGTGCGTTGCTCAAAGAAATTTTA
AGCCAGAATAAAGTCGGCATGCATTTAGCCCACAGCGTGGATGTGCGTATTGAAGTAGCACCTAAAATCC
AGGTTAATGCCCAATCTAATATCAATTACAAAGCCACAAAAACGAGCGTCAAAGACTCTTACACTTTTGA
AAATTTTGTCGTAGGCTCATGCAACAACACCGTTTATGAAATCGCTAAAAAAGTCGCCCAAAGCGATACC
CCCCCTTATAACCCGGTGCTTTTTTATGGCGGCACAGGGTTAGGCAAAACGCACATTTTAAACGCTATCG
GTAACCATGCCCTAGAAAAGCATAAAAAAGTCGTGTTAGTCACTTCAGAAGATTTTTTGACAGACTTTCT
AAAGCATTTAGACAACAAAAACATGGATTCTTTTAAAAAAAAATACCGCCATTGCGACTTTTTCTTGTTA

So, 1,645,852 bp. Now answer question 1.

What are some genetic/molecular events that could explain how these two assemblies of the same species could be so different in size?

On the Genbank page showing the H. pylori genome, you are viewing the features of the genome. We need to switch to seeing the actual DNA sequence. You can do that by clicking the 'FASTA' link at the top of the page. Do so. You should immediately see a repeat in the reference genome sequence when viewing it in a web browser.

If you assume a k-mer of 7, count how many times the k-mer ‘TGATTAG’ occurs in the first few lines of the reference assembly from NCBI?
Could this k-mer occur more than that number of times in the entire assembly?

Assuming we assembled this genome correctly using the short and long reads (with high coverage), we should find something similar in our versions of the genome assembly. So, let's look for four copies of this repeat in the short read-only version de Bruijn graphs. We can use grep to search for it.

[!WARNING]
Because HPCC is down, you can't do these steps.

grep CACTAATCACTAATCAC assembly.fasta

[!NOTE] On HPCC, you would have seen.

grep ACTAATCACTAATCACTAATCACTAATC assembly.fasta
CTATAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAAT
CACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAAT
CACTAATCACTAATCACTAATCACTAATCACTAAAAAAGGGGGTTGGTGGGATTTGAACGGTGGTAGGAA

So, yes, that repeated region is found in the final assembly when we used long and short reads.

What about when we use only short reads? First, recall that the paired-end-only assembly was fragmented, unlike the hp_long_high assembly.

If you moved to the hp_pe_only directory and performed the same search, you would see this.

cd ../hp_pe_only
grep CACTAATCACTAATCAC assembly.fasta
CACTAATCACTAATCACTAATCACTAATCACTAATCACTAAAAAAGGGGGTTGGTGGGATTTGAACGGTG

Notice that the output is much shorter, only a few instances of the repeat.

We can't do that so we're going to us a program called Notepad++ instead. It should be installed on your PC in the classroom.

Using Notepad ++, search the hp_pe_only assembly file linked here for the sequence CACTAATCACTAATCAC.

You should see:

So, the sequence is there on at least one contig but maybe two. Because this assembly is fragmented it might be nice to know on which contig(s) it's located and where on that contig (in the middle or at one end?).

If you search for other instances of this sequence, you won't find it. However, you should keep in mind that DNA is double-stranded. Maybe the sequence is on the other strand. If so, it would be the reverse complement of the one we searched for. Try to find GATTAGTGATTAGTGATTAGTG using Notepad++.

You will see this:

Assuming the scaffolds in the fasta file are in numerical order (which they are), it looks like there's one instance on scaffold 1 and another instance on scaffold 10.

How would the computer attempt to put these scaffolds together to make a de Bruijn graph of the genome?

To investigate the answer to this question, we can use the .gfa files. Let's start with the one for the hp_pe_only assembly linked here.

[!WARNING] Still can't use HPCC. This is what you would see if you could.

head assembly.gfa
S       1       AAAAACAAAGAATTTAAAAACAAAGAATTTAAAAACAAGGCTTCAAAACAAAACCCCCCCAAAAAAAAAGAGAGTTTAAAAGAAAGACTTTAAAAACAAAGACTTCAAAACAAAAACCCCCAAAAAAAAGAGTTCAAAAAATCAAAAAAAAGAAATCCCTTAAAAAGGGAGTTTCACAAAGGGTTTTAACGCTTTAGTAAGCGAACACATAATTCAAATACACGCTATACAATCTTCGGTATTGGAGTTGAGTGCCTAGCAAAGAATAGTAATTCGTGTTGATCGTGGGGATCTTCACGCCTAATTCCACGCCATGCTGAGCCGCATGATCGCTCGCTTTTTTCTTATTCTTAGCGAGGTTCATTCTCAAGCCTAAATTGAACAAGAATTGGAAATTCGCCACATTCATTTTAGCGCTATAGAAATTATTGAAGGTCGCTAAATTCACATACTGGGAATTCAGCCATGAAGTGCCAGCTAACGCAATCCCCCCAAACACCCCAAAAGAAATCTTGCTGTTTTTGGTGGTTTTATCATTGATAAAGTTATAGAGGACATCTGTCCCTACCCCATAAGTGAACACATCAGAAGCCGAATTGAAAAAGCTAGATTTGATATAAGCATGGTTATAGTCAAAAAAGCCGTAATACCTTAACCCCCACCTTCTCTTTTCACCAAAAAATTGTTTGTAGCCCACTTGCACACCGATCCCATTCATGGCACCATTGTTGGTTTGAGAGCTGATTAATCCAACGCGTCTGAAAGGGTTATGCCCTAATTCTTGCGTGGCGCTTAATAATTGCGAATAAGCGCTTTGGTTGACTTGATAAACGGCCTGTAAGCCCCCGGGGTTA....

tail assembly.gfa
L       92      +       63      +       0M
L       92      +       92      +       0M
L       1       +       34      +       0M
L       4       +       31      +       0M
L       3       +       31      +       0M
L       13      +       37      +       0M
L       50      +       14      -       0M
L       27      +       40      +       0M
L       59      +       47      -       0M
L       59      -       47      +       0M

Since you can't use HPCC, open the file with Notepad++. Scroll through it a bit to see what it looks like. You should see regions similar to what are pictured above.

We don't need to get into the weeds about the details of this file format but you can see right away that there are at least two types of lines. 'S' refers to segments, which have DNA sequence data. 'L' refers to links, which tells you how those segments are connected. For example, there is a link between segments 3 and 59, which are both in the forward orientation (+) compared to the sequence that is stored in the file. More on this later. If you want to learn more about .gfa files, go to https://peerj.com/articles/2681/.

Let's look for the repeat (GATTAGTGATTAGTGATTAGTGA) in the .gfa file using Notepad++. You'll see.

Notice that, just like in the .fasta file, we find it at the end of S 10. So, at the end of segment 10. Any guess as to where we'll find the reverse complement?

Yep, at the end of S 1. Search for CACTAATCACTAATCAC. You will see:

Using the information discussed in class, explain why this sequence was found at the ends of the contigs and not somewhere in the middle of them. You'd think that these two contigs would overlap with one another and therefore should have been joined into a single contig. What is an explanation as to why they weren't they joined?

We'll investigate this two ways, using the text of the .gfa and using a visualization program.

First, let's track down the contigs in question in the text of the .gfa file. It should still be open in Notepad++. If not, repoen it. If you go to the bottom of that file, there are several lines of interest.

Circled at the top is our repeat, segment 73. It's 7 nucleotides long, just as we found earlier. Below that are two circled lines indicating that that segment 73 is linked to segment 1 and also to itself. That second part is important. In being linked to itself, that means that the segment is repeated at least once. In our case, multiple times. If you scroll down a bit farther, you'll find that segment 73 is also linked (you guessed it) to segment 10, just as we found in our investigation above.

Now download and install the appropriate version of Bandage for your computer (https://github.com/rrwick/Bandage/releases/). This package allows you to visualize assembly graphs.

Download and open the assembly.gfa file for hp_pe_only (make sure it's the .gfa file, not the .fasta) using Bandage and click ‘Draw graph’.

Locate contigs 1 and 10 on the graph by clicking on the colored lines and looking at the box to the right (selected node). How do they link to one another in the graph? Do they link together directly or through another contig? What's the name of that short contig? How long is that short contig and is it visited more than once, as indicated by a looping line?

Also, remember earlier when we viewed the tail end of the .gfa file and saw this line?

tail assembly.gfa
L       3       +       59      +       0M

Here's what that looks like in bandage. The curved blue line is the link connecting segment 3 (in red) and segment 59 (in light blue).

The curved blue line is the link connecting segment 3 (in red) and segment 59 (in light blue).

In a separate Bandage instance, open the assembly.gfa from the hp_long_high and hp_long_low runs. How does the shape of these graphs compare? Explain the differences.
Two final questions. These graphs have no ends. They all circle back on themselves. Why is that?

So, if the genome is circular, that (TGATTAG)N repeat should have flanking sequences on both sides. In other words, the linear sequence depicted on the NCBI webpage really loops back on itself.

Compare the two ends of that linear sequence on the NCBI webpage to the flanks we extracted earlier. What similarities and differences do you observe and what biological explanation best explains those trends?

There are nine questions. Please upload your answers in a single Word document to Assignment 8 - Assembly graphs.

Acknowledgements

Shaun Jackman (https://sjackman.ca/) was helpful in generating this tutorial.