08. Visualizing and Understanding Assembly Graphs - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

NOW, WHERE WERE WE?

First things first, let's get back to where we were working in the previous exercise...

cd /lustre/scratch/[eraider]/gge2023/unicycler

Given its importance to human health, the H. pylori genome has been sequenced many, many times. You can find many of those assemblies on the NCBI databases. The easiest way to find a reference Helicobacter pylori assembly is to just search for it in the search field at the top of the page. At the time I wrote this, there were 3,404 assemblies in the database. Click the link in the second sentence of this paragraph. That number may have gone up since mid-August, 2023. For this exercise, I want to use a particular assembly for reasons that will become apparent later. So, go directly to it using this URL: https://www.ncbi.nlm.nih.gov/nuccore/AE000511.1?report=genbank.

On the resulting page, you can find out tons of information on this genome assembly. The first bit of information I want you to notice is how long the genome is. You can find that on the very top line, 1667867 bp.

This will be under Assignment 8 - Assembly graphs.

How is the reference assembly different from yours with regard to length? Before answering, there are several things to consider.

First of all, which assembly should we compare to? We have three to choose from, 'hp_pe_only', 'hp_long_low', and 'hp_long_high'. As discussed in class, I hope, a good assembly is sometimes considered one that has high N50 and low contig/scaffold numbers. Based on those criteria and the table from the last exercise, we should go with 'hp_long_high'.

So, how do you figure out how long that assembly is? There are tons of ways but we're going to figure it out using only tools that you've already learned. And fortunately, Unicycler puts the length of each contig in the header. And there's only one contig making up this assembly. So, do the following:

cd hp_long_high

head assembly.fasta

You should see:

>1 length=1645852 depth=1.00x circular=true
ATGGATACCAACAATAATATTGAAAAAGAAATCTTGGCGCTAGTCAAACAAAATCCTAAAGTTAGCCTCA
TAGAGTATGAAAATTACTTGAGCCAACTCAAATACAACCCTAACGCAAGCAAGAGCGATATTGCCTTTTT
TTATGCCCCCAACAAATTCTTATGCACCACGATTACGGCTAAATACGGTGCGTTGCTCAAAGAAATTTTA
AGCCAGAATAAAGTCGGCATGCATTTAGCCCACAGCGTGGATGTGCGTATTGAAGTAGCACCTAAAATCC
AGGTTAATGCCCAATCTAATATCAATTACAAAGCCACAAAAACGAGCGTCAAAGACTCTTACACTTTTGA
AAATTTTGTCGTAGGCTCATGCAACAACACCGTTTATGAAATCGCTAAAAAAGTCGCCCAAAGCGATACC
CCCCCTTATAACCCGGTGCTTTTTTATGGCGGCACAGGGTTAGGCAAAACGCACATTTTAAACGCTATCG
GTAACCATGCCCTAGAAAAGCATAAAAAAGTCGTGTTAGTCACTTCAGAAGATTTTTTGACAGACTTTCT
AAAGCATTTAGACAACAAAAACATGGATTCTTTTAAAAAAAAATACCGCCATTGCGACTTTTTCTTGTTA

So, 1645852 bp. Now answer question 1.

What are some genetic/molecular events that could explain how these two assemblies of the same species could be so different in size?

On the Genbank page showing the H. pylori genome, you are viewing the features of the genome. We need to switch to seeing the actual DNA sequence. You can do that by clicking the 'FASTA' link at the top of the page. Do so. You should immediately see a repeat in the reference genome sequence when viewing it in a web browser.

If you assume a k-mer of 7, what is the multiplicity of the k-mer ‘TGATTAG’ in the reference assembly from NCBI?

Assuming we assembled this genome correctly, we should find something similar in our versions of the genome assembly. So, let's look for four copies of this repeat in the short read-only version de Bruijn graphs. We can use grep to search for it.

grep ACTAATCACTAATCACTAATCACTAATC assembly.fasta

grep ACTAATCACTAATCACTAATCACTAATC assembly.fasta
CTATAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAAT
CACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAAT
CACTAATCACTAATCACTAATCACTAATCACTAAAAAAGGGGGTTGGTGGGATTTGAACGGTGGTAGGAA

Keep in mind that DNA has two strands. So, it's possible that any of the other contigs could also have this sequence but it may have been assembled from the complementary strand, and would therefore consist of the reverse complement DNA sequence. The reverse complement of 'TGATTAGTGATTAGTGATTAGTGATTAG' is 'CTAATCACTAATCACTAATCACTAATCA'. So, if any of your searches fail, consider using the reverse complement sequence.

The end results should be that yes, that repeated region is found in the final assembly when we use long and short reads.

What about when we use only short reads? First, recall that the paired-end-only assembly was fragmented, unlike the hp_long_high assembly.

Move to the hp_pe_only directory and perform the same search.

cd ../hp_pe_only
grep ACTAATCACTAATCACTAATCACTAATC assembly.fasta
CACTAATCACTAATCACTAATCACTAATCACTAATCACTAAAAAAGGGGGTTGGTGGGATTTGAACGGTG

If you get no result, remember that depending on the assembly, it may be on the opposite strand. So,

grep GATTAGTGATTAGTGATTAGTGATTAGT assembly.fasta
ATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTATAGCATCATTTTTTAAAATCATGT

So, the sequence is there on at least one contig but maybe two. Because this assembly is fragmented it might be nice to know on which contig(s) it's located and where on that contig (in the middle or at one end?).

For that we can take advantage of the graph file (.gfa). Before searching for the sequence, let's see what the actual file looks like. The file is quite long but we can get an idea using head and tail, which would print the first and last ten lines, respectively, to the screen.

Still working in the hp_pe_only directory...

head assembly.gfa
S       1       AAAAACAAAGAATTTAAAAACAAAGAATTTAAAAACAAGGCTTCAAAACAAAACCCCCCCAAAAAAAAAGAGAGTTTAAAAGAAAGACTTTAAAAACAAAGACTTCAAAACAAAAACCCCCAAAAAAAAGAGTTCAAAAAATCAAAAAAAAGAAATCCCTTAAAAAGGGAGTTTCACAAAGGGTTTTAACGCTTTAGTAAGCGAACACATAATTCAAATACACGCTATACAATCTTCGGTATTGGAGTTGAGTGCCTAGCAAAGAATAGTAATTCGTGTTGATCGTGGGGATCTTCACGCCTAATTCCACGCCATGCTGAGCCGCATGATCGCTCGCTTTTTTCTTATTCTTAGCGAGGTTCATTCTCAAGCCTAAATTGAACAAGAATTGGAAATTCGCCACATTCATTTTAGCGCTATAGAAATTATTGAAGGTCGCTAAATTCACATACTGGGAATTCAGCCATGAAGTGCCAGCTAACGCAATCCCCCCAAACACCCCAAAAGAAATCTTGCTGTTTTTGGTGGTTTTATCATTGATAAAGTTATAGAGGACATCTGTCCCTACCCCATAAGTGAACACATCAGAAGCCGAATTGAAAAAGCTAGATTTGATATAAGCATGGTTATAGTCAAAAAAGCCGTAATACCTTAACCCCCACCTTCTCTTTTCACCAAAAAATTGTTTGTAGCCCACTTGCACACCGATCCCATTCATGGCACCATTGTTGGTTTGAGAGCTGATTAATCCAACGCGTCTGAAAGGGTTATGCCCTAATTCTTGCGTGGCGCTTAATAATTGCGAATAAGCGCTTTGGTTGACTTGATAAACGGCCTGTAAGCCCCCGGGGTTA....

tail assembly.gfa
L       92      +       63      +       0M
L       92      +       92      +       0M
L       1       +       34      +       0M
L       4       +       31      +       0M
L       3       +       31      +       0M
L       13      +       37      +       0M
L       50      +       14      -       0M
L       27      +       40      +       0M
L       59      +       47      -       0M
L       59      -       47      +       0M

We don't need to get into the weeds about the details of this file format but you can see right away that there are at least two types of lines. 'S' refers to segments, which have DNA sequence data. 'L' refers to links, which tells you how those segments are connected. For example, there is a link between segments 3 and 59, which are both in the forward orientation (+) compared to the sequence that is stored in the file. More on this later. If you want to learn more about .gfa files, go to https://peerj.com/articles/2681/.

Let's look for the repeat in the .gfa file.

grep GATTAGTGATTAGTGATTAGTGATTAGT assembly.gfa

By scrolling up and down results in the terminal, you should see that this sequence was found at the ends of contig 1. You should get the same answer if all went well but it's possible that you'll get a different result.

Take note of the DNA sequence that flanks the repeated region.

On contig 1:

ATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTAGTGATTATAGCATCATTTTTT

Remember, that the reverse complement could also exist in the assembly and the graph. So,

grep ACTAATCACTAATCACTAATCACTAATC assembly.gfa

You should find it on on contig 11:

CTAATCACTAATCACTAATCACTAATCACTAATCACTAATCACTAAAAAAGGGGG

Using the information discussed in class, explain why this sequence was found at the ends of the contigs and not somewhere in the middle of them. You'd think that these two contigs would overlap with one another and therefore should have been joined into a single contig. What is an explanation as to why they weren't they joined?

We'll investigate this two ways, using the text of the .gfa and using a visualization program.

First, let's track down the contigs in question in the text of the .gfa file. Open it in a text editor. If you're on the computers in 405, you can use Notepad++. If you go to the bottom of that file, there are several lines of interest.

Circled at the top is our repeat, segment 73. It's 7 nucleotides long, just as we found earlier. Below that are two circled lines indicating that that segment 73 is linked to segment 1 and also to itself. That second part is important. In being linked to itself, that means that the segment is repeated at least once. In our case, multiple times. If you scroll down a bit farther, you'll find that segment 73 is also linked (you guessed it) to segment 10, just as we found in our investigation above.

Now download and install the appropriate version of Bandage for your computer (https://github.com/rrwick/Bandage/releases/). This package allows you to visualize assembly graphs.

Download and open the assembly.gfa file for hp_pe_only (make sure it's the .gfa file, not the .fasta) using Bandage and click ‘Draw graph’.

Locate those same contigs on the graph. How do they link to one another in the graph? Do they link together directly or through another contig? How long is that contig and is it visited more than once? You can investigate this by clicking on the colored segments and seeing which nodes are selected in the 'selected node' box on the right side of the window.

Also, remember earlier when we viewed the tail end of the .gfa file and saw this line?

tail assembly.gfa
L       3       +       59      +       0M

Here's what that looks like in bandage. The curved blue line is the link connecting segment 3 (in red) and segment 59 (in light blue).

The curved blue line is the link connecting segment 3 (in red) and segment 59 (in light blue).

In a separate Bandage instance, open the assembly.gfa from the hp_long_high and hp_long_low runs. How does the shape of these graphs compare? Explain the differences.
Two final questions. These graphs have no ends. They all circle back on themselves. Why is that?

So, if the genome is circular, that (TGATTAG)N repeat should have flanking sequences on both sides. In other words, the linear sequence depicted on the NCBI webpage really loops back on itself.

Compare the two ends of that linear sequence on the NCBI webpage to the flanks we extracted earlier. What similarities and differences do you observe and what biological explanation best explains those trends?

There are eight questions. Please upload your answers in a single Word document to Assignment 8 - Assembly graphs.

Acknowledgements

Shaun Jackman (https://sjackman.ca/) was helpful in generating this tutorial.