11. Read Mapping and pileups - davidaray/Genomes-and-Genome-Evolution GitHub Wiki
Conceptually, you know that in the previous exercises the reads were put together essentially like puzzle pieces to generate the assembly. However, it brings the point home to actually see those puzzle pieces in the context of the final assembly. That’s what this exercise is about.
We will use two software packages. Tablet (http://bioinf.scri.ac.uk/tablet/) will need to be installed on your own computer or you can use the computers in the classroom. Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is installed on HPCC but to ensure that we're all using the same version, please use the version installed on our singularity container.
What is bowtie2? Stealing blatantly from the bowtie2 site, "Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s of characters to relatively long (e.g. mammalian) genomes." In other words, it's a 'read mapper'. It will take your raw sequencing reads and find the place in an existing genome assembly they best match. You can use this tool to find polymorphisms in the data or in data from another individual of the same or closely related species. You can also use it to find out how well your assembly is by comparing it to the raw data. For example, if you have a really bad assembly relatively few of the reads, which came directly from the organism, will map well. If it's a good assembly, most of them will map well. We will be taking advantage of read mapping later in the semester as you learn how to identify variants in a genome.
To set up this exercise, we need a directory to work in, a copy of the genome assembly, and some reads to map to the assembly.
Before you get started though. get an interactive session so that you're not using login node resources. You should know how to do this.
mkdir -p /lustre/scratch/[eraider]/gge2024/unicycler/read_mapping
cd /lustre/scratch/[eraider]/gge2024/unicycler/read_mapping
To save space, we’re just going to link the files we need rather than copying them to this new location. Note that we’re going to be using the H. pylori data once again for this example. Specifically, we’ll use the high quality hybrid assembly from the long and short reads. Note the use of relative paths in the commands below. Be aware that you could also just use the same path to direct bowtie2 to the reads when you run bowtie2 later in this exercise. But, because I want to do some other things, we're going to link the data to from it's actual location to here.
ln -s ../../data/helicobacterpylori/short_reads_1.fastq.gz
ln -s ../../data/helicobacterpylori/short_reads_2.fastq.gz
ln -s ../../unicycler/hp_long_high/assembly.fasta hp_long_high_assembly.fasta
Just so you can see, run the following command, ls -lhrt
. This will show you the current contents of your directory. You should see something that looks like this in the list that appears.
Notice the far left column. That initial 'l' indicates that the files listed don't actually exist at this path but, rather, are representations of real files that exist at another location. That actual location is all the way to the right.
Just as a reminder of what these files are, you can peek at your reads by using:
zcat short_reads_1.fastq.gz | head
and/or head hp_long_high_assembly.fasta
To accomplish our mapping, the first thing to do is index the genome assembly. But, as always we need to set up to use our singularity container.
WORKDIR=/lustre/scratch/[eraider]/gge2024
GGE_CONTAINER=/lustre/scratch/[eraider]/gge2024/container
Now,
singularity exec -B $WORKDIR:$WORKDIR $GGE_CONTAINER/gge_container_v4.sif bowtie2-build hp_long_high_assembly.fasta hp_long_high_assembly.index
While this runs, think about what indexing is. Indexing a genome can be explained similar to indexing a book. If you want to know on which page a certain word appears or a chapter begins, it is much more efficient/faster to look it up in a pre-built index than going through every page of the book until you found it. Same goes for alignments. Indices allow the aligner to narrow down the potential origin of a query sequence within the genome, saving both time and memory.
Next, we’ll map all of the short reads to the assembly. Note the -p option. This should match the number of processors you snagged for your interactive session, in this case, 1.
singularity exec -B $WORKDIR:$WORKDIR $GGE_CONTAINER/gge_container_v4.sif bowtie2 -p 1 -x hp_long_high_assembly.index -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -S hp_long_high_assembly.sam
First question. Upload your answers in a single Word document to Assignment 13 - Read mapping.
- Use your google-fu to figure out what each of those options in the command line mean. For example, -p does what? -x tells the program what?
If all goes well, in about 10 minutes (how could you make it go faster?) you will get output that looks something like this:
$ bowtie2 -p 1 -x hp_long_high_assembly.index -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -S hp_long_high_assembly.sam
2000001 reads; of these:
2000001 (100.00%) were paired; of these:
136812 (6.84%) aligned concordantly 0 times
1819673 (90.98%) aligned concordantly exactly 1 time
43516 (2.18%) aligned concordantly >1 times
----
136812 pairs aligned concordantly 0 times; of these:
91519 (66.89%) aligned discordantly 1 time
----
45293 pairs aligned 0 times concordantly or discordantly; of these:
90586 mates make up the pairs; of these:
55907 (61.72%) aligned 0 times
27947 (30.85%) aligned exactly 1 time
6732 (7.43%) aligned >1 times
98.60% overall alignment rate
What does it all mean? Some is obvious. We already knew we had 2,000,001 reads that were paired with one another. What about 'concordant' and 'discordant'? Concordant pairs match expectations based on the assembly. In other words, two reads are concordant if they map within the expected insert size range and in the proper orientation relative to one another. The fact that ~91% of our reads mapped concordantly is good news for our assembly. High rates of discordant reads suggests an assembly that isn't put together particularly well. But may also indicate an important source of evolutionary novelty, structural variation. This can be important, specially if you're using reads from a different individual.
What about that ~2% that aligned concordantly more than once? Remember that repeat region. Because it's larger than the read length, there will be multiple reads that will map there multiple times and in the right orientation but still fall within the insert size range.
~7% (136,768) of the pairs didn't map concordantly but ~67% of them did align at least once, just outside of the expectations (probably farther apart or closer together than expected.
That leaves 90,490 reads that wouldn't map correctly with their presumed partner. That makes them 'singletons'. For 55,907 of them Bowtie2 was never able to find a place to put them. This may suggest contamination of the sample used to create the library. It may also suggest that the assembly is missing some regions of the genome. This could happen if the assembly isn't complete. In other words, there may be complicated regions that were difficult to assemble and the assembly left them out. The rest were able to be mapped as singletons at least once. That brings us to an overall rate of successful mapping of over 98%.
Note that the output is a .sam file. If you ls -lhrt
that directory, you’ll note that it’s a HUGE file. After all, it’s storing the reads and their quality scores but also all sorts of information about every single read from each of the two .fastq files. You'll recall from the first tutorial that each of these files consist of ~4 million reads. So given that the .sam file is storing information on 4 million reads, it would be expected to be pretty large. To learn more about this format, you can go here: https://en.wikipedia.org/wiki/SAM_(file_format) but here are the basics.
$ head -7 hp_long_high_assembly.sam
@HD VN:1.0 SO:unsorted
@SQ SN:1 LN:1646095
@PG ID:bowtie2 PN:bowtie2 VN:2.3.5.1 CL:"/home/daray/conda/bin/bowtie2-align-s --wrapper basic-0 -x assembly.index -S assembly.sam -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz"
DGL970R1:205:C6WR4ACXX:6:1101:4001:22265 99 1 574285 42 101M = 574461 277 TCCATGCATGCTCTTATGGGGAGGTCAAATCTTTAGAGCCGATCATTCACGCTTTAAAAGAGCCGATTTTAATCAGCGTTACCACCCATACCGGCTTTGAA @@CFFDFFHHHHHJJJJJJFIIIIDGIJJJJJJIAGGJJJJJJJJJJJIJJJIJICIGIJGHHHF>CCDEECDD@@CDDDDDDDCBDDDCDDDDBDDDDCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP
DGL970R1:205:C6WR4ACXX:6:1101:4001:22265 147 1 574461 42 101M = 574285 -277 TGGGAAAAAAACTTAAAACGCTTGAAAACTTTGGTGGTTACAGAAGCGGAATTGTGGTTTAACGTGTTTGATACGGCTCAAAAATTAGGGGCAAAAACCAT CDCCB@?CAB@C@CDDDBDDCCDDDDC>CAACCFEEFFHHGHHHHE@;IHFHGIGGGBCHGGGHJJJIHEGD?GIHDCDIGJIIEGIGFHGGHFDDDFCBB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP
DGL970R1:205:C6WR4ACXX:6:1101:4059:22317 83 1 1399321 42 101M = 1399099 -323 AATTACCTTATATTTACAACGCCCACCCTTTCATGGCCAATCAAAAACGCTTCATCCCCAACCAACAAGAAATTATCATGCTGCATTTTTATTTTGTAGGG >:>>4CCDEEDDADCBBBBDDDB?=@ECCCCEDFFFFFFDGGIIGCCHG@DJJJJIHGJHIIJIIIIDJJJIHEFAEHEJGIIHIGDIHHHHFFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP
DGL970R1:205:C6WR4ACXX:6:1101:4059:22317 163 1 1399099 42 101M = 1399321 323 CTCGCCAAAAATTTTCCCTCTATGAGCATTACCTTAAAGAAAAAGACATGCAAATCATTTGTGAAAACCACTATAACGTTGGGTTTTTAATCGTGAATTTA CCCFFFFFHHGD>FGEABHHH<EEBGIGGIIIJGHG9DDGGCHDGEHI<B?FEHHGHGGIIIIFHE;CG=A>AHFFBBBCDDD??BBB<:C3<?B>@@C>> AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP
The first three lines are the header and just provide some information on what this file contains and how it was generated. Every line after that provides details on the reads that were mapped. For each line:
Col | Field | Type | Brief description | My notes |
---|---|---|---|---|
1 | QNAME | String | Query template NAME | The name of the read given by the sequencer |
2 | FLAG | Int | bitwise FLAG | See information below |
3 | RNAME | String | References sequence NAME | The name of the contig or scaffold to which the read mapped |
4 | POS Int | 1- based leftmost mapping POSition | Self explanatory | |
5 | MAPQ | Int | MAPping Quality | A log score indicating the mapping score (probability of mapping correctly) |
6 | CIGAR | String | CIGAR String | Concise Idiosyncratic Gapped Alignment Report (CIGAR) string (not important for now but, https://genome.sph.umich.edu/wiki/SAM#What_is_a_CIGAR.3F |
7 | RNEXT | String | Ref. name of the mate/next read | For us, it's '=', meaning that the next read has the same name (remember, paired reads) |
8 | PNEXT | Int | Position of the mate/next read | Indicates where the next read maps |
9 | TLEN | Int | observed Template LENgth | It's a little more complicated but essentially, this is how far apart the paired reads mapped |
10 | SEQ | String | segment SEQuence | The actual sequence of the read |
11 | QUAL | String | ASCII of Phred-scaled base QUALity+33 | The quality information for the read |
These are followed by a series of optional fields that provide additional information. |
Flags are indicators that tell us about the reads themselves in a simple format. To interpret the flags, you can use this site.
Here comes your second question.
- What are the flags for the first two reads and what do they mean, according to the site I linked? If you were to translate that information into something your grandmother could understand, how would you say it?
Now, modify the commands above to map your reads to the two lower quality assemblies, the 'hp_low_long' assembly and the 'hp_pe_only' assembly. As before, the steps are to 1) link the assembly using a unique name, index it, then map the reads.
- Compare the alignment rate from when we mapped to the 'good' assembly to the other two. How do you explain any difference in the mapping rates?
The next step is to generate an index of the mapped reads file using samtools. We will use samtools
to convert the mapped reads in .sam format to .bam format (a binary format to save space), sort the .bam file, and index the .bam file.
Create another index of the genome using samtools
.
singularity exec -B $WORKDIR:$WORKDIR $GGE_CONTAINER/gge_container_v4.sif samtools faidx hp_long_high_assembly.fasta
Each mapping tool has their own way to index. the index from samtools is just one file, 'assembly.fasta.fai'. The index from bowtie2 consists of several files, all ending in '.bt2'. For example, if you head
the samtools-based index, you get something pretty simple. It's basically just the number of the contig, the length of the contig and some other numbers that we don't need to get into. But if you're interested, http://www.htslib.org/doc/faidx.html.
$ head hp_long_high_assembly.fasta.fai
1 1646095 44 70 71
If you head
the 1.bt2 file of the bowtie2-based index, you can't even read it because it's a binary file that is not human-readable.
Convert the .sam file to a .bam file using the view command. -b = convert to bam format.
singularity exec -B $WORKDIR:$WORKDIR $GGE_CONTAINER/gge_container_v4.sif samtools view -b hp_long_high_assembly.sam > hp_long_high_assembly.bam
Next, sort the .bam file. Sorting does just that. By default, the sort command arranges the file such that reads are aligned by leftmost coordinates. In other words, the read that maps farthest to the left on the assembly if we were to look at it on a page will be the first one listed. There are other ways to sort but this will work for us.
singularity exec -B $WORKDIR:$WORKDIR $GGE_CONTAINER/gge_container_v4.sif samtools sort hp_long_high_assembly.bam >hp_long_high_assembly.sort.bam
The sorted .bam also needs to be indexed.
singularity exec -B $WORKDIR:$WORKDIR $GGE_CONTAINER/gge_container_v4.sif samtools index hp_long_high_assembly.sort.bam
For the next step, you’ll need to download four files to your local computer to use in Tablet: 'hp_long_high_assembly.sort.bam', 'hp_long_high_assembly.fasta' 'hp_long_high_assembly.sort.bam.bai', and 'hp_long_high_assembly.fasta.fai'. You should be able to do this easily using BigVise on Windows or FileZilla on a (slight gagging noise) Mac. Make sure you know where the files you're downloading will be saved on your local computer.
In Tablet, click 'Open Assembly' and use the .bam file as your primary assembly file and the .fasta file as your Reference/Consensus file. (See the figure below.) Also, these are large files. It will take a minute or two to download.
So, after all that you can now see the way the reads work together to generate the assembly by clicking on Contig 1 in the window to the left of the screen. Each read is put together with other reads and they overlap, just like we've discussed in class. Further, the paired reads are X bp apart from one another as you will see in a few minutes. A few things to notice. First, the reads pile up on one another indicating that each base has been sequenced multiple times. Also notice that while most of the reads covering any give base give the same nucleotide identity, some of them have a different base.
-
Find a read in the assembly with at least two erroneous base calls (mismatches to the consensus) and note where the errors likely are. You can make them show up better by playing with the 'Variants' slider in the top panel. You can get the name of the read by hovering over it with your mouse. What is the name of that read?
-
How long are the reads that were used to assemble this genome? How did you figure this out?
Look at the top of the screen and find the Read Packing button (Circled in red in the picture below). Change it to “Packed (Paired End)”. Also change the view by choosing the “Visible reads” tab in the panel to the right (Circled in blue in the picture below).
- For me, there is a single read that ends at exactly position 2968 in the assembly. The same should be true for you. What is the name of that read?
Choose 20 random reads that were 'properly paired' and record their sequence IDs and the sequence ID of their paired read in a spreadsheet. You can do this by scrolling up and down in the left panel or left to right in the right panel. Hover over one of them and record the start and end positions of each member of the pair. Also record the insert sizes for each randomly chosen set.
-
Does the insert size encompass the reads themselves or just the space between them? How do you know this?
-
Determine the mean insert size for your set.
-
How does the mean compare to what you saw in the unicycler log for the hp_long_high_assembly from a couple of weeks ago?
-
Why are the insert sizes not all the same?
-
One last thing. Use the 'Jump to Base' function (along the top row of the Tablet window) to jump to base 1562677. Examine this region and region to the left and right. We've talked about this part of this genome quite a bit. What is it? How long is this region and why is that important for this assembly?