BMA231 IV: Alignment visualization - bcfgothenburg/HT24 GitHub Wiki

Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)


The aim of these exercises is to introduce you to the Integrative Genome Viewer (IGV) so you can visualize sequencing data as well as some common statistics.



Data

For this practical, these are the files you will be using:

* Exome.bam
* Exome.bam.bai
* RNAseq.bam
* RNAseq.bam.bai
* RNAseq.bam.tdf

You find this files in CANVAS under the R intro module:

  • Exome.zip
  • RNAseq.zip

Integrative Genome Viewer - IGV

IGV is a high-performance, easy-to-use, interactive tool for the visual exploration of genomic data.

Download IGV and open it in your local computer. Here you can find some documentation in case you need it.

  • In the Tool bar make sure hg19 is loaded.
  • In the Menu bar go to File -> Load from file and choose Exome.bam file.

In the the Search box, type chr10:76,150,047-76,163,384. This will zoom in to that specific position:

As you can remember the Exome.bam track displays all the reads aligned to this position in the hg19 genome. And the Exome.bam coverage track displays the coverage (the sum of all the reads).

Q1. Which gene are we looking at? specifically which exons?. Include a screenshot

Hover over or click on one of the reads. You will see a window with read information including read name, mapping quality (MAPQ), position where the read is mapped, how different parts of the read is mapped (important when looking for structural variants), information where the mate is mapped etc. You can toggle off an on this feature by clicking the yellow speech bubble.

If by accident you loaded a different version of the human genome, you will see a lot of colors, which represent mutations. Remember always to 
use the same version of the genome across all the analyses. For example, if your analysis was done using hg37/hg19 and you
load the alignment using version hg38, you will see the following:

Alignment Customization

By right clicking in the alignment track you will find different options:

- Color alignment by:
   - insert size
   - read strand
   - first-of-pair strand

- View as pairs

Move around the alignment as you try these options, not all the characteristics can be seen in the same location:

  • insert size - the length of DNA/RNA between the adapters, includes read1, read2 and the inner distance.
    red reads indicates inferred insertion site larger than expected.
    blue read indicates inferred insertion site smaller than expected.
Click here for output

  • read strand - the mapping orientation of the read
    red/pink indicates reads in positive strand 5'->3', and
    blue/purple the negative strand.
Click here for output

  • first-of-pair stand- useful for directional RNA libraries.
    red/pink shows the read or read pair with forward read first and
    blue/purple for read and read pairs with reverse read first.
Click here for output

  • view as pairs- Link pairs together displayed with a band connecting them.

Q2. Find a region with an read displaying an insert size larger than expected. Take a screenshot with the read pair linked. Hint: Remember to browse the alignment.

Single Nucleotide Variants

As you have noted, you need to zoom in quite a bit in order to see any alignment/coverage. To overcome this, let's generate a coverage track:

  • Go to Tools -> Run igvtools
  • Select count
  • Select Exome.bam
  • Click Run

This will create a tdf file, a pre-processed file for faster display. Load Exome.bam.tdf once you have it.

Now if you zoom out, you will be able to identify regions with coverage.

Click here for output


Besides the reads, you can also view nucleotides. When zooming in (a lot!), you can see the reference sequence at the bottom of the window. If any base in the reads do not match the reference sequence it will be highlighted in a different color. By clicking on the stack in coverage track where you can see coloured nucleotides, a window pops up including information about how many reads are mapped to this position and what nucleotides are represented:



As we will see later, mismatching nucleotides between the reference and the sample area, account for one type of genetic variation: Single Nucleotide Variants or SNVs. These can be heterozygous if both alleles are different or homozygous if both alleles are identical. This can be easily visualized as follows:



Q3. Browse the alignments and identify any heterozygous SNV, where about half of the nucleotides differ from the reference. Include a screenshot

Insertions and deletions

Another type of SNVs are insertions and deletions, these can also be visualized in IGV:

  • Insertions - are viewed as an purple I. If you hover over or click the I you can see the inserted bases:


Q4. Identify any insertion and take a screenshot displaying the inserted bases.

  • Deletions - are displayed with a black bar spanning the deleted region:


Q5. Identify any deletion and take a screenshot. What sequence has been deleted?

Splice variants

Alternative splicing is an alternative splicing process during gene expression that allows a single gene to code for multiple proteins. They may differ in the presence or absence of one or more exo n, in the length of an exon, etc. To visualize the different splice variants:

  • right-click on the RefSeq Genes track
  • Select Expanded
Click here for output


Q6. Browse the alignment and identify a gene, where not all its splice variants are targeted (not sequenced)? Include a screenshot

Q7. Are microRNAs or other long non-coding RNAs targeted? Hint: Remember that microRNAs are annotated as mir-.. and lncRNA are annotated as lnc.... You could of course google some names. Include a screenshot

Gapped alignment

Now load RNAseq.bam and RNAseq.bam.tdf.

Adjust the height for the tdf tracks, this is useful when you want to compare the coverage of several samples:

  • Select both tracks
  • right-click on one of them
  • Select Set data range
  • Set a Max value

Q8. What differences do you see between the Exome.bam and the RNAseq.bam alignments? Include a screenshot

Q9. Are there genes targeted in the exome data that are not expressed? Take a snapshot of one of these regions.

Q10. What about expressed regions not covered by the exome data? Take a snapshot of one of these regions.

RNAseq and splice junctions

You probably see a lot of light blue lines connecting the reads in the RNAseq alignment track. This indicates that one part of a read maps to one location in the reference genome and the other part of the read maps to another location. For an RNAseq alignment, this is typically seen when one read expands between exons.



Let's look a the splice junctions track (if you don't already see it):

  • right click in the RNAseq alignment track
  • Click Show Splice Junction track

This track is a visual representation of breaks in read coverage due to splice junctions:

  • red junctions are on the plus strand and
  • blue on the negative strand.

The thickness is proportional with the number of reads spanning a given splice junction. Click on a specific splice junction to see information about how many reads spanning this specific junction.



To see all splice junctions covering the same region:

  • right click in the Splice junction track
  • Select Expanded

Make sure you have the RefSeq track expanded.

In the figure below, you can see an example of data expressing several isoforms of the same gene. The splice junction in the blue box indicates that we have an isoform with exon2-exon4 (skipping exon3), that suggest that we have the variant A or E. The splice junctions shown in the red box indicates that we have variants C or D. To know which one, you need to inspect the whole gene.



Q11. Identify a region where you can find several isoforms of the same gene (similar to the one above). Take a screenshot and write a short description of what variants you found.

Well done! now you can visualize and inspect mapped NGS sequencing data!



Developed by Marcela Dávila, 2017. Modified by Vanja Börjesson, 2021

⚠️ **GitHub.com Fallback** ⚠️