B2 I: IGV - bcfgothenburg/HT23 GitHub Wiki

Course: HT23 Bioinformatics 2 (SC00038)

The aim of these exercises is to introduce you to the Integrative Genome Viewer (IGV) so you can visualize sequencing data as well as understand some common statistics.

Data

For this practical, these are the files you will be using:

File type	DNAseq	RNAseq
`Alignment`	Exome.bam	RNAseq.bam
`Alignment index`	Exome.bam.bai	RNAseq.bam.bai
`QC html report`	Exome_fastqc.html	RNAseq_fastqc.html
`QC stats`	Exome.bam.flagstat, Exome.bam.idxstats	RNAseq.bam.flagstat, RNAseq.bam.idxstats
`Coverage binary file`		RNAseq.bam.tdf

Download the files from the Visualization folder, as indicated in the course web platform

Sequencing quality with FastQC

As we know, FastQC provides us with some statistics to do some quality control checks on raw sequencing data. Fortunately, it also can process BAM files.

Start the FastQC analysis of the Exome.bam file in your computer, this will take a while. In the meantime have a look at Exome_fastqc.html, we processed the same file using the following CMD line:

# generating QC summary
fastqc -o . Exome.bam

Q1. What do you think about the quality of the sample? Focus on these statistics: Per base sequence quality, Per sequence quality scores, Per base sequence content, Per base N content, Sequence Length Distribution, Sequence Duplication Levels and Adapter Content

Have look at RNAseq_fastqc.html

Q2. What do you think about the quality of the sample? Focus on these statistics: Per base sequence quality, Per sequence quality scores, Per base sequence content, Per base N content, Sequence Length Distribution, Sequence Duplication Levels and Adapter Content

Once your analysis finishes, check that your html file is the same as the one provided.

Statistics with samtools

Samtools is a suite of programs for interacting with high-throughput sequencing data. It is one of the most common tools used to generate some statistics from BAM files. Since it is a command line tool, we processed the Exome.bam file using this code:

# indexing to generate Exome.bam.bai 
samtools index Exome.bam

# calculating and printing statistics
samtools flagstat Exome.bam > Exome.bam.flagstat

# retrieving and printing stats from the index file 
samtools idxstats Exome.bam > Exome.bam.idxstats

Open Exome.bam.flagstat and Exome.bam.idxstats in a text editor. Here you can find some help to understand what each number/column means

Q3. Which chromosome has more sequencing data?

Q4. What does "properly paired" mean?

Investigate RNAseq.bam.flagstat and RNAseq.bam.idxstats. Although there are better measures for RNAseq data, we still have a nice overview of the alignment process

Now let's visualize the alignment files!

Integrative Genome Viewer - IGV

IGV is a high-performance, easy-to-use, interactive tool for the visual exploration of genomic data.

Open IGV in your local computer Here you can find some documentation in case you need it.
- In the Tool bar make sure hg19 is loaded
- In the Menu bar go to File -> Load from file and choose Exome.bam file NOTE: Remember that you need the indexed BAM file in the same location!
In the the Search box, type chr10:76,150,047-76,163,384
Click Go This will zoom in to that specific position

As you can remember the Exome.bam track displays all the reads aligned to this position in the hg19 genome, while the Exome.bam coverage track displays the coverage (the sum of all the reads).

Q5. Which gene are we looking at? specifically which exons?

Hover over or click on one of the reads. You will see a window with read information including read name, mapping quality (MAPQ), position where the read is mapped, how different parts of the read is mapped (important when looking for structural variants), information where the mate is mapped etc. You can toggle off an on this feature by clicking the yellow speech bubble (in the Tool bar).

Alignment Customization

By right clicking in the alignment track you will find different options:

- Color alignment by: 
   - insert size
   - read strand
   - first-of-pair strand

- View as pairs

Move around the alignment as you try these options, not all the characteristics can be seen in the same location:

insert size - the length of DNA/RNA between the adapters, includes read1, read2 and the inner distance
- red reads indicates inferred insertion site larger than expected
- blue read indicates inferred insertion site smaller than expected
read strand - the mapping orientation of the read
- red/pink indicates reads in positive strand 5'->3', and
- blue/purple the negative strand
first-of-pair stand- useful for directional RNA libraries
- red/pink shows the read or read pair with forward read first and
- blue/purple for read and read pairs with reverse read first
view as pairs- Link pairs together displayed with a band connecting them

Q6. Find a region with a read displaying an insert size larger than expected. Take a screenshot with the read pair linked. Hint: Remember to browse the alignment.

Single Nucleotide Variants

As you have noted, you need to zoom in quite a bit in order to see any alignment/coverage. To overcome this, let's generate a coverage track:

Go to Tools -> Run igvtools
Select Count
Select Exome.bam
Click Run

This will create a tdf file, a pre-processed file for faster display.

Load Exome.bam.tdf once you have it

Now if you zoom out, you will be able to identify regions with coverage.

Besides the reads, you can also view nucleotides. When zooming in (a lot!), you can see the reference sequence at the bottom of the window. If any base in the reads do not match the reference sequence it will be highlighted in a different color. By clicking on the stack in coverage track where you can see coloured nucleotides, a window pops up including information about how many reads are mapped to this position and what nucleotides are represented:

As we will see later, mismatching nucleotides between the reference and the sample area, account for one type of genetic variation: Single Nucleotide Variants or SNVs. These can be heterozygous if both alleles are different or homozygous if both alleles are identical. This can be easily visualized as follows:

Q7. Browse the alignments and identify any heterozygous SNV, where about half of the nucleotides differ from the reference. Take a screenshot.

Insertions and deletions

Another type of SNVs are insertions and deletions, these can also be visualized in IGV:

Insertions - are viewed as an purple I. If you hover over or click the I you can see the inserted bases:

Q8. Identify any insertion and take a screenshot displaying the inserted bases.

Deletions - are displayed with a black bar spanning the deleted region:

Q9. Identify any deletion and take a screenshot. What sequence has been deleted?

Splice variants

Alternative splicing is a process during gene expression that allows a single gene to code for multiple proteins. They may differ in the presence or absence of one or more exons, in the length of an exon, etc. To visualize the different splice variants:

right-click on the RefSeq Genes track
Select Expanded

Q10. Browse the alignment and identify a gene, where not all its splice variants are targeted (not sequenced)?

Q11. Are microRNAs or other long non-coding RNAs targeted? Hint: Remember that microRNAs are annotated as mir-.. and lncRNA are annotated as lnc.... You could of course google some names.

Gapped alignment

Now load RNAseq.bam and RNAseq.bam.tdf
Adjust the height for the tdf tracks This is useful when you want to compare the coverage of several samples:
- Select both tracks
- right-click on one of them
- Select Set data range
- Set a Max value

Q12. What differences do you see between the Exome.bam and the RNAseq.bam alignments?

Q13. Are there genes targeted in the exome data that are not expressed? Take a snapshot of one of these regions.

Q14. What about expressed regions not covered by the exome data? Take a snapshot of one of these regions.

RNAseq and splice junctions

You probably see a lot of light blue lines connecting the reads in the RNAseq alignment track. This indicates that one part of a read maps to one location in the reference genome and the other part of the read maps to another location. For an RNAseq alignment, this is typically seen when one read expands between exons.

Now we are interested to see what gene splice variants this RNA is coding for. Let's look a the splice junctions track (if you don't already see it):

right click in the RNAseq alignment track
Click Show Splice Junction track

This track is a visual representation of breaks in read coverage due to splice junctions:

red junctions are on the plus strand and
blue on the negative strand

The thickness is proportional with the number of reads spanning a given splice junction. Click on a specific splice junction to see information about how many reads spanning this specific junction.

To see all splice junctions covering the same region:

right click in the Splice junction track
Select Expanded

Make sure you have the RefSeq track expanded.

In the figure below, you can see an example of data expressing several isoforms of the same gene. The splice junction in the blue box indicates that we have an isoform with exon2-exon4 (skipping exon3), that suggest that we have the variant A or E. The splice junctions shown in the red box indicates that we have variants C or D. To know which one, you need to inspect the whole gene.

Q15. Identify a region where you can find several isoforms of the same gene (similar to the one above). Take a screenshot and write a short description of what variants you found.

Well done! now you can visualize and inspect NGS sequencing data!

Home: Bioinformatics 2

Developed by Marcela Dávila, 2017. Modified by Vanja Börjesson, 2021 Modified by Marcela Dávila, 2023