B2 I: IGV - bcfgothenburg/HT23 GitHub Wiki
Course: HT23 Bioinformatics 2 (SC00038)
The aim of these exercises is to introduce you to the Integrative Genome Viewer (IGV) so you can visualize sequencing data as well as understand some common statistics.
Data
For this practical, these are the files you will be using:
| File type | DNAseq | RNAseq |
|---|---|---|
Alignment |
Exome.bam | RNAseq.bam |
Alignment index |
Exome.bam.bai | RNAseq.bam.bai |
QC html report |
Exome_fastqc.html | RNAseq_fastqc.html |
QC stats |
Exome.bam.flagstat, Exome.bam.idxstats | RNAseq.bam.flagstat, RNAseq.bam.idxstats |
Coverage binary file |
RNAseq.bam.tdf |
- Download the files from the
Visualizationfolder, as indicated in the course web platform
Sequencing quality with FastQC
As we know, FastQC provides us with some statistics to do some quality control checks on raw sequencing data. Fortunately, it also can process BAM files.
- Start the
FastQCanalysis of theExome.bamfile in your computer, this will take a while. In the meantime have a look atExome_fastqc.html, we processed the same file using the following CMD line:
# generating QC summary
fastqc -o . Exome.bam
Q1. What do you think about the quality of the sample? Focus on these statistics: Per base sequence quality, Per sequence quality scores, Per base sequence content, Per base N content, Sequence Length Distribution, Sequence Duplication Levels and Adapter Content
- Have look at
RNAseq_fastqc.html
Q2. What do you think about the quality of the sample? Focus on these statistics: Per base sequence quality, Per sequence quality scores, Per base sequence content, Per base N content, Sequence Length Distribution, Sequence Duplication Levels and Adapter Content
Once your analysis finishes, check that your html file is the same as the one provided.
Statistics with samtools
Samtools is a suite of programs for interacting with high-throughput sequencing data. It is one of the most common tools used to generate some statistics from BAM files. Since it is a command line tool, we processed the Exome.bam file using this code:
# indexing to generate Exome.bam.bai
samtools index Exome.bam
# calculating and printing statistics
samtools flagstat Exome.bam > Exome.bam.flagstat
# retrieving and printing stats from the index file
samtools idxstats Exome.bam > Exome.bam.idxstats
- Open
Exome.bam.flagstatandExome.bam.idxstatsin a text editor. Here you can find some help to understand what each number/column means
Q3. Which chromosome has more sequencing data?
Q4. What does "properly paired" mean?
- Investigate
RNAseq.bam.flagstatandRNAseq.bam.idxstats. Although there are better measures for RNAseq data, we still have a nice overview of the alignment process
Now let's visualize the alignment files!
Integrative Genome Viewer - IGV
IGV is a high-performance, easy-to-use, interactive tool for the visual exploration of genomic data.
-
Open
IGVin your local computer Here you can find some documentation in case you need it.- In the Tool bar make sure
hg19is loaded - In the Menu bar go to
File -> Load from fileand chooseExome.bamfile NOTE: Remember that you need the indexed BAM file in the same location!
- In the Tool bar make sure
-
In the the Search box, type
chr10:76,150,047-76,163,384 -
Click
GoThis will zoom in to that specific position
As you can remember the Exome.bam track displays all the reads aligned to this position in the hg19 genome, while the Exome.bam coverage track displays the coverage (the sum of all the reads).
Q5. Which gene are we looking at? specifically which exons?
Hover over or click on one of the reads. You will see a window with read information including read name, mapping quality (MAPQ), position where the read is mapped, how different parts of the read is mapped (important when looking for structural variants), information where the mate is mapped etc. You can toggle off an on this feature by clicking the yellow speech bubble (in the Tool bar).
Alignment Customization
By right clicking in the alignment track you will find different options:
- Color alignment by:
- insert size
- read strand
- first-of-pair strand
- View as pairs
Move around the alignment as you try these options, not all the characteristics can be seen in the same location:
-
insert size- the length of DNA/RNA between the adapters, includes read1, read2 and the inner distance- red reads indicates inferred insertion site larger than expected
- blue read indicates inferred insertion site smaller than expected
-
read strand- the mapping orientation of the read- red/pink indicates reads in positive strand 5'->3', and
- blue/purple the negative strand
-
first-of-pair stand- useful for directional RNA libraries- red/pink shows the read or read pair with forward read first and
- blue/purple for read and read pairs with reverse read first
-
view as pairs- Link pairs together displayed with a band connecting them
Q6. Find a region with a read displaying an insert size larger than expected. Take a screenshot with the read pair linked. Hint: Remember to browse the alignment.
Single Nucleotide Variants
As you have noted, you need to zoom in quite a bit in order to see any alignment/coverage. To overcome this, let's generate a coverage track:
- Go to
Tools -> Run igvtools - Select
Count - Select
Exome.bam - Click
Run
This will create a tdf file, a pre-processed file for faster display.
- Load
Exome.bam.tdfonce you have it
Now if you zoom out, you will be able to identify regions with coverage.
Besides the reads, you can also view nucleotides. When zooming in (a lot!), you can see the reference sequence at the bottom of the window. If any base in the reads do not match the reference sequence it will be highlighted in a different color. By clicking on the stack in coverage track where you can see coloured nucleotides, a window pops up including information about how many reads are mapped to this position and what nucleotides are represented:
As we will see later, mismatching nucleotides between the reference and the sample area, account for one type of genetic variation: Single Nucleotide Variants or SNVs. These can be heterozygous if both alleles are different or homozygous if both alleles are identical. This can be easily visualized as follows:
Q7. Browse the alignments and identify any heterozygous SNV, where about half of the nucleotides differ from the reference. Take a screenshot.
Insertions and deletions
Another type of SNVs are insertions and deletions, these can also be visualized in IGV:
- Insertions - are viewed as an
purple I. If you hover over or click theIyou can see the inserted bases:
Q8. Identify any insertion and take a screenshot displaying the inserted bases.
- Deletions - are displayed with a black bar spanning the deleted region:
Q9. Identify any deletion and take a screenshot. What sequence has been deleted?
Splice variants
Alternative splicing is a process during gene expression that allows a single gene to code for multiple proteins. They may differ in the presence or absence of one or more exons, in the length of an exon, etc. To visualize the different splice variants:
right-clickon the RefSeq Genes track- Select
Expanded
Q10. Browse the alignment and identify a gene, where not all its splice variants are targeted (not sequenced)?
Q11. Are microRNAs or other long non-coding RNAs targeted? Hint: Remember that microRNAs are annotated as
mir-..and lncRNA are annotated aslnc.... You could of course google some names.
Gapped alignment
- Now load
RNAseq.bamandRNAseq.bam.tdf - Adjust the height for the tdf tracks
This is useful when you want to compare the coverage of several samples:
- Select both tracks
right-clickon one of them- Select
Set data range - Set a
Maxvalue
Q12. What differences do you see between the
Exome.bamand theRNAseq.bamalignments?
Q13. Are there genes targeted in the exome data that are not expressed? Take a snapshot of one of these regions.
Q14. What about expressed regions not covered by the exome data? Take a snapshot of one of these regions.
RNAseq and splice junctions
You probably see a lot of light blue lines connecting the reads in the RNAseq alignment track. This indicates that one part of a read maps to one location in the reference genome and the other part of the read maps to another location. For an RNAseq alignment, this is typically seen when one read expands between exons.
Now we are interested to see what gene splice variants this RNA is coding for. Let's look a the splice junctions track (if you don't already see it):
right clickin the RNAseq alignment track- Click
Show Splice Junction track
This track is a visual representation of breaks in read coverage due to splice junctions:
- red junctions are on the plus strand and
- blue on the negative strand
The thickness is proportional with the number of reads spanning a given splice junction. Click on a specific splice junction to see information about how many reads spanning this specific junction.
To see all splice junctions covering the same region:
right clickin the Splice junction track- Select
Expanded
Make sure you have the RefSeq track expanded.
In the figure below, you can see an example of data expressing several isoforms of the same gene. The splice junction in the blue box indicates that we have an isoform with exon2-exon4 (skipping exon3), that suggest that we have the variant A or E. The splice junctions shown in the red box indicates that we have variants C or D. To know which one, you need to inspect the whole gene.
Q15. Identify a region where you can find several isoforms of the same gene (similar to the one above). Take a screenshot and write a short description of what variants you found.
Well done! now you can visualize and inspect NGS sequencing data!
Home: Bioinformatics 2
Developed by Marcela Dávila, 2017. Modified by Vanja Börjesson, 2021 Modified by Marcela Dávila, 2023