PostAlignment Visualization - Bioinformatics-Institute/transcriptomics_WBC GitHub Wiki
2-ii. Alignment Visualization
Before we can view our alignments in the IGV browser we need to index our BAM files. We will use samtools index for this purpose.
for i in $(ls tophat)
do
samtools index tophat/$i/accepted_hits.bam tophat/$i/accepted_hits.bam.bai
done
OPTIONAL ALTERNATIVE - Index HISAT2 BAM files
Create comparable files for the HISAT2 alignments
for i in $(ls hisat2/*_sorted.bam)
do
samtools index $i $i.bai
done
END OF OPTIONAL ALTERNATIVE - Index HISAT2 BAM files
Visualize alignments
We can observe the mapping files and the genome in a browser. There are several genome browsers available, for a list see here "Genome Browsers: WIKIPEDIA". We will use IGV, as it is small and fast to run. To open a mapping file and genome, you would use the following syntax:
#igv -g YOUR_GENOME_HERE BAMFILE1.bam,BAMFILE2.bam etc
As a specific example, let's load the genome and the first replicate of each condition.
igv -g fasta/chr22_ERCC92.fa tophat/HBR_R1/accepted_hits.bam,tophat/UHR_R1/accepted_hits.bam
Remember that you can only load .bam files if they have been indexed (some genome browsers do this for you automatically). You can use hisat2 or tophat mappings. You can't have a space between the .bam files, only a comma.
Go to an example gene locus on chr22:
- e.g. EIF3L, NDUFA6, and RBX1 have nice coverage
- e.g. SULT4A1 and GTSE1 are differentially expressed. Are they up-regulated or down-regulated in the brain (HBR) compared to cancer cell lines (UHR)?
- Mouse over some reads and use the read group (RG) flag to determine which replicate the reads come from. What other details can you learn about each read and its alignment to the reference genome.
igv -g fasta/chr22_ERCC92.fa tophat/HBR_R1/accepted_hits.bam,tophat/UHR_R1/accepted_hits.bam,annotation/genes_chr22_ERCC92.gtf
Exercise
Try to find a variant position in the RNAseq data:
- HINT: DDX17 is a highly expressed gene with several variants in its 3 prime UTR.
- Other highly expressed genes you might explore are: NUP50, CYB5R3, and EIF3L (all have at least one transcribed variant).
- Are these variants previously known (e.g., present in dbSNP)?
- How should we interpret the allele frequency of each variant? Remember that we have rather unusual samples here in that they are actually pooled RNAs corresponding to multiple individuals (genotypes).
Previous Section | This Section | Next Section |
---|---|---|
Alignment | Alignment Visualization | Expression |