PostAlignment Visualization - Bioinformatics-Institute/transcriptomics_WBC GitHub Wiki

RNA-seq Flowchart - Module 3

2-ii. Alignment Visualization

Before we can view our alignments in the IGV browser we need to index our BAM files. We will use samtools index for this purpose.

for i in $(ls tophat)
do
    samtools index tophat/$i/accepted_hits.bam tophat/$i/accepted_hits.bam.bai
done

OPTIONAL ALTERNATIVE - Index HISAT2 BAM files

Create comparable files for the HISAT2 alignments

for i in $(ls hisat2/*_sorted.bam)
do
    samtools index $i $i.bai
done

END OF OPTIONAL ALTERNATIVE - Index HISAT2 BAM files

Visualize alignments

We can observe the mapping files and the genome in a browser. There are several genome browsers available, for a list see here "Genome Browsers: WIKIPEDIA". We will use IGV, as it is small and fast to run. To open a mapping file and genome, you would use the following syntax:

    #igv -g YOUR_GENOME_HERE BAMFILE1.bam,BAMFILE2.bam etc

As a specific example, let's load the genome and the first replicate of each condition.

igv -g fasta/chr22_ERCC92.fa tophat/HBR_R1/accepted_hits.bam,tophat/UHR_R1/accepted_hits.bam

Remember that you can only load .bam files if they have been indexed (some genome browsers do this for you automatically). You can use hisat2 or tophat mappings. You can't have a space between the .bam files, only a comma.

Go to an example gene locus on chr22:

e.g. EIF3L, NDUFA6, and RBX1 have nice coverage
e.g. SULT4A1 and GTSE1 are differentially expressed. Are they up-regulated or down-regulated in the brain (HBR) compared to cancer cell lines (UHR)?
Mouse over some reads and use the read group (RG) flag to determine which replicate the reads come from. What other details can you learn about each read and its alignment to the reference genome.

igv -g fasta/chr22_ERCC92.fa tophat/HBR_R1/accepted_hits.bam,tophat/UHR_R1/accepted_hits.bam,annotation/genes_chr22_ERCC92.gtf

Exercise

Try to find a variant position in the RNAseq data:

HINT: DDX17 is a highly expressed gene with several variants in its 3 prime UTR.
Other highly expressed genes you might explore are: NUP50, CYB5R3, and EIF3L (all have at least one transcribed variant).
Are these variants previously known (e.g., present in dbSNP)?
How should we interpret the allele frequency of each variant? Remember that we have rather unusual samples here in that they are actually pooled RNAs corresponding to multiple individuals (genotypes).

Previous Section	This Section	Next Section
Alignment	Alignment Visualization	Expression