3D. Day2 Read mapping and Post mapping analysis - bioinfokushwaha/Livestock_Genomics GitHub Wiki

Short Read Alignment

Lecture_Read_Aligment.pdf

Once the raw read quality has been assessed and controlled sufficiently, the reads can be aligned to a reference. This process is an extremely active field of research and novel aligners are frequently published. There is, no “silver bullet” and the choice of aligners will be dependent on the reference to be used (genome or transcriptome), the data type (short vs. longer reads) and the available computational power, among other factors.

Most recent aligners use either BWT (Burrows M 1994) (Burrows-Wheeler transformation) or MEM (Khan et al. 2009) (Maximum Exact Matches) to perform alignment. Older generation alignment algorithms relied on a spliced-seed approach (Heng Li and Homer 2010). The numerous implementations of these different strategies all come with a myriad of options that may significantly affect the alignment outcome. Selecting the most accurate aligner and determining the optimal parameter set for a project can often represent a small project in itself.

Hence, in the following, we exemplify using aligners that we have incorporated in our processing pipeline based on internal benchmarking for our most common experimental setup: tree genome / transcriptome, Illumina HiSeq 2500, 101bp PE sequencing. The aligner of choice varies based on the type of reference available for the project: For genome based alignment of RNA-Seq data we use STAR, a MEM based aligner - it actually uses MMP (maximum mappable prefix, a variation of MEM); for alignment of RNA-Seq data to a reference transcriptome (Dobin et al. 2013) we use either bowtie (version 1, BWT FM-index based, (Langmead et al. 2009)) or the BWT FM-index or MEM implementations of BWA(H. Li and Durbin 2010), (H. Li and Durbin 2009).

Alignment to the genome /Transcriptome

1. Burrows - Wheeler transform (BWT): suffix/prefix tries based mapper bowtie/bowtie2

When using other aligners, you will need to multi-map the reads back to the transcriptome. In this example I will use bowtie. The bowtie option --all means that all alignments are reported. If you have a large dataset, you might like to limit the number of reported alignment to a large (but finite) number, as this will be faster and the resulting bam files will be smaller. In bowtie you could do this with -k 40 (40 reported alignments).

In the paper, and this example, we map the reads back as pair-end, meaning that if only one read end maps to a transcript and not both, neither is reported. There may be some disadvantages in this, but we have not experimented with it. Feel free to try different mapping options, such as single-end and let us know if it makes a difference!


#1.  Build the bowtie index:
bowtie-build Reference_seq index.name

#2. Reap mapping 
bowtie -q -v 0 -k 10  -t <index.name> R1.fq  -S <out.sam>
bowtie --all  <index.name> -1 R1.fq -2 R2.fq -S Sample.sam  

-q :input is a fastq file
-v 0: no mismatches are allowed
-k 10: max 10 hits are reported for each read 
-S: output is a sam file
-t: time the mapping took is printed to the screen

# In practice
mv ../Day3/*fastq.gz ./
mkdir -p Mapping/Bowtie

nice -n 1 bowtie -p 1 --all ~/Workshop/Day4/Genome/Indices/PhyKer -1 Pk24_R1.fastq.gz  -2 Pk24_R2.fastq.gz\
-S Mapping/Bowtie/Pk24.sam -t

# Recommended for space saving and inter-conversion is possible SAM <-> BAM
nice -n 1 samtools view -S -b Mapping/Bowtie/Pk24.sam -o Mapping/Bowtie/Pk24.bam
rm Mapping/Bowtie/Pk24.sam

# Here is a short bash script for doing batch job

FILES=`ls Pk*_R1.fastq.gz | sed 's/_R1.fastq.gz//g'`
for F in $FILES ; do
R1=${F}_R1.fastq.gz
R2=${F}_R2.fastq.gz
nice -n 1 bowtie --all -S Index -1 $R1 -2 $R2 > ${F}.sam  
nice -n 1 samtools view -S -b ${F}.sam > ${F}.bam
done

Bowtie2

Bowtie2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to few 100s. Bowtie2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.

# 1. create the index by executing
bowtie2-build Reference_seq index.name

#Mapping SE reads with Bowtie2 
mkdir -p Mapping/Bowtie2
bowtie2 -q -p 1 -x ~/Workshop/Day4/Genome/Indices/PhyKer -U Pk24_R1.fastq.gz -S Mapping/Bowtie2/Pk24_SE.sam --no-unal     

#Mapping SE reads with Bowtie2 
nice -n 1 bowtie2 -q -p 1 -x ~/Workshop/Day4/Genome/Indices/PhyKer -1 Pk24_R1.fastq.gz -2 Pk24_R2.fastq.gz -S Mapping/Bowtie2/Pk24_PE.sam -t

# convert SAM to BAM
# --no-unal: ignore reads that failed to align
Exercise: 
use --very-sensitive and --very-sensitive-local option and find difference in results due to difference in algorithm 


2. Star Algorithms

Maximal mappable pre-fix search in STAR algorithms

1. Indexing the genome

First, the genome needs to be indexed. This is performed using the following command:

STAR --runThreadN 1 --runMode genomeGenerate --genomeDir Genome/PK_Index --genomeFastaFiles Genome/Genome_seq.fn \
 --sjdbGTFfile Genome/Genome.gtf --sjdbOverhang 100

genomeDir: parameter specifies the output directory for the index, genomeFastaFiles: specifies the genome FASTA file path and sjdbGTFfile the file path of the gene annotation file

It can be retrieved from EnsEMBL/UCSC in gff3 format, to be converted in gtf format. We also provide an additional option that would need to be edited depending on your sequencing read length like sjdbOverhang 100; we selected 100 as our longest reads are 101bp long - see the STAR manual for the rationale behind this.

2. Performing the alignment

Once the genome index is built, we can align our sample reads to it. This is achieved as follows:


STAR --genomeDir PK_Index --readFilesIn $R1 $R2 --runThreadN 24 --sjdbGTFfile \ 
Genome/Genome.gff --outFileNamePrefix Mapping/filename \
--outFilterMismatchNoverLmax 0.02 --outFilterMatchNmin 17 --outFilterScoreMinOverLread 0.5 \
--alignIntronMax 1 --sjdbGTFtagExonParentTranscript Parent --outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate --outWigType bedGraph --quantMode GeneCounts --seedSearchStartLmax 18

where there are a number of additional parameters:

--alignIntronMax is important to specify so that STAR does not try to align split reads across a distance greater than --outFileNamePrefix sets the path and prefix to where the results will be written --outSAMtype BAM SortedByCoordinate

We provide a few additional parameters that may require adjustment based on your data: As you can see, we provide many more arguments, which you can lookup in the STAR manual; STAR is highly configurable!

For your practice

mkdir Mapping/Star
cd Mapping/Star

nice -n 1 STAR --genomeDir ~/Workshop/Day4/Genome/Star_Indices --readFilesIn Pk24_R1.fastq.gz Pk24_R2.fastq.gz \ 
--runThreadN 1 --sjdbGTFfile ~/Workshop/Day4/Genome/PhyKer_genomic.fna --outFileNamePrefix Mapping/Star/Pk24 \
--outFilterMismatchNoverLmax 0.02 --outFilterMatchNmin 17 --outFilterScoreMinOverLread 0.5 \
--alignIntronMax 1 --sjdbGTFtagExonParentTranscript Parent --outFilterMultimapNmax 1 --readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate --outWigType bedGraph --quantMode GeneCounts --seedSearchStartLmax 18


Post-processing the alignment result files

STAR returns a number of result files:

  1. Log.final.out: contain details the alignment rate, percentage of uniquely/multiple aligning reads, rate of
  2. *STARAligned.sortedByCoord.out.bam: SAM file that contains the alignment in SAM format (Li et al, 2009).
  3. Two FASTQ files containing the forward and reverse unmapped reads: mismatches, INDELs identified in the reads, etc
  4. SJ.out.tab: information in tabular format, all the EEJs identified by STAR and whether these exist in the provided gff3 file or if they are novel. This is an extremely useful resource that can be used to identify possible new transcript splice variants. One needs to keep in mind that transcription, as all biological processes, is a stochastic process and as such, there will be miss-spliced transcripts present at a low frequency in any RNA-Seq sample that has been sequenced to adequate depth. Hence novel identified junctions might represent low-frequency genuine transcription as well as noise.

The BAM format is very efficient for computers to work with, but as a binary file format, it is unreadble to humans. If you would like to take a look at the alignments, you can do so using samtools:

The sorted BAM file is then indexed
samtools view Pk24.bam | head

Exercise: 
Try to understand read mapping specification in BAM file
Hint: samtools view Pk24Aligned.sortedByCoord.out.bam|cut -f 2|sort|uniq
Put these number in [Decoding of SAM flags](https://broadinstitute.github.io/picard/explain-flags.html)

Merging of Star result into MultiQC results

MultiQC understands STAR log files and can be used to plot the data nicely.

mkdir ~/star_logs
cp ~/star/*Log.final.out ~/star_logs
cd ~/star_logs
multiqc

3 Pseudo-alignment and transcriptome read quantification with Kallisto

Kallisto is a fairly recent program that makes heavy use of k-mer hash tables and De-Bruijn graphs to “pseudo-align” reads (Bray et al. 2016). This means that Kallisto does not map a read to a given position in a genome/transcriptome, but rather to an abstract representation of a transcript. Kallisto then performs a likelihood estimation and expectation maximization algorithms to calculate a count and - at request - bootstraps a confidence value.

Step 1 is to create the Kallisto index from a transcriptome:

mkdir -p Mapping/Kallisto
cd Mapping/Kallisto

nice -n 1 kallisto index --index=Phyker ~/Workshop/Day4/Transcript.fn
nice -n 1 kallisto quant -i Phyker -b 100  -t 1 ~/Workshop/Day4/Pk72_R1.fq ~/Workshop/Day4/Pk72_R2.fq -o ./



Batch Script:

    find ../Trim -name "*trim_[12].fq.gz" | sort | head -n 4 | while read FW_READ
    do
      read RV_READ
      FILEBASE=$(basename "${FW_READ%_1.fq.gz}")
      kallisto quant -i ../Potri03-synthetic-transcripts.idx -b 100 \
      -o . -t 1 "$FW_READ" "$RV_READ"

      # Kallisto doesn't let us specify an output filename so we rename all 
      # output files
      mv "abundance.tsv" $FILEBASE-"abundance.tsv"
      mv "abundance.h5" $FILEBASE-"abundance.h5"
      mv "run_info.json" $FILEBASE-"run_info.json"
    done

The option -b 100 performs a bootstrap confidence estimation using 100 iterations.

Kallisto outputs several files:

abundance.tsv: This file is the main file of interest as it contains transcript counts, tpm, etc. abundance.h5: This file contains the bootstrap values so confidence intervals can be estimated for the gene counts run_info.json: This is a file containing the parameters supplied to Kallisto

We could modify the k-mer size using the -k option. This is - however - often a very sensitive parameter in k-mer/hash based applications. Lower values will provide higher sensitivity, but lower specificity and vice versa for higher values of k. The optimal choice for a k-mer would be the longest read length without a single mismatch due to natural polymorphisms or technical artefacts. In practise, this is very difficult to estimate unless you have access to a large pool of population genomics data.

With salmon

Salmon is a fast aligner for quantifying transcript abundances. For the purposes of corset, we can use one of salmon's outputs called equivalence classes to cluster transcripts and provide gene-level abundances.

1. To run salmon, start by indexing the assembled transcriptome:

cd ..
mkdir -p Mapping/Salmon

salmon index --index Phyker --transcripts ~/Workshop/Day4/Transcript.fn
salmon quant --index Phyker --libType IU --dumpEq -1 ~/Workshop/Day4/Pk72_R1.fq -2 ~/Workshop/Day4/Pk72_R2.fq \
--output Pk24.out


The library type string consists of three parts: the relative orientation of the reads, the strandedness of the library, and the directionality of the reads.

The first part of the library string (relative orientation) is only provided if the library is paired-end. The possible options are:

I = inward
O = outward
M = matching

The second part of the read library string specifies whether the protocol is stranded or unstranded; the options are:

S = stranded
U = unstranded


Then quantify the reads (it's very important to specify the --dumpEq flag, so the equivalence classes are output

Salmon outputs several files:

quant.sf: This file is the main file of interest as it contains transcript counts, tpm, etc.

# Batch script
FILES=`ls Pk*_R1.fq | sed 's/_R1.fq//g'`
for F in $FILES ; do
R1=${F}_R1.fq
R2=${F}_R2.fq
salmon quant --index Phyker --libType IU --dumpEq -1 $R1 -2 $R1 --output ${F}.out
done

The relevant files will be called eq_classes.txt in the subdirectory aux_info. You can adjust the salmon option for your own dataset, for example the library type or the number of threads.

=========================================

Post Mapping analysis

=======================================

1. View Mapped SAM/BAM file

samtools view test.sam/test.bam

2. SAM <-> BAM file

samtools view -bT reference.fa test.sam > test.bam samtools -h test.bam > test.sam

3. Sort BAM file

samtools sort test.bam test_sorted samtools view -bS file.sam | samtools sort - file_sorted

4. Filtering out unmapped reads in BAM files

samtools view -h -F 4 blah.bam > blah_only_mapped.sam

5. Extracting SAM entries mapping to a specific loci

#index the bam file first
samtools index test.bam<br> samtools view test.bam chr1:200000-500000<br> #all reads mapping on chr1 as another bam<br> samtools view -b test.bam chr1 > test_chr1.bam<br>

6. BAM to Fastq and fasta

samtools fastq SAMPLE.bam > SAMPLE.fastq <br> samtools fasta SAMPLE.bam > SAMPLE.fasta

7. Count number of reads

#number of reads
samtools idxstats in.bam | awk '{s+=$3+$4} END {print s}' <br> #number of mapped reads
samtools idxstats in.bam | awk '{s+=$3} END {print s}' <br>

8. Merge/concatenate BAMs

samtools cat bam1.bam bam2.bam >bam12.bam

9. Bam stats

samtools stats bam_file.bam

⚠️ **GitHub.com Fallback** ⚠️