3D. Day2 Read mapping and Post mapping analysis - bioinfokushwaha/Livestock_Genomics GitHub Wiki
Once the raw read quality has been assessed and controlled sufficiently, the reads can be aligned to a reference. This process is an extremely active field of research and novel aligners are frequently published. There is, no “silver bullet” and the choice of aligners will be dependent on the reference to be used (genome or transcriptome), the data type (short vs. longer reads) and the available computational power, among other factors.
Most recent aligners use either BWT (Burrows M 1994) (Burrows-Wheeler transformation) or MEM (Khan et al. 2009) (Maximum Exact Matches) to perform alignment. Older generation alignment algorithms relied on a spliced-seed approach (Heng Li and Homer 2010). The numerous implementations of these different strategies all come with a myriad of options that may significantly affect the alignment outcome. Selecting the most accurate aligner and determining the optimal parameter set for a project can often represent a small project in itself.
Hence, in the following, we exemplify using aligners that we have incorporated in our processing pipeline based on internal benchmarking for our most common experimental setup: tree genome / transcriptome, Illumina HiSeq 2500, 101bp PE sequencing. The aligner of choice varies based on the type of reference available for the project: For genome based alignment of RNA-Seq data we use STAR, a MEM based aligner - it actually uses MMP (maximum mappable prefix, a variation of MEM); for alignment of RNA-Seq data to a reference transcriptome (Dobin et al. 2013) we use either bowtie (version 1, BWT FM-index based, (Langmead et al. 2009)) or the BWT FM-index or MEM implementations of BWA(H. Li and Durbin 2010), (H. Li and Durbin 2009).
When using other aligners, you will need to multi-map the reads back to the transcriptome. In this example I will use bowtie. The bowtie option --all means that all alignments are reported. If you have a large dataset, you might like to limit the number of reported alignment to a large (but finite) number, as this will be faster and the resulting bam files will be smaller. In bowtie you could do this with -k 40 (40 reported alignments).
In the paper, and this example, we map the reads back as pair-end, meaning that if only one read end maps to a transcript and not both, neither is reported. There may be some disadvantages in this, but we have not experimented with it. Feel free to try different mapping options, such as single-end and let us know if it makes a difference!
#1. Build the bowtie index:
bowtie-build Reference_seq index.name
#2. Reap mapping
bowtie -q -v 0 -k 10 -t <index.name> R1.fq -S <out.sam>
bowtie --all <index.name> -1 R1.fq -2 R2.fq -S Sample.sam
-q :input is a fastq file
-v 0: no mismatches are allowed
-k 10: max 10 hits are reported for each read
-S: output is a sam file
-t: time the mapping took is printed to the screen
# In practice
mv ../Day3/*fastq.gz ./
mkdir -p Mapping/Bowtie
nice -n 1 bowtie -p 1 --all ~/Workshop/Day4/Genome/Indices/PhyKer -1 Pk24_R1.fastq.gz -2 Pk24_R2.fastq.gz\
-S Mapping/Bowtie/Pk24.sam -t
# Recommended for space saving and inter-conversion is possible SAM <-> BAM
nice -n 1 samtools view -S -b Mapping/Bowtie/Pk24.sam -o Mapping/Bowtie/Pk24.bam
rm Mapping/Bowtie/Pk24.sam
# Here is a short bash script for doing batch job
FILES=`ls Pk*_R1.fastq.gz | sed 's/_R1.fastq.gz//g'`
for F in $FILES ; do
R1=${F}_R1.fastq.gz
R2=${F}_R2.fastq.gz
nice -n 1 bowtie --all -S Index -1 $R1 -2 $R2 > ${F}.sam
nice -n 1 samtools view -S -b ${F}.sam > ${F}.bam
done
Bowtie2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to few 100s. Bowtie2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
# 1. create the index by executing
bowtie2-build Reference_seq index.name
#Mapping SE reads with Bowtie2
mkdir -p Mapping/Bowtie2
bowtie2 -q -p 1 -x ~/Workshop/Day4/Genome/Indices/PhyKer -U Pk24_R1.fastq.gz -S Mapping/Bowtie2/Pk24_SE.sam --no-unal
#Mapping SE reads with Bowtie2
nice -n 1 bowtie2 -q -p 1 -x ~/Workshop/Day4/Genome/Indices/PhyKer -1 Pk24_R1.fastq.gz -2 Pk24_R2.fastq.gz -S Mapping/Bowtie2/Pk24_PE.sam -t
# convert SAM to BAM
# --no-unal: ignore reads that failed to align
Exercise:
use --very-sensitive and --very-sensitive-local option and find difference in results due to difference in algorithm
Maximal mappable pre-fix search in STAR algorithms
First, the genome needs to be indexed. This is performed using the following command:
STAR --runThreadN 1 --runMode genomeGenerate --genomeDir Genome/PK_Index --genomeFastaFiles Genome/Genome_seq.fn \
--sjdbGTFfile Genome/Genome.gtf --sjdbOverhang 100
genomeDir: parameter specifies the output directory for the index, genomeFastaFiles: specifies the genome FASTA file path and sjdbGTFfile the file path of the gene annotation file
It can be retrieved from EnsEMBL/UCSC in gff3 format, to be converted in gtf format. We also provide an additional option that would need to be edited depending on your sequencing read length like sjdbOverhang 100; we selected 100 as our longest reads are 101bp long - see the STAR manual for the rationale behind this.
Once the genome index is built, we can align our sample reads to it. This is achieved as follows:
STAR --genomeDir PK_Index --readFilesIn $R1 $R2 --runThreadN 24 --sjdbGTFfile \
Genome/Genome.gff --outFileNamePrefix Mapping/filename \
--outFilterMismatchNoverLmax 0.02 --outFilterMatchNmin 17 --outFilterScoreMinOverLread 0.5 \
--alignIntronMax 1 --sjdbGTFtagExonParentTranscript Parent --outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate --outWigType bedGraph --quantMode GeneCounts --seedSearchStartLmax 18
where there are a number of additional parameters:
--alignIntronMax is important to specify so that STAR does not try to align split reads across a distance greater than --outFileNamePrefix sets the path and prefix to where the results will be written --outSAMtype BAM SortedByCoordinate
We provide a few additional parameters that may require adjustment based on your data: As you can see, we provide many more arguments, which you can lookup in the STAR manual; STAR is highly configurable!
mkdir Mapping/Star
cd Mapping/Star
nice -n 1 STAR --genomeDir ~/Workshop/Day4/Genome/Star_Indices --readFilesIn Pk24_R1.fastq.gz Pk24_R2.fastq.gz \
--runThreadN 1 --sjdbGTFfile ~/Workshop/Day4/Genome/PhyKer_genomic.fna --outFileNamePrefix Mapping/Star/Pk24 \
--outFilterMismatchNoverLmax 0.02 --outFilterMatchNmin 17 --outFilterScoreMinOverLread 0.5 \
--alignIntronMax 1 --sjdbGTFtagExonParentTranscript Parent --outFilterMultimapNmax 1 --readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate --outWigType bedGraph --quantMode GeneCounts --seedSearchStartLmax 18
STAR returns a number of result files:
- Log.final.out: contain details the alignment rate, percentage of uniquely/multiple aligning reads, rate of
- *STARAligned.sortedByCoord.out.bam: SAM file that contains the alignment in SAM format (Li et al, 2009).
- Two FASTQ files containing the forward and reverse unmapped reads: mismatches, INDELs identified in the reads, etc
- SJ.out.tab: information in tabular format, all the EEJs identified by STAR and whether these exist in the provided gff3 file or if they are novel. This is an extremely useful resource that can be used to identify possible new transcript splice variants. One needs to keep in mind that transcription, as all biological processes, is a stochastic process and as such, there will be miss-spliced transcripts present at a low frequency in any RNA-Seq sample that has been sequenced to adequate depth. Hence novel identified junctions might represent low-frequency genuine transcription as well as noise.
The BAM format is very efficient for computers to work with, but as a binary file format, it is unreadble to humans. If you would like to take a look at the alignments, you can do so using samtools:
The sorted BAM file is then indexed
samtools view Pk24.bam | head
Exercise:
Try to understand read mapping specification in BAM file
Hint: samtools view Pk24Aligned.sortedByCoord.out.bam|cut -f 2|sort|uniq
Put these number in [Decoding of SAM flags](https://broadinstitute.github.io/picard/explain-flags.html)
MultiQC understands STAR log files and can be used to plot the data nicely.
mkdir ~/star_logs
cp ~/star/*Log.final.out ~/star_logs
cd ~/star_logs
multiqc
Kallisto is a fairly recent program that makes heavy use of k-mer hash tables and De-Bruijn graphs to “pseudo-align” reads (Bray et al. 2016). This means that Kallisto does not map a read to a given position in a genome/transcriptome, but rather to an abstract representation of a transcript. Kallisto then performs a likelihood estimation and expectation maximization algorithms to calculate a count and - at request - bootstraps a confidence value.
Step 1 is to create the Kallisto index from a transcriptome:
mkdir -p Mapping/Kallisto
cd Mapping/Kallisto
nice -n 1 kallisto index --index=Phyker ~/Workshop/Day4/Transcript.fn
nice -n 1 kallisto quant -i Phyker -b 100 -t 1 ~/Workshop/Day4/Pk72_R1.fq ~/Workshop/Day4/Pk72_R2.fq -o ./
Batch Script:
find ../Trim -name "*trim_[12].fq.gz" | sort | head -n 4 | while read FW_READ
do
read RV_READ
FILEBASE=$(basename "${FW_READ%_1.fq.gz}")
kallisto quant -i ../Potri03-synthetic-transcripts.idx -b 100 \
-o . -t 1 "$FW_READ" "$RV_READ"
# Kallisto doesn't let us specify an output filename so we rename all
# output files
mv "abundance.tsv" $FILEBASE-"abundance.tsv"
mv "abundance.h5" $FILEBASE-"abundance.h5"
mv "run_info.json" $FILEBASE-"run_info.json"
done
The option -b 100 performs a bootstrap confidence estimation using 100 iterations.
Kallisto outputs several files:
abundance.tsv: This file is the main file of interest as it contains transcript counts, tpm, etc. abundance.h5: This file contains the bootstrap values so confidence intervals can be estimated for the gene counts run_info.json: This is a file containing the parameters supplied to Kallisto
We could modify the k-mer size using the -k option. This is - however - often a very sensitive parameter in k-mer/hash based applications. Lower values will provide higher sensitivity, but lower specificity and vice versa for higher values of k. The optimal choice for a k-mer would be the longest read length without a single mismatch due to natural polymorphisms or technical artefacts. In practise, this is very difficult to estimate unless you have access to a large pool of population genomics data.
Salmon is a fast aligner for quantifying transcript abundances. For the purposes of corset, we can use one of salmon's outputs called equivalence classes to cluster transcripts and provide gene-level abundances.
1. To run salmon, start by indexing the assembled transcriptome:
cd ..
mkdir -p Mapping/Salmon
salmon index --index Phyker --transcripts ~/Workshop/Day4/Transcript.fn
salmon quant --index Phyker --libType IU --dumpEq -1 ~/Workshop/Day4/Pk72_R1.fq -2 ~/Workshop/Day4/Pk72_R2.fq \
--output Pk24.out
The library type string consists of three parts: the relative orientation of the reads, the strandedness of the library, and the directionality of the reads.
The first part of the library string (relative orientation) is only provided if the library is paired-end. The possible options are:
I = inward
O = outward
M = matching
The second part of the read library string specifies whether the protocol is stranded or unstranded; the options are:
S = stranded
U = unstranded
Then quantify the reads (it's very important to specify the --dumpEq flag, so the equivalence classes are output
Salmon outputs several files:
quant.sf: This file is the main file of interest as it contains transcript counts, tpm, etc.
# Batch script
FILES=`ls Pk*_R1.fq | sed 's/_R1.fq//g'`
for F in $FILES ; do
R1=${F}_R1.fq
R2=${F}_R2.fq
salmon quant --index Phyker --libType IU --dumpEq -1 $R1 -2 $R1 --output ${F}.out
done
The relevant files will be called eq_classes.txt in the subdirectory aux_info. You can adjust the salmon option for your own dataset, for example the library type or the number of threads.
=========================================
=======================================
samtools view test.sam/test.bam
samtools view -bT reference.fa test.sam > test.bam samtools -h test.bam > test.sam
samtools sort test.bam test_sorted samtools view -bS file.sam | samtools sort - file_sorted
samtools view -h -F 4 blah.bam > blah_only_mapped.sam
#index the bam file first
samtools index test.bam<br> samtools view test.bam chr1:200000-500000<br> #all reads mapping on chr1 as another bam<br> samtools view -b test.bam chr1 > test_chr1.bam<br>
samtools fastq SAMPLE.bam > SAMPLE.fastq <br> samtools fasta SAMPLE.bam > SAMPLE.fasta
#number of reads
samtools idxstats in.bam | awk '{s+=$3+$4} END {print s}' <br>
#number of mapped reads
samtools idxstats in.bam | awk '{s+=$3} END {print s}' <br>
samtools cat bam1.bam bam2.bam >bam12.bam
samtools stats bam_file.bam