3E. Day2 Read coutning strategies - bioinfokushwaha/Livestock_Genomics GitHub Wiki

Transcript quantification

Exercise 1: RSEM

RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels. For visualization, It can generate BAM and Wiggle files in both transcript-coordinate and genomic-coordinate. Genomic-coordinate files can be visualized by both UCSC Genome browser and Broad Institute’s Integrative Genomics Viewer (IGV). Transcript-coordinate files can be visualized by IGV. RSEM also has its own scripts to generate transcript read depth plots in pdf format. The unique feature of RSEM is, the read depth plots can be stacked, with read depth contributed to unique reads shown in black and contributed to multi-reads shown in red. In addition, models learned from data can also be visualized. Last but not least, RSEM contains a simulator.

#### Prepare reference

a) Go to ur Name directory like Sandeep

b) cp r ~/Workshop/Day4 ./

c) mkdir RSEM && cd RSEM

d) prepare reference 

nice -n 1 rsem-prepare-reference --gtf ~/Workshop/Day4/Genome/PhyKer_genomic.gtf --star -p 1 \
~/Workshop/Day4/Genome/PhyKer_genomic.fna PhyKer.rsem

e) calculate expression
nice -n 1 rsem-calculate-expression --paired-end -p 1 --star --output-genome-bam  \
~/Workshop/Day4/Data/Nb_Pk24_Rep1_PE_R1.fq.gz ~/Workshop/Day4/Data/Nb_Pk24_Rep1_PE_R2.fq.gz PhyKer.rsem Pk24 --star-gzipped-read-file

#Repeat exercise

nice -n 1 rsem-calculate-expression --paired-end -p 1 --star --output-genome-bam  \
~/Workshop/Day4/Data/Nb_Pk48_Rep1_PE_R1.fq.gz ~/Workshop/Day4/Data/Nb_Pk48_Rep1_PE_R2.fq.gz PhyKer.rsem Pk24 --star-gzipped-read-file

#Answer following Question
Q1. Total number of transcripts 
Q2. Total number of transcripts and genes with expression
Q3. Find plus and minus strand gene expression


Exercise 2: FeatureCount

FeatureCounts takes as input SAM/BAM files and an annotation file including chromosomal coordinates of features. It outputs numbers of reads assigned to features (or meta-features). It also outputs stat info for the overall summrization results, including number of successfully assigned reads and number of reads that failed to be assigned due to various reasons.

FeatureCounts can count reads at either feature level or at meta-feature level. When summarizing reads at meta-feature level, read counts obtained for features included in the same meta-feature will be added up to yield the read count for the corresponding meta-feature.

** Overlap between reads and features**

A read is said to overlap a feature if at least one read base is found to overlap the feature. For paired-end data, a fragment (or template) is said to overlap a feature if any of the two reads from that fragment is found to overlap the feature.

By default, featureCounts does not count reads overlapping with more than one feature (or more than one meta-feature when summarizing at meta-feature level). Users can use the -O option to instruct featureCounts to count such reads (they will be assigned to all their overlapping features or meta-features).

Note that, when counting at the meta-feature level, reads that overlap multiple features of the same meta-feature are always counted exactly once for that meta-feature, provided there is no overlap with any other meta-feature. For example, an exon-spanning read will be counted only once for the corresponding gene even if it overlaps with more than one exon.

#Abundance estimation by FeatureCount
### Come out from RSEM directories
cd ..
nice -n 0 featureCounts -g gene_id -T 1 -O -s 1 -M -a ~/Workshop/Day4/Genome/PhyKer_genomic.gtf -o Star_FC RSEM/*.bam

-s: Perform strand-specific read counting.
-M: Multi-mapping reads will also be counted
-O: Assign reads to all their overlapping meta-features (or features if -f is specified).
-g: Specify attribute type in GTF annotation. 'gene_id'
-t: Specify feature type in GTF annotation. 'exon' by default. 
-a: Name of an annotation file. GTF/GFF format by default
-o: Name of the output file including read counts.

Exercise 3: htseq

The script htseq-count is a tool for RNA-Seq data analysis: Given a SAM/BAM file and a GTF or GFF file with gene models, it counts for each gene how many aligned reads overlap its exons. These counts can then be used for gene-level differential expression analyses using methods such as DESeq2 (Love et al., 2014) or edgeR (Robinson et al., 2010).

As the script is designed specifically for differential expression analysis, only reads mapping unambiguously to a single gene are counted, whereas reads aligned to multiple positions or overlapping with more than one gene are discarded.

To see why this is desirable, consider two genes with some sequence similarity, one of which is differentially expressed while the other one is not. A read that maps to both genes equally well should be discarded, because if it were counted for both genes, the extra reads from the differentially expressed gene may cause the other gene to be wrongly called differentially expressed, too.

Another design choice made with the downstream application of differential expression testing in mind is to count fragments, not reads, in case of paired-end data. This is because the two mates originating from the same fragment provide only evidence for one cDNA fragment and should hence be counted only once.

counting rule

#Abundance estimation by htseq

mkdir Htseq && cd Htseq

htseq-count RSEM/*.bam ~/Workshop/Day4/Genome/PhyKer_genomic.gtf -f bam -m union >htSeq_count_union


htseq vs featureCount

Exercise

Compare htseq and featureCount results

Hint: Take read count >1 and extract the genes and gene counts