Generalized Linear Model with Replicates - williamritchie/IRFinder GitHub Wiki

Another way to investigate differential IR between two conditions with replicates can be achieved through fitting a generalize linear model (GLM). This method is used to test the fold change of IR between two conditions, instead of testing the absolute changes of IR. Biological replicates of each condition are REQUIRED.

To start, we need to create TWO files:

  1. A file contains paths of all the IRFinder results, say filePaths.txt. Each line MUST be a FULL path of a IRFinder result. The file structure looks like:
/…/…/Sample1/IRFinder-IR-dir.txt  
/…/…/Sample2/IRFinder-IR-dir.txt  
/…/…/Sample3/IRFinder-IR-dir.txt  
/…/…/Sample4/IRFinder-IR-dir.txt  
  1. A file describe the condition and/or other design of each sample, say experiment.txt. The first column MUST be sample names. And the order of the sample names MUST MATCH the order in “filePath.txt”. The other columns can be any information, usually those ones used to build GLM model. A file header is optional, but strongly recommended. The file structure looks like:
SampleNames	Condition  
Sample1	WT  
Sample2	WT  
Sample3	KO  
Sample4	KO  

NOTE: The following process will pair the lines in the above two files. So please make sure each line in the second file describes the same line in the first file.

After generating these two files, using filePaths.txt and experiment.txt as an example, we can passing IRFinder result to DESeq2 in R interface:

library(DESeq2)
source("/SOME_PATH/IRFinder/bin/DESeq2Constructor.R")  #Load IRFinder-related function

results = read.table("filePaths.txt")
paths = as.vector(results$V1)                                            # File names must be saved in a vector
experiment = read.table("experiment.txt",header=T)                       
experiment$Condition=factor(experiment$Condition,levels=c("WT","KO"))    # Set WT as the baseline in the analysis
rownames(experiment)=NULL                                                # Force removing rownames

# WARNING: make sure the rownames of `experiment` is set to NULL. 
# WARNING: users MUST check if the order of files in the `path` matches the order of samples in `experiment` before continue  

metaList=DESeqDataSetFromIRFinder(filePaths=paths, designMatrix=experiment, designFormula=~1)
# The above line generate a meta list contains four slots
# First slot is a DESeq2 Object that can be directly pass to DESeq2 analysis.  
# Second slot is a matrix for trimmed mean of intron depth
# Third slot  is a matrix for correct splicing depth flanking introns
# Fourth slot is a matrix for maximum splicing reads at either ends of introns
# We build a “null” regression model on the interception only. 
# A “real” model can be assigned either here directly, or in the downstream step. See below

dds = metaList$DESeq2Object                       # Extract DESeq2 Object with normalization factors ready
colData(dds)                                      # Check design of matrix

DataFrame with 8 rows and 3 columns
    SampleNames    Condition    IRFinder
    <factor>        <factor>    <factor>
1   Sample1               WT          IR
2   Sample2               WT          IR
3   Sample3               KO          IR
4   Sample4               KO          IR
5   Sample1               WT          Splice
6   Sample2               WT          Splice
7   Sample3               KO          Splice
8   Sample4               KO          Splice

# Please note that sample size has been doubled and one additional column "IRFinder" has been added.
# This is because IRFinder considers each sample has two sets of counts: one for reads inside intronic region and one for reads at splice site, indicating by "IR" and "Splice" respectively.
# "IRFinder" is considered as an additional variable in the GLM model.
# Please also be aware that size factors have been set to 1 for all samples. Re-estimation of size factors is NOT recommended and is going to bias the result.
# More details at the end of the instruction.

design(dds) = ~Condition + Condition:IRFinder     # Build a formula of GLM. Read below for more details. 
dds = DESeq(dds)                                  # Estimate parameters and fit to model
resultsNames(dds)                                 # Check the actual variable name assigned by DESeq2

[1] "Intercept"                   "Condition_KO_vs_WT"   
[3] "ConditionWT.IRFinderIR" "ConditionKO.IRFinderIR" 

res.WT = results(dds, name = "ConditionWT.IRFinderIR")
# This tests if the number of IR reads are significantly different from normal spliced reads, in the WT samples.
# We might only be interested in the "log2FoldChange" column, instead of the significance.
# This is because "log2FoldChange" represents log2(number of intronic reads/number of normal spliced reads).
# So we the value of (intronic reads/normal spliced reads) by
    
WT.IR_vs_Splice=2^res.WT$log2FoldChange
    
# As IR ratio is calculated as (intronic reads/(intronic reads+normal spliced reads))
# We can easily convert the above value to IR ratio by
    
IRratio.WT = WT.IR_vs_Splice/(1+WT.IR_vs_Splice)
    
# Similarly, we can get IR ratio in the KO samples
res.KO = results(dds, name = "ConditionKO.IRFinderIR")
KO.IR_vs_Splice=2^res.KO$log2FoldChange
IRratio.KO = KO.IR_vs_Splice/(1+KO.IR_vs_Splice)

# Finally we can test the difference of (intronic reads/normal spliced reads) ratio between WT and KO
res.diff = results(dds, contrast=list("ConditionKO.IRFinderIR","ConditionWT.IRFinderIR"))  
    
# We can plot the changes of IR ratio with p values
# In this example we defined significant IR changes as
# 1) IR changes no less than 10% (both direction) and 
# 2) with adjusted p values less than 0.05

IR.change = IRratio.KO - IRratio.WT
plot(IR.change,col=ifelse(res.diff$padj < 0.05 & abs(IR.change)>=0.1, "red", "black"))

# PLEASE NOTE    
# You might see dots at the same horizontal level (same IR change) either marked as significant (red) or not (black)
# This is because the Wald test applied above is testing on fold changes instead of absolute changes
# For example, both changes from 0.01 to 0.11 and from 0.8 to 0.9 have absolute changes of 0.1.
# But the fold changes for them are 11 folds and 1.1 folds respectively, and lead to different p values
⚠️ **GitHub.com Fallback** ⚠️