Summarize ISOseq isoforms - wdlingit/cop GitHub Wiki
This page describes programs and terminologies applied in the paper An improved repertoire of splicing variants and their potential roles in Arabidopsis photomorphogenic development (PMID: 35139889). The programs and scripts applied are aimed to:
- compare isoforms detected in ISOseq reads against database gene models and report alternative splicing events in reads
- summarize detected existing and novel isoforms for comparing samples in terms of expressed isoforms
In order to identify isoforms, a system of noting a read alignment against a gene model was invented: square brackets ([]
) were used to represent exons and parentheses (()
) were used to represent alignment blocks of a read. For example, (-6[1])-([2])-([3]+19)
is representing that the alignment is agreeing intron junctions of the gene model composed of three exons, i.e., [1]
, [2]
, and [3]
, and 6bp shrinking (19bp extension) at the 5' (3') end. For another example, (-6[1]=[2])-(+3[3]+19)
is representing that the first intron was retained and the second donor was altered with 3bp extension. With this noting system, two alignments were considered belonging to the same isoform if (i) they were mapped to the same gene model, and (ii) their alignment representing strings were exactly the same except the two numbers that the 5' and 3' ends.
The above described terminology would be consistent if a representative gene model is selected exactly for one gene locus. In so doing, long reads aligning to the same gene would be noted relative to the same reference. Accordingly, computing representative gene models for gene locus is an important task.
In this example, we start with genome annotation GFF3 file of IRGSP-1.0 from the RAP database. Note that their GFF3 files were separated for gene and non-gene (mRNA, exon, ... etc) records. We will have to merge them and make some correction. The merge and correction steps should be non-necessary if dealing with an all-in-one GFF3 file.
ubuntu@ubuntu20:~$ wget https://rapdb.dna.affrc.go.jp/download/archive/irgsp1/IRGSP-1.0_representative_2023-03-15.tar.gz
ubuntu@ubuntu20:~$ wget https://rapdb.dna.affrc.go.jp/download/archive/irgsp1/IRGSP-1.0_predicted_2023-03-15.tar.gz
ubuntu@ubuntu20:~$ tar -zxvf IRGSP-1.0_representative_2023-03-15.tar.gz
IRGSP-1.0_representative/
IRGSP-1.0_representative/locus.gff
IRGSP-1.0_representative/transcripts.gff
IRGSP-1.0_representative/transcripts_exon.gff
ubuntu@ubuntu20:~$ tar -zxvf IRGSP-1.0_predicted_2023-03-15.tar.gz
IRGSP-1.0_predicted/
IRGSP-1.0_predicted/locus.gff
IRGSP-1.0_predicted/transcripts.gff
IRGSP-1.0_predicted/transcripts_exon.gff
By some simple observation, it was noticed that IRGSP-1.0_predicted
is containing predicted genome annotation of genes much fewer than those in IRGSP-1.0_representative
. As we decide to include genome annotation from the two directories, we also noticed that locus.gff
contains only gene
records and transcripts.gff
contains features like mRNA
, five_prime_UTR
, three_prime_UTR
, and CDS
. As there is no connections between mRNA
and gene
records in the two files, we will have to take care about this when merging all the files.
ubuntu@ubuntu20:~$ cat IRGSP-1.0_representative/transcripts.gff | perl -ne 'chomp; @t=split(/\t/); $t[8].=";"; $t[8]=~s/Locus_id=/Parent=/; print join("\t",@t)."\n"' > merged.gff
ubuntu@ubuntu20:~$ cat IRGSP-1.0_predicted/transcripts.gff | perl -ne 'chomp; @t=split(/\t/); $t[8].=";"; $t[8]=~s/Locus_id=/Parent=/; print join("\t",@t)."\n"' >> merged.gff
ubuntu@ubuntu20:~$ cat IRGSP-1.0_representative/locus.gff IRGSP-1.0_predicted/locus.gff >> merged.gff
The perl one-liner in the first two lines are for replacing attributes Locus_id
of mRNA
records (in the source gff files) into Parent
attributes so that mRNA
would have a Parent
as some gene
.
Next, do some feature hierarchy check using the rackj toolkit. Refer the environment setting if needed.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I merged.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): merged.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]
ubuntu@ubuntu20:~$ cat merged.gff.features
GffRoot
gene*
mRNA*
five_prime_UTR
CDS
three_prime_UTR
exon
So it is confirmed that merged.gff
contains correct gene-mRNA relationships so that they have parent-child relationship in the feature hierarchy tree.
For some genome annotation database, there is no suggestion of representative gene modes for multi-model genes. In this case, it is reasonable to take longest gene models as representative.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -O rice.test
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): merged.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA, exon:CDS:five_prime_UTR:three_prime_UTR]]
intron preserving merged model (-IP)): false
representative list (-rep): null
output prefix (-O): rice.test
In so doing, we will have a file named rice.test.model
that contains exon coordinates and gene IDs for all gene models.
ubuntu@ubuntu20:~$ cat rice.test.model | head
>Os01t0100100-01 chr01 2983 10815 + Os01g0100100
2983 3268
3354 3616
4357 4455
5457 5560
7136 7944
8028 8150
8232 8320
8408 8608
9210 9615
So the following perl one-liner helps us to find the longest gene model for each gene.
ubuntu@ubuntu20:~$ cat rice.test.model | perl -ne 'chomp; @t=split; if(/^>(\S+)/){ $mID=$1; $gmHash{$t[-1]}{$mID}=1; }else{ $lenHash{$mID}+=($t[1]-$t[0]+1) } if(eof){ for $g (sort keys %gmHash){ @m=keys %{$gmHash{$g}}; @m=sort {$lenHash{$b}<=>$lenHash{$a}} @m; print "$g\t$m[0]\n"; } }' > rice.test.repMap
And we run the same java program misc.ModelCGFF
again, but with the representative information.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -O rice.RAP -rep rice.test.repMap
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): merged.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA, exon:CDS:five_prime_UTR:three_prime_UTR]]
intron preserving merged model (-IP)): false
representative list (-rep): rice.test.repMap
output prefix (-O): rice.RAP
In so doing, we will have a rice.RAP.hybridCgff
. This file would contain the following information for each gene:
- exon coordinates of the representative model
- exon coordinates of non-representative models that are not overlapping the representative
- representative exons numbered 1, 2, ... in gene orientation
- non-representative exons numbered as x^y for y-th additional exon behind x-th representative exon
>Os01t0103600-01 chr01 186250 190904 -
186250 186771 3^1
187345 189715 3
189841 189990 2
190087 190904 1
>Os01t0103650-00 chr01 187545 188586 +
187545 188020 1
188060 188385 2
188455 188586 3
In other words, in some sense, rice.RAP.hybridCgff
contains all exon coordinates of all genes in the database. We used to call these synthesized gene models as hybrid representative models because they are hybrids of representatives and non-representatives. This file will not be used in the following commands but it can help us to understand the coordinates.
A few long reads were faked for two gene locus. Their alignments are saved in file example_align.bam
. Java program special.ReadASEvent
is for comparing isoforms in long reads against representative gene models in the database.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar special.ReadASEvent -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -GPC gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -rep rice.test.repMap -M SAM example_align.bam -O example_align
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ReadASevent
GFF3 filename (-GFF3): merged.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA, exon:CDS:five_prime_UTR:three_prime_UTR]]
gene-prot-cds feature triples (-GPC): [[gene, mRNA, exon:CDS:five_prime_UTR:three_prime_UTR]]
hybrid CGFF replacement (-hybrid): null
representative list (-rep): rice.test.repMap
mapping method/filename (-M):
SAM : example_align.bam
gene matcher method (-matcher): ISOseq
accepted record filename (-A): null
output prefix (-O): example_align
block join factor (-J): 2
identity cutoff (-ID): 0.95
use exon region (-exon): true
check by containing (-contain, FALSE for by intersecting): false
minimum overlap (-min): 8
check all alignment blocks (-ALL): true
care mapping direction (-direction): false
verify fit models only by junction (-junction): true
Option meanings:
-
-GFF3
: the genome annotation GFF3 file -
-GRE
: gene-rna-exon feature triples to be extracted from the GFF3 file -
-GPC
: gene-protein-cds feature triples to be extracted from the GFF3 file. When the input BAM file contains multiple alignments for one read, alignments will be compared to transcript regions (defined by-GRE
) for picking the alignment that best fitting some transcript. When there are two or more alignments equally best fitting some transcripts, the alignment picking process will be repeated again using regions defined by-GPC
in terms for coding regions. -
-rep
: a gene-representative mapping tab-delimited file of all genes. This is for computing the coordinate information in the aforementionedrice.RAP.hybridCgff
. -
-M SAM example_align.bam
: use SAM format to readexample_align.bam
. For BAM files containing multiple alignments for one read, must make sure they are sorted by names. -
-matcher ISOseq
: apply the aforementionedGRE-GPC
alignment picking strategy. -
-A
: the file to save accepted alignments. Would be useful if multiple alignments for one read. -
-O
: output file prefix. The key report file would be a.xls
file. -
-J
: connect gaps if the insertion/deletion size is smaller than the specified size -
-ID
: alignment identity threshold (matches/read length) -
-exon
: use exon regions (or false for gene regions) for considering a read is hitting a gene -
-contain
: based on regions defined by-exon
, consider hit by containment or intersection -
-min
: minimum size of containment or intersection -
-ALL
: need all read alignment block satisfying the above hitting criteria? -
-direction
: should the alignment in the same orientation as the hitting gene? -
-junction
: use only junction information for deciding fitting gene models. Note that single exon genes got no junctions (introns).
The output .xls
file contains many columns. Meaning of the columns:
- readID: the read ID
- representative: representative model ID of the hitting gene
- repLength: length of the representative model
- measurement: precision of the read alignment against the hybrid representative model, i.e., number of overlapping bases between the alignment and the model divided by the alignment length
- recall: recall of the read alignment against the hybrid representative model, i.e., number of overlapping bases between the alignment and the model divided by the model length
- fitModels: a list of gene models of the hitting genes, where the gene models have no junctions disagreeing the alignment
- IR: a list of introns been retained, marked as preceding exons
- altD: a list of exons expressed with alternative donor
- altA: a list of exons expressed with alternative accepter
- casseteE: a list of expressed cassette exons (exons inside introns), taking the hybrid representative as the reference, marked as preceding exons
- crypticI: a list of expressed cryptic introns (introns inside exons), taking the hybrid representative as the reference, marked as located exons
- ES: a list of skipped exons
- altTSE: alternative transcription start exon, compared the representative model. The answer could be in two forms: +/-N for N more/less expressed exon compared to the representative, or the staring exon notation in the hybrid representative model
- altTEE: alternative transcription end exon. The answer is similar with that of altTSE.
- combination: a combination string of the above alternative splicing events
- eventCnt: alternative splicing event count
- headDist: relative position of read start to the hybrid representative start
- rearDist: relative position of read stop to the hybrid representative stop
- alignment2representative: the notation of the read alignment against the hybrid representative model
In the example example_align.bam
. We got two long reads for the first gene locus, one with exactly the same coordinates with the only gene model, and the other got one intron retained.
The second gene locus got two gene models, where Os01t0103600-01
is the representative.
Accordingly, the output file example_align.xls has the following outputs:
- For the first gene locus, the exact matching read has alignment string
([1])-([2])-([3])-([4])-([5])-([6])-([7])-([8])-([9])-([10])-([11])-([12])
. It also has the gene model in the fitModels column - For the first gene locus, the other read has alignment string
([1])-([2]=[3])-([4])-([5])-([6])-([7])-([8])-([9])-([10])-([11])-([12])
showing that the intron between exons 2 and 3 was retained. Under the IR column there is a "2". - For the second gene locus, the first read's junction combination is not agreeing with any of the existing models so its fitModels column is empty. It has alignment string
(+50[1])-([2])-([3])-([3^1]+50)
because it expressed all exons of the two existing models with 50bp extensions at the two ends. - For the second gene locus, the second read is very similar with the first read and its alignment string is
(+50[1])-([2])-([3])-([3^1])
. - For the second gene locus, the third read is actually the model Os01t0103600-02. Its alignment string is
(-642[1])-([2])-([3]-2262)-([3^1])
, meaning 2262bp shrink of exon 3 (compared to the representative) and expression of exon3^1
. fitModels listed Os01t0103600-02. - For the second gene locus, the last read is actually the representative Os01t0103600-01. So we have alignment string
([1])-([2])-([3])-[3^1]
and Os01t0103600-01 listed in fitModels.
Some special care regarding insertions and deletions should be paid to real PacBio ISOseq reads and nanopore DRS reads. The following perl one-liner makes simple statistics on insertion/deletion sizes inside the BAM file provided by a local PacBio service company.
ubuntu@ubuntu20:~$ samtools view sampleA.bam | perl -ne '@t=split; while($t[5]=~/(\d+)[I|D]/g){ $hash{$1}++ } if(eof){ for $x (sort {$a<=>$b} keys %hash){ print "$x\t$hash{$x}\n" } }' | head -30
1 10030206
2 1464022
3 442603
4 173860
5 85722
6 56669
7 30839
8 24067
9 16149
10 10090
11 7487
12 6489
13 4631
14 4195
15 4039
16 2873
17 2155
18 1860
19 1574
20 1292
21 1369
22 1157
23 960
24 1086
25 775
26 714
27 752
28 635
29 4201
30 558
It is observed that many insertions/deletions shorter than 10bps and they can be considered as noises. Accordingly, setting -J 10
or -J 20
is reasonable when running special.ReadASEvent
. Lowering -ID
to 0.7 or even 0.3 is also reasonable in case that some adapters or noises could be still in the reads.
ubuntu@ubuntu20:~$ ls sample*.bam | perl -ne 'chomp; /(.+?)\./; $cmd="java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar special.ReadASEvent -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -GPC gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -rep rice.test.repMap -J 20 -ID 0.7 -direction true -junction false -M SAM $_ -O $1"; print "\nCMD: $cmd\n"; #system $cmd'
CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar special.ReadASEvent -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -GPC gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -rep rice.test.repMap -J 20 -ID 0.7 -direction true -junction false -M SAM sampleA.bam -O sampleA
CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar special.ReadASEvent -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -GPC gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -rep rice.test.repMap -J 20 -ID 0.7 -direction true -junction false -M SAM sampleB.bam -O sampleB
(of course we remove the sharp mark for actual running the commands)
The real datasets give us up to millions of reads so the output .xls files would be large.
ubuntu@ubuntu20:~$ wc -l sample*.xls
1208647 sampleA.xls
1047889 sampleB.xls
2256536 total
It is reasonable to do some filtering based on the output measurements of precision and recall. For example, the following perl one-liner saves records with recall>=0.8 ($t[4]>=0.8
) to .sel.xls files.
ubuntu@ubuntu20:~$ ls sample*.xls | perl -ne 'chomp; /(.+?)\./; print "!!!FILE: $1\n"; system "cat $_"' | perl -ne 'if(/^!!!FILE: (\S+)/){ open(FILE,">$1.sel.xls") }else{ @t=split(/\t/); print FILE if $t[4]>=0.8 || /^#/ }'
ubuntu@ubuntu20:~$ wc -l sample*.xls
994209 sampleA.sel.xls
1208647 sampleA.xls
936972 sampleB.sel.xls
1047889 sampleB.xls
4187717 total
The isoformSummary.pl
from this repository can be used for counting reads of the same isoforms, as well as merged thus-made data from multiple samples.
ubuntu@ubuntu20:~$ git https://github.com/wdlingit/cop.git
ubuntu@ubuntu20:~$ ./cop/scripts/isoformSummary.pl
isoformSummary.pl [<alias> <filename>]+
ubuntu@ubuntu20:~$ ./cop/scripts/isoformSummary.pl sampleA sampleA.sel.xls sampleB sampleB.sel.xls > summary.xls
ubuntu@ubuntu20:~$ wc -l summary.xls
129168 summary.xls
The isoformSummary.pl
script simply does two things:
- Collect alignment strings in source tables (made by
special.ReadASEvent
) of the same isoforms. For example, two long reads hitting the same gene locus with alignment strings(+50[1])-([2])-([3])-([3^1])
and(+50[1])-([2])-([3])-([3^1]+50)
are considered from the same isoform. - Collect read counts of isoforms from source tables and summarize them into the output.
So the headers of summary.xls
are
ubuntu@ubuntu20:~$ head -1 summary.xls | perl -ne 'chomp; @t=split(/\t/); for($i=0;$i<@t;$i++){ print "$i\t$t[$i]\n" }'
0 #representative
1 repLength
2 measurement
3 recall
4 fitModels
5 IR
6 altD
7 altA
8 casseteE
9 crypticI
10 ES
11 altTSE
12 altTEE
13 combination
14 eventCnt
15 alignment2representative
16 sampleA
17 sampleB
where the last two columns are read counts from sampleA and sampleB.