Summarize ISOseq isoforms - wdlingit/cop GitHub Wiki

This page describes programs and terminologies applied in the paper An improved repertoire of splicing variants and their potential roles in Arabidopsis photomorphogenic development (PMID: 35139889). The programs and scripts applied are aimed to:

compare isoforms detected in ISOseq reads against database gene models and report alternative splicing events in reads
summarize detected existing and novel isoforms for comparing samples in terms of expressed isoforms

The terminology

In order to identify isoforms, a system of noting a read alignment against a gene model was invented: square brackets ([]) were used to represent exons and parentheses (()) were used to represent alignment blocks of a read. For example, (-6[1])-([2])-([3]+19) is representing that the alignment is agreeing intron junctions of the gene model composed of three exons, i.e., [1], [2], and [3], and 6bp shrinking (19bp extension) at the 5' (3') end. For another example, (-6[1]=[2])-(+3[3]+19) is representing that the first intron was retained and the second donor was altered with 3bp extension. With this noting system, two alignments were considered belonging to the same isoform if (i) they were mapped to the same gene model, and (ii) their alignment representing strings were exactly the same except the two numbers that the 5' and 3' ends.

Getting representative gene models

The above described terminology would be consistent if a representative gene model is selected exactly for one gene locus. In so doing, long reads aligning to the same gene would be noted relative to the same reference. Accordingly, computing representative gene models for gene locus is an important task.

In this example, we start with genome annotation GFF3 file of IRGSP-1.0 from the RAP database. Note that their GFF3 files were separated for gene and non-gene (mRNA, exon, ... etc) records. We will have to merge them and make some correction. The merge and correction steps should be non-necessary if dealing with an all-in-one GFF3 file.

ubuntu@ubuntu20:~$ wget https://rapdb.dna.affrc.go.jp/download/archive/irgsp1/IRGSP-1.0_representative_2023-03-15.tar.gz
ubuntu@ubuntu20:~$ wget https://rapdb.dna.affrc.go.jp/download/archive/irgsp1/IRGSP-1.0_predicted_2023-03-15.tar.gz

ubuntu@ubuntu20:~$ tar -zxvf IRGSP-1.0_representative_2023-03-15.tar.gz
IRGSP-1.0_representative/
IRGSP-1.0_representative/locus.gff
IRGSP-1.0_representative/transcripts.gff
IRGSP-1.0_representative/transcripts_exon.gff

ubuntu@ubuntu20:~$ tar -zxvf IRGSP-1.0_predicted_2023-03-15.tar.gz
IRGSP-1.0_predicted/
IRGSP-1.0_predicted/locus.gff
IRGSP-1.0_predicted/transcripts.gff
IRGSP-1.0_predicted/transcripts_exon.gff

By some simple observation, it was noticed that IRGSP-1.0_predicted is containing predicted genome annotation of genes much fewer than those in IRGSP-1.0_representative. As we decide to include genome annotation from the two directories, we also noticed that locus.gff contains only gene records and transcripts.gff contains features like mRNA, five_prime_UTR, three_prime_UTR, and CDS. As there is no connections between mRNA and gene records in the two files, we will have to take care about this when merging all the files.

ubuntu@ubuntu20:~$ cat IRGSP-1.0_representative/transcripts.gff | perl -ne 'chomp; @t=split(/\t/); $t[8].=";"; $t[8]=~s/Locus_id=/Parent=/; print join("\t",@t)."\n"' > merged.gff
ubuntu@ubuntu20:~$ cat IRGSP-1.0_predicted/transcripts.gff | perl -ne 'chomp; @t=split(/\t/); $t[8].=";"; $t[8]=~s/Locus_id=/Parent=/; print join("\t",@t)."\n"' >> merged.gff
ubuntu@ubuntu20:~$ cat IRGSP-1.0_representative/locus.gff IRGSP-1.0_predicted/locus.gff >> merged.gff

The perl one-liner in the first two lines are for replacing attributes Locus_id of mRNA records (in the source gff files) into Parent attributes so that mRNA would have a Parent as some gene.

Next, do some feature hierarchy check using the rackj toolkit. Refer the environment setting if needed.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I merged.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): merged.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]

ubuntu@ubuntu20:~$ cat merged.gff.features
GffRoot
  gene*
    mRNA*
      five_prime_UTR
      CDS
      three_prime_UTR
      exon

So it is confirmed that merged.gff contains correct gene-mRNA relationships so that they have parent-child relationship in the feature hierarchy tree.

For some genome annotation database, there is no suggestion of representative gene modes for multi-model genes. In this case, it is reasonable to take longest gene models as representative.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -O rice.test
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): merged.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA, exon:CDS:five_prime_UTR:three_prime_UTR]]
intron preserving merged model (-IP)): false
representative list (-rep): null
output prefix (-O): rice.test

In so doing, we will have a file named rice.test.model that contains exon coordinates and gene IDs for all gene models.

ubuntu@ubuntu20:~$ cat rice.test.model | head
>Os01t0100100-01        chr01   2983    10815   +       Os01g0100100
2983    3268
3354    3616
4357    4455
5457    5560
7136    7944
8028    8150
8232    8320
8408    8608
9210    9615

So the following perl one-liner helps us to find the longest gene model for each gene.

ubuntu@ubuntu20:~$ cat rice.test.model | perl -ne 'chomp; @t=split; if(/^>(\S+)/){ $mID=$1; $gmHash{$t[-1]}{$mID}=1; }else{ $lenHash{$mID}+=($t[1]-$t[0]+1) } if(eof){ for $g (sort keys %gmHash){ @m=keys %{$gmHash{$g}}; @m=sort {$lenHash{$b}<=>$lenHash{$a}} @m; print "$g\t$m[0]\n"; } }' > rice.test.repMap

And we run the same java program misc.ModelCGFF again, but with the representative information.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -O rice.RAP -rep rice.test.repMap
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): merged.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA, exon:CDS:five_prime_UTR:three_prime_UTR]]
intron preserving merged model (-IP)): false
representative list (-rep): rice.test.repMap
output prefix (-O): rice.RAP

In so doing, we will have a rice.RAP.hybridCgff. This file would contain the following information for each gene:

exon coordinates of the representative model
exon coordinates of non-representative models that are not overlapping the representative
representative exons numbered 1, 2, ... in gene orientation
non-representative exons numbered as x^y for y-th additional exon behind x-th representative exon

>Os01t0103600-01        chr01   186250  190904  -
186250  186771  3^1
187345  189715  3
189841  189990  2
190087  190904  1
>Os01t0103650-00        chr01   187545  188586  +
187545  188020  1
188060  188385  2
188455  188586  3

In other words, in some sense, rice.RAP.hybridCgff contains all exon coordinates of all genes in the database. We used to call these synthesized gene models as hybrid representative models because they are hybrids of representatives and non-representatives. This file will not be used in the following commands but it can help us to understand the coordinates.

Compute AS/isoform information of long reads, setup examples

A few long reads were faked for two gene locus. Their alignments are saved in file example_align.bam. Java program special.ReadASEvent is for comparing isoforms in long reads against representative gene models in the database.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar special.ReadASEvent -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -GPC gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -rep rice.test.repMap -M SAM example_align.bam -O example_align
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ReadASevent
GFF3 filename (-GFF3): merged.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA, exon:CDS:five_prime_UTR:three_prime_UTR]]
gene-prot-cds feature triples (-GPC): [[gene, mRNA, exon:CDS:five_prime_UTR:three_prime_UTR]]
hybrid CGFF replacement (-hybrid): null
representative list (-rep): rice.test.repMap
mapping method/filename (-M):
  SAM : example_align.bam
gene matcher method (-matcher): ISOseq
accepted record filename (-A): null
output prefix (-O): example_align
block join factor (-J): 2
identity cutoff (-ID): 0.95
use exon region (-exon): true
check by containing (-contain, FALSE for by intersecting): false
minimum overlap (-min): 8
check all alignment blocks (-ALL): true
care mapping direction (-direction): false
verify fit models only by junction (-junction): true

Option meanings:

-GFF3: the genome annotation GFF3 file
-GRE: gene-rna-exon feature triples to be extracted from the GFF3 file
-GPC: gene-protein-cds feature triples to be extracted from the GFF3 file. When the input BAM file contains multiple alignments for one read, alignments will be compared to transcript regions (defined by -GRE) for picking the alignment that best fitting some transcript. When there are two or more alignments equally best fitting some transcripts, the alignment picking process will be repeated again using regions defined by -GPC in terms for coding regions.
-rep: a gene-representative mapping tab-delimited file of all genes. This is for computing the coordinate information in the aforementioned rice.RAP.hybridCgff.
-M SAM example_align.bam: use SAM format to read example_align.bam. For BAM files containing multiple alignments for one read, must make sure they are sorted by names.
-matcher ISOseq: apply the aforementioned GRE-GPC alignment picking strategy.
-A: the file to save accepted alignments. Would be useful if multiple alignments for one read.
-O: output file prefix. The key report file would be a .xls file.
-J: connect gaps if the insertion/deletion size is smaller than the specified size
-ID: alignment identity threshold (matches/read length)
-exon: use exon regions (or false for gene regions) for considering a read is hitting a gene
-contain: based on regions defined by -exon, consider hit by containment or intersection
-min: minimum size of containment or intersection
-ALL: need all read alignment block satisfying the above hitting criteria?
-direction: should the alignment in the same orientation as the hitting gene?
-junction: use only junction information for deciding fitting gene models. Note that single exon genes got no junctions (introns).

The output .xls file contains many columns. Meaning of the columns:

readID: the read ID
representative: representative model ID of the hitting gene
repLength: length of the representative model
measurement: precision of the read alignment against the hybrid representative model, i.e., number of overlapping bases between the alignment and the model divided by the alignment length
recall: recall of the read alignment against the hybrid representative model, i.e., number of overlapping bases between the alignment and the model divided by the model length
fitModels: a list of gene models of the hitting genes, where the gene models have no junctions disagreeing the alignment
IR: a list of introns been retained, marked as preceding exons
altD: a list of exons expressed with alternative donor
altA: a list of exons expressed with alternative accepter
casseteE: a list of expressed cassette exons (exons inside introns), taking the hybrid representative as the reference, marked as preceding exons
crypticI: a list of expressed cryptic introns (introns inside exons), taking the hybrid representative as the reference, marked as located exons
ES: a list of skipped exons
altTSE: alternative transcription start exon, compared the representative model. The answer could be in two forms: +/-N for N more/less expressed exon compared to the representative, or the staring exon notation in the hybrid representative model
altTEE: alternative transcription end exon. The answer is similar with that of altTSE.
combination: a combination string of the above alternative splicing events
eventCnt: alternative splicing event count
headDist: relative position of read start to the hybrid representative start
rearDist: relative position of read stop to the hybrid representative stop
alignment2representative: the notation of the read alignment against the hybrid representative model

In the example example_align.bam. We got two long reads for the first gene locus, one with exactly the same coordinates with the only gene model, and the other got one intron retained. ReadASEvent_ex01

The second gene locus got two gene models, where Os01t0103600-01 is the representative. ReadASEvent_ex02

Accordingly, the output file example_align.xls has the following outputs:

For the first gene locus, the exact matching read has alignment string ([1])-([2])-([3])-([4])-([5])-([6])-([7])-([8])-([9])-([10])-([11])-([12]). It also has the gene model in the fitModels column
For the first gene locus, the other read has alignment string ([1])-([2]=[3])-([4])-([5])-([6])-([7])-([8])-([9])-([10])-([11])-([12]) showing that the intron between exons 2 and 3 was retained. Under the IR column there is a "2".
For the second gene locus, the first read's junction combination is not agreeing with any of the existing models so its fitModels column is empty. It has alignment string (+50[1])-([2])-([3])-([3^1]+50) because it expressed all exons of the two existing models with 50bp extensions at the two ends.
For the second gene locus, the second read is very similar with the first read and its alignment string is (+50[1])-([2])-([3])-([3^1]).
For the second gene locus, the third read is actually the model Os01t0103600-02. Its alignment string is (-642[1])-([2])-([3]-2262)-([3^1]), meaning 2262bp shrink of exon 3 (compared to the representative) and expression of exon 3^1. fitModels listed Os01t0103600-02.
For the second gene locus, the last read is actually the representative Os01t0103600-01. So we have alignment string ([1])-([2])-([3])-[3^1] and Os01t0103600-01 listed in fitModels.

Compute AS/isoform information of long reads, real examples

Some special care regarding insertions and deletions should be paid to real PacBio ISOseq reads and nanopore DRS reads. The following perl one-liner makes simple statistics on insertion/deletion sizes inside the BAM file provided by a local PacBio service company.

ubuntu@ubuntu20:~$ samtools view sampleA.bam | perl -ne '@t=split; while($t[5]=~/(\d+)[I|D]/g){ $hash{$1}++ } if(eof){ for $x (sort {$a<=>$b} keys %hash){ print "$x\t$hash{$x}\n" } }' | head -30
1       10030206
2       1464022
3       442603
4       173860
5       85722
6       56669
7       30839
8       24067
9       16149
10      10090
11      7487
12      6489
13      4631
14      4195
15      4039
16      2873
17      2155
18      1860
19      1574
20      1292
21      1369
22      1157
23      960
24      1086
25      775
26      714
27      752
28      635
29      4201
30      558

It is observed that many insertions/deletions shorter than 10bps and they can be considered as noises. Accordingly, setting -J 10 or -J 20 is reasonable when running special.ReadASEvent. Lowering -ID to 0.7 or even 0.3 is also reasonable in case that some adapters or noises could be still in the reads.

ubuntu@ubuntu20:~$ ls sample*.bam | perl -ne 'chomp; /(.+?)\./; $cmd="java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar special.ReadASEvent -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -GPC gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -rep rice.test.repMap -J 20 -ID 0.7 -direction true -junction false -M SAM $_ -O $1"; print "\nCMD: $cmd\n"; #system $cmd'

CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar special.ReadASEvent -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -GPC gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -rep rice.test.repMap -J 20 -ID 0.7 -direction true -junction false -M SAM sampleA.bam -O sampleA

CMD: java -classpath /home/ubuntu/Tools/rackJ/rackj.jar:/home/ubuntu/Tools/sam-1.89.jar special.ReadASEvent -GFF3 merged.gff -GRE gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -GPC gene mRNA exon:CDS:five_prime_UTR:three_prime_UTR -rep rice.test.repMap -J 20 -ID 0.7 -direction true -junction false -M SAM sampleB.bam -O sampleB

(of course we remove the sharp mark for actual running the commands)

The real datasets give us up to millions of reads so the output .xls files would be large.

ubuntu@ubuntu20:~$ wc -l sample*.xls
  1208647 sampleA.xls
  1047889 sampleB.xls
  2256536 total

It is reasonable to do some filtering based on the output measurements of precision and recall. For example, the following perl one-liner saves records with recall>=0.8 ($t[4]>=0.8) to .sel.xls files.

ubuntu@ubuntu20:~$ ls sample*.xls | perl -ne 'chomp; /(.+?)\./; print "!!!FILE: $1\n"; system "cat $_"' | perl -ne 'if(/^!!!FILE: (\S+)/){ open(FILE,">$1.sel.xls") }else{ @t=split(/\t/); print FILE if $t[4]>=0.8 || /^#/ }'

ubuntu@ubuntu20:~$ wc -l sample*.xls
   994209 sampleA.sel.xls
  1208647 sampleA.xls
   936972 sampleB.sel.xls
  1047889 sampleB.xls
  4187717 total

The isoformSummary.pl from this repository can be used for counting reads of the same isoforms, as well as merged thus-made data from multiple samples.

ubuntu@ubuntu20:~$ git https://github.com/wdlingit/cop.git

ubuntu@ubuntu20:~$ ./cop/scripts/isoformSummary.pl
isoformSummary.pl [<alias> <filename>]+

ubuntu@ubuntu20:~$ ./cop/scripts/isoformSummary.pl sampleA sampleA.sel.xls sampleB sampleB.sel.xls > summary.xls

ubuntu@ubuntu20:~$ wc -l summary.xls
129168 summary.xls

The isoformSummary.pl script simply does two things:

Collect alignment strings in source tables (made by special.ReadASEvent) of the same isoforms. For example, two long reads hitting the same gene locus with alignment strings (+50[1])-([2])-([3])-([3^1]) and (+50[1])-([2])-([3])-([3^1]+50) are considered from the same isoform.
Collect read counts of isoforms from source tables and summarize them into the output.

So the headers of summary.xls are

ubuntu@ubuntu20:~$ head -1 summary.xls | perl -ne 'chomp; @t=split(/\t/); for($i=0;$i<@t;$i++){ print "$i\t$t[$i]\n" }'
0       #representative
1       repLength
2       measurement
3       recall
4       fitModels
5       IR
6       altD
7       altA
8       casseteE
9       crypticI
10      ES
11      altTSE
12      altTEE
13      combination
14      eventCnt
15      alignment2representative
16      sampleA
17      sampleB

where the last two columns are read counts from sampleA and sampleB.