GFF3 handling, Araport11 AtRTD2 merge - wdlingit/cop GitHub Wiki
This document describes my process of dealing with Araport11 and AtRTD2 annotation files. All based on 2016 versions. Refer the environment setting if needed.
Download the AtRTD2 GTF file and some short examination on types, attributes, and chromosomes.
ubuntu@ubuntu:~$ wget https://ics.hutton.ac.uk/atRTD/RTD2/AtRTDv2_QUASI_19April2016.gtf
ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gtf | perl -ne '@t=split(/\t/); $hash{$t[2]}++; if(eof){ for $k (sort keys %hash){ print "$k\t$hash{$k}\n" } }'
exon 567987
ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gtf | perl -ne 'chomp; @t=split(/\t/,$_); @s=$t[8]=~/(\S+)/g; for $x (@s){ $hash{$x}=1 if not $x=~/"/ } if(eof){ for $k (sort keys %hash){ print "$k\n" } }'
Note
gene_id
transcript_id
ubuntu@ubuntu:~$ wc -l AtRTDv2_QUASI_19April2016.gtf
567987 AtRTDv2_QUASI_19April2016.gtf
ubuntu@ubuntu:~$ grep Note AtRTDv2_QUASI_19April2016.gtf | wc -l
195557
ubuntu@ubuntu:~$ grep gene_id AtRTDv2_QUASI_19April2016.gtf | wc -l
567987
ubuntu@ubuntu:~$ grep transcript_id AtRTDv2_QUASI_19April2016.gtf | wc -l
567987
ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gtf | perl -ne '@t=split(/\t/); $hash{$t[0]}++; if(eof){ for $k (sort keys %hash){ print "$k\t$hash{$k}\n" } }'
Chr1 150952
Chr2 87993
Chr3 109892
Chr4 87021
Chr5 131811
ChrC 157
ChrM 161
It was found that every line in this GTF file is an exon
record, and every line contais gene_id
and transcript_id
attributes. The following perl one-liner was used to transfer the GTF file into a GFF3 file AtRTDv2_QUASI_19April2016.gff3
simply by translating the exon records and adding gene/transcript records.
ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gtf | perl -ne 'chomp; @t=split(/\t/); next if $t[2] ne "exon"; ($geneID)=$t[8]=~/gene_id "(.+?)"/; ($transID)=$t[8]=~/transcript_id "(.+?)"/; push @{$additive{$geneID}},$t[3],$t[4]; push @{$additive{$transID}},$t[3],$t[4]; $parent{$transID}=$geneID; @{$info{$geneID}}=($t[0],$t[6],"gene"); @{$info{$transID}}=($t[0],$t[6],"transcript"); print join("\t",@t[0..7])."\tParent=$transID\n"; if(eof){ for $k (sort keys %additive){ @p=sort {$a<=>$b} @{$additive{$k}}; print "$info{$k}[0]\tunknown\t$info{$k}[2]\t$p[0]\t$p[-1]\t.\t$info{$k}[1]\t.\t"; print "ID=$k"; print ";Parent=$parent{$k}" if exists $parent{$k}; print "\n" } }' > AtRTDv2_QUASI_19April2016.gff3
Some minor check on the GFF3 file. Note that the .features
file contains just a simple hierarchy of gene
-transcript
-exon
.
ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I AtRTDv2_QUASI_19April2016.gff3
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): AtRTDv2_QUASI_19April2016.gff3
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]
ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gff3.features
GffRoot
gene*
transcript*
exon
Extracting intervals of gene-level objects and transcript-level objects. They were saved in files AtRTD2.cgff
and AtRTD2.model
, respectively.
ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 AtRTDv2_QUASI_19April2016.gff3 -GRE gene transcript exon -O AtRTD2 -IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): AtRTDv2_QUASI_19April2016.gff3
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, transcript, exon]]
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): AtRTD2
The following command parses the structure of the Araport11 GFF3 file and save it in the .features
file. It can be found that records were assigned with different features.
ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]
ubuntu@ubuntu:~$ cat Araport11_GFF3_genes_transposons.201606.gff.features
GffRoot
gene*
mRNA*
five_prime_UTR*
exon*
CDS*
three_prime_UTR*
protein*
lnc_RNA*
exon*
antisense_lncRNA*
exon*
miRNA_primary_transcript*
miRNA*
transcript_region*
exon*
tRNA*
exon*
antisense_RNA*
exon*
snoRNA*
exon*
ncRNA*
exon*
snRNA*
exon*
uORF*
rRNA*
exon*
transposable_element*
transposon_fragment*
pseudogene*
pseudogenic_transcript*
pseudogenic_exon*
pseudogenic_tRNA*
pseudogenic_exon*
transposable_element_gene*
mRNA*
exon*
Applied the same command as previously described for extracting intervals of gene-level objects and transcript-level objects.
ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.201606.gff -O Araport11 -IP true -GRE gene mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA five_prime_UTR:exon:CDS:three_prime_UTR -GRE gene miRNA_primary_transcript miRNA_primary_transcript -GRE pseudogene pseudogenic_transcript:pseudogenic_tRNA pseudogenic_exon -GRE transposable_element_gene mRNA exon
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA, five_prime_UTR:exon:CDS:three_prime_UTR], [gene, miRNA_primary_transcript, miRNA_primary_transcript], [pseudogene, pseudogenic_transcript:pseudogenic_tRNA, pseudogenic_exon], [transposable_element_gene, mRNA, exon]]
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): Araport11
Numbers of gene-level objects.
ubuntu@ubuntu:~$ grep ">" AtRTD2.cgff | wc -l
34212
ubuntu@ubuntu:~$ grep ">" Araport11.cgff | wc -l
38194
Numbers of transcript-level objects.
ubuntu@ubuntu:~$ grep ">" AtRTD2.model | wc -l
82190
ubuntu@ubuntu:~$ grep ">" Araport11.model | wc -l
59777
Applied the same command as previously described for extracting feature definitions of gene-level objects of Araport11.
ubuntu@ubuntu:~$ cat second.features
mRNA
miRNA_primary_transcript
tRNA
antisense_lncRNA
snoRNA
lnc_RNA
antisense_RNA
transcript_region
snRNA
ncRNA
rRNA
pseudogenic_transcript
pseudogenic_tRNA
ubuntu@ubuntu:~$ cat Araport11_GFF3_genes_transposons.201606.gff | perl -ne 'if($.==1){ open(FILE,"<second.features"); while($line=<FILE>){ chomp $line; $hash{$line}=1 } close FILE; } chomp; @t=split(/\t/); if(exists $hash{$t[2]}){ $t[8].=";"; $t[8]=~/Parent=(.+?);/ || $t[8]=~/Derives_from=(.+?);/; $feature{$1}{$t[2]}=1 } if($t[2] eq "transposable_element_gene"){ $t[8].=";"; $t[8]=~/ID=(.+?);/; $teHash{$1}=1 } if(eof){ for $k (sort keys %feature){ @features=keys %{$feature{$k}}; if(exists $teHash{$k}){ print "$k\ttransposable_element_gene\n" }else{ print "$k\t@features\n" } } }' > Araport11.gene.features
By attaching feature definitions from Araport11 to genes in AtRTD2 GFF3, diff
command and a perl one-liner was applied to see what kinds of genes are exclusively in Araport11 but not AtRTD2.
ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gff3 | perl -ne 'if($.==1){ open(FILE,"<Araport11.gene.features"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $type{$s[0]}=$s[1]; } close FILE } chomp; @t=split(/\t/); if($t[2] eq "gene"){ $t[8].=";"; ($id)=$t[8]=~/ID=(.+?);/; print "$id\t$type{$id}\n" }' | sort > AtRTD2.gene.features
ubuntu@ubuntu:~$ diff Araport11.gene.features AtRTD2.gene.features | perl -ne 'chomp; @t=split; if($t[0] eq "<"){ $hash{$t[2]}++ } if(eof){ for $k (sort keys %hash){ print "$k\t$hash{$k}\n" } }'
antisense_RNA 75
antisense_lncRNA 820
lnc_RNA 2440
mRNA 89
miRNA_primary_transcript 154
ncRNA 55
pseudogenic_tRNA 17
pseudogenic_transcript 7
snRNA 69
snoRNA 217
tRNA 11
transcript_region 508
transposable_element_gene 19
The following commands were used to extract a list of Araport11-exclusive coding genes and a list of AtRTD2-exclusive genes.
ubuntu@ubuntu:~$ diff Araport11.gene.features AtRTD2.gene.features | perl -ne 'chomp; @t=split; print "$_\n" if ($t[0] eq "<") && ($t[2] eq "mRNA")' > Araport11Only.codingGenes.txt
ubuntu@ubuntu:~$ diff Araport11.gene.features AtRTD2.gene.features | perl -ne 'chomp; @t=split; print "$_\n" if ($t[0] eq ">") && ($t[2] eq "")' > AtRTD2Only.genes.txt
ubuntu@ubuntu:~$ wc -l Araport11Only.codingGenes.txt AtRTD2Only.genes.txt
89 Araport11Only.codingGenes.txt
499 AtRTD2Only.genes.txt
588 total
The two files Araport11.model
and AtRTD2.model
are containing intervals (of exon-level objects) of transcript-level objects. The perl one-liner was applied to extract their gene-level parents so that we used model2GFF3.pl
to merge and translate the two .model
files into a GFF3 file that contains definitions from Araport11 and AtRTD2.
ubuntu@ubuntu:~$ ls *.model
Araport11.model AtRTD2.model
ubuntu@ubuntu:~$ cat *.model | perl -ne 'chomp; if(/^>(\S+)/){ @t=split; print "$1\t$t[5]\n" }' | perl -ne '$hash{$_}=1; if(eof){ for $k (sort keys %hash){ print $k} }' > rnaGene.merged
ubuntu@ubuntu:~$ cat Araport11.model AtRTD2.model | model2GFF3.pl /dev/stdin rnaGene.merged > Araport11-AtRTD2.fake.gff3
Some simple parse fo checking. Note that the .features
file contains just a simple hierarchy of gene
-transcript
-exon
. This is because it was translated from the two .model
files that contains only interval information.
ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11-AtRTD2.fake.gff3
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11-AtRTD2.fake.gff3
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]
ubuntu@ubuntu:~$ cat Araport11-AtRTD2.fake.gff3.features
GffRoot
gene*
transcript*
exon
It could be desirable to restore features to the merged GFF3 at some certain level. The following perl one-liner was applied to restore features to transcript-level objects. Note that the feature structure is not exactly the same as the one of Araport11 GFF3.
ubuntu@ubuntu:~$ cat Araport11-AtRTD2.fake.gff3 | perl -ne 'if($.==1){ open(FILE,"<Araport11.gene.features"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $hash{$s[0]}=$s[1] } close FILE } chomp; @t=split(/\t/); $t[8].=";" if not $t[8]=~/;$/; $t[8]=~/Parent=(.+?);/; if(exists $hash{$1}){ $t[2]=$hash{$1} } print join("\t",@t)."\n"' > Araport11-AtRTD2.restored.gff3
ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11-AtRTD2.restored.gff3
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11-AtRTD2.restored.gff3
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]
ubuntu@ubuntu:~$ cat Araport11-AtRTD2.restored.gff3.features
GffRoot
gene*
mRNA*
exon
miRNA_primary_transcript*
exon
transcript*
exon
tRNA*
exon
antisense_lncRNA*
exon
pseudogenic_transcript*
exon
transposable_element_gene*
exon
snoRNA*
exon
lnc_RNA*
exon
antisense_RNA*
exon
transcript_region*
exon
snRNA*
exon
ncRNA*
exon
pseudogenic_tRNA*
exon
rRNA*
exon