GFF3 handling, Araport11 example - wdlingit/cop GitHub Wiki
This document describes my process of handling Araport11 GFF3 file at date 20230223. Refer the environment setting if needed.
ubuntu@ubuntu20:~$ date
Thu Feb 23 03:27:39 UTC 2023
ubuntu@ubuntu20:~$ wget https://www.arabidopsis.org/download_files/Genes/Araport11_genome_release/Araport11_GFF3_genes_transposons.current.gff.gz
ubuntu@ubuntu20:~$ gzip -d Araport11_GFF3_genes_transposons.current.gff.gz
Applied misc.GffTree
to check and parse the GFF3 file.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11_GFF3_genes_transposons.current.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11_GFF3_genes_transposons.current.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]
error line: of Maize Ac (hAT-element) (Zea mays)(source:TAIR10);Note=transposable element gene;symbol=None;curator_summary=None
java.util.NoSuchElementException
at java.base/java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
at misc.GffTree.parseGFF(GffTree.java:55)
at misc.GffTree.<init>(GffTree.java:23)
at misc.GffTree.main(GffTree.java:321)
It says that misc.GffTree
cannot correctly parse the line "of Maize Ac (hAT-element) (Zea mays)(source:TAIR10);Note=transposable element gene;symbol=None;curator_summary=None". Actually there are some more lines with similar mistakes if we extract lines without first token matching regular expression /Chr./
.
ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.current.gff | perl -ne 'next if /^#/; @t=split; print if $t[0]!~/Chr./' | less
of Maize Ac (hAT-element) (Zea mays)(source:TAIR10);Note=transposable element gene;symbol=None;curator_summary=None
of Maize Ac (hAT-element) (Zea mays)(source:TAIR10)
of Maize Ac (hAT-element) (Zea mays)(source:TAIR10)
of Maize Ac (hAT-element) (Zea mays)(source:TAIR10)
After some manual fix using a text editor. The GFF3 file seems OK and we got its feature tree.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11_GFF3_genes_transposons.current.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11_GFF3_genes_transposons.current.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]
ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.current.gff.features
GffRoot
gene*
mRNA*
CDS*
exon*
five_prime_UTR*
protein*
three_prime_UTR*
miRNA_primary_transcript*
miRNA*
tRNA*
exon*
antisense_lncRNA*
exon*
snoRNA*
exon*
lnc_RNA*
exon*
antisense_RNA*
exon*
transcript_region*
exon*
snRNA*
exon*
ncRNA*
exon*
uORF*
rRNA*
exon*
pseudogene*
pseudogenic_transcript*
pseudogenic_exon*
pseudogenic_tRNA*
pseudogenic_exon*
transposable_element_gene*
mRNA*
exon*
transposable_element*
transposon_fragment*
Next, composed a file named Araport11.subFeatures
that lists parent-child relationships to be checked for interval consistency.
ubuntu@ubuntu20:~$ cat Araport11.subFeatures
gene mRNA:miRNA_primary_transcript:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:uORF:rRNA
pseudogene pseudogenic_transcript:pseudogenic_tRNA
transposable_element_gene mRNA
transposable_element transposon_fragment
mRNA CDS:exon:five_prime_UTR:three_prime_UTR
miRNA_primary_transcript miRNA
tRNA exon
antisense_lncRNA exon
snoRNA exon
lnc_RNA exon
antisense_RNA exon
transcript_region exon
snRNA exon
ncRNA exon
uORF uORF
rRNA exon
pseudogenic_transcript pseudogenic_exon
pseudogenic_tRNA pseudogenic_exon
transposable_element transposon_fragment
The following perl one-liner was used to iterate Araport11.subFeatures
and to call GFF3checker.pl
for checking interval consistency of every listed parent-child relationship.
ubuntu@ubuntu20:~$ cat Araport11.subFeatures | perl -ne 'chomp; @t=split; next if $t[0] eq $t[1]; $cmd="cat Araport11_GFF3_genes_transposons.current.gff | GFF3checker.pl -feature $t[0] -expanding $t[1] | head -3"; $msg=`$cmd`; chomp $msg; print "\nCMD: $cmd\n$msg\n" if length($msg)>0'
CMD: cat Araport11_GFF3_genes_transposons.current.gff | GFF3checker.pl -feature gene -expanding mRNA:miRNA_primary_transcript:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:uORF:rRNA | head -3
non-expanding: AT2G07845 : (9664200,9666686) => (9665550,9666686)
non-expanding: AT2G07850 : (9732464,9733623) => (9732473,9733328)
non-expanding: AT2G08345 : (12639500,12640959) => (12640366,12640959)
CMD: cat Araport11_GFF3_genes_transposons.current.gff | GFF3checker.pl -feature pseudogene -expanding pseudogenic_transcript:pseudogenic_tRNA | head -3
non-expanding: ATMG00095 : (20041,20134) => (,)
non-expanding: ATMG00215 : (184459,184760) => (,)
non-expanding: ATMG00685 : (346567,346660) => (,)
CMD: cat Araport11_GFF3_genes_transposons.current.gff | GFF3checker.pl -feature miRNA_primary_transcript -expanding miRNA | head -3
non-expanding: AT1G01046.1 : (28500,28706) => (28635,28655)
non-expanding: AT1G01183.1 : (78927,79037) => (78932,79030)
non-expanding: AT1G04027.1 : (234006,234159) => (234017,234146)
CMD: cat Araport11_GFF3_genes_transposons.current.gff | GFF3checker.pl -feature tRNA -expanding exon | head -3
non-expanding: ATMG00030.1 : (149673,149743) => (,)
non-expanding: ATMG00460.1_tRNA : (250307,250380) => (,)
The first section of the report indicated some interval inconsistency between geen records and its children. These were left untouched as they seem minor. The second and the fourth sections of the report means some pseudogene/tRNA records have no corresponding child records. These were temporarily fixed by adding corresponding fake records as they are all short enough. The third section of the report says that we got interval inconsistency for miRNA_primary_transcript-miRNA. By some checking, it was found that miRNA_primary_transcript records were rather like miRNA precursors and miRNA records for matured miRNA. For example, the two interval of "AT1G01046.1 : (28500,28706) => (28635,28655)" is just like the following picture. This means that we better pick correct regions for read counting if we are doing NGS data analysis.
Accordingly, some manual corrections were applied.
ubuntu@ubuntu20:~$ diff Araport11_GFF3_genes_transposons.current.gff Araport11_GFF3_genes_transposons.fixed.gff
927188a927189
> ChrM Araport11 exon 149673 149743 . + . Parent=ATMG00030.1
927201a927203,927204
> ChrM Araport11 pseudogenic_transcript 20041 20134 . + . ID=ATMG00095.1;Parent=ATMG00095
> ChrM Araport11 pseudogenic_exon 20041 20134 . + . Parent=ATMG00095.1
927226a927230,927231
> ChrM Araport11 pseudogenic_transcript 184459 184760 . - . ID=ATMG00215.1;Parent=ATMG00215
> ChrM Araport11 pseudogenic_exon 184459 184760 . - . Parent=ATMG00215.1
927261c927266,927267
< ChrM Araport11 tRNA 250307 250380 . + . ID=ATMG00460.1_tRNA;Parent=ATMG00460;Name=ATMG00460.1;id2=id-tRNAI(CAT);parent2=gene-tRNAI(CAT);NEWNote=None;exception=RNA editing;gbkey=tRNA;gene=tRNAI(CAT);product=tRNA-Ile;symbol=TRNI;computational_description=pre-tRNA tRNA-Ile (anticodon: CAT)(source:Araport11, TAIR10);curator_summary=tRNA-Ile
---
> ChrM Araport11 tRNA 250307 250380 . + . ID=ATMG00460.1;Parent=ATMG00460;Name=ATMG00460.1;id2=id-tRNAI(CAT);parent2=gene-tRNAI(CAT);NEWNote=None;exception=RNA editing;gbkey=tRNA;gene=tRNAI(CAT);product=tRNA-Ile;symbol=TRNI;computational_description=pre-tRNA tRNA-Ile (anticodon: CAT)(source:Araport11, TAIR10);curator_summary=tRNA-Ile
> ChrM Araport11 exon 250307 250380 . + . Parent=ATMG00460.1
927302a927309,927310
> ChrM Araport11 pseudogenic_transcript 346567 346660 . + . ID=ATMG00685.1;Parent=ATMG00685
> ChrM Araport11 pseudogenic_exon 346567 346660 . + . Parent=ATMG00685.1
Redid the interval consistency checking again with updated Araport11.subFeatures
and got only report for gene records.
ubuntu@ubuntu20:~$ cat Araport11.subFeatures
gene mRNA:miRNA_primary_transcript:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:uORF:rRNA
pseudogene pseudogenic_transcript:pseudogenic_tRNA
transposable_element_gene mRNA
transposable_element transposon_fragment
mRNA CDS:exon:five_prime_UTR:three_prime_UTR
miRNA_primary_transcript miRNA_primary_transcript
tRNA exon
antisense_lncRNA exon
snoRNA exon
lnc_RNA exon
antisense_RNA exon
transcript_region exon
snRNA exon
ncRNA exon
uORF uORF
rRNA exon
pseudogenic_transcript pseudogenic_exon
pseudogenic_tRNA pseudogenic_exon
transposable_element transposon_fragment
ubuntu@ubuntu20:~$ cat Araport11.subFeatures | perl -ne 'chomp; @t=split; next if $t[0] eq $t[1]; $cmd="cat Araport11_GFF3_genes_transposons.fixed.gff | GFF3checker.pl -feature $t[0] -expanding $t[1] | head -3"; $msg=`$cmd`; chomp $msg; print "\nCMD: $cmd\n$msg\n" if length($msg)>0'
CMD: cat Araport11_GFF3_genes_transposons.fixed.gff | GFF3checker.pl -feature gene -expanding mRNA:miRNA_primary_transcript:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:uORF:rRNA | head -3
non-expanding: AT2G07845 : (9664200,9666686) => (9665550,9666686)
non-expanding: AT2G07850 : (9732464,9733623) => (9732473,9733328)
non-expanding: AT2G08345 : (12639500,12640959) => (12640366,12640959)
Since there are gene records that might be mRNA, rRNA, lncRNA... etc (at second level of the feature tree). It is desirable to classify gene categories following the GFF3 file. Firstly, second-level features were listed in second.features
.
ubuntu@ubuntu20:~$ cat second.features
mRNA
miRNA_primary_transcript
tRNA
antisense_lncRNA
snoRNA
lnc_RNA
antisense_RNA
transcript_region
snRNA
ncRNA
uORF
rRNA
pseudogenic_transcript
pseudogenic_tRNA
One direct way to do this is to look these records of second-level features and assign these features to parents. However, it was found that some genes were associated with two kinds of second-level features.
ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.fixed.gff | perl -ne 'if($.==1){ open(FILE,"<second.features"); while($line=<FILE>){ chomp $line; $hash{$line}=1 } close FILE } chomp; @t=split(/\t/); if(exists $hash{$t[2]}){ $t[8].=";"; $t[8]=~/Parent=(.+?);/ || $t[8]=~/Derives_from=(.+?);/; $feature{$1}{$t[2]}=1 } if(eof){ for $k (sort keys %feature){ @features=keys %{$feature{$k}}; print "$k\t@features\n" if @features>1 } }' | head
AT1G06150 uORF mRNA
AT1G16860 mRNA uORF
AT1G19850 mRNA uORF
AT1G23150 uORF mRNA
AT1G25470 mRNA uORF
AT1G29950 mRNA uORF
AT1G30270 mRNA uORF
AT1G30330 uORF mRNA
AT1G36730 uORF mRNA
AT1G48600 uORF mRNA
Some further checking showed that a set of genes were assigned with mRNA children and uORF children, and no genes was assigned with only uORF children. Considering that assigning a gene as mRNA gene would be rather meaningful than uORF gene. uORF was removed from second.features
. Also, transposable_element_gene records have mRNA children and we want them keep being classified as transposable_element_gene.
ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.fixed.gff | perl -ne 'if($.==1){ open(FILE,"<second.features"); while($line=<FILE>){ chomp $line; $hash{$line}=1 } close FILE; } chomp; @t=split(/\t/); if(exists $hash{$t[2]}){ $t[8].=";"; $t[8]=~/Parent=(.+?);/ || $t[8]=~/Derives_from=(.+?);/; $feature{$1}{$t[2]}=1 } if($t[2] eq "transposable_element_gene"){ $t[8].=";"; $t[8]=~/ID=(.+?);/; $teHash{$1}=1 } if(eof){ for $k (sort keys %feature){ @features=keys %{$feature{$k}}; if(exists $teHash{$k}){ print "$k\ttransposable_element_gene\n" }else{ print "$k\t@features\n" } } }' > gene.features
Applied misc.ModelCGFF
for interval coordinate extraction. Note that the -GRE
triple for miRNA_primary_transcript is -GRE gene miRNA_primary_transcript miRNA_primary_transcript
.
ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.fixed.gff -GRE gene mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA five_prime_UTR:exon:CDS:three_prime_UTR -GRE gene miRNA_primary_transcript miRNA_primary_transcript -GRE pseudogene pseudogenic_transcript:pseudogenic_tRNA pseudogenic_exon -GRE transposable_element_gene mRNA exon -O Araport11 -IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): Araport11_GFF3_genes_transposons.fixed.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA, five_prime_UTR:exon:CDS:three_prime_UTR], [gene, miRNA_primary_transcript, miRNA_primary_transcript], [pseudogene, pseudogenic_transcript:pseudogenic_tRNA, pseudogenic_exon], [transposable_element_gene, mRNA, exon]]
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): Araport11
ubuntu@ubuntu20:~$ head Araport11.cgff
>AT1G01010 Chr1 3631 5899 +
3631 3913
3996 4276
4486 4605
4706 5095
5174 5326
5439 5899
>AT1G01020 Chr1 6788 9130 -
6788 7069
7157 7232
ubuntu@ubuntu20:~$ head Araport11.model
>AT1G01010.1 Chr1 3631 5899 + AT1G01010
3631 3913
3996 4276
4486 4605
4706 5095
5174 5326
5439 5899
>AT1G01020.2 Chr1 6788 8737 - AT1G01020
6788 7069
7157 7450
File features
lists interested features and their level in the feature tree.
ubuntu@ubuntu20:~$ cat features
1 gene
1 pseudogene
1 transposable_element_gene
2 mRNA
2 lnc_RNA
2 antisense_lncRNA
2 miRNA_primary_transcript
2 transcript_region
2 tRNA
2 antisense_RNA
2 snoRNA
2 ncRNA
2 snRNA
2 rRNA
2 pseudogenic_transcript
2 pseudogenic_tRNA
Numbers of attributes associated to level-1 features and level-2 features were observed.
ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.fixed.gff | perl -ne 'if($.==1){ open(FILE,"<features"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $feature{$s[1]}=$s[0]; } close FILE; } chomp; @t=split(/\t/); if($feature{$t[2]}==1){ $t[8].=";" if $t[8]!~/;$/; $tmpstr=$t[8]; while($tmpstr=~/^([^,;=]+?)=(.+)$/){ $key=$1; $hash{$key}++; $tmpstr=$2; if($tmpstr=~/(.+?);\s*([\S^=]+?=.+)/){ $val=$1; $tmpstr=$2 }else{ ($val)=$tmpstr=~/(.+);/; $tmpstr=""; } } } if(eof){ for $k (sort keys %hash){ print "$k\t$hash{$k}\n" } }'
Alias 1
Dbxref 9049
ID 38089
Name 38065
Note 13829
computational_description 13538
curator_summary 8059
description 1069
exception 1
full_name 9854
gbkey 57
gene 57
gene_synonym 8
id2 57
locus 23699
locus_type 37927
nochangenat-description 44
part 1
partial 1
pseudo 3
start_range 1
symbol 10135
ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.fixed.gff | perl -ne 'if($.==1){ open(FILE,"<features"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $feature{$s[1]}=$s[0]; } close FILE; } chomp; @t=split(/\t/); if($feature{$t[2]}==2){ $t[8].=";" if $t[8]!~/;$/; $tmpstr=$t[8]; while($tmpstr=~/^([^,;=]+?)=(.+)$/){ $key=$1; $hash{$key}++; $tmpstr=$2; if($tmpstr=~/(.+?);\s*([\S^=]+?=.+)/){ $val=$1; $tmpstr=$2 }else{ ($val)=$tmpstr=~/(.+);/; $tmpstr=""; } } } if(eof){ for $k (sort keys %hash){ print "$k\t$hash{$k}\n" } }'
Dbxref 31869
ID 59672
NEWNote 3
Name 59666
Note 59652
Parent 59672
UniProt 1
computational_description 59146
conf_class 34463
conf_rating 34148
curator_summary 59392
description 6
exception 2
full_name 17956
gbkey 54
gene 12710
gene_synonym 6
id2 54
index 4901
locus_type 29
parent2 25
part 1
partial 1
product 25
start_range 1
symbol 59360
For this time, it was decided to extract attributes from level-2 features.
ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.fixed.gff | perl -ne 'if($.==1){ open(FILE,"<features"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $feature{$s[1]}=$s[0]; } close FILE; } chomp; @t=split(/\t/); if($feature{$t[2]}==2){ $t[8].=";" if $t[8]!~/;$/; ($id)=$t[8]=~/ID=(.+?);/; $tmpstr=$t[8]; while($tmpstr=~/^([^,;=]+?)=(.+)$/){ $key=$1; $tmpstr=$2; if($tmpstr=~/(.+?);\s*([\S^=]+?=.+)/){ $val=$1; $tmpstr=$2 }else{ ($val)=$tmpstr=~/(.+);/; } $hash{$id}{$key}=$val; } } if(eof STDIN){ print "#transcript\tNote\tcomputational_description\tcurator_summary\n"; for $id (sort keys %hash){ print "$id"; for $k ("Note","computational_description","curator_summary"){ if(exists $hash{$id}{$k}){ print "\t$hash{$id}{$k}" } else{ print "\t" } } print "\n" } }' > Araport11.description.txt
The following process is the my current best practice to fix unprocessable text. Firstly, visually confirm the following code decodes those encoded text, as there might be some unprocessable spots.
ubuntu@ubuntu20:~$ cat Araport11.description.txt | perl -ne 'print if /\%/' | perl -MHTML::Entities -Xne 'while(/(\S+\s+)/g){ $x=$1; $x=~s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; print decode_entities($x) }' | less
Some of them really cannot be processed by the above code. The following perl one-liner identified them (by /[^[:ascii:]]/
) and some manual operations were involved. You might have to use a text editor that can edit in hex mode.
ubuntu@ubuntu20:~$ cp Araport11.description.txt Araport11.description.fix0.txt
ubuntu@ubuntu20:~$ cat Araport11.description.fix0.txt | perl -MHTML::Entities -Xne 'while(/(\S+\s+)/g){ $x=$1; $x=~s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; print decode_entities($x) }' | perl -ne 'print if /[^[:ascii:]]/' | less
After manual corrections on Araport11.description.fix0.txt
, the following perl one-liner was applied to get a final version.
ubuntu@ubuntu20:~$ cat Araport11.description.fix0.txt | perl -MHTML::Entities -Xne 'while(/(\S+\s+)/g){ $x=$1; $x=~s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; print decode_entities($x) }' > Araport11.description.fix1.txt