GFF3 handling, Araport11 example - wdlingit/cop GitHub Wiki

This document describes my process of handling Araport11 GFF3 file at date 20230223. Refer the environment setting if needed.

ubuntu@ubuntu20:~$ date
Thu Feb 23 03:27:39 UTC 2023
ubuntu@ubuntu20:~$ wget https://www.arabidopsis.org/download_files/Genes/Araport11_genome_release/Araport11_GFF3_genes_transposons.current.gff.gz
ubuntu@ubuntu20:~$ gzip -d Araport11_GFF3_genes_transposons.current.gff.gz

Initial parse

Applied misc.GffTree to check and parse the GFF3 file.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11_GFF3_genes_transposons.current.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11_GFF3_genes_transposons.current.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]

error line: of Maize Ac (hAT-element) (Zea mays)(source:TAIR10);Note=transposable element gene;symbol=None;curator_summary=None
java.util.NoSuchElementException
		at java.base/java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
		at misc.GffTree.parseGFF(GffTree.java:55)
		at misc.GffTree.<init>(GffTree.java:23)
		at misc.GffTree.main(GffTree.java:321)

It says that misc.GffTree cannot correctly parse the line "of Maize Ac (hAT-element) (Zea mays)(source:TAIR10);Note=transposable element gene;symbol=None;curator_summary=None". Actually there are some more lines with similar mistakes if we extract lines without first token matching regular expression /Chr./.

ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.current.gff | perl -ne 'next if /^#/; @t=split; print if $t[0]!~/Chr./' | less
of Maize Ac (hAT-element) (Zea mays)(source:TAIR10);Note=transposable element gene;symbol=None;curator_summary=None
of Maize Ac (hAT-element) (Zea mays)(source:TAIR10)
of Maize Ac (hAT-element) (Zea mays)(source:TAIR10)
of Maize Ac (hAT-element) (Zea mays)(source:TAIR10)

After some manual fix using a text editor. The GFF3 file seems OK and we got its feature tree.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11_GFF3_genes_transposons.current.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11_GFF3_genes_transposons.current.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]

ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.current.gff.features
GffRoot
  gene*
	mRNA*
	  CDS*
	  exon*
	  five_prime_UTR*
	  protein*
	  three_prime_UTR*
	miRNA_primary_transcript*
	  miRNA*
	tRNA*
	  exon*
	antisense_lncRNA*
	  exon*
	snoRNA*
	  exon*
	lnc_RNA*
	  exon*
	antisense_RNA*
	  exon*
	transcript_region*
	  exon*
	snRNA*
	  exon*
	ncRNA*
	  exon*
	uORF*
	rRNA*
	  exon*
  pseudogene*
	pseudogenic_transcript*
	  pseudogenic_exon*
	pseudogenic_tRNA*
	  pseudogenic_exon*
  transposable_element_gene*
	mRNA*
	  exon*
  transposable_element*
	transposon_fragment*

Interval consistency

Next, composed a file named Araport11.subFeatures that lists parent-child relationships to be checked for interval consistency.

ubuntu@ubuntu20:~$ cat Araport11.subFeatures
gene    mRNA:miRNA_primary_transcript:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:uORF:rRNA
pseudogene      pseudogenic_transcript:pseudogenic_tRNA
transposable_element_gene       mRNA
transposable_element    transposon_fragment
mRNA    CDS:exon:five_prime_UTR:three_prime_UTR
miRNA_primary_transcript        miRNA
tRNA    exon
antisense_lncRNA        exon
snoRNA  exon
lnc_RNA exon
antisense_RNA   exon
transcript_region       exon
snRNA   exon
ncRNA   exon
uORF    uORF
rRNA    exon
pseudogenic_transcript  pseudogenic_exon
pseudogenic_tRNA        pseudogenic_exon
transposable_element    transposon_fragment

The following perl one-liner was used to iterate Araport11.subFeatures and to call GFF3checker.pl for checking interval consistency of every listed parent-child relationship.

ubuntu@ubuntu20:~$ cat Araport11.subFeatures | perl -ne 'chomp; @t=split; next if $t[0] eq $t[1]; $cmd="cat Araport11_GFF3_genes_transposons.current.gff | GFF3checker.pl -feature $t[0] -expanding $t[1] | head -3"; $msg=`$cmd`; chomp $msg; print "\nCMD: $cmd\n$msg\n" if length($msg)>0'

CMD: cat Araport11_GFF3_genes_transposons.current.gff | GFF3checker.pl -feature gene -expanding mRNA:miRNA_primary_transcript:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:uORF:rRNA | head -3
non-expanding: AT2G07845 : (9664200,9666686) => (9665550,9666686)
non-expanding: AT2G07850 : (9732464,9733623) => (9732473,9733328)
non-expanding: AT2G08345 : (12639500,12640959) => (12640366,12640959)

CMD: cat Araport11_GFF3_genes_transposons.current.gff | GFF3checker.pl -feature pseudogene -expanding pseudogenic_transcript:pseudogenic_tRNA | head -3
non-expanding: ATMG00095 : (20041,20134) => (,)
non-expanding: ATMG00215 : (184459,184760) => (,)
non-expanding: ATMG00685 : (346567,346660) => (,)

CMD: cat Araport11_GFF3_genes_transposons.current.gff | GFF3checker.pl -feature miRNA_primary_transcript -expanding miRNA | head -3
non-expanding: AT1G01046.1 : (28500,28706) => (28635,28655)
non-expanding: AT1G01183.1 : (78927,79037) => (78932,79030)
non-expanding: AT1G04027.1 : (234006,234159) => (234017,234146)

CMD: cat Araport11_GFF3_genes_transposons.current.gff | GFF3checker.pl -feature tRNA -expanding exon | head -3
non-expanding: ATMG00030.1 : (149673,149743) => (,)
non-expanding: ATMG00460.1_tRNA : (250307,250380) => (,)

The first section of the report indicated some interval inconsistency between geen records and its children. These were left untouched as they seem minor. The second and the fourth sections of the report means some pseudogene/tRNA records have no corresponding child records. These were temporarily fixed by adding corresponding fake records as they are all short enough. The third section of the report says that we got interval inconsistency for miRNA_primary_transcript-miRNA. By some checking, it was found that miRNA_primary_transcript records were rather like miRNA precursors and miRNA records for matured miRNA. For example, the two interval of "AT1G01046.1 : (28500,28706) => (28635,28655)" is just like the following picture. This means that we better pick correct regions for read counting if we are doing NGS data analysis.

Araport11_AT1G01046

Accordingly, some manual corrections were applied.

ubuntu@ubuntu20:~$ diff Araport11_GFF3_genes_transposons.current.gff  Araport11_GFF3_genes_transposons.fixed.gff
927188a927189
> ChrM  Araport11       exon    149673  149743  .       +       .       Parent=ATMG00030.1
927201a927203,927204
> ChrM  Araport11       pseudogenic_transcript  20041   20134   .       +       .       ID=ATMG00095.1;Parent=ATMG00095
> ChrM  Araport11       pseudogenic_exon        20041   20134   .       +       .       Parent=ATMG00095.1
927226a927230,927231
> ChrM  Araport11       pseudogenic_transcript  184459  184760  .       -       .       ID=ATMG00215.1;Parent=ATMG00215
> ChrM  Araport11       pseudogenic_exon        184459  184760  .       -       .       Parent=ATMG00215.1
927261c927266,927267
< ChrM  Araport11       tRNA    250307  250380  .       +       .       ID=ATMG00460.1_tRNA;Parent=ATMG00460;Name=ATMG00460.1;id2=id-tRNAI(CAT);parent2=gene-tRNAI(CAT);NEWNote=None;exception=RNA editing;gbkey=tRNA;gene=tRNAI(CAT);product=tRNA-Ile;symbol=TRNI;computational_description=pre-tRNA tRNA-Ile (anticodon: CAT)(source:Araport11, TAIR10);curator_summary=tRNA-Ile
---
> ChrM  Araport11       tRNA    250307  250380  .       +       .       ID=ATMG00460.1;Parent=ATMG00460;Name=ATMG00460.1;id2=id-tRNAI(CAT);parent2=gene-tRNAI(CAT);NEWNote=None;exception=RNA editing;gbkey=tRNA;gene=tRNAI(CAT);product=tRNA-Ile;symbol=TRNI;computational_description=pre-tRNA tRNA-Ile (anticodon: CAT)(source:Araport11, TAIR10);curator_summary=tRNA-Ile
> ChrM  Araport11       exon    250307  250380  .       +       .       Parent=ATMG00460.1
927302a927309,927310
> ChrM  Araport11       pseudogenic_transcript  346567  346660  .       +       .       ID=ATMG00685.1;Parent=ATMG00685
> ChrM  Araport11       pseudogenic_exon        346567  346660  .       +       .       Parent=ATMG00685.1

Redid the interval consistency checking again with updated Araport11.subFeatures and got only report for gene records.

ubuntu@ubuntu20:~$ cat Araport11.subFeatures
gene    mRNA:miRNA_primary_transcript:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:uORF:rRNA
pseudogene      pseudogenic_transcript:pseudogenic_tRNA
transposable_element_gene       mRNA
transposable_element    transposon_fragment
mRNA    CDS:exon:five_prime_UTR:three_prime_UTR
miRNA_primary_transcript        miRNA_primary_transcript
tRNA    exon
antisense_lncRNA        exon
snoRNA  exon
lnc_RNA exon
antisense_RNA   exon
transcript_region       exon
snRNA   exon
ncRNA   exon
uORF    uORF
rRNA    exon
pseudogenic_transcript  pseudogenic_exon
pseudogenic_tRNA        pseudogenic_exon
transposable_element    transposon_fragment

ubuntu@ubuntu20:~$ cat Araport11.subFeatures | perl -ne 'chomp; @t=split; next if $t[0] eq $t[1]; $cmd="cat Araport11_GFF3_genes_transposons.fixed.gff | GFF3checker.pl -feature $t[0] -expanding $t[1] | head -3"; $msg=`$cmd`; chomp $msg; print "\nCMD: $cmd\n$msg\n" if length($msg)>0'

CMD: cat Araport11_GFF3_genes_transposons.fixed.gff | GFF3checker.pl -feature gene -expanding mRNA:miRNA_primary_transcript:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:uORF:rRNA | head -3
non-expanding: AT2G07845 : (9664200,9666686) => (9665550,9666686)
non-expanding: AT2G07850 : (9732464,9733623) => (9732473,9733328)
non-expanding: AT2G08345 : (12639500,12640959) => (12640366,12640959)

Second-level feature extraction

Since there are gene records that might be mRNA, rRNA, lncRNA... etc (at second level of the feature tree). It is desirable to classify gene categories following the GFF3 file. Firstly, second-level features were listed in second.features.

ubuntu@ubuntu20:~$ cat second.features
mRNA
miRNA_primary_transcript
tRNA
antisense_lncRNA
snoRNA
lnc_RNA
antisense_RNA
transcript_region
snRNA
ncRNA
uORF
rRNA
pseudogenic_transcript
pseudogenic_tRNA

One direct way to do this is to look these records of second-level features and assign these features to parents. However, it was found that some genes were associated with two kinds of second-level features.

ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.fixed.gff | perl -ne 'if($.==1){ open(FILE,"<second.features"); while($line=<FILE>){ chomp $line; $hash{$line}=1 } close FILE } chomp; @t=split(/\t/); if(exists $hash{$t[2]}){ $t[8].=";"; $t[8]=~/Parent=(.+?);/ || $t[8]=~/Derives_from=(.+?);/; $feature{$1}{$t[2]}=1 } if(eof){ for $k (sort keys %feature){ @features=keys %{$feature{$k}}; print "$k\t@features\n" if @features>1 } }' | head
AT1G06150       uORF mRNA
AT1G16860       mRNA uORF
AT1G19850       mRNA uORF
AT1G23150       uORF mRNA
AT1G25470       mRNA uORF
AT1G29950       mRNA uORF
AT1G30270       mRNA uORF
AT1G30330       uORF mRNA
AT1G36730       uORF mRNA
AT1G48600       uORF mRNA

Some further checking showed that a set of genes were assigned with mRNA children and uORF children, and no genes was assigned with only uORF children. Considering that assigning a gene as mRNA gene would be rather meaningful than uORF gene. uORF was removed from second.features. Also, transposable_element_gene records have mRNA children and we want them keep being classified as transposable_element_gene.

ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.fixed.gff | perl -ne 'if($.==1){ open(FILE,"<second.features"); while($line=<FILE>){ chomp $line; $hash{$line}=1 } close FILE; } chomp; @t=split(/\t/); if(exists $hash{$t[2]}){ $t[8].=";"; $t[8]=~/Parent=(.+?);/ || $t[8]=~/Derives_from=(.+?);/; $feature{$1}{$t[2]}=1 } if($t[2] eq "transposable_element_gene"){ $t[8].=";"; $t[8]=~/ID=(.+?);/; $teHash{$1}=1 } if(eof){ for $k (sort keys %feature){ @features=keys %{$feature{$k}}; if(exists $teHash{$k}){ print "$k\ttransposable_element_gene\n" }else{ print "$k\t@features\n" } } }' > gene.features

Interval extraction

Applied misc.ModelCGFF for interval coordinate extraction. Note that the -GRE triple for miRNA_primary_transcript is -GRE gene miRNA_primary_transcript miRNA_primary_transcript.

ubuntu@ubuntu20:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.fixed.gff -GRE gene mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA five_prime_UTR:exon:CDS:three_prime_UTR -GRE gene miRNA_primary_transcript miRNA_primary_transcript -GRE pseudogene pseudogenic_transcript:pseudogenic_tRNA pseudogenic_exon -GRE transposable_element_gene mRNA exon -O Araport11 -IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): Araport11_GFF3_genes_transposons.fixed.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA, five_prime_UTR:exon:CDS:three_prime_UTR], [gene, miRNA_primary_transcript, miRNA_primary_transcript], [pseudogene, pseudogenic_transcript:pseudogenic_tRNA, pseudogenic_exon], [transposable_element_gene, mRNA, exon]]
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): Araport11

ubuntu@ubuntu20:~$ head Araport11.cgff
>AT1G01010      Chr1    3631    5899    +
3631    3913
3996    4276
4486    4605
4706    5095
5174    5326
5439    5899
>AT1G01020      Chr1    6788    9130    -
6788    7069
7157    7232
ubuntu@ubuntu20:~$ head Araport11.model
>AT1G01010.1    Chr1    3631    5899    +       AT1G01010
3631    3913
3996    4276
4486    4605
4706    5095
5174    5326
5439    5899
>AT1G01020.2    Chr1    6788    8737    -       AT1G01020
6788    7069
7157    7450

Gene info extraction

File features lists interested features and their level in the feature tree.

ubuntu@ubuntu20:~$ cat features
1       gene
1       pseudogene
1       transposable_element_gene
2       mRNA
2       lnc_RNA
2       antisense_lncRNA
2       miRNA_primary_transcript
2       transcript_region
2       tRNA
2       antisense_RNA
2       snoRNA
2       ncRNA
2       snRNA
2       rRNA
2       pseudogenic_transcript
2       pseudogenic_tRNA

Numbers of attributes associated to level-1 features and level-2 features were observed.

ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.fixed.gff | perl -ne 'if($.==1){ open(FILE,"<features"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $feature{$s[1]}=$s[0]; } close FILE; } chomp; @t=split(/\t/); if($feature{$t[2]}==1){ $t[8].=";" if $t[8]!~/;$/; $tmpstr=$t[8]; while($tmpstr=~/^([^,;=]+?)=(.+)$/){ $key=$1; $hash{$key}++; $tmpstr=$2; if($tmpstr=~/(.+?);\s*([\S^=]+?=.+)/){ $val=$1; $tmpstr=$2 }else{ ($val)=$tmpstr=~/(.+);/; $tmpstr=""; } } } if(eof){ for $k (sort keys %hash){ print "$k\t$hash{$k}\n" } }'
Alias   1
Dbxref  9049
ID      38089
Name    38065
Note    13829
computational_description       13538
curator_summary 8059
description     1069
exception       1
full_name       9854
gbkey   57
gene    57
gene_synonym    8
id2     57
locus   23699
locus_type      37927
nochangenat-description 44
part    1
partial 1
pseudo  3
start_range     1
symbol  10135
ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.fixed.gff | perl -ne 'if($.==1){ open(FILE,"<features"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $feature{$s[1]}=$s[0]; } close FILE; } chomp; @t=split(/\t/); if($feature{$t[2]}==2){ $t[8].=";" if $t[8]!~/;$/; $tmpstr=$t[8]; while($tmpstr=~/^([^,;=]+?)=(.+)$/){ $key=$1; $hash{$key}++; $tmpstr=$2; if($tmpstr=~/(.+?);\s*([\S^=]+?=.+)/){ $val=$1; $tmpstr=$2 }else{ ($val)=$tmpstr=~/(.+);/; $tmpstr=""; } } } if(eof){ for $k (sort keys %hash){ print "$k\t$hash{$k}\n" } }'
Dbxref  31869
ID      59672
NEWNote 3
Name    59666
Note    59652
Parent  59672
UniProt 1
computational_description       59146
conf_class      34463
conf_rating     34148
curator_summary 59392
description     6
exception       2
full_name       17956
gbkey   54
gene    12710
gene_synonym    6
id2     54
index   4901
locus_type      29
parent2 25
part    1
partial 1
product 25
start_range     1
symbol  59360

For this time, it was decided to extract attributes from level-2 features.

ubuntu@ubuntu20:~$ cat Araport11_GFF3_genes_transposons.fixed.gff | perl -ne 'if($.==1){ open(FILE,"<features"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $feature{$s[1]}=$s[0]; } close FILE; } chomp; @t=split(/\t/); if($feature{$t[2]}==2){ $t[8].=";" if $t[8]!~/;$/; ($id)=$t[8]=~/ID=(.+?);/; $tmpstr=$t[8]; while($tmpstr=~/^([^,;=]+?)=(.+)$/){ $key=$1; $tmpstr=$2; if($tmpstr=~/(.+?);\s*([\S^=]+?=.+)/){ $val=$1; $tmpstr=$2 }else{ ($val)=$tmpstr=~/(.+);/; } $hash{$id}{$key}=$val; } } if(eof STDIN){ print "#transcript\tNote\tcomputational_description\tcurator_summary\n"; for $id (sort keys %hash){ print "$id"; for $k ("Note","computational_description","curator_summary"){ if(exists $hash{$id}{$k}){ print "\t$hash{$id}{$k}" } else{ print "\t" } } print "\n" } }' > Araport11.description.txt

The following process is the my current best practice to fix unprocessable text. Firstly, visually confirm the following code decodes those encoded text, as there might be some unprocessable spots.

ubuntu@ubuntu20:~$ cat Araport11.description.txt | perl -ne 'print if /\%/' | perl -MHTML::Entities -Xne 'while(/(\S+\s+)/g){ $x=$1; $x=~s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; print decode_entities($x) }' | less

Some of them really cannot be processed by the above code. The following perl one-liner identified them (by /[^[:ascii:]]/) and some manual operations were involved. You might have to use a text editor that can edit in hex mode.

ubuntu@ubuntu20:~$ cp Araport11.description.txt Araport11.description.fix0.txt

ubuntu@ubuntu20:~$ cat Araport11.description.fix0.txt | perl -MHTML::Entities -Xne 'while(/(\S+\s+)/g){ $x=$1; $x=~s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; print decode_entities($x) }' | perl -ne 'print if /[^[:ascii:]]/' | less

After manual corrections on Araport11.description.fix0.txt, the following perl one-liner was applied to get a final version.

ubuntu@ubuntu20:~$ cat Araport11.description.fix0.txt | perl -MHTML::Entities -Xne 'while(/(\S+\s+)/g){ $x=$1; $x=~s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; print decode_entities($x) }' > Araport11.description.fix1.txt
⚠️ **GitHub.com Fallback** ⚠️