GFF3 handling, Araport11 AtRTD2 merge - wdlingit/cop GitHub Wiki

This document describes my process of dealing with Araport11 and AtRTD2 annotation files. All based on 2016 versions. Refer the environment setting if needed.

Parse AtRTD2 GTF file

Download the AtRTD2 GTF file and some short examination on types, attributes, and chromosomes.

ubuntu@ubuntu:~$ wget https://ics.hutton.ac.uk/atRTD/RTD2/AtRTDv2_QUASI_19April2016.gtf

ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gtf | perl -ne '@t=split(/\t/); $hash{$t[2]}++; if(eof){ for $k (sort keys %hash){ print "$k\t$hash{$k}\n" } }'
exon    567987

ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gtf | perl -ne 'chomp; @t=split(/\t/,$_); @s=$t[8]=~/(\S+)/g; for $x (@s){ $hash{$x}=1 if not $x=~/"/ } if(eof){ for $k (sort keys %hash){ print "$k\n" } }'
Note
gene_id
transcript_id

ubuntu@ubuntu:~$ wc -l AtRTDv2_QUASI_19April2016.gtf
567987 AtRTDv2_QUASI_19April2016.gtf
ubuntu@ubuntu:~$ grep Note AtRTDv2_QUASI_19April2016.gtf | wc -l
195557
ubuntu@ubuntu:~$ grep gene_id AtRTDv2_QUASI_19April2016.gtf | wc -l
567987
ubuntu@ubuntu:~$ grep transcript_id AtRTDv2_QUASI_19April2016.gtf | wc -l
567987

ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gtf | perl -ne '@t=split(/\t/); $hash{$t[0]}++; if(eof){ for $k (sort keys %hash){ print "$k\t$hash{$k}\n" } }'
Chr1    150952
Chr2    87993
Chr3    109892
Chr4    87021
Chr5    131811
ChrC    157
ChrM    161

It was found that every line in this GTF file is an exon record, and every line contais gene_id and transcript_id attributes. The following perl one-liner was used to transfer the GTF file into a GFF3 file AtRTDv2_QUASI_19April2016.gff3 simply by translating the exon records and adding gene/transcript records.

ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gtf | perl -ne 'chomp; @t=split(/\t/); next if $t[2] ne "exon"; ($geneID)=$t[8]=~/gene_id "(.+?)"/; ($transID)=$t[8]=~/transcript_id "(.+?)"/; push @{$additive{$geneID}},$t[3],$t[4]; push @{$additive{$transID}},$t[3],$t[4]; $parent{$transID}=$geneID; @{$info{$geneID}}=($t[0],$t[6],"gene"); @{$info{$transID}}=($t[0],$t[6],"transcript"); print join("\t",@t[0..7])."\tParent=$transID\n"; if(eof){ for $k (sort keys %additive){ @p=sort {$a<=>$b} @{$additive{$k}}; print "$info{$k}[0]\tunknown\t$info{$k}[2]\t$p[0]\t$p[-1]\t.\t$info{$k}[1]\t.\t"; print "ID=$k"; print ";Parent=$parent{$k}" if exists $parent{$k}; print "\n" } }' > AtRTDv2_QUASI_19April2016.gff3

Some minor check on the GFF3 file. Note that the .features file contains just a simple hierarchy of gene-transcript-exon.

ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I AtRTDv2_QUASI_19April2016.gff3
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): AtRTDv2_QUASI_19April2016.gff3
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]

ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gff3.features
GffRoot
  gene*
	transcript*
	  exon

Extracting intervals of gene-level objects and transcript-level objects. They were saved in files AtRTD2.cgff and AtRTD2.model, respectively.

ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 AtRTDv2_QUASI_19April2016.gff3 -GRE gene transcript exon -O AtRTD2 -IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): AtRTDv2_QUASI_19April2016.gff3
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, transcript, exon]]
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): AtRTD2

Parse Araport11 GFF3 file

The following command parses the structure of the Araport11 GFF3 file and save it in the .features file. It can be found that records were assigned with different features.

ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]

ubuntu@ubuntu:~$ cat Araport11_GFF3_genes_transposons.201606.gff.features
GffRoot
  gene*
	mRNA*
	  five_prime_UTR*
	  exon*
	  CDS*
	  three_prime_UTR*
	  protein*
	lnc_RNA*
	  exon*
	antisense_lncRNA*
	  exon*
	miRNA_primary_transcript*
	  miRNA*
	transcript_region*
	  exon*
	tRNA*
	  exon*
	antisense_RNA*
	  exon*
	snoRNA*
	  exon*
	ncRNA*
	  exon*
	snRNA*
	  exon*
	uORF*
	rRNA*
	  exon*
  transposable_element*
	transposon_fragment*
  pseudogene*
	pseudogenic_transcript*
	  pseudogenic_exon*
	pseudogenic_tRNA*
	  pseudogenic_exon*
  transposable_element_gene*
	mRNA*
	  exon*

Applied the same command as previously described for extracting intervals of gene-level objects and transcript-level objects.

ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.201606.gff -O Araport11 -IP true -GRE gene mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA five_prime_UTR:exon:CDS:three_prime_UTR -GRE gene miRNA_primary_transcript miRNA_primary_transcript -GRE pseudogene pseudogenic_transcript:pseudogenic_tRNA pseudogenic_exon -GRE transposable_element_gene mRNA exon
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): Araport11_GFF3_genes_transposons.201606.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA, five_prime_UTR:exon:CDS:three_prime_UTR], [gene, miRNA_primary_transcript, miRNA_primary_transcript], [pseudogene, pseudogenic_transcript:pseudogenic_tRNA, pseudogenic_exon], [transposable_element_gene, mRNA, exon]]
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): Araport11

Comparing Araport11 vs AtRTD2

Numbers of gene-level objects.

ubuntu@ubuntu:~$ grep ">" AtRTD2.cgff | wc -l
34212
ubuntu@ubuntu:~$ grep ">" Araport11.cgff | wc -l
38194

Numbers of transcript-level objects.

ubuntu@ubuntu:~$ grep ">" AtRTD2.model | wc -l
82190
ubuntu@ubuntu:~$ grep ">" Araport11.model | wc -l
59777

Applied the same command as previously described for extracting feature definitions of gene-level objects of Araport11.

ubuntu@ubuntu:~$ cat second.features
mRNA
miRNA_primary_transcript
tRNA
antisense_lncRNA
snoRNA
lnc_RNA
antisense_RNA
transcript_region
snRNA
ncRNA
rRNA
pseudogenic_transcript
pseudogenic_tRNA

ubuntu@ubuntu:~$ cat Araport11_GFF3_genes_transposons.201606.gff | perl -ne 'if($.==1){ open(FILE,"<second.features"); while($line=<FILE>){ chomp $line; $hash{$line}=1 } close FILE; } chomp; @t=split(/\t/); if(exists $hash{$t[2]}){ $t[8].=";"; $t[8]=~/Parent=(.+?);/ || $t[8]=~/Derives_from=(.+?);/; $feature{$1}{$t[2]}=1 } if($t[2] eq "transposable_element_gene"){ $t[8].=";"; $t[8]=~/ID=(.+?);/; $teHash{$1}=1 } if(eof){ for $k (sort keys %feature){ @features=keys %{$feature{$k}}; if(exists $teHash{$k}){ print "$k\ttransposable_element_gene\n" }else{ print "$k\t@features\n" } } }' > Araport11.gene.features

By attaching feature definitions from Araport11 to genes in AtRTD2 GFF3, diff command and a perl one-liner was applied to see what kinds of genes are exclusively in Araport11 but not AtRTD2.

ubuntu@ubuntu:~$ cat AtRTDv2_QUASI_19April2016.gff3 | perl -ne 'if($.==1){ open(FILE,"<Araport11.gene.features"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $type{$s[0]}=$s[1]; } close FILE } chomp; @t=split(/\t/); if($t[2] eq "gene"){ $t[8].=";"; ($id)=$t[8]=~/ID=(.+?);/; print "$id\t$type{$id}\n" }' | sort > AtRTD2.gene.features

ubuntu@ubuntu:~$ diff Araport11.gene.features AtRTD2.gene.features | perl -ne 'chomp; @t=split; if($t[0] eq "<"){ $hash{$t[2]}++ } if(eof){ for $k (sort keys %hash){ print "$k\t$hash{$k}\n" } }'
antisense_RNA   75
antisense_lncRNA        820
lnc_RNA 2440
mRNA    89
miRNA_primary_transcript        154
ncRNA   55
pseudogenic_tRNA        17
pseudogenic_transcript  7
snRNA   69
snoRNA  217
tRNA    11
transcript_region       508
transposable_element_gene       19

The following commands were used to extract a list of Araport11-exclusive coding genes and a list of AtRTD2-exclusive genes.

ubuntu@ubuntu:~$ diff Araport11.gene.features AtRTD2.gene.features | perl -ne 'chomp; @t=split; print "$_\n" if ($t[0] eq "<") && ($t[2] eq "mRNA")' > Araport11Only.codingGenes.txt

ubuntu@ubuntu:~$ diff Araport11.gene.features AtRTD2.gene.features | perl -ne 'chomp; @t=split; print "$_\n" if ($t[0] eq ">") && ($t[2] eq "")' > AtRTD2Only.genes.txt

ubuntu@ubuntu:~$ wc -l Araport11Only.codingGenes.txt AtRTD2Only.genes.txt
  89 Araport11Only.codingGenes.txt
 499 AtRTD2Only.genes.txt
 588 total

Merge Araport11 and AtRTD2 GFF3 files

The two files Araport11.model and AtRTD2.model are containing intervals (of exon-level objects) of transcript-level objects. The perl one-liner was applied to extract their gene-level parents so that we used model2GFF3.pl to merge and translate the two .model files into a GFF3 file that contains definitions from Araport11 and AtRTD2.

ubuntu@ubuntu:~$ ls *.model
Araport11.model  AtRTD2.model

ubuntu@ubuntu:~$ cat *.model  | perl -ne 'chomp; if(/^>(\S+)/){ @t=split; print "$1\t$t[5]\n" }' | perl -ne '$hash{$_}=1; if(eof){ for $k (sort keys %hash){ print $k} }' > rnaGene.merged

ubuntu@ubuntu:~$ cat Araport11.model AtRTD2.model | model2GFF3.pl /dev/stdin rnaGene.merged > Araport11-AtRTD2.fake.gff3

Some simple parse fo checking. Note that the .features file contains just a simple hierarchy of gene-transcript-exon. This is because it was translated from the two .model files that contains only interval information.

ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11-AtRTD2.fake.gff3
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11-AtRTD2.fake.gff3
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]

ubuntu@ubuntu:~$ cat Araport11-AtRTD2.fake.gff3.features
GffRoot
  gene*
	transcript*
	  exon

It could be desirable to restore features to the merged GFF3 at some certain level. The following perl one-liner was applied to restore features to transcript-level objects. Note that the feature structure is not exactly the same as the one of Araport11 GFF3.

ubuntu@ubuntu:~$ cat Araport11-AtRTD2.fake.gff3 | perl -ne 'if($.==1){ open(FILE,"<Araport11.gene.features"); while($line=<FILE>){ chomp $line; @s=split(/\s+/,$line); $hash{$s[0]}=$s[1] } close FILE } chomp; @t=split(/\t/); $t[8].=";" if not $t[8]=~/;$/; $t[8]=~/Parent=(.+?);/; if(exists $hash{$1}){ $t[2]=$hash{$1} } print join("\t",@t)."\n"' > Araport11-AtRTD2.restored.gff3

ubuntu@ubuntu:~$ java -classpath /home/ubuntu/Tools/rackJ/rackj.jar misc.GffTree -I Araport11-AtRTD2.restored.gff3
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11-AtRTD2.restored.gff3
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]

ubuntu@ubuntu:~$ cat Araport11-AtRTD2.restored.gff3.features
GffRoot
  gene*
	mRNA*
	  exon
	miRNA_primary_transcript*
	  exon
	transcript*
	  exon
	tRNA*
	  exon
	antisense_lncRNA*
	  exon
	pseudogenic_transcript*
	  exon
	transposable_element_gene*
	  exon
	snoRNA*
	  exon
	lnc_RNA*
	  exon
	antisense_RNA*
	  exon
	transcript_region*
	  exon
	snRNA*
	  exon
	ncRNA*
	  exon
	pseudogenic_tRNA*
	  exon
	rRNA*
	  exon
⚠️ **GitHub.com Fallback** ⚠️