Riboseq multi stage mapping and filtering, part 0 - wdlingit/cop GitHub Wiki
Araport11 (2016 version) and the TAIR10 genome were adopted in this example. For any other species, the key is to get a genome annotation GFF3 file and a genome assembly file, where the annotation and assembly are matching to each other, including sequence (chromosome) names.
The current version of Araport11 genome annotation in the TAIR website is not exactly the same with the original one. To obtain the 2016 version Araport11 genome annotation, it is needed to go to the archive page and download file Araport11_GFF3_genes_transposons.Jun2016.gff.gz
. Its MD5 check sum is 6eb0c6219aa15e73f76937bafc4fd833
.
wdlin@comp02:somewhere$ md5sum Araport11_GFF3_genes_transposons.Jun2016.gff.gz
6eb0c6219aa15e73f76937bafc4fd833 Araport11_GFF3_genes_transposons.Jun2016.gff.gz
wdlin@comp02:somewhere$ gzip -d Araport11_GFF3_genes_transposons.Jun2016.gff.gz
wdlin@comp02:somewhere$ head Araport11_GFF3_genes_transposons.Jun2016.gff | cut -f 1-6
##gff-version 3
Chr1 Araport11 gene 3631 5899 .
Chr1 Araport11 mRNA 3631 5899 .
Chr1 Araport11 five_prime_UTR 3631 3759 .
Chr1 Araport11 exon 3631 3913 .
Chr1 Araport11 CDS 3760 3913 .
Chr1 Araport11 exon 3996 4276 .
Chr1 Araport11 CDS 3996 4276 .
Chr1 Araport11 exon 4486 4605 .
Chr1 Araport11 CDS 4486 4605 .
Note that chromosome names in this Araport11 GFF3 file is in format Chr1
, Chr2
, ..., Chr5
, ChrM
, and ChrC
.
The Araport11 genome annotation is based on the original TARI10 genome assembly. The current version of TAIR10 genome assembly in the TAIR website is actually TAIR10.1
but not TAIR10
(the difference is mitochondria). To obtain original TAIR10 genome assembly for matching the original Araport11 genome annotation, we download the genome assembly from EnsemblPlants.
wdlin@comp02:somewhere$ wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
wdlin@comp02:somewhere$ gzip -d Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
wdlin@comp02:somewhere$ grep ">" Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
>1 dna:chromosome chromosome:TAIR10:1:1:30427671:1 REF
>2 dna:chromosome chromosome:TAIR10:2:1:19698289:1 REF
>3 dna:chromosome chromosome:TAIR10:3:1:23459830:1 REF
>4 dna:chromosome chromosome:TAIR10:4:1:18585056:1 REF
>5 dna:chromosome chromosome:TAIR10:5:1:26975502:1 REF
>Mt dna:chromosome chromosome:TAIR10:Mt:1:366924:1 REF
>Pt dna:chromosome chromosome:TAIR10:Pt:1:154478:1 REF
Note that sequence names in the assembly file were not the same as those in the GFF3 file. So we manually modified sequence names in the assembly file and save it in another file named TAIR10_chr_all.fas
.
wdlin@comp02:somewhere$ grep ">" TAIR10_chr_all.fas
>Chr1
>Chr2
>Chr3
>Chr4
>Chr5
>ChrM
>ChrC
For convenience, all above files were moved into a DB
folder`. Our next step was to observe the organized feature hierarchy inside the GFF3 file and extract genomic coordinates of the database transcriptome.
wdlin@comp02:somewhere/DB$ java -classpath rackj.jar misc.GffTree -I Araport11_GFF3_genes_transposons.Jun2016.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11_GFF3_genes_transposons.Jun2016.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]
wdlin@comp02:somewhere/DB$ cat Araport11_GFF3_genes_transposons.Jun2016.gff.features
GffRoot
gene*
mRNA*
five_prime_UTR*
exon*
CDS*
three_prime_UTR*
protein*
lnc_RNA*
exon*
antisense_lncRNA*
exon*
miRNA_primary_transcript*
miRNA*
transcript_region*
exon*
tRNA*
exon*
antisense_RNA*
exon*
snoRNA*
exon*
ncRNA*
exon*
snRNA*
exon*
uORF*
rRNA*
exon*
transposable_element*
transposon_fragment*
pseudogene*
pseudogenic_transcript*
pseudogenic_exon*
pseudogenic_tRNA*
pseudogenic_exon*
transposable_element_gene*
mRNA*
exon*
Since basic continuous blocks in the genome are features like exon, we applied misc.ModelCGFF
to extract coordinates of these basic blocks and collect them under individual transcripts.
wdlin@comp02:somewhere/DB$ java -classpath rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.Jun2016.gff \
-GRE gene mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA five_prime_UTR:exon:CDS:three_prime_UTR \
-GRE gene miRNA_primary_transcript miRNA_primary_transcript \
-GRE pseudogene pseudogenic_transcript:pseudogenic_tRNA pseudogenic_exon \
-GRE transposable_element_gene mRNA exon \
-O Araport11.strand -IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): Araport11_GFF3_genes_transposons.Jun2016.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA, five_prime_UTR:exon:CDS:three_prime_UTR], [gene, miRNA_primary_transcript, miRNA_primary_transcript], [pseudogene, pseudogenic_transcript:pseudogenic_tRNA, pseudogenic_exon], [transposable_element_gene, mRNA, exon]]
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): Araport11.strand
The most important option of the above command is the -GRE
option, it controls which intervals to be extracted as E
xons, which features to be considered as transcripts (R
NAs), and which features to be considered as G
enes. In so doing, we can collect exon intervals of transcripts, and identify gene locations. The above command outputted two files prefixed by Araport11.strand
: Araport11.strand.model
for coordinates of transcripts and Araport11.strand.cgff
for coordinates of gene loci.
wdlin@comp02:somewhere/DB$ head -20 Araport11.strand.model
>AT1G01010.1 Chr1 3631 5899 + AT1G01010
3631 3913
3996 4276
4486 4605
4706 5095
5174 5326
5439 5899
>AT1G01020.2 Chr1 6788 8737 - AT1G01020
6788 7069
7157 7450
7564 7649
7762 7835
7942 7987
8236 8325
8417 8464
8571 8737
>AT1G01020.6 Chr1 6788 8737 - AT1G01020
6788 7069
7157 7450
7564 7649
wdlin@comp02:somewhere/DB$ head -20 Araport11.strand.cgff
>AT1G01010 Chr1 3631 5899 +
3631 3913
3996 4276
4486 4605
4706 5095
5174 5326
5439 5899
>AT1G01020 Chr1 6788 9130 -
6788 7069
7157 7232
7384 7450
7564 7649
7762 7835
7942 7987
8236 8325
8417 8464
8571 9130
>AT1G03987 Chr1 11101 11372 +
11101 11372
>AT1G01030 Chr1 11649 13714 -
Since Araport11.strand.model
contains coordinates of exon intervals of all individual transcripts, we used SeqGen.pl
to generate all transcript sequences based on these coordinates and the genome assembly file.
wdlin@comp02:somewhere/DB$ SeqGen.pl
Usage: SeqGen.pl [-n <lineLength>] <outputFasta> <genomeFasta> <CGFF> [<strandFile>]
wdlin@comp02:somewhere/DB$ SeqGen.pl Araport11.strand.model.fasta TAIR10_chr_all.fas Araport11.strand.model
wdlin@comp02:somewhere/DB$ grep ">" Araport11.strand.model.fasta | head
>AT1G01010.1
>AT1G01020.2
>AT1G01020.6
>AT1G01020.1
>AT1G01020.3
>AT1G01020.4
>AT1G01020.5
>AT1G03987.1
>AT1G01030.1
>AT1G01030.2
Since Riboseq reads are considered as fragments of transcripts that were protected by the ribosome, it is reasonable to believe that, for a species with enriched genome annotation, the transcriptome database should cover a large part of Riboseq reads. Accordingly, we built Bowtie2 index on the above Araport11.strand.model.fasta
so that we can map Riboseq reads to this database.
wdlin@comp02:somewhere/DB$ bowtie2-build Araport11.strand.model.fasta Araport11.strand.model