Riboseq multi stage mapping and filtering, part 0 - wdlingit/cop GitHub Wiki

Database construction

Araport11 (2016 version) and the TAIR10 genome were adopted in this example. For any other species, the key is to get a genome annotation GFF3 file and a genome assembly file, where the annotation and assembly are matching to each other, including sequence (chromosome) names.

Download Araport11 GFF file (2016 version) and TAIR10 genome assembly

The current version of Araport11 genome annotation in the TAIR website is not exactly the same with the original one. To obtain the 2016 version Araport11 genome annotation, it is needed to go to the archive page and download file Araport11_GFF3_genes_transposons.Jun2016.gff.gz. Its MD5 check sum is 6eb0c6219aa15e73f76937bafc4fd833.

wdlin@comp02:somewhere$ md5sum Araport11_GFF3_genes_transposons.Jun2016.gff.gz
6eb0c6219aa15e73f76937bafc4fd833  Araport11_GFF3_genes_transposons.Jun2016.gff.gz

wdlin@comp02:somewhere$ gzip -d Araport11_GFF3_genes_transposons.Jun2016.gff.gz

wdlin@comp02:somewhere$ head Araport11_GFF3_genes_transposons.Jun2016.gff | cut -f 1-6
##gff-version   3
Chr1    Araport11       gene    3631    5899    .
Chr1    Araport11       mRNA    3631    5899    .
Chr1    Araport11       five_prime_UTR  3631    3759    .
Chr1    Araport11       exon    3631    3913    .
Chr1    Araport11       CDS     3760    3913    .
Chr1    Araport11       exon    3996    4276    .
Chr1    Araport11       CDS     3996    4276    .
Chr1    Araport11       exon    4486    4605    .
Chr1    Araport11       CDS     4486    4605    .

Note that chromosome names in this Araport11 GFF3 file is in format Chr1, Chr2, ..., Chr5, ChrM, and ChrC.

The Araport11 genome annotation is based on the original TARI10 genome assembly. The current version of TAIR10 genome assembly in the TAIR website is actually TAIR10.1 but not TAIR10 (the difference is mitochondria). To obtain original TAIR10 genome assembly for matching the original Araport11 genome annotation, we download the genome assembly from EnsemblPlants.

wdlin@comp02:somewhere$ wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz

wdlin@comp02:somewhere$ gzip -d Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz

wdlin@comp02:somewhere$ grep ">" Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
>1 dna:chromosome chromosome:TAIR10:1:1:30427671:1 REF
>2 dna:chromosome chromosome:TAIR10:2:1:19698289:1 REF
>3 dna:chromosome chromosome:TAIR10:3:1:23459830:1 REF
>4 dna:chromosome chromosome:TAIR10:4:1:18585056:1 REF
>5 dna:chromosome chromosome:TAIR10:5:1:26975502:1 REF
>Mt dna:chromosome chromosome:TAIR10:Mt:1:366924:1 REF
>Pt dna:chromosome chromosome:TAIR10:Pt:1:154478:1 REF

Note that sequence names in the assembly file were not the same as those in the GFF3 file. So we manually modified sequence names in the assembly file and save it in another file named TAIR10_chr_all.fas.

wdlin@comp02:somewhere$ grep ">" TAIR10_chr_all.fas
>Chr1
>Chr2
>Chr3
>Chr4
>Chr5
>ChrM
>ChrC

Extract transcriptome coordinates from the genome annotation GFF3

For convenience, all above files were moved into a DB folder`. Our next step was to observe the organized feature hierarchy inside the GFF3 file and extract genomic coordinates of the database transcriptome.

wdlin@comp02:somewhere/DB$ java -classpath rackj.jar misc.GffTree -I Araport11_GFF3_genes_transposons.Jun2016.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): Araport11_GFF3_genes_transposons.Jun2016.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]

wdlin@comp02:somewhere/DB$ cat Araport11_GFF3_genes_transposons.Jun2016.gff.features
GffRoot
  gene*
    mRNA*
      five_prime_UTR*
      exon*
      CDS*
      three_prime_UTR*
      protein*
    lnc_RNA*
      exon*
    antisense_lncRNA*
      exon*
    miRNA_primary_transcript*
      miRNA*
    transcript_region*
      exon*
    tRNA*
      exon*
    antisense_RNA*
      exon*
    snoRNA*
      exon*
    ncRNA*
      exon*
    snRNA*
      exon*
    uORF*
    rRNA*
      exon*
  transposable_element*
    transposon_fragment*
  pseudogene*
    pseudogenic_transcript*
      pseudogenic_exon*
    pseudogenic_tRNA*
      pseudogenic_exon*
  transposable_element_gene*
    mRNA*
      exon*

Since basic continuous blocks in the genome are features like exon, we applied misc.ModelCGFF to extract coordinates of these basic blocks and collect them under individual transcripts.

wdlin@comp02:somewhere/DB$ java -classpath rackj.jar misc.ModelCGFF -GFF3 Araport11_GFF3_genes_transposons.Jun2016.gff \
                           -GRE gene mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA five_prime_UTR:exon:CDS:three_prime_UTR \
                           -GRE gene miRNA_primary_transcript miRNA_primary_transcript \
                           -GRE pseudogene pseudogenic_transcript:pseudogenic_tRNA pseudogenic_exon \
                           -GRE transposable_element_gene mRNA exon \
                           -O Araport11.strand -IP true
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: ModelCGFF
GFF3 filename (-GFF3): Araport11_GFF3_genes_transposons.Jun2016.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
gene-rna-exon feature triples (-GRE): [[gene, mRNA:tRNA:antisense_lncRNA:snoRNA:lnc_RNA:antisense_RNA:transcript_region:snRNA:ncRNA:rRNA, five_prime_UTR:exon:CDS:three_prime_UTR], [gene, miRNA_primary_transcript, miRNA_primary_transcript], [pseudogene, pseudogenic_transcript:pseudogenic_tRNA, pseudogenic_exon], [transposable_element_gene, mRNA, exon]]
intron preserving merged model (-IP)): true
representative list (-rep): null
output prefix (-O): Araport11.strand

The most important option of the above command is the -GRE option, it controls which intervals to be extracted as Exons, which features to be considered as transcripts (RNAs), and which features to be considered as Genes. In so doing, we can collect exon intervals of transcripts, and identify gene locations. The above command outputted two files prefixed by Araport11.strand: Araport11.strand.model for coordinates of transcripts and Araport11.strand.cgff for coordinates of gene loci.

wdlin@comp02:somewhere/DB$ head -20 Araport11.strand.model
>AT1G01010.1    Chr1    3631    5899    +       AT1G01010
3631    3913
3996    4276
4486    4605
4706    5095
5174    5326
5439    5899
>AT1G01020.2    Chr1    6788    8737    -       AT1G01020
6788    7069
7157    7450
7564    7649
7762    7835
7942    7987
8236    8325
8417    8464
8571    8737
>AT1G01020.6    Chr1    6788    8737    -       AT1G01020
6788    7069
7157    7450
7564    7649

wdlin@comp02:somewhere/DB$ head -20 Araport11.strand.cgff
>AT1G01010      Chr1    3631    5899    +
3631    3913
3996    4276
4486    4605
4706    5095
5174    5326
5439    5899
>AT1G01020      Chr1    6788    9130    -
6788    7069
7157    7232
7384    7450
7564    7649
7762    7835
7942    7987
8236    8325
8417    8464
8571    9130
>AT1G03987      Chr1    11101   11372   +
11101   11372
>AT1G01030      Chr1    11649   13714   -

The transcriptome database

Since Araport11.strand.model contains coordinates of exon intervals of all individual transcripts, we used SeqGen.pl to generate all transcript sequences based on these coordinates and the genome assembly file.

wdlin@comp02:somewhere/DB$ SeqGen.pl
Usage: SeqGen.pl [-n <lineLength>] <outputFasta> <genomeFasta> <CGFF> [<strandFile>]

wdlin@comp02:somewhere/DB$ SeqGen.pl Araport11.strand.model.fasta TAIR10_chr_all.fas Araport11.strand.model

wdlin@comp02:somewhere/DB$ grep ">" Araport11.strand.model.fasta | head
>AT1G01010.1
>AT1G01020.2
>AT1G01020.6
>AT1G01020.1
>AT1G01020.3
>AT1G01020.4
>AT1G01020.5
>AT1G03987.1
>AT1G01030.1
>AT1G01030.2

Since Riboseq reads are considered as fragments of transcripts that were protected by the ribosome, it is reasonable to believe that, for a species with enriched genome annotation, the transcriptome database should cover a large part of Riboseq reads. Accordingly, we built Bowtie2 index on the above Araport11.strand.model.fasta so that we can map Riboseq reads to this database.

wdlin@comp02:somewhere/DB$ bowtie2-build Araport11.strand.model.fasta Araport11.strand.model
⚠️ **GitHub.com Fallback** ⚠️