Preparing genomes - erhard-lab/gedi GitHub Wiki
In general, for each genome (e.g. some human genome version or some virus genome), you will have to prepare a reference. This means that if you are analyzing virus infected cells, you will prepare two references, and supply both their names for later steps.
Without any additional parameters, the IndexGenome
program will (among other things) create bowtie, kallisto and STAR indices for your genome (use the -nobowtie
, -nokallisto
and -nostar
parameters, if you are sure that you don't need them).
There are four ways to automatically generate references:
-
Let the tool download the fasta and gtf file automatically from Ensembl
gedi -e IndexGenome -organism homo_sapiens -version 75 -p
The
-organism
parameter must be an ensembl organism name (check the ftp-server for the organism names of the current version). For-version
use a current release number of ensembl. This will create a reference with name homo_sapiens.75. -
Using a fasta file and a gtf file (as e.g. obtained from Ensembl)
gedi -e IndexGenome -s Homo_sapiens.GRCh38.dna.primary_assembly.fa -a Homo_sapiens.GRCh38.86.gtf -n homo_ens86
The
-n
parameter specifies a short name to access this reference (otherwise it would be called Homo_sapiens.GRCh38.86). -
Using a genbank file
gedi -e IndexGenome -gb NC_006273.gb -gblabel gene
The
-gblabel
argument determines the name of the genbank feature that is used as the gene name (as there is no standard). -
Using only a fasta file.
gedi -e IndexGenome -s rRNA.fa
This is particularly useful for a compiled rRNA database. All indexing steps refering to an annotation are skipped.
By default, an oml file referencing all generated files is put into $HOME/.gedi/genomic
. All files within this folder can be accessed by name (instead of a full path) whenever a prepared reference is needed.
If you opt for not downloading the GTF file from Ensembl, make sure your GTF file contains at least these minimal data:
- Lines with
exon
in the feature_type field (column 3) - Either lines with
CDS
as feature_type, or withstart_codon
andstop_codon
to indicate protein coding parts of transcripts. -
gene_id
andtranscript_id
attributes (for compatibility with old Ensembl files, there may be an arbitrary prefix, e.g.ensembl_gene_id
).
A minimal test GTF looks like this:
1 test exon 100 200 . + . gene_id "g1"; transcript_id "t1"
1 test exon 300 400 . + . gene_id "g1"; transcript_id "t1"
1 test CDS 110 200 . + . gene_id "g1"; transcript_id "t1"
1 test CDS 300 330 . + . gene_id "g1"; transcript_id "t1"
1 test exon 1100 1200 . + . gene_id "g1"; transcript_id "t2"
1 test exon 1300 1400 . + . gene_id "g1"; transcript_id "t2"
1 test start_codon 1110 1112 . + . gene_id "g1"; transcript_id "t2"
1 test stop_codon 1328 1330 . + . gene_id "g1"; transcript_id "t2"
1 test exon 2100 2200 . + . gene_id "g2"; transcript_id "t3"
1 test exon 2300 2400 . + . gene_id "g2"; transcript_id "t3"
This means that we have three transcripts from two genes. t1 and t2 are protein coding transcripts, t3 is non-coding.
You can check your GTF file by entering:
gedi Nashorn -e 'new GtfFileReader("test.gtf","exon").readIntoMemoryThrowOnNonUnique().ei().print()'
This will ouput the three transcripts like this:
1+:99-200|299-400 Transcript [geneId=g1, transcriptId=t1, codingStart=109, codingEnd=330]
1+:1099-1200|1299-1400 Transcript [geneId=g1, transcriptId=t2, codingStart=1109, codingEnd=1330]
1+:2099-2200|2299-2400 Transcript [geneId=g2, transcriptId=t3, codingStart=-1, codingEnd=-1]
For many purposes, Gedi needs to prepare a reference, i.e. a genome sequence with corresponding annotation:
- The reference sequence is indexed for random access (i.e. to quickly access the sequence in the region from X to Y on chromosome Z from a genome). This is for instance used in the genome browser to display the genomic sequence or during ORF inference to identify stop codons.
- The annotation is indexed in an interval tree data structure for quick queries such as "give me all genes overlapping position X on chromosome Z". This is used in the genome browser to display annotated genes and transcripts and also to annotate inferred ORFs.
- Mapping tables are extracted from the annotation (e.g. from gene symbol to another gene identifier, or from a gene identifier to the gene's biotype)
- A data structure for efficient access to all gene symbols with a given prefix is filled. This is for instance used in the search field of the genome browser.
- The transcriptome is extracted from the genomic sequence based on the annotation and indexed as well. It is for instance used as an additional reference for read mapping, which is important to map reads across splice junctions with bowtie.
- Bowtie and STAR indices are build for both genome and the extracted transcriptome.
The complete usage of this tool is:
gedi -e IndexGenome <Options>
Options:
-s <fasta-file> Fasta file containing chromosomes
-a <gtf-file> Gtf file containing annotation
-organism <name> Organism name to download data from ensembl (e.g. homo_sapiens)
-version <version> Ensembl version to download (e.g. 75)
-gb <genbank-file> Genbank file containing annotation and sequence
-gblabel <label> Which genbank entry to take as gene and transcript label (default: label)
-f <folder> Output folder (Default: next to Fasta and Gtf / genbank)
-n <name> Name of the genome for later use (Default: file name of gtf/genbank-file)
-o <file> Specify output file (Default: ~/.gedi/genomic/${name}.oml)
-nobowtie Do not create bowtie indices
-nostar Do not create STAR indices
-p Show progress
-h Show this message
-D Output debugging information