Preparing genomes - erhard-lab/gedi GitHub Wiki

In general, for each genome (e.g. some human genome version or some virus genome), you will have to prepare a reference. This means that if you are analyzing virus infected cells, you will prepare two references, and supply both their names for later steps.

Without any additional parameters, the IndexGenome program will (among other things) create bowtie, kallisto and STAR indices for your genome (use the -nobowtie, -nokallisto and -nostar parameters, if you are sure that you don't need them).

There are four ways to automatically generate references:

  1. Let the tool download the fasta and gtf file automatically from Ensembl

    gedi -e IndexGenome -organism homo_sapiens -version 75 -p 

    The -organism parameter must be an ensembl organism name (check the ftp-server for the organism names of the current version). For -version use a current release number of ensembl. This will create a reference with name homo_sapiens.75.

  2. Using a fasta file and a gtf file (as e.g. obtained from Ensembl)

    gedi -e IndexGenome -s Homo_sapiens.GRCh38.dna.primary_assembly.fa -a Homo_sapiens.GRCh38.86.gtf -n homo_ens86

    The -n parameter specifies a short name to access this reference (otherwise it would be called Homo_sapiens.GRCh38.86).

  3. Using a genbank file

    gedi -e IndexGenome -gb NC_006273.gb -gblabel gene

    The -gblabel argument determines the name of the genbank feature that is used as the gene name (as there is no standard).

  4. Using only a fasta file.

    gedi -e IndexGenome -s rRNA.fa

    This is particularly useful for a compiled rRNA database. All indexing steps refering to an annotation are skipped.

By default, an oml file referencing all generated files is put into $HOME/.gedi/genomic. All files within this folder can be accessed by name (instead of a full path) whenever a prepared reference is needed.

What has to be in the GTF file?

If you opt for not downloading the GTF file from Ensembl, make sure your GTF file contains at least these minimal data:

  1. Lines with exon in the feature_type field (column 3)
  2. Either lines with CDS as feature_type, or with start_codon and stop_codon to indicate protein coding parts of transcripts.
  3. gene_id and transcript_id attributes (for compatibility with old Ensembl files, there may be an arbitrary prefix, e.g. ensembl_gene_id).

A minimal test GTF looks like this:

1	test	exon	100	200	.	+	.	gene_id "g1"; transcript_id "t1"
1	test	exon	300	400	.	+	.	gene_id "g1"; transcript_id "t1"
1	test	CDS	110	200	.	+	.	gene_id "g1"; transcript_id "t1"
1	test	CDS	300	330	.	+	.	gene_id "g1"; transcript_id "t1"
1	test	exon	1100	1200	.	+	.	gene_id "g1"; transcript_id "t2"
1	test	exon	1300	1400	.	+	.	gene_id "g1"; transcript_id "t2"
1	test	start_codon	1110	1112	.	+	.	gene_id "g1"; transcript_id "t2"
1	test	stop_codon	1328	1330	.	+	.	gene_id "g1"; transcript_id "t2"
1	test	exon	2100	2200	.	+	.	gene_id "g2"; transcript_id "t3"
1	test	exon	2300	2400	.	+	.	gene_id "g2"; transcript_id "t3"

This means that we have three transcripts from two genes. t1 and t2 are protein coding transcripts, t3 is non-coding.

You can check your GTF file by entering:

gedi Nashorn -e 'new GtfFileReader("test.gtf","exon").readIntoMemoryThrowOnNonUnique().ei().print()'

This will ouput the three transcripts like this:

1+:99-200|299-400       Transcript [geneId=g1, transcriptId=t1, codingStart=109, codingEnd=330]
1+:1099-1200|1299-1400  Transcript [geneId=g1, transcriptId=t2, codingStart=1109, codingEnd=1330]
1+:2099-2200|2299-2400  Transcript [geneId=g2, transcriptId=t3, codingStart=-1, codingEnd=-1]

What is it good for?

For many purposes, Gedi needs to prepare a reference, i.e. a genome sequence with corresponding annotation:

  • The reference sequence is indexed for random access (i.e. to quickly access the sequence in the region from X to Y on chromosome Z from a genome). This is for instance used in the genome browser to display the genomic sequence or during ORF inference to identify stop codons.
  • The annotation is indexed in an interval tree data structure for quick queries such as "give me all genes overlapping position X on chromosome Z". This is used in the genome browser to display annotated genes and transcripts and also to annotate inferred ORFs.
  • Mapping tables are extracted from the annotation (e.g. from gene symbol to another gene identifier, or from a gene identifier to the gene's biotype)
  • A data structure for efficient access to all gene symbols with a given prefix is filled. This is for instance used in the search field of the genome browser.
  • The transcriptome is extracted from the genomic sequence based on the annotation and indexed as well. It is for instance used as an additional reference for read mapping, which is important to map reads across splice junctions with bowtie.
  • Bowtie and STAR indices are build for both genome and the extracted transcriptome.

The complete usage of this tool is:

gedi -e IndexGenome <Options>

Options:
 -s <fasta-file>        Fasta file containing chromosomes
 -a <gtf-file>          Gtf file containing annotation
 -organism <name>       Organism name to download data from ensembl (e.g. homo_sapiens)
 -version <version>     Ensembl version to download (e.g. 75)
 -gb <genbank-file>     Genbank file containing annotation and sequence
 -gblabel <label>       Which genbank entry to take as gene and transcript label (default: label)
 -f <folder>            Output folder (Default: next to Fasta and Gtf / genbank)
 -n <name>              Name of the genome for later use (Default: file name of gtf/genbank-file)
 -o <file>              Specify output file (Default: ~/.gedi/genomic/${name}.oml)
 -nobowtie              Do not create bowtie indices
 -nostar                Do not create STAR indices
 -p                     Show progress
 -h                     Show this message
 -D                     Output debugging information
⚠️ **GitHub.com Fallback** ⚠️