Genome reference data sources - pcingola/SnpEff GitHub Wiki

SnpEff genome databases are built from genomic data sources, such as Ensembl, RefSeq, NCBI, UCSC, etc.

Sometimes, information is provided in the snpEff.config file, under the genome_name.reference entry.

Example 1: GRCh37.75

If you are looking for the GRCh37.75 genome, you can search for the entry in snpEff.conf file:

$ grep -A 1 GRCh37.75 snpEff.config
GRCh37.75.genome : Homo_sapiens
GRCh37.75.reference : ftp://ftp.ensembl.org/pub/release-75/gtf/

As you can see, the genome data is from Ensembl, release 75 (as expected).

Example 2: hg19

If you are looking for the hg19 genome, you can also search for the entry in snpEff.conf file:

$ grep -i hg19.genome snpEff.config 
hg19.genome : Homo_sapiens (USCS)
...

In this case, there is no hg19.reference entry, but the genome name clearly states that the database was retrieved from UCSC (having RefSeq). Which exact sub-version is this hg19? Well, unfortunately, UCSC does not keep track of sub-versions. A rule of the thumb is that the database is retrieved before it is built, so you can look at the date/time from the snpEff database:

$ ls -al data/hg19/snpEffectPredictor.bin
-rw-r--r-- 1 pcingola pcingola 52630202 Mar 19 08:27 data/hg19/snpEffectPredictor.bin

So thishg19 database was retrieved from UCSC around on March 19th.

Example 3: Salmonella_enterica

Sometimes the information is in the genome's reference entry is not enough to determine which exact version was used, but the snpEff.config file provides some additional information in the comments For example, let's say we'd like to find the data source for Salmonella_enterica genome

If we edit the snpEff.config and find the entry for Salmonella_enterica, we see something like this:

Salmonella_enterica.genome : Salmonella_enterica
Salmonella_enterica.reference : ftp.ensemblgenomes.org

OK, it is from Ensembl, but which version? If you scroll up in the config file, you'll see a comment like this:

#---
# ENSEMBL BFMPP release 32
#---

Here ENSEMBL BFMPP stands for Endembl Bacteria, Fungi, Metazoa, Plants and Protists. So the comment is indicating that this is Ensembl's release 32.