Genome reference data sources - pcingola/SnpEff GitHub Wiki
SnpEff genome databases are built from genomic data sources, such as Ensembl, RefSeq, NCBI, UCSC, etc.
Sometimes, information is provided in the snpEff.config
file, under the genome_name.reference
entry.
Example 1: GRCh37.75
If you are looking for the GRCh37.75
genome, you can search for the entry in snpEff.conf
file:
$ grep -A 1 GRCh37.75 snpEff.config
GRCh37.75.genome : Homo_sapiens
GRCh37.75.reference : ftp://ftp.ensembl.org/pub/release-75/gtf/
As you can see, the genome data is from Ensembl, release 75 (as expected).
Example 2: hg19
If you are looking for the hg19
genome, you can also search for the entry in snpEff.conf
file:
$ grep -i hg19.genome snpEff.config
hg19.genome : Homo_sapiens (USCS)
...
In this case, there is no hg19.reference
entry, but the genome name clearly states that the database was retrieved from UCSC (having RefSeq).
Which exact sub-version is this hg19?
Well, unfortunately, UCSC does not keep track of sub-versions.
A rule of the thumb is that the database is retrieved before it is built, so you can look at the date/time from the snpEff database:
$ ls -al data/hg19/snpEffectPredictor.bin
-rw-r--r-- 1 pcingola pcingola 52630202 Mar 19 08:27 data/hg19/snpEffectPredictor.bin
So thishg19
database was retrieved from UCSC around on March 19th.
Example 3: Salmonella_enterica
Sometimes the information is in the genome's reference
entry is not enough to determine which exact version was used, but the snpEff.config
file provides some additional information in the comments
For example, let's say we'd like to find the data source for Salmonella_enterica
genome
If we edit the snpEff.config
and find the entry for Salmonella_enterica, we see something like this:
Salmonella_enterica.genome : Salmonella_enterica
Salmonella_enterica.reference : ftp.ensemblgenomes.org
OK, it is from Ensembl, but which version? If you scroll up in the config file, you'll see a comment like this:
#---
# ENSEMBL BFMPP release 32
#---
Here ENSEMBL BFMPP
stands for Endembl Bacteria, Fungi, Metazoa, Plants and Protists.
So the comment is indicating that this is Ensembl's release 32.