GenericDb - GeneMANIA/pipeline GitHub Wiki

GENERIC DB

GENERIC_DB is a tabular text-only format for processed organism data. The data is organized as a set of files under the result/generic_db/ folder, described below.

These files may be useful for ad-hoc analysis and troubleshooting of the build process, since its plain text vs the binary formats of the final data files used in by the GeneMANIA application. They are also used as inputs to the programs that compute those final binary products. However they are intended as an internal intermediate, we reserve the right to change the format, and so it isn't suitable for external dependencies.

Historically, this data populated a SQL database, with each file (apart from the bulk interaction data) represented by a table. The organization of the fields in the text files are reminiscent of this SQL structure, resembling for example foreign keys etc. A SQL database is no longer used however, being replaced with an index built with apache lucene.

The files are organized as follows:

generic_db/
├── ATTRIBUTE_GROUPS.txt
├── ATTRIBUTES
│   └── {I}.txt
├── ATTRIBUTES.txt
├── GENE_DATA.txt
├── GENE_NAMING_SOURCES.txt
├── GENES.txt
├── GO_CATEGORIES
│   ├── {N}.annos.txt
│   ├── {N}_BP.txt
│   ├── {N}_CC.txt
│   └── {N}_MF.txt
├── INTERACTIONS
│   ├── {N}.{J}.txt
├── NETWORK_GROUPS.txt
├── NETWORK_METADATA.txt
├── NETWORKS.txt
├── NETWORK_TAG_ASSOC.txt
├── NODES.txt
├── ONTOLOGIES.txt
├── ONTOLOGY_CATEGORIES.txt
├── ORGANISMS.txt
├── SCHEMA.txt
├── STATISTICS.txt
└── TAGS.txt

Where I is an attribute group id, J is a network id, and N is an organism id. An essentially arbitrary number of attribute groups, networks, and organisms are allowed represented.

Each indivual file is plain text, UTF-8 encoded. Inconveniently, there is no header row, rather the column names are recorded in the SCHEMA.txt file.

ATTRIBUTES.txt

Contains a record for each individual attribute in the dataset. The attribute name is required for display in query results.

ID: internal GeneMANIA attribute id
ORGANISM_ID: internal GeneMANIA organism id
ATTRIBUTE_GROUP_ID: the attribute group to which this attribute belongs, refers to a record in the ATTRIBUTE_GROUPS table
EXTERNAL_ID: database identifier for the attribute in the external data source
NAME: attribute name, may be the same as EXTERNAL_ID
DESCRIPTION: additional descriptive text where available

ATTRIBUTE_GROUPS.txt

Lists all the attribute groups in the dataset. Each group belongs to an organism, and is associated with multiple individual attributes listed in the ATTRIBUTES table.

ID: internal GeneMANIA attribute group id
ORGANISM_ID: internal GeneMANIA organism id
NAME: attribute group name, e.g. InterPro
CODE: not used
DESCRIPTION: description of the attribute group, e.g. Protein Domain Families
LINKOUT_LABEL: link text for display in individual attribute linkouts, e.g. to a descriptive page in an external resource for a particular protein domain
LINKOUT_URL: URL for individual attribute linkouts, with the pattern {1} where the attribute's external_id is to be interpolated
DEFAULT_SELECTED: 1 or 0, if the attribute is to be selected by default for use in GeneMANIA queries
PUBLICATION_NAME: Descriptive text for a linkout to publication reference for this attribute group
PUBLICATION_URL: link to publication reference for this attribute group

GENES.txt

Contains all the gene symbols in the system. Individual unique genes are stored in the NODES table (yes this is confusing, sorry). Each unique gene (NODES table) can be associated with multiple gene symbols (in the GENES table) and descriptions in the GENE_DATA table. In addition, each GENE symbol is associated with a GENE_NAMING_SOURCE (which records the source or type of symbol, e.g. Entrez Gene ID, etc)

ID: internal GeneMANIA gene id
SYMBOL: gene symbol, e.g. BRCA2
SYMBOL_TYPE: not used?
NAMING_SOURCE_ID: refers to a record in the GENE_NAMING_SOURCES table that identifies the symbol source (type)
NODE_ID: refers to a record in the NODES table, to which this gene symbol belongs
ORGANISM_ID: internal GeneMANIA organism id
DEFAULT_SELECTED: 1 or 0, if the gene symbol is to be displayed in the organisms default query

GENE_DATA.txt

ID: internal GeneMANIA gene data id
DESCRIPTION: description of the corresponding unique gene (nodes table)
EXTERNAL_ID: not used?
LINKOUT_SOURCE_ID: not used?

GENE_NAMING_SOURCES.txt

Lists all the gene naming sources in the system, e.g. Entrez Gene ID etc.

ID: internal GeneMANIA naming source id
NAME: naming source name, used in gene linkouts
RANK: integer indcating display preference, the highest value available associated with any particular gene, is used when displaying that gene in query results
SHORT_NAME: not used

NETWORKS.txt

Lists all networks in the system. While each network is associated with a particular organism, that association must be determined via the NETWORK_GROUPS table.

ID: internal GeneMANIA network id
NAME: network name
METADATA_ID: id of a record in the NETWORK_METADATA table containing publication rerefence etc
DESCRIPTION: optional descriptive text for display
DEFAULT_SELECTED: 1 or 0, depending on if the network is to be included in searches with default parameters
GROUP_ID: refers to a record in the NETWORK_GROUPS table, indicating to which network group this network belongs, e.g. Co-expression

NETWORK_GROUPS.txt

Contains a record for each network group available for each organism, e.g. Co-expression, Genetic Interaction, etc

ID: internal GeneMANIA network group id
NAME: network group name
CODE: internal mnemonic code for the network group, e.g. spd for Shared Protein Domains
DESCRIPTION: not used
ORGANISM_ID: internal GeneMANIA id of the organism to which this network group belongs

NETWORK_METADATA.txt

Desriptive metadata for each network

ID: internal id
source: name of data source, e.g. GEO
reference: external id of the network, e.g. GSE10502 for a GEO network
pubmedId:
authors: comma delimited list of author last names
publicationName
yearPublished
processingDescription: e.g. Pearson Correlation
networkType: redundant but must contain the network group name like 'Co-expression'
alias: not used
interactionCount: number of interactions in the network
dynamicRange: not used
edgeWeightDistribution: not used
accessStats: not used
comment: additional text to be added to the network description, e.g. '1 of 2 datasets produced from this publication'
other: extra generic network labelling text, e.g. 'Small-scale studies','Affinity Capture', is this used?
title: name of paper
url: reference linkout, typically pubmed
sourceUrl: linkout to data source, e.g. http://thebiogrid.org/, can include a string interpolation location with '%s', if present then soure_id will be inserted at that location of the url

NETWORK_TAG_ASSOC.txt

Associates networks to descriptive network tags. Not currently used.

ID: internal record id
NETWORK_ID: refers to a record in the NETWORK table
TAG_ID: refers to a record in the TAGS table

NODES.txt

Contains a record for each unique gene (graph node) in the system. Each of these genes can be identified by multiple symbols in the GENES table.

ID: internal GeneMANIA node id
NAME: node name, not used
GENE_DATA_ID: refers to a record in the GENE_DATA table containing a description of the gene for display
ORGANISM_ID: the GeneMANIA organism id to which the gene belongs

ONTOLOGIES.txt

Lists all the sets of functional annotations available for use in enrichment analsys. Currently there is only 1 per organism.

ID: internal GeneMANIA ontology id
NAME: not used

ONTOLOGY_CATEGORIES.txt

Lists the names of all the individual functional categories belonging to the various sets of functional annotations (one set per organism, currently). Used in display of functional enrichmet results.

ID: internal GeneMANIA ontology category id
ONTOLOGY_ID: refers to a record in the ONTOLOGIES table to which this particular category belongs
NAME: the external id of the category, e.g. GO:0005509
DESCRIPTION: description of the category, e.g. calcium ion binding

ORGANISMS.txt

Contains a record for each organism in the dataset.

Fields:

ID: internal GeneMANIA organism id, e.g. 1
NAME: e.g. S. cerevisia
DESCRIPTION: e.g. baker's yeast
ALIAS: Saccharomyces cerevisiae
ONTOLOGY_ID: the id corresponding to a record in ONTOLOGIES.txt, specifies a set of functional categories to be used in enrichment analysis
TAXONOMY_ID: NCBI taxonomy id, e.g. 4932 for yeast

STATISTICS.txt

Contains a single record summarizing the # of organism, networks, and interations in the dataset.

The fields are:

ID: record id, typically 1.
organisms: # of organisms
networks: total # of networks summed across all organisms
interactions: total # of interactions in all networks across all organisms
genes: total number of unique genes represented (not gene symbols, which would be larger)
predictions: not used
date: production date of dataset, YYYY-MM-DD

TAGS.txt

A list of descriptive tags, similar to MeSH terms, used to describe networks. The terms are not organism specific, a single set of terms is available to all organisms.

Tags were supported in the past but no longer used. The file must exist but will be empty. Support may be re-added in the future.

The columns are:

ID
NAME