GenericDb - GeneMANIA/pipeline GitHub Wiki

GENERIC DB

GENERIC_DB is a tabular text-only format for processed organism data. The data is organized as a set of files under the result/generic_db/ folder, described below.

These files may be useful for ad-hoc analysis and troubleshooting of the build process, since its plain text vs the binary formats of the final data files used in by the GeneMANIA application. They are also used as inputs to the programs that compute those final binary products. However they are intended as an internal intermediate, we reserve the right to change the format, and so it isn't suitable for external dependencies.

Historically, this data populated a SQL database, with each file (apart from the bulk interaction data) represented by a table. The organization of the fields in the text files are reminiscent of this SQL structure, resembling for example foreign keys etc. A SQL database is no longer used however, being replaced with an index built with apache lucene.

The files are organized as follows:

generic_db/
├── ATTRIBUTE_GROUPS.txt
├── ATTRIBUTES
│   └── {I}.txt
├── ATTRIBUTES.txt
├── GENE_DATA.txt
├── GENE_NAMING_SOURCES.txt
├── GENES.txt
├── GO_CATEGORIES
│   ├── {N}.annos.txt
│   ├── {N}_BP.txt
│   ├── {N}_CC.txt
│   └── {N}_MF.txt
├── INTERACTIONS
│   ├── {N}.{J}.txt
├── NETWORK_GROUPS.txt
├── NETWORK_METADATA.txt
├── NETWORKS.txt
├── NETWORK_TAG_ASSOC.txt
├── NODES.txt
├── ONTOLOGIES.txt
├── ONTOLOGY_CATEGORIES.txt
├── ORGANISMS.txt
├── SCHEMA.txt
├── STATISTICS.txt
└── TAGS.txt

Where I is an attribute group id, J is a network id, and N is an organism id. An essentially arbitrary number of attribute groups, networks, and organisms are allowed represented.

Each indivual file is plain text, UTF-8 encoded. Inconveniently, there is no header row, rather the column names are recorded in the SCHEMA.txt file.

ATTRIBUTES.txt

Contains a record for each individual attribute in the dataset. The attribute name is required for display in query results.

  • ID: internal GeneMANIA attribute id
  • ORGANISM_ID: internal GeneMANIA organism id
  • ATTRIBUTE_GROUP_ID: the attribute group to which this attribute belongs, refers to a record in the ATTRIBUTE_GROUPS table
  • EXTERNAL_ID: database identifier for the attribute in the external data source
  • NAME: attribute name, may be the same as EXTERNAL_ID
  • DESCRIPTION: additional descriptive text where available

ATTRIBUTE_GROUPS.txt

Lists all the attribute groups in the dataset. Each group belongs to an organism, and is associated with multiple individual attributes listed in the ATTRIBUTES table.

  • ID: internal GeneMANIA attribute group id
  • ORGANISM_ID: internal GeneMANIA organism id
  • NAME: attribute group name, e.g. InterPro
  • CODE: not used
  • DESCRIPTION: description of the attribute group, e.g. Protein Domain Families
  • LINKOUT_LABEL: link text for display in individual attribute linkouts, e.g. to a descriptive page in an external resource for a particular protein domain
  • LINKOUT_URL: URL for individual attribute linkouts, with the pattern {1} where the attribute's external_id is to be interpolated
  • DEFAULT_SELECTED: 1 or 0, if the attribute is to be selected by default for use in GeneMANIA queries
  • PUBLICATION_NAME: Descriptive text for a linkout to publication reference for this attribute group
  • PUBLICATION_URL: link to publication reference for this attribute group

GENES.txt

Contains all the gene symbols in the system. Individual unique genes are stored in the NODES table (yes this is confusing, sorry). Each unique gene (NODES table) can be associated with multiple gene symbols (in the GENES table) and descriptions in the GENE_DATA table. In addition, each GENE symbol is associated with a GENE_NAMING_SOURCE (which records the source or type of symbol, e.g. Entrez Gene ID, etc)

  • ID: internal GeneMANIA gene id
  • SYMBOL: gene symbol, e.g. BRCA2
  • SYMBOL_TYPE: not used?
  • NAMING_SOURCE_ID: refers to a record in the GENE_NAMING_SOURCES table that identifies the symbol source (type)
  • NODE_ID: refers to a record in the NODES table, to which this gene symbol belongs
  • ORGANISM_ID: internal GeneMANIA organism id
  • DEFAULT_SELECTED: 1 or 0, if the gene symbol is to be displayed in the organisms default query

GENE_DATA.txt

  • ID: internal GeneMANIA gene data id
  • DESCRIPTION: description of the corresponding unique gene (nodes table)
  • EXTERNAL_ID: not used?
  • LINKOUT_SOURCE_ID: not used?

GENE_NAMING_SOURCES.txt

Lists all the gene naming sources in the system, e.g. Entrez Gene ID etc.

  • ID: internal GeneMANIA naming source id
  • NAME: naming source name, used in gene linkouts
  • RANK: integer indcating display preference, the highest value available associated with any particular gene, is used when displaying that gene in query results
  • SHORT_NAME: not used

NETWORKS.txt

Lists all networks in the system. While each network is associated with a particular organism, that association must be determined via the NETWORK_GROUPS table.

  • ID: internal GeneMANIA network id
  • NAME: network name
  • METADATA_ID: id of a record in the NETWORK_METADATA table containing publication rerefence etc
  • DESCRIPTION: optional descriptive text for display
  • DEFAULT_SELECTED: 1 or 0, depending on if the network is to be included in searches with default parameters
  • GROUP_ID: refers to a record in the NETWORK_GROUPS table, indicating to which network group this network belongs, e.g. Co-expression

NETWORK_GROUPS.txt

Contains a record for each network group available for each organism, e.g. Co-expression, Genetic Interaction, etc

  • ID: internal GeneMANIA network group id
  • NAME: network group name
  • CODE: internal mnemonic code for the network group, e.g. spd for Shared Protein Domains
  • DESCRIPTION: not used
  • ORGANISM_ID: internal GeneMANIA id of the organism to which this network group belongs

NETWORK_METADATA.txt

Desriptive metadata for each network

  • ID: internal id
  • source: name of data source, e.g. GEO
  • reference: external id of the network, e.g. GSE10502 for a GEO network
  • pubmedId:
  • authors: comma delimited list of author last names
  • publicationName
  • yearPublished
  • processingDescription: e.g. Pearson Correlation
  • networkType: redundant but must contain the network group name like 'Co-expression'
  • alias: not used
  • interactionCount: number of interactions in the network
  • dynamicRange: not used
  • edgeWeightDistribution: not used
  • accessStats: not used
  • comment: additional text to be added to the network description, e.g. '1 of 2 datasets produced from this publication'
  • other: extra generic network labelling text, e.g. 'Small-scale studies','Affinity Capture', is this used?
  • title: name of paper
  • url: reference linkout, typically pubmed
  • sourceUrl: linkout to data source, e.g. http://thebiogrid.org/, can include a string interpolation location with '%s', if present then soure_id will be inserted at that location of the url

NETWORK_TAG_ASSOC.txt

Associates networks to descriptive network tags. Not currently used.

  • ID: internal record id
  • NETWORK_ID: refers to a record in the NETWORK table
  • TAG_ID: refers to a record in the TAGS table

NODES.txt

Contains a record for each unique gene (graph node) in the system. Each of these genes can be identified by multiple symbols in the GENES table.

  • ID: internal GeneMANIA node id
  • NAME: node name, not used
  • GENE_DATA_ID: refers to a record in the GENE_DATA table containing a description of the gene for display
  • ORGANISM_ID: the GeneMANIA organism id to which the gene belongs

ONTOLOGIES.txt

Lists all the sets of functional annotations available for use in enrichment analsys. Currently there is only 1 per organism.

  • ID: internal GeneMANIA ontology id
  • NAME: not used

ONTOLOGY_CATEGORIES.txt

Lists the names of all the individual functional categories belonging to the various sets of functional annotations (one set per organism, currently). Used in display of functional enrichmet results.

  • ID: internal GeneMANIA ontology category id
  • ONTOLOGY_ID: refers to a record in the ONTOLOGIES table to which this particular category belongs
  • NAME: the external id of the category, e.g. GO:0005509
  • DESCRIPTION: description of the category, e.g. calcium ion binding

ORGANISMS.txt

Contains a record for each organism in the dataset.

Fields:

  • ID: internal GeneMANIA organism id, e.g. 1
  • NAME: e.g. S. cerevisia
  • DESCRIPTION: e.g. baker's yeast
  • ALIAS: Saccharomyces cerevisiae
  • ONTOLOGY_ID: the id corresponding to a record in ONTOLOGIES.txt, specifies a set of functional categories to be used in enrichment analysis
  • TAXONOMY_ID: NCBI taxonomy id, e.g. 4932 for yeast

STATISTICS.txt

Contains a single record summarizing the # of organism, networks, and interations in the dataset.

The fields are:

  • ID: record id, typically 1.
  • organisms: # of organisms
  • networks: total # of networks summed across all organisms
  • interactions: total # of interactions in all networks across all organisms
  • genes: total number of unique genes represented (not gene symbols, which would be larger)
  • predictions: not used
  • date: production date of dataset, YYYY-MM-DD

TAGS.txt

A list of descriptive tags, similar to MeSH terms, used to describe networks. The terms are not organism specific, a single set of terms is available to all organisms.

Tags were supported in the past but no longer used. The file must exist but will be empty. Support may be re-added in the future.

The columns are:

  • ID
  • NAME