DataLayout - GeneMANIA/pipeline GitHub Wiki

Data Layout

The data/ folder collects all gene and interaction data input files, its top level organization is:

data/
├── attributes
├── functions
├── identifiers
├── networks
├── organism.cfg
└── metadata_fixes.txt

data/organism.cfg

Organism description configuration file.

Example:

name = Saccharomyces cerevisiae
short_name = S. cerevisiae
common_name = baker's yeast
gm_organism_id = 6
ncbi_taxonomy_id = 4932
default_genes = MRE11, RAD54, RAD52, RAD10, XRS2, CDC27, APC4, APC2, APC5, APC11

The field gm_organism_id is an internal numeric identifier, and will be automatically assigned if not given. Specification of this field is allowed to maintain continuity of this identifier between builds.

data/metadata_fixes.txt

Optional input file to specify and record changes to network metadata. Should not normally be needed as metadata edits can be made directly to the individual network .cfg files. However, the use case is:

Performing bulk import of many networks with similar parameters, e.g. 100 networks from GEO all as co-expression.
Want to reclassify a particular network as co-localization, and the import script doesn't lend itself to such customization. We can specify this in metadata_fixes.txt.
In future we expect to overwrite these files on a subsequent import, but won't lose track of the reclassification since its contained in a separate file.

The file itself is three columns, tab delimited no header. The first column is the path relative to the data folder to the network .cfg file to be changed. The second and third columns are a variable name and variable value to be changed.

Example: change the group of a network to co-localization

networks/direct/geo/GSE123.cfg	group	coloc

data/identifiers/

Text files containing gene identifiers and their descriptions, organized into subfolders.

data/identifiers/
└── descriptions
└── mixed_table
└── symbols

data/identifiers/symbols/{filename}.txt

Gene identifiers in id/symbol/source triplets, one per line. Multiple files may be provided the symbols will be aggregated.

There can be multiple records for each 'id', the 'id' is used to group them together and can be any string - it will not be used to externally identify the genes. 'symbol' and 'source' are text strings such as 'HRA1' and 'Gene Name' respectively.

No header, tab-delimited records.

Example:

1334149 Q0060   Ensembl Gene ID
1334149 854595  Entrez Gene ID
1334149 AI3     Gene Name
1334149 NP_009308       RefSeq Protein ID
1334149 NM_001184355    RefSeq mRNA ID
1334149 I-SceIII        Synonym
1334149 P03877  Uniprot ID
1334149 SCE3_YEAST      Uniprot ID
1334150 Q0065   Ensembl Gene ID
1334150 854596  Entrez Gene ID
1334150 AI4     Gene Name
1334150 NP_009307       RefSeq Protein ID
1334150 NM_001184356    RefSeq mRNA ID
1334150 I-SceII Synonym
1334150 P03878  Uniprot ID
1334150 SCE2_YEAST      Uniprot ID

data/identifiers/descriptions/{filename}.txt

Gene descriptions in id/description pairs, one per line.

No header, tab-delimited records.

Example:

1334149 Endonuclease I-SceIII, encoded by a mobile group I intron within the mitochondrial COX1 gene [Source:SGD;Acc:S000007263]
1334150 Endonuclease I-SceII, encoded by a mobile group I intron within the mitochondrial COX1 gene; intron is normally spliced by the BI4p maturase

data/identifiers/mixed_table/{filename}.txt

For compatability with identifier files from a previous system (don't use it if you can help it). This file contains multiple identifier sources, multiple identifiers per source, gene descriptions, and gene biotype information organized into one row per gene.

Tabular tab-delimited, with header row. Some fields contain multipe fields with them, delimited by semi-colons.

Example:

GMID    Ensembl Gene ID Protein Coding  Gene Name       Ensembl Transcript ID   Ensembl Protein ID      Uniprot ID      Entrez Gene ID  RefSeq mRNA ID  RefSeq Protein ID       Synonyms        Definition
1334136 15S_rRNA        rRNA    15S_RRNA        15S_rRNA                N/A     N/A     N/A     N/A     15S_RRNA_2;14s rRNA     Ribosomal RNA of the small mitochondrial ribosomal subunit; MSU1 allele suppresses ochre stop mutations in mitochondrial protein-coding genes [Source:SGD;Acc:S000007287]
1334137 21S_rRNA        rRNA    21S_RRNA        21S_rRNA                N/A     N/A     N/A     N/A     21S_rRNA_4;21S_rRNA_3   Mitochondrial 21S rRNA; intron encodes the I-SceI DNA endonuclease [Source:SGD;Acc:S000007288]
1334138 HRA1    ncRNA   HRA1    HRA1            N/A     N/A     N/A     N/A     N/A     Non-protein-coding RNA, substrate of RNase P, possibly involved in rRNA processing, specifically maturation of 20S precursor into the mature 18S rRNA [Source:SGD;Acc:S000119380]

data/networks/

Interaction networks are specified by data files and optional configuration files, organized by the type of processing required to convert the data into networks.

data/networks/
├── direct
│   ├── {collection2}
│   └── {collection3}
├── profile
│   └── {collection3}
└── sharedneighbour
    └── {collection4}

Collections are subfolders organizing networks for ease of managment, e.g. data source, and can be any value given by the user. Note this is different from the Network Group displayed for a network in the application, collection names are for internal organization and not displayed for the user.

data/networks/direct/{collection}/{filename}.txt

Direct networks are given in text files where each record contains a gene-symbol/gene-symbol/interaction-weight triplet.

EAF1    YPI1    1
EAF1    BNI4    1
EAF1    GIP2    1

data/networks/direct/{collection}/{filename}.cfg

Each network can have a configuration file specifying the network name, group, and providing reference and other metadata. The file should have the exact same name as the corresponding network data file, but ending in '.cfg' instead of '.txt'.

Example:

group = gi
default_selected = 1
name = ""
description = ""
pubmed_id = 21984913
source = BIOGRID
source_id = ""

Network names and descriptions are optional and will be automatically generated from publication record retrieved from pubmed when available.

data/networks/profile/{collection}/{filename}.txt

Networks are computed from profile data where each record contains a gene identifier followed by a series of numeric level measurements.

Example:

YAL001C 0.629   0.209   0.141   1.001   1.492   0.102
YAL002W 0.011   0.06    0.301   0.243   -0.046  0.14
YAL003W -0.522  -0.117  0.721   0.595   -0.402  0.315
YAL004W -0.4079 0.063   0.267   0.269   -0.627  0.276
YAL005C -0.195  0.009   1.304   2.426   -0.642  0.328
YAL007C -0.633  -0.222  -0.28   0.091   -0.447  0.267
YAL008W -0.303  -0.115  0.214   1.912   -0.598  0.223
YAL009W -0.012  0.096   -0.108  0.106   0.43    0.08
YAL010C 0.159   0.098   0.536   0.212   0.076   0.169
YAL011W 0.263   0.498   0.482   0.215   -0.045  0.32

Network metadata is specified in a corresponding .cfg file as for direct networks.

data/networks/sharedneighbour/{collection}/{filename}.txt

Networks are computed from sparse binary profiles where each record contains a gene identifier followed by the name (or id) of a binary feature it possesses.

YBR218C IPR000089       IPR005479       IPR003379       IPR005481       IPR005482       IPR000891
YBR221C IPR005476       IPR005475

Network metadata is specified in a corresponding .cfg file as for direct networks.

data/functions/

Functional annotations for network combination and enrichment analysis. These are currently specified in a single tabular text file in a legacy format. Simpler formats will be supported in the future (issue #2).

The file contains 11 columns of tab delimited data, preceeded with a pair of comment rows starting with '#'.

# go db: 2014-03-08 assocdb None
# genus 'Saccharomyces' species 'cerevisiae' taxonomy id 4932
organellar small ribosomal subunit      cellular_component      GO:0000314      mitochondrial small ribosomal subunit   cellular_component      GO:0005763
      15S_RRNA        SGD     S000007287      ISS     1
mitochondrial ribosome  cellular_component      GO:0005761      mitochondrial small ribosomal subunit   cellular_component      GO:0005763      15S_RRNA
        SGD     S000007287      ISS     1
mitochondrial part      cellular_component      GO:0044429      mitochondrial small ribosomal subunit   cellular_component      GO:0005763      15S_RRNA
        SGD     S000007287      ISS     1

The relevant columns are 1, 2, 3, and 7 containing the category name, GO branch, category id, and gene name. Transitive annotations in addition to direct must be provided, and any other desired filtering such as evidence code should already have been applied in preparing this file. Downstream filtering performed on this file will be for gene symbol and category size. Redundant annotations are allowed and will be removed.

data/attributes/

Gene attributes are collections of binary features treated as networks by representing each feature as a clique connecting all the genes that possess that attribute.

data/attributes/
├── attrib-gene-list
│   └── {collection1}
└── gene-attrib-list
    ├── {collection2}
    └── {collection3}

data/attributes/attrib-gene-list/{collection}/{filename}.txt

Attributes are specified by a text file containing an attribute name followed by a list of genes that possess the attribute.

DB03307      CDK2
DB02059     GAPDH   SIRT3   SIRT5   EEF2

Multiple records for the same gene is allowed, so the attribute list can be flattened to into a column.

Network metadata is specified in a corresponding .cfg file as for direct networks.

data/attributes/attrib-gene-list/{collection}/{filename}.desc

A text file containing a descriptive string for each attribute, to include in user display.

DB03307 4-[(6-Amino-4-Pyrimidinyl)Amino]Benzenesulfonamide
DB02059 Adenosine-5-Diphosphoribose

data/attributes/gene-attrib-list/{collection}/{filename}.txt

Attributes are specified by a text file containing a gene symbol followed by a list of attributes the gene possessees.

ENSDARG00000000086      SSF48065      SSF49562      SSF50044      SSF50729
ENSDARG00000000102      SSF48726
ENSDARG00000000102      SSF56112
ENSDARG00000000102      SSF57440

Multiple records for the same gene is allowed, so the attribute list can be flattened to into a column.

Network metadata is specified in a corresponding .cfg file as for direct networks.

A descriptive file is specified in a corresponding .desc file as for attrib-gene-list.

[TODO:isn't there .gmt format support also?]