Data dictionary for GMS Schema - IntegratedBreedingPlatform/Documentation GitHub Wiki

Introduction

The ICIS GMS database stores and manages information on genesis, genealogy, nomenclature and chronology of germplasm for a particular crop. It operates on the principle of unique identification of germplasm through system assigned Germplasm IDs (GIDs).

In this page we will focus on describing the table structure and related key concepts only. For detailed information about the conceptual background please refer to

GERMPLASM TABLE (GERMPLSM)

Main details of germplasm its parents, genesis methods and other metadata.

Columns Short Name (Name) Description Type Size (bytes)
GERMPLSM_ID (GID) Unique germplasm identifier. GID is automatically generated by the system each time a new record is added. Links to GID in PROGNTRS, NAMES and ATRIBUTS tables. Number (Integer) 4
NO_PROGENITORS (GNPGS) Defines type of genesis and number of progenitors. For a derivative process GNPGS =-1 and then GPID1 contains the germplasm group ID and GPID2 the source germplasm ID. For a generative process GNPGS contains the number of specified parents. (The number of parents required by a method is recorded by NPRGN on the METHODS TABLE). If GNPGS=1 or 2 then the IDs of the progenitors are contained in the GPID1 and GPID2 fields on the GERMPLSM table. If GNPGS>2 then further IDs are stored on the PROGNTRS table. GNPGS = 0 for land race or wild species collections or if none of the parents is known. GNPGS <= NPRGN, but some of the GNPGS specified parents may be unknown in which case the corresponding GPIDs are MISSING (0). For example in a simple cross with only male parent known, GNPGS would have to be 2 with GPID1=0 and GPD2 set to GID of the known male parent. Number (Integer) 4
METHOD_NO (METHN) Number that identifies the method of genesis for the germplasm. Details of the method are on the METHODS table and specific parameters may be stored on the ATTRIBUTES table (see MATTR on the METHODS table). Links to MID in METHODS table. Number (Integer). 4
PROGENITOR_ID1(GPID1) If GNPGS is -1 then GPID1 is the group ID, which is the GID of last source germplasm, produced by a generative process or the GID of the last source of unknown origin in the development path of the current germplasm. This ID defines a group of germplasm in which all members are derived from the same generic parent. If GNPGS >0 then GPID1 points to the first progenitor in a generative process (usually the female parent). GPID1=0 if unknown. See gpid diagram Number (Integer) 4
PROGENITOR_ID2 (GPID2) If GNPGS is -1 then GPID2 points to the immediate source from which the current germplasm is derived otherwise it points to the second progenitor of the germplasm in a generative process (usually the male parent). GPID2=0 if unknown. See gpid diagram. Number (Integer) 4
MANAGEMENT_GID (MGID) If the current germplasm is a managed sample then MGID contains the GID of the germplasm at the root of the management tree, else 0. Number (Integer) 4
GERMPLASM_USER_ID (GERMUID) (renamed to created_by) ID of the user entering details of the current germplasm. UID in the USERS table Number (Integer) 4
LOCATION_NO (GLOCN) Identifier of the location where the germplasm was created as a distinct unit of management with a new GID. LOCN in LOCATION table. GLOCN = MISSING or 0 if location is unknown. For germplasm that acquires a GID because it is a newly-harvested seed sample (for example, as part of a breeding programme, or genebank multiplication programme, or harvested direct from a farmer’s field), this is the location of the plot where the seed were harvested. For a sample of germplasm that is newly collected from a farm store, market place etc, this is the location from which the sample was collected. For germplasm that acquires a GID because it is transferred from one organization or organizational unit to another (for example, germplasm transfer between genebanks, or from farm or market place to genebank, so that it enters a new management regime and becomes the founding sample of a new management neighbourhood), this is the location of the receiving organizational unit. All GIDs created on transfer of germplasm to a new management regime should also have a GID representing the "parental" germplasm sample held by the donating organization. They should have method type import, and GPID2 pointing to the donor's sample. N.B. GIDs derived only by maintenance methods from a sample collected from a farm, market place etc, should have their GPID1 point to the GID of the original collected sample, and GLOCN of the collected sample is the collecting location. Number (Integer) 4
GERMPLASM_DATE (GDATE) Date of germplasm genesis, i.e. the date on which the germplasm was created as a distinct unit of management with a new GID. (YYYYMMDD). GDATE = 0 if unknown. For germplasm that acquires a GID because it is a newly-harvested batch of seed (for example, as part of a breeding programme, or genebank multiplication programme, or harvested direct from a farmer’s field), this is the date the seed were harvested. For a sample of germplasm that is newly collected from a farm store, market place etc, this is the date on which the sample was collected. For germplasm that acquires a GID because it is transferred from one organizational unit to another (for example, germplasm transfer between genebanks, or from farm or market place to genebank), this is the date the receiving organization acquired the germplasm and so started managing the germplasm. N.B. GIDs derived only by maintenance methods from a sample collected from a farm, market place etc, should also have an attribute COLLDATE which is the collection date. GPID1 should point to the GID of the original collected sample, and COLLDATE of the accession should equal the GDATE of the collected sample. Number (Long) 4
GERMPLASM_REFERENCE (GREF) A number that identifies a bibliographic reference from where the germplasm data was retrieved. GREF is missing if the source is unpublished or unknown Number (Integer) 4
GERMPLASM_REPLACE (GRPLCE) Records deletion or replacement for the current GERMPLASM record. 0 for unchanged, own GID for deleted (Deprecated. use deleted), and replacement GID for replaced Number (Integer) 4
cid Number (Integer) 4
sid Number (Integer) 4
gchange Number (Integer) 4
deleted 1 or 0 (supersedes GRPLCE = GID). Added index: germplsm_idx13 to improve performance (mainly total count) in the new germplasm search module Number (Tinyint) 1
germplsm_uuid Germplasm unique identifier Text 60

gpid diagram

gpid

OTHER_PROGENITORS TABLE (PROGNTRS)

Stores IDs of progenitors for generative methods that require more than two parents.

Columns Short Name (Name) Description Type Size (bytes)
ID (ID) Table Primary Key Number (Integer) 4
GERMPLASM_ID (GID) Identifier for the germplasm. GID in the GERMPLSM table Number (Integer) 4
PROGENITOR_NO (PNO) Progenitor number 3, 4, ... Number (Integer) 4
PROGENITOR_ID (PID) Germplasm identifier for progenitor number PNO. GID in the GERMPLSM table. PID = 0 if the progenitor is unknown Number (Integer) 4

NAME DATA

Germplasm collects a multitude of labels during the development and release process. These are all tracked through the NAMES table of GMS. Names are classified by name types which are defined in the USER_DEFINED_FIELDS table. An abbreviation may be associated with each name type, they have a maximum length of eight characters and are stored in the NAME field of the NAMES table like any other name. They are defined in the USER_DEFINED_FIELDS table with USER_FIELD_TYPE = ABBREVIATION but with the same name type as the name Abbreviations inherit the NAME_TYPE, NAME_USER, NAME_LOCATION and NAME_REFERENCE values from the name being abbreviated. This allows abbreviations to be matched to names. Unique matching requires the restriction that different names cannot have all these fields identical. Abbreviations take on their own NAME_DATE so that multiple abbreviations of the same name can be distinguished. One name must be defined as the preferred name, one abbreviation may also be declared as preferred (this need not be an abbreviation of the preferred name). Names may contain imbedded information, and this can be made accessible to application programs for specific name types by specifying a format for the imbedded information in the USER_DEFINED_FIELDS table. All names for local germplasm must appear in the local names table, but some names for germplasm in the central database can appear in the local names table (with positive GID values). One name in the central must be preferred, but central germplasm may also have one name set as preferred in the local names table. This name takes precedence over the preferred name in the central database whenever the particular local is being used. This allows local users to change preferred names or specify local preferred names.

The Names Table (NAMES)

Stores all germplasm names, abbreviations and naming details.

Columns Short Name (Name) Description Type Size (bytes)
NAME_ID (NID) Unique identifier for the name Number (Integer) 4
GERMPLASM_ID (GID) Identifier for the germplasm associated with the name. GID in GERMPLSM table Number (Integer) 4
NAME_TYPE (NTYPE) Number that identifies the type of name or abbreviation. FLDNO in UDFLDS table Number (Integer) 4
NAME_STATUS(NSTAT) Number indicating the storage type and status of the name: 1 – Preferred name (must be ASCII), 8 - Preferred ID, 2 - Preferred abbreviations (must be ASCII), 3 - Chinese-GBK (GD) DBCS names, 4 - Chinese Big 5, 5 - Japanese 6 – Korean, 10 - UNICODE names which are not preferred, 9 - the name is marked as deleted Number (Integer) 4
NAME_USER_ID (NUID) (renamed to created_by) ID of the user naming the germplasm. UID in the USERS table Number (Integer) 4
NAME_VALUE (NVAL) Germplasm name. For ASCII names each character occupies one byte, but for UNICODE names (eg Chinese), each character occupies 2 bytes Text 5000
NAME_LOCATION_NO (NLOCN) Identifier for the location where the name was first assigned to the maintenance neighbourhood of which this GID is a member. LOCN in LOCATION table. NLOCN = 0 if the location is unknown. If the name was newly created for this GID, then NLOCN=GLOCN. If the name was inherited from the source GID (implying a management method of germplasm creation), then NLOCN is inherited from the same name of the source GID. For example, suppose IRRI’s genebank holds an accession donated to it by USDA, identified in USDA with a PI accession number created by USDA. IRIS will contain two GIDs, one (GID-a) to represent the original accession held by USDA, and one (GID-b) to represent the copy held by IRRI, and GPID2 of GID-b will point to GID-a. IRIS will also hold two instances of the corresponding PI number: one as name type ACCNO associated with GID-a, and one as name type FACCN associated with GID-b. Both instances of the PI number must have the same NLOCN, and it must point to USDA. Number (Integer) 4
NAME_DATE (NDATE) Date on which the name was first assigned to the maintenance neighbourhood of which this GID is a member. (YYYYMMDD). NDATE = 0 if the date is unknown. If the name was newly created for this GID, then NDATE=GDATE. If the name was inherited from the source GID (implying a management method of germplasm creation), then NDATE is inherited from the same name of the source GID. For example, suppose IRRI’s genebank holds an accession donated to it by USDA, identified in USDA with a PI accession number created by USDA. IRIS will hold two instances of the corresponding PI number (see NLOCN). Both instances of the PI number must have the same NDATE, and that must be the date that USDA assigned that PI number to its accession Number (Integer) 4
NAME_REFERENCE(NREF) A number that identifies a bibliographic reference for the name. NREF is missing if the source is unpublished or unknown Number (Integer) 4

Name Standardization

A major problem with identifying germplasm is the detection of minor variants of a name. For example some users hyphenate foreign language names while others use spaces or capital letters within a name string of small case letters, some separate a character prefix from a number by a space, others do not. Another problem is that different DBMS and JDBC drivers implement the same SQL search differently with respect to case sensitivity. For these reasons, a set of name standardization has been devised with the objective of producing the same standardized name from as wide a range of variants as possible. These rules are applied in order as follows:

(L= any letter; ^= space; N= any numeral, S= any of {-,',[,],+,.})
a)  Capitalize all letters          Khao-Dawk-Mali105 becomes KHAO-DAWK-MALI105
b)  L( becomes L^( and )L becomes )^L   IR64(BPH) becomes IR64 (BPH)
c)  N( becomes N^( and )N becomes )^N   IR64(5A) becomes IR64 (5A)
d)  L. becomes L^               IR 63 SEL. becomes IR 64 SEL
e)  LN becomes L^N EXCEPT SLN       MALI105 becomes MALI 105 but MALI-F4 IS unchanged
f)  NL becomes N^L EXCEPT SNL       B 533A-1 becomes B 533 A-1 but B 533 A-4B is unchanged
g)  LL-LL becomes LL^LL         KHAO-DAWK-MALI 105 becomes KHAO DAWK MALI 105
h)  ^0N becomes ^N              IRTP 00123 becomes IRTP 123
i)  ^^ becomes ^
j)  REMOVE LEADING OR TRAILING ^
k)  ^) becomes ) and (^ becomes (
l)  L-N becomes L^N when there is only one ‘–‘ in the name and L is not preceded by a space
m)  ^/ becomes / and /^ becomes /

When a name string is supplied for searching, it is searched as supplied and after standardization. Routines, which write names to the database, do not apply the standardization and users must specifically call a standardization routine if they wish to standardize names before storing them. Names in the database are not required to be standardized. It is hoped that the resulting standardized name is acceptable for presentation, but there will be cases where users insist on having a name, which violates the rules. For example CIMMYT prefers that abbreviations contain no spaces. This is allowed but the rule violation should not be used to distinguish genotypes, and the standardized name should be given as a synonym. Users will always be able to search for names that violate the rules and the correct germplasm should be found.

DEFINITION OF METHODS

Method definitions are stored in the METHODS table, and complete documentation, including bibliographic references, is expected. One anticipated use of the methods table is to provide information to applications which calculate the degree of relationship between germplasm, or which trace gene flows through germplasm development. Some methods depend on parameters, which may vary each time the method is used, such as the number and mixing proportions of parents in the generation of a population. These parameters can be defined by assigning an attribute to the method with the METHOD_ATTRIBUTE field. Values for each instance of the associated method are recorded on an attribute record linked to the GID of the germplasm being produced. Attributes are flexible user definable data fields.

The METHODS TABLE (METHODS)

Details of methods of germplasm genesis

Columns Short Name (Name) Description Type Size (bytes)
METHOD_ID (MID) Identifier for the method of germplasm development. METHN on the GERMPLSM table. Number (Integer) 4
METHOD_TYPE (MTYPE) Method type: GEN for generative, DER for derivative, MAN for maintenance methods Text 3
METHOD_GROUP (MGRP) Defines the breeding system to which the method applies: S – self fertilizing, O – cross pollinating, C clonally propagating and G – all systems Text 3
METHOD_CODE (MCODE) Mnemonic description of method Text 8
METHOD_NAME (MNAME) Name of method Text 50
METHOD_REFERENCE (MREF) Number of a reference to the method. Links to the Reference number in the Bibliography Table. May be missing Number (Integer) 4
METHOD_DESCRIPTION (MDESC) Description of the method. Text 255
NO_PROGENITORS (MPRGN) MPRGN =-1 if the method is derivative or management. Otherwise MPRGN is the number of progenitors required for the generative process. MISSING (0) if the number of progenitors is variable Number (Integer) 4
NO_FEMALE_PARENTS (MFPRG) Number of female parents for a generative process, these are specified first in the list of parents. MPRGN = MISSING if the method is derivative or the number of female parents is variable (if variable it can be recorded as an attribute for each instance, see MATTR). Number (Integer) 4
METHOD_ATTRIBUTE (MATTR) Identification number of an attribute associated with this method. The attribute is defined in the UDFLDS table. FLDNO in UDFLDS table and ATYPE in ATRIBUTS table. MISSING if no attribute is associated with the current method. Number (Integer) 4
GENETIC_EQUIVALENCE (GENEQ) METHN of a basic method which has equivalent genetic relationship between progenitors and offspring for the purpose of computing coefficients of parentage Number(Integer) 4
METHOD_USER (MUID) ID of the user defining the method Number (Integer) 4
LOCAL_METHOD_ID (LMID) Local method number from the installation where it was defined Number (Integer) 4
METHOD_DATE (MDATE) Date of method definition Number (Integer) 4
SEPARATOR Separator for naming rule configuration Text 255
PREFIX Prefix for naming rule configuration Text 255
COUNT Count for naming rule configuration Text 255
SUFFIX Suffix for naming rule configuration Text 255
PROGRAM_UUID Unique ID of the program record in workbench_project table of the Workbench database. If null, the method record is accessible in all programs of this crop. When specific program UUID is set it indicates that program's specific breeding method. Text 36

GERMPLASM ATTRIBUTES

Attributes are text variables used to store information about the genesis, genealogy, nomenclature or chronology of germplasm. Attribute types are defined and described on the USER_DEFINED_FIELDS table. Like names, attributes may contain imbedded information in the form of sub-fields or variables within the attribute text. The structure and format of such attributes is defined through the USER_FIELD_FMT of the USER_DEFINED_FIELDS table, as is the format of structured names.

The ATTRIBUTES TABLE (ATRIBUTS)

Table of germplasm attributes used to store information about germplasm genesis, genealogy and nomenclature, which is not stored in other tables.

Columns Short Name (Name) Description Type Size (bytes)
ATTRIBUTE_ID (AID) ID of the Attribute (unique) Number (Long) 4
GERMPLASM _ID (GID) ID of germplasm with which the attribute is associated. GID in GERMPLASM table. Number (Integer) 4
ATTRIBUTE_TYPE (ATYPE) Number that identifies the attribute. FLDNO in UDFLDS table. Use 999 to mark the attribute as deleted. Number (Integer). 4
ATTRIBUTE_USER_ID (AUID) (renamed to created_by) ID of the user giving the attribute value. UID in USERS table Number (Integer) 4
ATTRIBUTE _VALUE (AVAL) Contains the attribute value. Text 5000
ATTRIBUTE_LOCATION_NO (ALOCN) Identifier for the location where the attribute value was set. LOCN on the LOCATION table. ALOCN = MISSING if the location is unknown. Number (Integer) 4
ATTRIBUTE_REFERENCE(AREF) Key to the reference for the attribute data. MISSING if the source is unknown Number (Integer) 4
ATTRIBUTE_DATE (ADATE) Date of value assignation. (YYYYMMDD) Number (Integer) 4

USER DEFINED CONSTANTS AND DATA FIELDS

The ICIS structure allows considerable flexibility in the type and scope of information that can be stored. It does this by storing structural information (metadata) in the USER DEFINED FIELDS TABLE (UDFLDS). Some of this structural information is standard across implementations and is pre-loaded, some is specific to particular crops and is loaded by the central database administrator, and some is specific to local installations and can be loaded or modified by individual users. Types of metadata range from the definition of constants, where only the range or scope of allowable values is defined, to definition of fields where the type, scope and structure are defined. Allowable user or reference types are examples of constants and attribute or location descriptors are fields since they include both description and structure. In some cases both types are possible. For example name types may also have a format describing the structure of information imbedded in a name of a particular type.

The USER_DEFINED_FIELDS TABLE (UDFLDS)

Table of definitions of user defined fields and constants

Columns Short Name (Name) Description Type Size (bytes)
USER_FIELD_NO (FLDNO) Unique identifier for the user defined field. NTYPE in NAMES table, ATYPE in the ATRIBUTS table and MATTR in the METHODS table, DTYPE in the LOCDES table and to the table and field specified in FTABLE and FTYPE for constant definitions Integer 4
USER_FIELD_TABLE (FTABLE) Short Name of the ICIS data table to contain the user defined field:NAMES, ATRIBUTS or LOCDES or the table containing the constant link Text 24
USER_FIELD_TYPE (FTYPE) Type of defined field: NAME, ATTRIBUTE, ABBREVIATION, or DESCRIPTOR or field name for definition of constants Text 12
USER_FIELD_CODE (FCODE) Short name for the user defined field or a permissible values for a constant Text 50
USER_FIELD_NAME (FNAME) Long name for the user defined field or descriptive value for a constant Text 50
USER_FIELD_FMT (FFMT) Definition of sub-fields or variables contained in the user defined field. These variables can have multiple elements, and may contain text, integer or real valued data. If FFMT is blank then the user-defined field contains a single, unnamed text variable with variable length such as a name. Text 255
USER_FIELD_DESCRIPTION (FDESC) Full description of the user defined field, which must describe all sub-fields specified in FFMT and may contain a bibliographic reference. Text 255
LOCAL_FIELD_NO(LFLDNO) Local field number Integer 4
FIELD_USER_ID(FUID) ID of the user defining the field Integer 4
FIELD_DATE(FDATE) Date of field definition Integer 4
SCALE_ID(SCALEID) Identification number of its scale. The valid values can be stored in the SCALEDIS or SCALECON table Integer 4

The KEY_SEQUENCE_REGISTER TABLE (key_sequence_register)

Columns Short Name (Name) Description Type Size (bytes)
ID DB Record Id Integer 4
KEY_PREFIX Prefix for naming rule configuration Text 5000
LAST_USED_SEQUENCE Last sequence used for naming rule configuration Integer 4
OPTIMISTIC_LOCK_NUMBER Optimistic locking as ultimate mechanism to ensure that two threads never end up getting the same sequence number from registry. Integer 4

The SEARCH_REQUEST TABLE

Used by BMSAPI, DMS, Germplasm and Inventory

Columns Short Name (Name) Description Type Size (bytes)
REQUEST_ID DB record Id Integer 4
PARAMETERS Filters for Search Medium Text

The EXTERNAL_REFERENCE TABLE

Stores external references specified when using BraPI POST /germplasm

Columns Short Name (Name) Description Type Size (bytes)
ID Primary key. DB Record id Integer 4
GID Foreign Key to germplasm table Integer 4
REFERENCE_ID Reference Id Text 2000
REFERENCE_SOURCE Reference Source Text 255
CREATED_DATE Created Date TimeStamp 4