Data dictionary for GMS Schema - IntegratedBreedingPlatform/Documentation GitHub Wiki
Introduction
The ICIS GMS database stores and manages information on genesis, genealogy, nomenclature and chronology of germplasm for a particular crop. It operates on the principle of unique identification of germplasm through system assigned Germplasm IDs (GIDs).
In this page we will focus on describing the table structure and related key concepts only. For detailed information about the conceptual background please refer to
GERMPLASM TABLE (GERMPLSM)
Main details of germplasm its parents, genesis methods and other metadata.
Columns Short Name (Name) | Description | Type | Size (bytes) |
---|---|---|---|
GERMPLSM_ID (GID) | Unique germplasm identifier. GID is automatically generated by the system each time a new record is added. Links to GID in PROGNTRS, NAMES and ATRIBUTS tables. | Number (Integer) | 4 |
NO_PROGENITORS (GNPGS) | Defines type of genesis and number of progenitors. For a derivative process GNPGS =-1 and then GPID1 contains the germplasm group ID and GPID2 the source germplasm ID. For a generative process GNPGS contains the number of specified parents. (The number of parents required by a method is recorded by NPRGN on the METHODS TABLE). If GNPGS=1 or 2 then the IDs of the progenitors are contained in the GPID1 and GPID2 fields on the GERMPLSM table. If GNPGS>2 then further IDs are stored on the PROGNTRS table. GNPGS = 0 for land race or wild species collections or if none of the parents is known. GNPGS <= NPRGN, but some of the GNPGS specified parents may be unknown in which case the corresponding GPIDs are MISSING (0). For example in a simple cross with only male parent known, GNPGS would have to be 2 with GPID1=0 and GPD2 set to GID of the known male parent. | Number (Integer) | 4 |
METHOD_NO (METHN) | Number that identifies the method of genesis for the germplasm. Details of the method are on the METHODS table and specific parameters may be stored on the ATTRIBUTES table (see MATTR on the METHODS table). Links to MID in METHODS table. | Number (Integer). | 4 |
PROGENITOR_ID1(GPID1) | If GNPGS is -1 then GPID1 is the group ID, which is the GID of last source germplasm, produced by a generative process or the GID of the last source of unknown origin in the development path of the current germplasm. This ID defines a group of germplasm in which all members are derived from the same generic parent. If GNPGS >0 then GPID1 points to the first progenitor in a generative process (usually the female parent). GPID1=0 if unknown. See gpid diagram | Number (Integer) | 4 |
PROGENITOR_ID2 (GPID2) | If GNPGS is -1 then GPID2 points to the immediate source from which the current germplasm is derived otherwise it points to the second progenitor of the germplasm in a generative process (usually the male parent). GPID2=0 if unknown. See gpid diagram. | Number (Integer) | 4 |
MANAGEMENT_GID (MGID) | If the current germplasm is a managed sample then MGID contains the GID of the germplasm at the root of the management tree, else 0. | Number (Integer) | 4 |
created_by ) |
ID of the user entering details of the current germplasm. UID in the USERS table | Number (Integer) | 4 |
LOCATION_NO (GLOCN) | Identifier of the location where the germplasm was created as a distinct unit of management with a new GID. LOCN in LOCATION table. GLOCN = MISSING or 0 if location is unknown. For germplasm that acquires a GID because it is a newly-harvested seed sample (for example, as part of a breeding programme, or genebank multiplication programme, or harvested direct from a farmer’s field), this is the location of the plot where the seed were harvested. For a sample of germplasm that is newly collected from a farm store, market place etc, this is the location from which the sample was collected. For germplasm that acquires a GID because it is transferred from one organization or organizational unit to another (for example, germplasm transfer between genebanks, or from farm or market place to genebank, so that it enters a new management regime and becomes the founding sample of a new management neighbourhood), this is the location of the receiving organizational unit. All GIDs created on transfer of germplasm to a new management regime should also have a GID representing the "parental" germplasm sample held by the donating organization. They should have method type import, and GPID2 pointing to the donor's sample. N.B. GIDs derived only by maintenance methods from a sample collected from a farm, market place etc, should have their GPID1 point to the GID of the original collected sample, and GLOCN of the collected sample is the collecting location. | Number (Integer) | 4 |
GERMPLASM_DATE (GDATE) | Date of germplasm genesis, i.e. the date on which the germplasm was created as a distinct unit of management with a new GID. (YYYYMMDD). GDATE = 0 if unknown. For germplasm that acquires a GID because it is a newly-harvested batch of seed (for example, as part of a breeding programme, or genebank multiplication programme, or harvested direct from a farmer’s field), this is the date the seed were harvested. For a sample of germplasm that is newly collected from a farm store, market place etc, this is the date on which the sample was collected. For germplasm that acquires a GID because it is transferred from one organizational unit to another (for example, germplasm transfer between genebanks, or from farm or market place to genebank), this is the date the receiving organization acquired the germplasm and so started managing the germplasm. N.B. GIDs derived only by maintenance methods from a sample collected from a farm, market place etc, should also have an attribute COLLDATE which is the collection date. GPID1 should point to the GID of the original collected sample, and COLLDATE of the accession should equal the GDATE of the collected sample. | Number (Long) | 4 |
GERMPLASM_REFERENCE (GREF) | A number that identifies a bibliographic reference from where the germplasm data was retrieved. GREF is missing if the source is unpublished or unknown | Number (Integer) | 4 |
GERMPLASM_REPLACE (GRPLCE) | Records deletion or replacement for the current GERMPLASM record. 0 for unchanged, own GID for deleted (Deprecated. use deleted ), and replacement GID for replaced |
Number (Integer) | 4 |
cid | Number (Integer) | 4 | |
sid | Number (Integer) | 4 | |
gchange | Number (Integer) | 4 | |
deleted | 1 or 0 (supersedes GRPLCE = GID). Added index: germplsm_idx13 to improve performance (mainly total count) in the new germplasm search module | Number (Tinyint) | 1 |
germplsm_uuid | Germplasm unique identifier | Text | 60 |
gpid diagram
OTHER_PROGENITORS TABLE (PROGNTRS)
Stores IDs of progenitors for generative methods that require more than two parents.
Columns Short Name (Name) | Description | Type | Size (bytes) |
---|---|---|---|
ID (ID) | Table Primary Key | Number (Integer) | 4 |
GERMPLASM_ID (GID) | Identifier for the germplasm. GID in the GERMPLSM table | Number (Integer) | 4 |
PROGENITOR_NO (PNO) | Progenitor number 3, 4, ... | Number (Integer) | 4 |
PROGENITOR_ID (PID) | Germplasm identifier for progenitor number PNO. GID in the GERMPLSM table. PID = 0 if the progenitor is unknown | Number (Integer) | 4 |
NAME DATA
Germplasm collects a multitude of labels during the development and release process. These are all tracked through the NAMES table of GMS. Names are classified by name types which are defined in the USER_DEFINED_FIELDS table. An abbreviation may be associated with each name type, they have a maximum length of eight characters and are stored in the NAME field of the NAMES table like any other name. They are defined in the USER_DEFINED_FIELDS table with USER_FIELD_TYPE = ABBREVIATION but with the same name type as the name Abbreviations inherit the NAME_TYPE, NAME_USER, NAME_LOCATION and NAME_REFERENCE values from the name being abbreviated. This allows abbreviations to be matched to names. Unique matching requires the restriction that different names cannot have all these fields identical. Abbreviations take on their own NAME_DATE so that multiple abbreviations of the same name can be distinguished. One name must be defined as the preferred name, one abbreviation may also be declared as preferred (this need not be an abbreviation of the preferred name). Names may contain imbedded information, and this can be made accessible to application programs for specific name types by specifying a format for the imbedded information in the USER_DEFINED_FIELDS table. All names for local germplasm must appear in the local names table, but some names for germplasm in the central database can appear in the local names table (with positive GID values). One name in the central must be preferred, but central germplasm may also have one name set as preferred in the local names table. This name takes precedence over the preferred name in the central database whenever the particular local is being used. This allows local users to change preferred names or specify local preferred names.
The Names Table (NAMES)
Stores all germplasm names, abbreviations and naming details.
Columns Short Name (Name) | Description | Type | Size (bytes) |
---|---|---|---|
NAME_ID (NID) | Unique identifier for the name | Number (Integer) | 4 |
GERMPLASM_ID (GID) | Identifier for the germplasm associated with the name. GID in GERMPLSM table | Number (Integer) | 4 |
NAME_TYPE (NTYPE) | Number that identifies the type of name or abbreviation. FLDNO in UDFLDS table | Number (Integer) | 4 |
NAME_STATUS(NSTAT) | Number indicating the storage type and status of the name: 1 – Preferred name (must be ASCII), 8 - Preferred ID, 2 - Preferred abbreviations (must be ASCII), 3 - Chinese-GBK (GD) DBCS names, 4 - Chinese Big 5, 5 - Japanese 6 – Korean, 10 - UNICODE names which are not preferred, 9 - the name is marked as deleted | Number (Integer) | 4 |
created_by ) |
ID of the user naming the germplasm. UID in the USERS table | Number (Integer) | 4 |
NAME_VALUE (NVAL) | Germplasm name. For ASCII names each character occupies one byte, but for UNICODE names (eg Chinese), each character occupies 2 bytes | Text | 5000 |
NAME_LOCATION_NO (NLOCN) | Identifier for the location where the name was first assigned to the maintenance neighbourhood of which this GID is a member. LOCN in LOCATION table. NLOCN = 0 if the location is unknown. If the name was newly created for this GID, then NLOCN=GLOCN. If the name was inherited from the source GID (implying a management method of germplasm creation), then NLOCN is inherited from the same name of the source GID. For example, suppose IRRI’s genebank holds an accession donated to it by USDA, identified in USDA with a PI accession number created by USDA. IRIS will contain two GIDs, one (GID-a) to represent the original accession held by USDA, and one (GID-b) to represent the copy held by IRRI, and GPID2 of GID-b will point to GID-a. IRIS will also hold two instances of the corresponding PI number: one as name type ACCNO associated with GID-a, and one as name type FACCN associated with GID-b. Both instances of the PI number must have the same NLOCN, and it must point to USDA. | Number (Integer) | 4 |
NAME_DATE (NDATE) | Date on which the name was first assigned to the maintenance neighbourhood of which this GID is a member. (YYYYMMDD). NDATE = 0 if the date is unknown. If the name was newly created for this GID, then NDATE=GDATE. If the name was inherited from the source GID (implying a management method of germplasm creation), then NDATE is inherited from the same name of the source GID. For example, suppose IRRI’s genebank holds an accession donated to it by USDA, identified in USDA with a PI accession number created by USDA. IRIS will hold two instances of the corresponding PI number (see NLOCN). Both instances of the PI number must have the same NDATE, and that must be the date that USDA assigned that PI number to its accession | Number (Integer) | 4 |
NAME_REFERENCE(NREF) | A number that identifies a bibliographic reference for the name. NREF is missing if the source is unpublished or unknown | Number (Integer) | 4 |
Name Standardization
A major problem with identifying germplasm is the detection of minor variants of a name. For example some users hyphenate foreign language names while others use spaces or capital letters within a name string of small case letters, some separate a character prefix from a number by a space, others do not. Another problem is that different DBMS and JDBC drivers implement the same SQL search differently with respect to case sensitivity. For these reasons, a set of name standardization has been devised with the objective of producing the same standardized name from as wide a range of variants as possible. These rules are applied in order as follows:
(L= any letter; ^= space; N= any numeral, S= any of {-,',[,],+,.})
a) Capitalize all letters Khao-Dawk-Mali105 becomes KHAO-DAWK-MALI105
b) L( becomes L^( and )L becomes )^L IR64(BPH) becomes IR64 (BPH)
c) N( becomes N^( and )N becomes )^N IR64(5A) becomes IR64 (5A)
d) L. becomes L^ IR 63 SEL. becomes IR 64 SEL
e) LN becomes L^N EXCEPT SLN MALI105 becomes MALI 105 but MALI-F4 IS unchanged
f) NL becomes N^L EXCEPT SNL B 533A-1 becomes B 533 A-1 but B 533 A-4B is unchanged
g) LL-LL becomes LL^LL KHAO-DAWK-MALI 105 becomes KHAO DAWK MALI 105
h) ^0N becomes ^N IRTP 00123 becomes IRTP 123
i) ^^ becomes ^
j) REMOVE LEADING OR TRAILING ^
k) ^) becomes ) and (^ becomes (
l) L-N becomes L^N when there is only one ‘–‘ in the name and L is not preceded by a space
m) ^/ becomes / and /^ becomes /
When a name string is supplied for searching, it is searched as supplied and after standardization. Routines, which write names to the database, do not apply the standardization and users must specifically call a standardization routine if they wish to standardize names before storing them. Names in the database are not required to be standardized. It is hoped that the resulting standardized name is acceptable for presentation, but there will be cases where users insist on having a name, which violates the rules. For example CIMMYT prefers that abbreviations contain no spaces. This is allowed but the rule violation should not be used to distinguish genotypes, and the standardized name should be given as a synonym. Users will always be able to search for names that violate the rules and the correct germplasm should be found.
DEFINITION OF METHODS
Method definitions are stored in the METHODS table, and complete documentation, including bibliographic references, is expected. One anticipated use of the methods table is to provide information to applications which calculate the degree of relationship between germplasm, or which trace gene flows through germplasm development. Some methods depend on parameters, which may vary each time the method is used, such as the number and mixing proportions of parents in the generation of a population. These parameters can be defined by assigning an attribute to the method with the METHOD_ATTRIBUTE field. Values for each instance of the associated method are recorded on an attribute record linked to the GID of the germplasm being produced. Attributes are flexible user definable data fields.
The METHODS TABLE (METHODS)
Details of methods of germplasm genesis
Columns Short Name (Name) | Description | Type | Size (bytes) |
---|---|---|---|
METHOD_ID (MID) | Identifier for the method of germplasm development. METHN on the GERMPLSM table. | Number (Integer) | 4 |
METHOD_TYPE (MTYPE) | Method type: GEN for generative, DER for derivative, MAN for maintenance methods | Text | 3 |
METHOD_GROUP (MGRP) | Defines the breeding system to which the method applies: S – self fertilizing, O – cross pollinating, C clonally propagating and G – all systems | Text | 3 |
METHOD_CODE (MCODE) | Mnemonic description of method | Text | 8 |
METHOD_NAME (MNAME) | Name of method | Text | 50 |
METHOD_REFERENCE (MREF) | Number of a reference to the method. Links to the Reference number in the Bibliography Table. May be missing | Number (Integer) | 4 |
METHOD_DESCRIPTION (MDESC) | Description of the method. | Text | 255 |
NO_PROGENITORS (MPRGN) | MPRGN =-1 if the method is derivative or management. Otherwise MPRGN is the number of progenitors required for the generative process. MISSING (0) if the number of progenitors is variable | Number (Integer) | 4 |
NO_FEMALE_PARENTS (MFPRG) | Number of female parents for a generative process, these are specified first in the list of parents. MPRGN = MISSING if the method is derivative or the number of female parents is variable (if variable it can be recorded as an attribute for each instance, see MATTR). | Number (Integer) | 4 |
METHOD_ATTRIBUTE (MATTR) | Identification number of an attribute associated with this method. The attribute is defined in the UDFLDS table. FLDNO in UDFLDS table and ATYPE in ATRIBUTS table. MISSING if no attribute is associated with the current method. | Number (Integer) | 4 |
GENETIC_EQUIVALENCE (GENEQ) | METHN of a basic method which has equivalent genetic relationship between progenitors and offspring for the purpose of computing coefficients of parentage | Number(Integer) | 4 |
METHOD_USER (MUID) | ID of the user defining the method | Number (Integer) | 4 |
LOCAL_METHOD_ID (LMID) | Local method number from the installation where it was defined | Number (Integer) | 4 |
METHOD_DATE (MDATE) | Date of method definition | Number (Integer) | 4 |
SEPARATOR | Separator for naming rule configuration | Text | 255 |
PREFIX | Prefix for naming rule configuration | Text | 255 |
COUNT | Count for naming rule configuration | Text | 255 |
SUFFIX | Suffix for naming rule configuration | Text | 255 |
PROGRAM_UUID | Unique ID of the program record in workbench_project table of the Workbench database. If null, the method record is accessible in all programs of this crop. When specific program UUID is set it indicates that program's specific breeding method. | Text | 36 |
GERMPLASM ATTRIBUTES
Attributes are text variables used to store information about the genesis, genealogy, nomenclature or chronology of germplasm. Attribute types are defined and described on the USER_DEFINED_FIELDS table. Like names, attributes may contain imbedded information in the form of sub-fields or variables within the attribute text. The structure and format of such attributes is defined through the USER_FIELD_FMT of the USER_DEFINED_FIELDS table, as is the format of structured names.
The ATTRIBUTES TABLE (ATRIBUTS)
Table of germplasm attributes used to store information about germplasm genesis, genealogy and nomenclature, which is not stored in other tables.
Columns Short Name (Name) | Description | Type | Size (bytes) |
---|---|---|---|
ATTRIBUTE_ID (AID) | ID of the Attribute (unique) | Number (Long) | 4 |
GERMPLASM _ID (GID) | ID of germplasm with which the attribute is associated. GID in GERMPLASM table. | Number (Integer) | 4 |
ATTRIBUTE_TYPE (ATYPE) | Number that identifies the attribute. FLDNO in UDFLDS table. Use 999 to mark the attribute as deleted. | Number (Integer). | 4 |
created_by ) |
ID of the user giving the attribute value. UID in USERS table | Number (Integer) | 4 |
ATTRIBUTE _VALUE (AVAL) | Contains the attribute value. | Text | 5000 |
ATTRIBUTE_LOCATION_NO (ALOCN) | Identifier for the location where the attribute value was set. LOCN on the LOCATION table. ALOCN = MISSING if the location is unknown. | Number (Integer) | 4 |
ATTRIBUTE_REFERENCE(AREF) | Key to the reference for the attribute data. MISSING if the source is unknown | Number (Integer) | 4 |
ATTRIBUTE_DATE (ADATE) | Date of value assignation. (YYYYMMDD) | Number (Integer) | 4 |
USER DEFINED CONSTANTS AND DATA FIELDS
The ICIS structure allows considerable flexibility in the type and scope of information that can be stored. It does this by storing structural information (metadata) in the USER DEFINED FIELDS TABLE (UDFLDS). Some of this structural information is standard across implementations and is pre-loaded, some is specific to particular crops and is loaded by the central database administrator, and some is specific to local installations and can be loaded or modified by individual users. Types of metadata range from the definition of constants, where only the range or scope of allowable values is defined, to definition of fields where the type, scope and structure are defined. Allowable user or reference types are examples of constants and attribute or location descriptors are fields since they include both description and structure. In some cases both types are possible. For example name types may also have a format describing the structure of information imbedded in a name of a particular type.
The USER_DEFINED_FIELDS TABLE (UDFLDS)
Table of definitions of user defined fields and constants
Columns Short Name (Name) | Description | Type | Size (bytes) |
---|---|---|---|
USER_FIELD_NO (FLDNO) | Unique identifier for the user defined field. NTYPE in NAMES table, ATYPE in the ATRIBUTS table and MATTR in the METHODS table, DTYPE in the LOCDES table and to the table and field specified in FTABLE and FTYPE for constant definitions | Integer | 4 |
USER_FIELD_TABLE (FTABLE) | Short Name of the ICIS data table to contain the user defined field:NAMES, ATRIBUTS or LOCDES or the table containing the constant link | Text | 24 |
USER_FIELD_TYPE (FTYPE) | Type of defined field: NAME, ATTRIBUTE, ABBREVIATION, or DESCRIPTOR or field name for definition of constants | Text | 12 |
USER_FIELD_CODE (FCODE) | Short name for the user defined field or a permissible values for a constant | Text | 50 |
USER_FIELD_NAME (FNAME) | Long name for the user defined field or descriptive value for a constant | Text | 50 |
USER_FIELD_FMT (FFMT) | Definition of sub-fields or variables contained in the user defined field. These variables can have multiple elements, and may contain text, integer or real valued data. If FFMT is blank then the user-defined field contains a single, unnamed text variable with variable length such as a name. | Text | 255 |
USER_FIELD_DESCRIPTION (FDESC) | Full description of the user defined field, which must describe all sub-fields specified in FFMT and may contain a bibliographic reference. | Text | 255 |
LOCAL_FIELD_NO(LFLDNO) | Local field number | Integer | 4 |
FIELD_USER_ID(FUID) | ID of the user defining the field | Integer | 4 |
FIELD_DATE(FDATE) | Date of field definition | Integer | 4 |
SCALE_ID(SCALEID) | Identification number of its scale. The valid values can be stored in the SCALEDIS or SCALECON table | Integer | 4 |
The KEY_SEQUENCE_REGISTER TABLE (key_sequence_register)
Columns Short Name (Name) | Description | Type | Size (bytes) |
---|---|---|---|
ID | DB Record Id | Integer | 4 |
KEY_PREFIX | Prefix for naming rule configuration | Text | 5000 |
LAST_USED_SEQUENCE | Last sequence used for naming rule configuration | Integer | 4 |
OPTIMISTIC_LOCK_NUMBER | Optimistic locking as ultimate mechanism to ensure that two threads never end up getting the same sequence number from registry. | Integer | 4 |
The SEARCH_REQUEST TABLE
Used by BMSAPI, DMS, Germplasm and Inventory
Columns Short Name (Name) | Description | Type | Size (bytes) |
---|---|---|---|
REQUEST_ID | DB record Id | Integer | 4 |
PARAMETERS | Filters for Search | Medium Text |
The EXTERNAL_REFERENCE TABLE
Stores external references specified when using BraPI POST /germplasm
Columns Short Name (Name) | Description | Type | Size (bytes) |
---|---|---|---|
ID | Primary key. DB Record id | Integer | 4 |
GID | Foreign Key to germplasm table | Integer | 4 |
REFERENCE_ID | Reference Id | Text | 2000 |
REFERENCE_SOURCE | Reference Source | Text | 255 |
CREATED_DATE | Created Date | TimeStamp | 4 |