A. Database description - ufarrell/sgp_phase2 GitHub Wiki

Navigate through database in full here: https://ufarrell.github.io/sgp_phase2/

Simplified diagram of the SGP database

Sample

Sample identifiers

Sample identifiers in SGP include

Original sample name/number (original_num) - not necessarily unique
SGP numeric identifier (sample_id) - automatically generated and unique within SGP
Universally unique identifier (UUID) - unique, generated by the database
IGSN (International Generic Sample Number) (igsn) - see https://www.geosamples.org/
Alternate number(s)(alternate_num) - alternate versions of the original sample name/number, not necessarily unique.

Original sample names/numbers are not necessarily unique but often carry some useful information about the sample, e.g. it is common practice to use a combination of section name and height/depth in a section. These names do not always remain stable: they may have been deliberately renamed or accidentally mis-typed in publications, between different labs, or in different generations of a spreadsheet (see also Lehnart et al. 2000). For this reason we also store alternate versions of the original identifier (alternate_num).

In the geological community the importance of sample tracing and physical sample preservation is increasingly recognised, with the System for Earth Sample Registration (SESAR) leading the way. IGSNs (International Generic Sample Numbers) are also stored in SGP, in a column on the sample table, and linked to a table with all IGSN metadata - both added since Phase 1. However, it remains the case that for the majority of geochemical samples global identifiers are not yet available, especially in the case of unpublished or legacy data. Samples with IGSNs stored in SGP are primarily those from Australia Geoscience OZCHEM database.

Sample attributes

The primary sample-level details in the database are

lithology
color
is_bioturbated (t/f)

Lithology is separated into basic rock type (e.g. shale, limestone) with the option of adding textural (e.g. silty, sandy) and compositional (e.g. calcareous, carbonaceous) descriptors. Samples can also be associated with fossils and sedimentary structures. Our lithological dictionary primarily comes from Macrostrat (https://macrostrat.org/api/defs/lithologies?all), with some minor additional entries to accommodate diverse data from different sources.

Fossils are stored verbatim at whatever level of detail the contributors provide (this could be anything from ‘shells’ to a full species level identification). Verbatim names can be linked to a dictionary with a formal taxonomic name and Paleobiology Database (PBDB) identifiers. However, given the rarity of fossil data, and the geochemical focus of this database, this formal link has not yet been made for most of our samples. The verbatim fossil information is available on the website.

Interpreted Age

Interpreted age is a numerical estimate for the age of each sample in millions of years (Ma). This age estimate is necessary for many of the research goals of the SGP consortium. Whenever possible the original authors, who are most familiar with the samples and sections, are asked to provide the interpreted age. They can use whatever method they feel most comfortable with, for example ages may be estimated based on assumed sedimentation rates and linear interpretation, or groups of samples can be assigned one age based on proximity to any available time markers. A justification is required for each age provided, which may be used in the future to refine ages further.

Maximum and minimum ages can also be stored, and indeed, are critical for the type of re-weighted bootstrap analyses employed by many SGP Working Groups. The formal geological age name is recorded separately (see geological context below).

Geographical context

Sites in the context of SGP are currently classified into

core
outcrop
cuttings
modern_freshwater
modern_marine

The majority of sites are single outcrop sections or cores. The same site may be sampled multiple times and therefore sites are linked to samples through the collecting event table, which stores the date of collection (if available) and is linked in turn to collectors.

This section of the database is primarily modeled on Specify6, with the addition of two tables which are relevant to the geology of the location: basin and craton_terrane. The former records details of the sedimentary basin, including basin type and name, and the latter records the craton or terrane where the site is located (e.g. Laurentia, Avalonia).

Basin types were originally based the Richardson-Tellus classification but after observing the types of basins submitted by Collaborative Team members, and in discussion with SGP team members with expertise in basin analysis, they were simplified to the following list:

back-arc
fore-arc
peripheral foreland
retro-arc foreland
intracratonic sag
passive margin
rift
wrench

Sites have a locality description, higher level geography (country, state/province, county) and coordinates which are stored verbatim and as decimal latitude and decimal longitude. If coordinates are not provided then the site is georeferenced according to georeferencing best-practices

Sites are also categorized primarily into three low-grade metamorphic bins, which are roughly based on metapelite zones:

Diagenetic zone: under mature, preserved biomarkers, KI>0.42, CAI<=3, Ro<2.0, zeolite-subgreenschist facies, diagenesis to very low grade
Anchizone: over mature, no preserved biomarkers, CAI=4, Ro 2-4, subgreenschist facies, very low grade
Epizone: Ro>4, CAI =5, KI<0.25, greenschist facies, low-grade

Two additional higher-grade categories were added in Phase 2, these are only used for a small minority of samples:

Amphibolite
Catazone (added to accommodate samples from Wang et al. 2024 EPSL)

Geological context

The geological context brings together details of the lithostratigraphical unit (e.g. formation name), the geological age, the depositional environment and biostratigraphy. Geological context is related to the site via the sample. Multiple samples from a site, or closely related sites, may have the same geological context.

Geological Age

The geological age is stored verbatim, in whatever system, local or international, that the author or contributor has used. All ages are also converted to international ages, with names constrained using a dictionary based on the International Commission on Stratigraphy (ICS) International Chronostratigraphic Chart

Lithostratigraphy

Lithostratigraphical names were imported from Macrostrat in 2015 (see Peters et al. 2018, section 2.5 Lithostratigraphic Names and Hierarchies for details). New names are added in the same format, and where possible identifiers are stored from national stratigraphic databases such as the British Geological Survey Lexicon of Named Stratigraphic Units, Canadian Weblex and Australian Stratigraphic Units database. Verbatim lithostratigraphic names are also stored, and may contain additional details, in particular in the case of USGS-NGDB samples.

Biostratigraphy

The biostratigraphical information provided by authors is recorded verbatim, and if possible linked to a table of formal biozone names.

Depositional environment

Contributors in Phase 1 were asked to classify their samples into one of the following depositional environmental bins, with 1-3 based on Canfield et al. 2008 and subsequently Sperling et al. 2015:

Inner Shelf (marine): Shale interbedded with abundant shallow-water indicators. This includes clastic beds with wave-generated sedimentary structures as well as shallow-water carbonates such as stromatolites, oolites, and rip-up conglomerates. Evidence of exposure—i.e. mudcracks, karsting, teepee structures—are often in relatively close stratigraphic proximity on the meters to 10s of meters scale.
Outer Shelf (marine): Shale from sequences that generally show little wave activity, but with occasional evidence for storm and/or wave activity, such as hummocky cross-stratified sands encased in shales. Evidence for exposure is not in close stratigraphic proximity.
Basinal (marine): Shale from successions with no evidence for any storm and/or wave activity for an appreciable (i.e. >50 m) stratigraphic distance. Generally located considerably basin-ward of shallower-water facies.
Lacustrine
Fluvial

In Phase 2 two extra categories were added to accommodate specific samples, very rarely used:

Estuarine
Igneous (sill)

These bins and descriptions are shale-centric - in keeping with the Phase 1 goals. In 2021 a secondary dictionary (dic_env_detail) was added to cater for the addition of more carbonate samples, although it covers siliciclastic environments as well. The dictionary is based on Macrostrat environments: see https://macrostrat.org/api/defs/environments?all. It allows for distinctions to be made between, for example, different kinds of inner shelf environments e.g. peritidal, reef, shoal. The original environmental bins remain unchanged, and are still one of the key pieces of information that collaborators are asked to contribute.

In addition to the defined dictionary terms, contributors also have the opportunity to provide any other details that may be relevant in a notes field.

Analytical data

The BGS model for analytical methods and geochemical results has been adopted almost without modification. In SGP we currently store only the raw analytical data, and do not standardize the results to any given unit (see Watson for discussion). This has advantages (faster data entry, more accurate reflection of published studies) but also makes retrieval of data more complex. In the SGP website users can access data in the original format (Analyses Search), or data that has been standardized: converted to a given unit, replicates averaged, and oxides converted to elements (Simple Search and Detailed Search).

See C. Analyses for more discussion.

Batch

Samples are grouped into batches, which are analyzed at a particular place on a particular date, or within a range of dates. Data from published papers may be grouped into batches based on published tables, or in groups of related analytes, likely measured together, even if the location of the analysis is not known. Data is from external data sources is grouped based on the characteristics of the source - e.g. by original table names for USGS-NGDB, by publications for USGS-CMIBS. The same sample may be part of several batches. Commercial analytical labs usually have their own identifiers (e.g. VAN numbers for results from Bureau Veritas) which are also stored in this table.

Analysis

Each batch may include one or more kinds of analysis e.g. isotopic and trace element analysis. In detail, each analysis includes three types of information about the experimental method:

preparation method
experimental method
analytical method

The preparation type usually describes the way the rock sample was crushed to a powder for analysis. In SGP we are most concerned with the material the sample is powdered with e.g. Tungsten carbide shatterbox. The experimental method provides details of any treatment applied to the sample before is it measured e.g. a three-acid digestion. The analytical method provides details of how the sample is analyzed e.g. ICP:MS. Controlled vocabularies were initially based on EarthChem lists, but additions were required in order to take into account the level of detail needed for SGP research. See C. Analyses

Some of the external data sources (USGS-NGDB, USGS-CMIBS, OZCHEM, AGS) have methodological information. Where possible we match to existing SGP method codes. The original codes (if any) are stored in the lab_method_code column of the analysis table (also used to store codes from commercial laboratories).

In some cases, however, new method codes are added to the SGP dictionary. Additions have been made in cases where a new method was genuinely needed i.e. something not yet seen in SGP data. On the other hand, some were added for methods which are likely to be the same or similar to existing SGP codes, but could not be confidently/definitively matched, or could be matched only with some loss of information. This latter category includes cases where the external data source was less detailed (e.g. USGS-NGDB 'total digest' and 'partial digest'), but also cases where SGP was less detailed (e.g. ED-XRF for CMIBS, XRF only for SGP).

Some codes include the data source in brackets (e.g. (CMIBS method)), indicating that the method code is likely specific to, or originated from, that data source. In some cases new codes entered from an external data source have subsequently been adopted for SGP data. In the future, some codes might be merged.

Analyses may be run by and/or provided by a person. The former concept is useful locally for tracing data sheets to the person who ran the experiment. The latter is useful for assigning ownership to unpublished data. This is one example where the database is designed to deal with information at two scales - firstly, tracing results within a lab, where fine level detail is available (e.g. lab experiments producing data on a day-to-day basis in the Stanford Historical Geobiology Lab), and secondly dealing with legacy data (e.g. a collaborator might provide a batch of unpublished data produced at a lab over a longer time period, and the specifics of the lab personnel running the experiments is no longer available).

Results

Each analysis can produce data for a suite of analytes with different detection limits. Lower and upper detection limits and the analyte measured are stored in the analyte_determination_limits table. Reference standards for isotopes are also stored in this table. Finally, the raw results are stored in the analyte_determination table, with their associated units.

Uncertainty details, such as standard error and standard deviation, are also stored in this table, if available. Additional (and less structured) information about precision and accuracy can be found in the analysis notes, on the analysis table. These are often copied directly from the methods section of a paper.

If the results are published, they are linked directly to a reference work on an individual basis so a fine level distinction can be made between published and related unpublished data from the same samples. In addition, the same data can often appear in related published papers - samples can be linked to multiple papers, individual results are linked to the publication they first appeared in.

In SGP we make every effort not to include the same result twice. However replicates may legitimately be added, if the same sample has undergone analysis for the same analyte more than once.

In Phase 1 we did not assign sample identifiers to sub-samples. In Phase 2 a parent-child relationship was added to accommodate carbonate samples e.g. a brachiopod sample from within a grainstone sample. However, thus far this capability has rarely been used.

Abundance can only be empty (NULL) if the measurement is explicitly recorded as below or above detection limits.

Proxy

A data table often includes calculated values, in addition to directly measured results. For example Ca-carb, Mg-carb and Ca-carb/Mg-carb. The published calculated values are stored in the proxy table. On the search website, the calculations are made from the components first, and the published is only presented if the components are not available.

Projects

Samples can be grouped into projects, and projects can be associated with zero, one or more publications. Projects can searched on the SGP website, and provide one way to access related data, in particular any data associated with a given publication, which is original to that publication, or linked to the samples from that publication.