3. USGS‐CMIBS - ufarrell/sgp_phase2 GitHub Wiki

USGS-CMIBS includes predominantly Phanerozoic shale data from all continents. Detailed metadata can be found here: https://www.sciencebase.gov/catalog/item/59de7494e4b05fe04ccd3a47

USGS-CMIBS includes 12276 samples with 384000 results from 2295 sites.

Geography

The samples are from 44 countries, with 42% from Canada, 24% from the United States and 14% from Australia.

Lithology

The majority of samples are fine-grained siliciclastic sediments (64% shale, mudstone, siltstone or argillite).

Age

48% of samples are from the Palaeozoic, 20% from the Mesozoic, 5% from the Cenozoic and 21% are Precambrian. The figure below is based on interpreted age in Ma, and does not represent all samples - see Completeness and Data Collection/Processing below for more details.

Data

Data was entered in batches based on the PUBL_ID (i.e. grouped by publication). Summary of batches (count samples, results, analyte lists) here. Categories below are based on those used on our search website (http://sgp-search.io/).

Completeness

Data Collection/Processing

USGS-CMIBS had considerable overlap with USGS-NGDB and we removed all USGS-NGDB samples to begin with (samples with the publ_id ‘NGDB_2013’ and ‘NGDB_2014’).

Secondly, samples were excluded if the sample type included lithologies indicative of ore, or if the title of the study indicated that authors were primarily concerned with mineralized deposits, ore deposits, or studying the effect of metamorphism on shales (e.g. the effect of sill emplacement on shales). A comparison between the remaining verbatim lithologies, and the SGP-matched terms for CMIBS samples can be seen here.

Interpreted age

CMIBS samples have been associated with Macrostrat continuous-time age models where possible.

CMIBS samples that did not match with Macrostrat were given age information by SGP team members (Erik Sperling, Judi Sclafani, Paul Hoffman, Swapan Sahoo) where possible (5995 samples). In some cases note that this was done without any direct knowledge of the formations or the studies (although we only coded ages where we were reasonably confident in the age assignment). The justification and logic for age calls was recorded in every case (‘interpreted age justification’), meaning that a user can see exactly why a sample was given a specific age.

3109 samples remain without ages. As absolute age information is the currency necessary to conduct analyses of geochemical trends through time, filling in these ages remains a clear target for the next phase.

Phase2 Updates

In Phase 2 CMIBS samples with interpreted ages from Macrostrat were updated by Daven Quinn, using the following process:

  • Samples are linked to a Macrostrat stratigraphic column footprint that contains them
  • The search window is expanded to adjacent columns, recognizing that Macrostrat's notion of "column footprint" is fuzzy
  • Priority is given to units within the column directly underlying the sample
  • Adjacent column matching can be turned off with a flag
  • The units within the matched column(s) are used to establish a semantic window for linking
  • All stratigraphic names are extracted, and Macrostrat's lexicon is traversed to extract parent and child units (e.g., members and groups) established in Macrostrat's lexicon
  • Concepts and synonyms encompassing synonymous stratigraphic names are also linked
  • Strat_name_footprints, which are computed from map units as well as stratigraphic column units, are also used to match. These are taken as a fallback as ages are usually less well-established
  • Matches are attempted first by exact matching and then by substring matching

In keeping with CARE (Collective benefit, Authority to control, Responsibility, and Ethics) principles we have removed 215 CMIBS samples from 16 sites with 4059 results, where the decimal latitude and longitude interset with Native-held land (identified using public TIGER/Line shape files provided by the U.S. Census Bureau - accessed through QGIS via https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/AIANNHA/MapServer).

Full references have been added (based on CMIBS PUBL_ID), and samples grouped into separate projects accordingly. Project names include the publication name e.g. Abre et al. 2011 (CMIBS). See reference list.

Sample notes have been updated to include both CMIBS PUBL_ID and DATA_SOURCE.

CMIBS samples (and associated geological context and data) from PUBL_ID Algeo_2004 were replaced with an SGP version coded directly by Dr. Thomas Algeo. This included 306 samples and their associated sites (listed below).

site_id section_name site_type country site_desc
15922 CMIBS-1659 core United States KGS Orville Edmonds No. 1A study core, eastern KS
15924 CMIBS-1660 core United States Edmonds study core, eastern KS
15925 CMIBS-1661 core United States Ermal study core, eastern KS
15926 CMIBS-1662 core United States Heilman study core, eastern KS
15927 CMIBS-1663 core United States Mitchellson study core, eastern KS
15928 CMIBS-1664 core United States Womelsdorf study core, eastern KS

Data Entry - CMIBS vs SGP

Each sample in CMIBS has a ‘best value’ result per analyte, where multiple values were originally available (Granitto et al. 2017). The choice of ‘best value’ was made using a rubric which included consideration of the sample weight, the sample ‘decomposition’ (e.g. full vs. partial acid digestion), the instruments used in the analysis and the detection limits (Granitto et al. 2013). The values are included as both elemental data and as oxides, where relevant (e.g. Al and Al2O3). Both were incorporated into the database. This is in contrast to SGP data, which only includes the original measured value.

An effort was made to match most USGS CMIBS columns to SGP columns (see table below), but in some cases compromises were required e.g. concatenating data into one SGP column. In most cases if a column was omitted it did not contain any values (all NULL). Where information was particularly important (e.g. stratigraphical names) the data was cleaned so that it could be matched to the existing dictionaries, although verbatim was also included.

CMIBS column CMIBS column description SGP table_name.column_name(s) Notes
ADDL_ATTR Additional attributes used to modify PRIMARY_CLASS, SECONDARY_CLASS, or SPECIFIC_NAME; derived from sample codes in fields of original databases that do not have equivalent fields in the NGDB. NOT IMPORTED All NULL for included samples
ALTERATION An indication of the presence or type of alteration noted in the sample by the submitter. NOT IMPORTED All NULL for included samples
COORDINATES_COMMENT Comment regarding precision of geospatial coordinates. site.notes Included in site_notes along with coordinate_qual e.g.
COUNTRY Country or marine body of water from where the sample was collected. site.country
DATA_SOURCE Identifier for other source of data; database, publication, individual. sample.sample_notes Included in sample_notes along with PUBL_ID
DATE_SUBMITTED Date sample was submitted to Sample Control for initial database processing prior to sample prep and analysis; estimated for non-USGS samples. NOT IMPORTED
DATUM Reference datum, when recorded, for the latitude and longitude coordinates of the sample site. site.datum_original
DEPTH Depth from the surface at which the sample was collected; units are specified by the submitter. sample.height_depth_m, sample.min_depth, sample.max_depth, sample.sample_notes Values converted to meters. Ranges of depths are added to max and min depth. Verbatim stored in sample_notes.
FIELD_ID Field identifier assigned by the sample collector of sample submitted for analysis, possibly corrected by data renovator due to truncation of data entry. alternate_num.alternate_num
GEOLOGIC_AGE Age or range of ages from the Geological Time Scale for the collected sample. geol_age.verbatim_age
JOB_ID Laboratory batch identifier assigned by the Sample Control Officer of the analytical laboratory that received the samples as a batch. NOT IMPORTED All NULL for included samples
LAB_ID Unique identifier assigned to each submitted sample by the Sample Control Officer of the analytical laboratory that received the sample. alternate_num.alternate_num
LATITUDE Latitude coordinate of sample site, reported in decimal degrees; see metadata for further datum and spheroid information. site.lat_original
LOCATE_DESC Geographic information relating to the location of the sample site. site.site_desc
LONGITUDE Longitude coordinate of sample site, reported in decimal degrees; there are sites on both sides of the International Date Line; see metadata for further datum and spheroid information. site.long_original
METALLOGENY Metallogeny of stratigraphic unit; CAMIRO database field; exact description not provided. NOT IMPORTED
METHOD_COLLECTED Sample collection method: Single grab, composite, or channel. NOT IMPORTED
MINERALIZATION An indication of mineralization or mineralization types as provided by the sample submitter. NOT IMPORTED All NULL for included samples
PREVIOUS_JOB_ID Original NGDB batch number (JOB_ID) of a USGS resubmitted sample that has been given a new batch number upon resubmittal for further analysis. NOT IMPORTED All NULL for included samples
PREVIOUS_LAB_ID Original NGDB LAB_ID of a USGS resubmitted sample that has been given a new lab number upon resubmittal for further analysis. NOT IMPORTED All NULL for included samples
PRIMARY_CLASS Primary classification of sample media. NOT IMPORTED
PROJECT_NAME Project name, at times derived from a project account number, of work group funded for the collection and analysis of submitted samples. NOT IMPORTED
QUAD Name of 1:250,000-scale quadrangle (1∞x2∞ or 1∞x3∞) in which sample was collected. NOT IMPORTED All NULL for included samples
REGIONAL_GEOLOGY Regional geologic setting of stratigraphic unit; CAMIRO database field; exact description not provided. basin.basin_name, craton_terrane.ct_name Used to populate basin_name and ct_name where unambiguous names were available (e.g. "Sedimentary basin; Appalachian Basin").
SAMPLE_COMMENT Attribute used to modify PRIMARY_CLASS, SECONDARY_CLASS, or SPECIFIC_NAME; data is not derived from sample codes. sample.lith_notes
SAMPLE_SOURCE Physical setting or environment from which the sample was collected. site.site_type
SECONDARY_CLASS Secondary classification or subclass of sample media; attribute of PRIMARY_CLASS. SEE NOTES Higher level rock type e.g. sedimentary, metamorphic. Not imported, but available in dic_lithology, with the lithology type.
SOURCE_TERRAIN Tectonic terrain as source of deposition for stratigraphic unit; CAMIRO database field; exact description not provided. basin.basin_notes
SPECIFIC_NAME Specific name for the sample media collected; attribute of PRIMARY_CLASS and/or SECONDARY_CLASS. SEE NOTES Used to populate lithology fields - lithology type, lithology composition, lithology texture. Specific_name entered as verbatim_lith (Added after data freeze, will be available in Phase 2).
SPHEROID Reference spheroid or ellipsoid, when recorded, for the latitude and longitude coordinates of the sample site. NOT IMPORTED all NULL for included samples
STATE_PROVINCE Abbreviation of state from where the sample was collected. site.state_province State abbreviations translated to full state name (i.e. CO = Colorado)
STRAT_GRP Summary field of formation, group or supergroup for entries in STRATIGRAPHY, used for query analysis. SEE NOTES Used in combination with strat_sort, and stratigraphy to match to dic_lithostrat values, or to create new dic_lithostrat values where necessary.
STRAT_SORT Summary field for entries in STRATIGRAPHY, used for query analysis. SEE NOTES Used in combination with strat_sort, and stratigraphy to match to dic_lithostrat values, or to create new dic_lithostrat values where necessary.
STRATIGRAPHY Name of the stratigraphic unit from which the sample was collected. When present, values are as given by the sample submitter and may represent either a formal name, an informal name, or geologic map unit abbreviation. lithostrat.verbatim_strat Used in combination with strat_sort, and stratigraphy to match to dic_lithostrat values, or to create new dic_lithostrat values where necessary.
SUBMITTER Name of the individual(s) who submitted the sample in a batch to the laboratory for analysis; not necessarily the sample collector. NOT IMPORTED
TECTONIC_SETTING Tectonic setting for deposition of stratigraphic unit; CAMIRO database field; exact description not provided. basin.basin_type Information relevent to basin_type was extracted, did not use tectonic_setting which had a question mark, or where information was otherwise unclear.
CMIBS_ID Unique identifier assigned to each sample entered in the CMIBS geochemical database. sample.original_num
PUBL_ID Identifier for publication that is the source of data, or where data is cited; link to CMIBS_Geol table. sample.sample_notes Included in sample_notes along with DATA_SOURCE, and used to populated full references in reference_work table.