ESGF_Publisher_Internals - ESGF/esgf.github.io GitHub Wiki

ESGF Publisher Internals

Introduction

The purpose of the ESGF publisher is twofold:

  • Make data accessible to ESGF users
  • Make associate metadata searchable by ESGF users.

The publisher interfaces with three of the main ESGF components:

  • Postgres database
    • Store all metadata
  • THREDDS Data Server (TDS)
    • Define metadata in a standard, web-accessible form: TDS catalogs
    • Make data accessible
  • SOLR index
    • Make metadata searchable

The publisher is implemented in Python, and is organized and distributed in the esgcet package.

Data Model

dataset_illus.png

Also reference the publisher [ database schema ](http://www- pcmdi.llnl.gov/esgcet/esgcet_schema.png) .

The central data objects in the publisher are:

  • Datasets and datasetVersions
    • Conceptually a datasetVersion is a collection of data stored in one or more physical data files (called fileVersions here). Typically the files are in a self-describing data format such as netCDF.
    • As the above figure illustrates, a datasetVersion may be thought of in either of two equivalent ways:
      • as a collection of fileVersions, or
      • as a collection of variables.
    • When one or more files in a datasetVersion is modified, the publisher creates a new datasetVersion with a new version number which is larger than the previous version. The sequence of related datasetVersion objects is called a dataset.
    • A datasetVersion is the unit of publication in ESGF, meaning that the typical action of the publisher is to publish one or more datasetVersions.
    • We often use the term dataset instead of datasetVersion when there is no risk of confusion.
    • A dataset is associated with exactly one project, model, and experiment.
  • Variables and fileVariables
    • A variable is a multidimensional array, with named dimensions and associated attributes, for example: sea_level_pressure(lat, lon, time). A variable may span multiple files, as illustrated above.
    • A fileVariable is the portion of a variable in a single fileVersion.
    • Part of the publisher workflow is to combine fileVariables from related files into variables. This process is called aggregation. In the THREDDS context, a variable is called an aggregation. Aggregation is done on a single dimension, typically time. Variables are aggregated only for the last datasetVersion (highest version number).
  • Files and fileVersions
    • A fileVersion is a single physical file, containing a collection of multidimensional fileVariables.
    • If a fileVersion is replaced by publishing a different file of the same base name, a new fileVersion is created. The sequence of fileVersions is called a file. All fileVersions associated with a file have the same name, but may exist in different directories.
    • A given file belongs to one and only one dataset.
  • Project
    • A project is a collection of datasets, together with associated metadata. For example, 'CMIP5'.
  • Models
    • A model is a computer program that generates data. A model is associated with exactly one project. For example, 'HadGEM2-AO'.
  • Experiments
    • An experiment is the configuration (boundary conditions, input, etc.) of a model that generates a dataset. For example, 'historical'.
    • An experiment is associated with exactly one project.

Code Layout

The esgcet publisher packages contains subpackages:

  • config: Defines the data handlers that are used to extract metadata, and configuration functions for reading the esg.ini configuration file.
  • model: Defines the data objects, and their mappings to the SQL database.
  • publish: Functionality for extracting metadata, generating THREDDS catalogs, and publishing metadata.
  • query: Metadata querying for the local SQL database and remote SOLR indexes.
  • schema_migration: Definition of SQL database tables; versioned control of database schemas.
  • ui: GUI interface.

See the autogenerated [ esgcet package documentation ](http://www- pcmdi.llnl.gov/esgcet/doc) .

Workflow

publisher_workflow.png

The above diagram shows the workflow when a dataset is published. Unpublishing essentially follows the reverse direction.

  • Scan data and save to SQL database:
    • The function esgcet.publish.extract.extractFromDataset scans the files in the dataset and populates the associated SQL database tables for:
      • datasets, datasetVersions
      • files, fileVersions
      • fileVariables (partially)
  • Aggregate fileVariables into variables:
    • The function esgcet.publish.extract.aggregateVariables creates variables, and populates the SQL database for:
      • fileVariables
      • variables
  • Generate THREDDS catalogs
    • The function esgcet.publish.thredds.generateThredds writes THREDDS catalogs.
    • updateThreddsMasterCatalog and updateThreddsRootCatalog write the master and root THREDDS catalogs.
    • reinitializeThredds reinitializes the TDS, making the catalog visible.
  • Publish THREDDS catalogs to the SOLR index.
    • Function esgcet.publish.publish.publishDataset contacts the index node through either a RESTful or Hessian API. The index node harvests the THREDDS catalog and makes the dataset metadata searchable.

Database Interface

The database interface is managed by the SQLAlchemy object-relational system. Persistent data classes are defined and mapped to SQL data tables. Setting and referencing data instances generates corresponding database reads and writes. The O-R model dramatically simplifies coding by largely removed the need for explicit SQL calls.

All esgcet persistent data objects and SQL tables are defined in package esgcet.model.

For example, the Dataset class is defined by:

class Dataset(object):

    def __init__(self, name, project, model, experiment, run_name, calendar=None, aggdim_name=None, aggdim_units=None, status_id=None, offline=False, masterGateway=None):
        self.name = name
        self.project = project
        self.model = model
        self.experiment = experiment
        self.run_name = run_name
        self.calendar = calendar
        self.aggdim_name = aggdim_name
        self.aggdim_units = aggdim_units
        self.status_id = status_id
        self.offline = offline
        self.master_gateway = masterGateway

The corresponding database table is defined by:

datasetTable = Table('dataset', metadata,
                     Column('id', types.Integer, primary_key=True, autoincrement=True),
                     Column('name', types.String(255), unique=True),
                     Column('project', types.String(64)),
                     Column('model', types.String(64)),
                     Column('experiment', types.String(64)),
                     Column('run_name', types.String(64)),
                     Column('calendar', types.String(32)),
                     Column('aggdim_name', types.String(64)),
                     Column('aggdim_units', types.String(64)),
                     Column('status_id', types.String(64)),
                     Column('offline', types.Boolean),
                     Column('master_gateway', types.String(64)),
                     ForeignKeyConstraint(['model', 'project'], ['model.name', 'model.project']),
                     ForeignKeyConstraint(['experiment', 'project'], ['experiment.name', 'experiment.project']),
                     mysql_engine='InnoDB',
                     )

and the relationship between the class and table is defined by:

mapper(Dataset, datasetTable, properties={...}

Subsequent creation of Dataset instances results in records being created in Postgres. For example, in esgcet.publish.extract.extractFromDataset, the following creates a Dataset instance that is persisted in the esgcet dataset table:

        dset = Dataset(datasetName, context.get('project', None), context.get('model', None), context.get('experiment', None), context.get('run_name', None), offline=offline, masterGateway=masterGateway)
        session.add(dset)

Subsequent changes to dset will be persisted when the database session is committed.

Internal APIs

Cdunif: data file I/O

Cdunif is a part of CDMS, but is a simpler, lower-level interface than cdms2.Cdunif is available to handlers for file I/O. As part of cdms2, Cdunif provides access to the formats that CDMS supports, including netCDF and GrADS/GRIB.

Function Description
from cdms2 import Cdunif f = Cdunif.CdunifFile(path, mode) Create or open a file. mode = 'r' (read), 'r+ (read-write), 'a' (read/write, possibly existing), 'w' (new)
dir(f) Global attributes
f.someatt Read global attribute 'someatt'.
f.someatt = 'New value' Write a global attribute.
f.dimensions Dictionary of dimension lengths.
f.variables Dictionary of file variables.
d = f.createDimension(name, len) Create a dimension. 'name' is the name of the dimension, 'len' is the length, or None for an unlimited dimension
v = f.createVariable(name, type, dimensions Create a new variable in the file. 'name' is the name of the variable. 'type' is a string type identifier. 'dimensions' is a tuple of dimensions created by createDimension.
f.sync() Sync modifications to the file
f.close() Close the file
v  = f.variables['name'] Get a variable object
dir(v) Get variable attributes
v.someatt Read variable attribute 'someatt'
v.someatt = 'New value' Write a variable attribute
ar = v[i:j, k:l] Read variable data
v[i:j, k:l] = ar Write variable data
ar = v.getValue() Read the entire variable data array
v.assignValue(ar) Write the entire variable data array
v.shape Variable dimension lengths
v.typecode() Variable type (string)

FormatHandler Interface

The FormatHandler interface ( esgcet.config.format.FormatHandler ) is the publisher's internal I/O interface. All format handlers must inherit from FormatHandler and implement these methods:

    - open: Open a file.
    - getFormatDescription: Get a string description of the format.
    - close: Close a file.
    - getAttribute: Get a global or variable attribute.
    - getVariable: Get variable data (only used for coordinate variables).
    - hasAttribute: Inquire if a global or variable attribute exists.
    - hasVariable: Inquire if a variable exists.
    - inquireAttributeList: Get a list of attributes.
    - inquireVariableDimensions: Get a list of variable dimensions.
    - inquireVariableList: Get a list of variables.
    - inquireVariableShape: Get the shape of a variable, as a tuple.

The only built-in format handler is the CdunifFormatHandler ( esgcet.config.netcdf_handler.CdunifFormatHandler ). This handler calls the Cdunif interface to perform file I/O.

MetadataHandler Interface

The MetadataHandler interface ( esgcet.config.metadata.MetadataHandler ) is the internal interface to metadata conventions and time value logic. The only implementation is CFHandler , which implements the CF metadata convention, and uses the cdtime module in CDMS for time and calendar functions.

Project Handlers

Project handlers encapsulate the logic for obtaining metadata associated with a particular project. The API is defined in the ProjectHandler class ( esgcet.config.project.ProjectHandler ). For example, the CMIP5 project handler (esgcet.config.ipcc5_handler.IPCC5Handler) deals with CMOR tables, DRS, determination of the product field, and so on. The built-in project handlers are:

  • BasicHandler : A basic handler for generic projects.

  • IPCC5Handler : CMIP5 project handler

  • IPCC4Handler : CMIP3 handler

Project handlers encapsulate an important data structure, the project context . The context is a dictionary of name-value pairs that contains the file metadata discovered at any given point in processing. The metadata may be obtained from:

  • command line
  • directory names
  • file global attributes
  • dataset IDs specified in a mapfile
  • project-specific internally generated metadata, such as the CMIP5 product field.

All project handlers inherit from the abstract base class ProjectHandler . The most important methods that project handlers must implement are:

  • readContext: populate the project context, typically by reading metadata from a file
  • getContext: return the project context

For example, the BasicHandler implementation reads standard netCDF header information:

    def getContext(self, context):
        ProjectHandler.getContext(self, context)
        if self.context.get('creation_time', '')=='':
            self.context['creation_time'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        if self.context.get('format', '')=='':
            self.context['format'] = self.formatHandlerClass.getFormatDescription()
            conventions = self.context.get('Conventions')
            if conventions is not None:
                self.context['format'] += ', %s'%conventions
        return self.context

    def readContext(self, cdfile):
        """Get a dictionary of key/value pairs from an open file.
           cdfile is an instance of CdunifFormatHandler.
           cdfile.file is a Cdunif file.
           This handler sets basic file descriptive metadata.
           """
        f = cdfile.file
        result = {}
        if hasattr(f, 'title'):
            result['title'] = f.title
        if hasattr(f, 'Conventions'):
            result['Conventions'] = f.Conventions
        if hasattr(f, 'source'):
            result['source'] = f.source
        if hasattr(f, 'history'):
            result['history'] = f.history
        return result