ESGF_Publisher_Internals - ESGF/esgf.github.io GitHub Wiki
ESGF Publisher Internals
Introduction
The purpose of the ESGF publisher is twofold:
- Make data accessible to ESGF users
- Make associate metadata searchable by ESGF users.
The publisher interfaces with three of the main ESGF components:
- Postgres database
- Store all metadata
- THREDDS Data Server (TDS)
- Define metadata in a standard, web-accessible form: TDS catalogs
- Make data accessible
- SOLR index
- Make metadata searchable
The publisher is implemented in Python, and is organized and distributed in the esgcet package.
Data Model
Also reference the publisher [ database schema ](http://www- pcmdi.llnl.gov/esgcet/esgcet_schema.png) .
The central data objects in the publisher are:
- Datasets and datasetVersions
- Conceptually a datasetVersion is a collection of data stored in one or more physical data files (called fileVersions here). Typically the files are in a self-describing data format such as netCDF.
- As the above figure illustrates, a datasetVersion may be thought of in either of two equivalent ways:
- as a collection of fileVersions, or
- as a collection of variables.
- When one or more files in a datasetVersion is modified, the publisher creates a new datasetVersion with a new version number which is larger than the previous version. The sequence of related datasetVersion objects is called a dataset.
- A datasetVersion is the unit of publication in ESGF, meaning that the typical action of the publisher is to publish one or more datasetVersions.
- We often use the term dataset instead of datasetVersion when there is no risk of confusion.
- A dataset is associated with exactly one project, model, and experiment.
- Variables and fileVariables
- A variable is a multidimensional array, with named dimensions and associated attributes, for example: sea_level_pressure(lat, lon, time). A variable may span multiple files, as illustrated above.
- A fileVariable is the portion of a variable in a single fileVersion.
- Part of the publisher workflow is to combine fileVariables from related files into variables. This process is called aggregation. In the THREDDS context, a variable is called an aggregation. Aggregation is done on a single dimension, typically time. Variables are aggregated only for the last datasetVersion (highest version number).
- Files and fileVersions
- A fileVersion is a single physical file, containing a collection of multidimensional fileVariables.
- If a fileVersion is replaced by publishing a different file of the same base name, a new fileVersion is created. The sequence of fileVersions is called a file. All fileVersions associated with a file have the same name, but may exist in different directories.
- A given file belongs to one and only one dataset.
- Project
- A project is a collection of datasets, together with associated metadata. For example, 'CMIP5'.
- Models
- A model is a computer program that generates data. A model is associated with exactly one project. For example, 'HadGEM2-AO'.
- Experiments
- An experiment is the configuration (boundary conditions, input, etc.) of a model that generates a dataset. For example, 'historical'.
- An experiment is associated with exactly one project.
Code Layout
The esgcet publisher packages contains subpackages:
- config: Defines the data handlers that are used to extract metadata, and configuration functions for reading the esg.ini configuration file.
- model: Defines the data objects, and their mappings to the SQL database.
- publish: Functionality for extracting metadata, generating THREDDS catalogs, and publishing metadata.
- query: Metadata querying for the local SQL database and remote SOLR indexes.
- schema_migration: Definition of SQL database tables; versioned control of database schemas.
- ui: GUI interface.
See the autogenerated [ esgcet package documentation ](http://www- pcmdi.llnl.gov/esgcet/doc) .
Workflow
The above diagram shows the workflow when a dataset is published. Unpublishing essentially follows the reverse direction.
- Scan data and save to SQL database:
- The function esgcet.publish.extract.extractFromDataset scans the files in the dataset and populates the associated SQL database tables for:
- datasets, datasetVersions
- files, fileVersions
- fileVariables (partially)
- The function esgcet.publish.extract.extractFromDataset scans the files in the dataset and populates the associated SQL database tables for:
- Aggregate fileVariables into variables:
- The function esgcet.publish.extract.aggregateVariables creates variables, and populates the SQL database for:
- fileVariables
- variables
- The function esgcet.publish.extract.aggregateVariables creates variables, and populates the SQL database for:
- Generate THREDDS catalogs
- The function esgcet.publish.thredds.generateThredds writes THREDDS catalogs.
- updateThreddsMasterCatalog and updateThreddsRootCatalog write the master and root THREDDS catalogs.
- reinitializeThredds reinitializes the TDS, making the catalog visible.
- Publish THREDDS catalogs to the SOLR index.
- Function esgcet.publish.publish.publishDataset contacts the index node through either a RESTful or Hessian API. The index node harvests the THREDDS catalog and makes the dataset metadata searchable.
Database Interface
The database interface is managed by the SQLAlchemy object-relational system. Persistent data classes are defined and mapped to SQL data tables. Setting and referencing data instances generates corresponding database reads and writes. The O-R model dramatically simplifies coding by largely removed the need for explicit SQL calls.
All esgcet persistent data objects and SQL tables are defined in package esgcet.model.
For example, the Dataset class is defined by:
class Dataset(object):
def __init__(self, name, project, model, experiment, run_name, calendar=None, aggdim_name=None, aggdim_units=None, status_id=None, offline=False, masterGateway=None):
self.name = name
self.project = project
self.model = model
self.experiment = experiment
self.run_name = run_name
self.calendar = calendar
self.aggdim_name = aggdim_name
self.aggdim_units = aggdim_units
self.status_id = status_id
self.offline = offline
self.master_gateway = masterGateway
The corresponding database table is defined by:
datasetTable = Table('dataset', metadata,
Column('id', types.Integer, primary_key=True, autoincrement=True),
Column('name', types.String(255), unique=True),
Column('project', types.String(64)),
Column('model', types.String(64)),
Column('experiment', types.String(64)),
Column('run_name', types.String(64)),
Column('calendar', types.String(32)),
Column('aggdim_name', types.String(64)),
Column('aggdim_units', types.String(64)),
Column('status_id', types.String(64)),
Column('offline', types.Boolean),
Column('master_gateway', types.String(64)),
ForeignKeyConstraint(['model', 'project'], ['model.name', 'model.project']),
ForeignKeyConstraint(['experiment', 'project'], ['experiment.name', 'experiment.project']),
mysql_engine='InnoDB',
)
and the relationship between the class and table is defined by:
mapper(Dataset, datasetTable, properties={...}
Subsequent creation of Dataset instances results in records being created in Postgres. For example, in esgcet.publish.extract.extractFromDataset, the following creates a Dataset instance that is persisted in the esgcet dataset table:
dset = Dataset(datasetName, context.get('project', None), context.get('model', None), context.get('experiment', None), context.get('run_name', None), offline=offline, masterGateway=masterGateway)
session.add(dset)
Subsequent changes to dset will be persisted when the database session is committed.
Internal APIs
Cdunif: data file I/O
Cdunif is a part of CDMS, but is a simpler, lower-level interface than cdms2.Cdunif is available to handlers for file I/O. As part of cdms2, Cdunif provides access to the formats that CDMS supports, including netCDF and GrADS/GRIB.
Function | Description |
---|---|
from cdms2 import Cdunif f = Cdunif.CdunifFile(path, mode) |
Create or open a file. mode = 'r' (read), 'r+ (read-write), 'a' (read/write, possibly existing), 'w' (new) |
dir(f) |
Global attributes |
f.someatt |
Read global attribute 'someatt'. |
f.someatt = 'New value' |
Write a global attribute. |
f.dimensions |
Dictionary of dimension lengths. |
f.variables |
Dictionary of file variables. |
d = f.createDimension(name, len) |
Create a dimension. 'name' is the name of the dimension, 'len' is the length, or None for an unlimited dimension |
v = f.createVariable(name, type, dimensions |
Create a new variable in the file. 'name' is the name of the variable. 'type' is a string type identifier. 'dimensions' is a tuple of dimensions created by createDimension. |
f.sync() |
Sync modifications to the file |
f.close() |
Close the file |
v  = f.variables['name'] |
Get a variable object |
dir(v) |
Get variable attributes |
v.someatt |
Read variable attribute 'someatt' |
v.someatt = 'New value' |
Write a variable attribute |
ar = v[i:j, k:l] |
Read variable data |
v[i:j, k:l] = ar |
Write variable data |
ar = v.getValue() |
Read the entire variable data array |
v.assignValue(ar) |
Write the entire variable data array |
v.shape |
Variable dimension lengths |
v.typecode() |
Variable type (string) |
FormatHandler Interface
The FormatHandler
interface ( esgcet.config.format.FormatHandler
) is
the publisher's internal I/O interface. All format handlers must inherit from
FormatHandler
and implement these methods:
- open: Open a file.
- getFormatDescription: Get a string description of the format.
- close: Close a file.
- getAttribute: Get a global or variable attribute.
- getVariable: Get variable data (only used for coordinate variables).
- hasAttribute: Inquire if a global or variable attribute exists.
- hasVariable: Inquire if a variable exists.
- inquireAttributeList: Get a list of attributes.
- inquireVariableDimensions: Get a list of variable dimensions.
- inquireVariableList: Get a list of variables.
- inquireVariableShape: Get the shape of a variable, as a tuple.
The only built-in format handler is the CdunifFormatHandler
( esgcet.config.netcdf_handler.CdunifFormatHandler
). This handler calls the
Cdunif interface to perform file I/O.
MetadataHandler Interface
The MetadataHandler
interface ( esgcet.config.metadata.MetadataHandler
) is the internal interface to metadata conventions and time value logic. The
only implementation is CFHandler
, which implements the CF metadata
convention, and uses the cdtime module in CDMS for time and calendar
functions.
Project Handlers
Project handlers encapsulate the logic for obtaining metadata associated with
a particular project. The API is defined in the ProjectHandler
class ( esgcet.config.project.ProjectHandler
). For example, the CMIP5 project
handler (esgcet.config.ipcc5_handler.IPCC5Handler) deals with CMOR tables,
DRS, determination of the product field, and so on. The built-in project
handlers are:
-
BasicHandler
: A basic handler for generic projects. -
IPCC5Handler
: CMIP5 project handler -
IPCC4Handler
: CMIP3 handler
Project handlers encapsulate an important data structure, the project context . The context is a dictionary of name-value pairs that contains the file metadata discovered at any given point in processing. The metadata may be obtained from:
- command line
- directory names
- file global attributes
- dataset IDs specified in a mapfile
- project-specific internally generated metadata, such as the CMIP5 product field.
All project handlers inherit from the abstract base class ProjectHandler . The most important methods that project handlers must implement are:
- readContext: populate the project context, typically by reading metadata from a file
- getContext: return the project context
For example, the BasicHandler implementation reads standard netCDF header information:
def getContext(self, context):
ProjectHandler.getContext(self, context)
if self.context.get('creation_time', '')=='':
self.context['creation_time'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
if self.context.get('format', '')=='':
self.context['format'] = self.formatHandlerClass.getFormatDescription()
conventions = self.context.get('Conventions')
if conventions is not None:
self.context['format'] += ', %s'%conventions
return self.context
def readContext(self, cdfile):
"""Get a dictionary of key/value pairs from an open file.
cdfile is an instance of CdunifFormatHandler.
cdfile.file is a Cdunif file.
This handler sets basic file descriptive metadata.
"""
f = cdfile.file
result = {}
if hasattr(f, 'title'):
result['title'] = f.title
if hasattr(f, 'Conventions'):
result['Conventions'] = f.Conventions
if hasattr(f, 'source'):
result['source'] = f.source
if hasattr(f, 'history'):
result['history'] = f.history
return result