Ingestion Packages - ge-semtk/semtk GitHub Wiki

Ingestion Packages

A complete dataset can be packaged as a single ZIP file (see sample here) containing various files with embedded ingestion instructions. Ingestion packages make it easy to construct reproducible datasets that can be shared with other users or reloaded at a later date.

Loading an ingestion package

An ingestion package may be loaded into SemTK using UtilityClient.execLoadIngestionPackage(...)

This function requires parameters defaultModelGraph and defaultDataGraph. These graphs are used anywhere that the model or data graph is not specified within the ingestion package contents.

Ingestion package contents

An ingestion package zip file must contain a top-level manifest file named manifest.yaml. This is the entry point for the ingestion process and will orchestrate loading all the subsequent ZIP file components.

Manifest file

Manifest files are stored in YAML format. A manifest file contains:

Name and description information
A list of ingestion steps
The model and data graphs used by the components of the manifest
Extra instructions for completing the ingestion process

name: "short name"
description: "optional long package description"
footprint:
    model-graphs:
      - "http://rack001/model"
    data-graphs:
      - "http://rack001/data"
steps:
  - manifest: another.yaml
  - model: model-manifest.yaml
  - data: data-manifest.yaml
  - nodegroups: nodegroup-directory
  - copygraph:
      from-graph: "http://source/graph"
      to-graph: "http://destination/graph"
copy-to-graph: "http://target/graph"
perform-entity-resolution: "http://target/graph"

The name and description fields are informational.

The footprint section is optional, and contains a list of model and data graphs loaded by this ingestion package. If provided, these graphs may be used in several places: 1) by API to create a connection string (e.g. to view the new data for SPARQLgraph) 2) optionally cleared before loading, if specified via API call and 3) optionally copied via copy-to-graph (see below)

The steps section is required. It describes the sequential process of loading this ingestion package. This section must be a list of singleton maps. There are five step types:

manifest: points to a sub-manifest YAML file to process
model: points to a model ingestion YAML file to process
data: points to a data ingestion YAML file to process
nodegroups: points to a directory of nodegroups to process
copygraph: executes a data copy from one graph to another. Specify source and target graphs with the keys from-graph and to-graph.

The copy-to-graph field is optional. When provided, it copies the footprint graphs to the target graph.

The perform-entity-resolution field is optional. When provided, it performs entity resolution within the target graph.

All file paths are resolved relative to the location of the manifest YAML file.

Model manifests

Models are indexed by a YAML file using the following top-level keys:

model-graphs: a model graph to be used for ingestion (currently supports only a single model graph)
files: a required list of the OWL files that constitute the model.

Data manifests

Data files are indexed by a YAML file using the following top-level keys:

model-graphs: model graph(s) to be used in the ingestion connection
data-graph: a data graph to be used for ingestion
extra-data-graphs: read-only data graphs used for lookups
ingestion-steps: specifies data to load, following one of these examples:
- {class: "http://class/uri", csv: "activities.csv"} (load CSV file using the specified class URI, via automatically generated ingestion nodegroup)
- {nodegroup: "Nodegroup 1", csv: "activities.csv"} (load CSV file using the specified nodegroup)
- owl: data.owl (load OWL file as data)

Nodegroup directories

A directory containing nodegroup JSON files with metadata in store_data.csv, as may be generated using SemTK's export nodegroup functionality.