REST Services - ge-semtk/semtk GitHub Wiki

The following is a basic overview of the SemTK REST Services. They are all built on Spring Boot.

Core SemTK Services:

Query Service
Ingestion Service
Ontology Info Service
Nodegroup Service
Nodegroup Store Service
Nodegroup Execution Service
Dispatch Service
Results Service
Status Service
Utility Service

SemTK Services for EDC and FDC:

EDC Query Generation Service
FDC Cache Service
FDC Sample Service

SemTK Services that wrap external data stores:

Athena Service (AWS Athena)
ArangoDB Service
Hive Service (Apache Hive)
File Staging Service

See the diagrams at the bottom for a view of how these services interact.

Query Service

The Query Service wraps calls to triple stores (e.g. Virtuoso) and provides functionality to handle parameter naming for inputs, enforcing an output format, and basic error handling. The Query Service allows for predictable interactions with different triple stores, which often have varying behavior, particularly for default return encapsulation and error handling. Routing all triple store queries through a single point also allows us to provide some useful utility methods, allowing users to perform certain actions without writing SPARQL queries.

Ingestion Service

The Ingestion Service simplifies the insertion of triples into the triple store. The data for these triples can currently come from either an ODBC connection or a CSV file. The Ingestion Service uses a Json template to define the structure of the sub-graph to be populated and the transformations to perform in order to successfully map the input values to the output triples.

This service supports both a direct mode (which inserts each record independently) as well as a pre-check mode (which only inserts any data if all the inputs are checked to be error-free).

More information is here

Ontology Info Service

The Ontology Info Service provides information about the ontology (semantic model). Among other things, it is used to populate the lefthand pane in SparqlGraph, which features a hierarchical view of the ontology.

Nodegroup Service

The Nodegroup Service allows creation and editing of nodegroups. The general flow is for a caller to provide nodegroup JSON and connection information for the ontology, and recieve new nodegroup JSON in return.

Note the following:

the jsonRenderedNodeGroup endpoint parameter may be the JSON from a SparqlGraphJson (including the connection information, NodeGroup, and ImportSpec) or simply a NodeGroup. Endpoints that require the ontology via connection JSON will throw an error if the connection is missing from the jsonRenderedNodeGroup, while those that don't need the ontology will process either form of the parameter.
endpoints that use a conn or jsonRenderedNodeGroup connection to retrieve an ontology will cache the ontology for future calls to the Nodegroup Service. If the ontology changes between calls, it may take several minutes before the services sees the new version. Calling /clearCachedOntology will clear the cache so that changes to the ontology take effect immediately.

Nodegroup Store Service

The Nodegroup Store Service allows for storage and retrieval of nodegroups. Nodegroups store information about how to generate semantic queries about a subgraph of interest. Nodegroups may also store information such as a SPARQL connection and/or data loading specifications.

More information is here

Nodegroup Execution Service

The Nodegroup Execution Service executes a nodegroup to retrieve data programmatically, as an alternative to using the SparqlGraph UI.

More information is here

Dispatch Service

The Dispatch Service fulfills incoming queries by initiating SPARQL queries to retrieve data and sending the results to the Results Service. It updates the job's status and percent completion using the Status Service.

This service by default uses dispatcher class com.ge.research.semtk.sparqlX.asynchronousQuery.AsynchronousNodeGroupDispatcher (see the .env file), which can fulfill queries containing plain SPARQL by sending to the triple store for execution.

The default dispatcher class may be replaced by a custom dispatcher when needed. For example, to use the EDC feature, the dispatch service must be configured to use com.ge.research.semtk.sparqlX.dispatch.EdcDispatcher, which accepts and processes a SPARQL query that may include data external to the triple store.

Results Service

The Results Service enables query results to be written to a cache, and then subsequently retrieved using a job ID.

Status Service

The Status Service is a mechanism to keep track of in-progress and completed tasks. Using a job id, it can provide the status and percent complete for the task, as well as accept status updates for a given job id. Together with the Results Service, it enables jobs to be completed in an asynchronous manner.

Utility Service

Provides endpoints for managing SemTK. Currently the only utility endpoints available are for configuring EDC (external data connections), as shown below. In the future, other types of utility endpoints may be added.

EDC Query Generation Service

Generates queries to run against an external data store. Endpoints are described below.

Endpoint to generate time series queries for Hive or AWS Athena

Generates queries to be run against a relational database containing time-coherent time series data. This service assumes that the time series data has a timestamp column of type double, containing a Unix timestamp in seconds (e.g. 1497015968.556 for Fri, 09 Jun 2017 13:46:08). The name of the column is configurable in SemTK. Example query (for Hive):

select cast(timestamp_col AS timestamp) as timestamp, pressure2 AS PRESSURE, diameter2 AS DIAMETER, flow2 AS FLOW from my_database.my_table order by timestamp
select cast(timestamp_col AS timestamp) as timestamp, pressure3 AS PRESSURE, diameter3 AS DIAMETER, flow3 AS FLOW from my_database.my_table where ( (pressure3 > 80) OR (diameter3 > 40000 AND diameter3 < 70000) ) and ( ( unix_timestamp(to_utc_timestamp(timestamp_col,'Ect/GMT+0'), 'yyyy-MM-dd hh:mm:ss') >= unix_timestamp('10/08/2014 10:00:00 AM','MM/dd/yyyy hh:mm:ss a') ) AND ( unix_timestamp(to_utc_timestamp(timestamp_col,'Ect/GMT+0'), 'yyyy-MM-dd hh:mm:ss') <= unix_timestamp('10/08/2014 11:00:00 AM','MM/dd/yyyy hh:mm:ss a') ) ) order by timestamp

Flags may be used to omit column aliases, or to return the timestamps in their stored format (rather than converting to a human-readable timestamp).

Endpoint to generate time series queries for KairosDB

Generates queries to be run against KairosDB. Example query:

{"start_relative":{"value":10,"unit":"YEARS"},"cacheTime":0,"metrics":[{"name":"PRESSURE","tags":{},"group_by":[],"aggregators":[]},{"name":"DIAMETER","tags":{},"group_by":[],"aggregators":[]}]}

Endpoint to generate "queries" to retrieve binary files from a file store

Generates "queries" with the necessary information to retrieve a file from a file store (e.g. HDFS). Example: *hdfs://test1-98231834.img###test1.img (file location, file name)

Service Interaction Diagrams

The following diagrams show the interactions between various SemTK services.