EDC (External Data) - ge-semtk/semtk GitHub Wiki

Should I use EDC or FDC?

SemTK EDC and SemTK FDC both enable retrieval of data from sources outside of the knowledge graph. The following diagram shows the differences between them.

EDC Overview

SemTK provides functionality for Ontology-based Data Access (OBDA), enabling semantics-based access to data external to the triple store. In SemTK, this functionality is referred to as EDC ("external data connection").

As shown below, EDC is enabled by placing into the triple store metadata about how to access external data stores. When a user submits a query, the EDC functionality transparently retrieves the data from either the triple store, or if necessary, from external stores as well. Thus, the user has a single point of access for retrieving data from multiple stores, in domain terms.

Here are the steps to set up a new EDC connection. Details for each step are below.

Identify the external store to connect to (e.g. a Hive table, an HDFS folder)
Identify a Query Generator and Executor pair to write and execute queries against this external store. A new generator and/or executor may need to be created, depending on what external store is being targeted.
Modify your use-case specific ontology (*.sadl/owl) to incorporate the EDC ontology (sparqlEdc.sadl/owl), and load it to the semantic store.
Create instance data (.csv) and corresponding loading templates (.json), and load it to the semantic store. This instance data will include metadata about the external store (e.g. database connection information, table names, column names).
Add a new mnemonic to the EDC configuration, to specify what classes should trigger EDC, what query generator/executor to use, and other items.
Query for EDC data

1. Identify the external store

This may be any queryable location. For example, Apache Hive for time series data, or an HDFS folder for image files.

2. Query Generators & Query Executors

EDC requires 1) a query generation service to generate queries to be run against an external store and 2) a query execution service to execute the queries. When connecting to new external stores, new query generation and/or execution services may need to be written, conforming to the specifications below.

Query Generation Service

This is a service that generates a set of queries. It must have a client that extends QueryGenClient, implementing this method:

public abstract TableResultSet execute(Table locationAndValueInfoTable, JSONObject constraintsJson, JSONArray flagsJsonArray) throws Exception;

The query generation service inputs are:

Table locationAndValueInfoTable: a table with 1) a UUID column and 2) other columns with parameters required by the query generator. Different query generators use different sets of columns. Parameters (e.g. TableName below) must be specified in the EDC configuration (as Paramsparqlid) - the EDC Dispatcher populates the table column using Paramclass and Paramkeyname, also specified in the EDC configuration. Example table (for generating time series queries against a relational store):

UUID,DatabaseServer,Database,TableName,TimeStampColumnList,VarName,TagName
db873eeb-1c7a-4016-a681-ad199e3f7500,jdbc:hive2://myserver:10000,mydatabase,table1,ts_utc,PRESSURE1,p_31
db873eeb-1c7a-4016-a681-ad199e3f7500,jdbc:hive2://myserver:10000,mydatabase,table1,ts_utc,PRESSURE2,p_32
db873eeb-1c7a-4016-a681-ad199e3f7500,jdbc:hive2://myserver:10000,mydatabase,table1,ts_utc,PRESSURE3,p_33
0c333f93-a9cf-4a36-bea2-11d9d5e40e8c,jdbc:hive2://myserver:10000,mydatabase,table2,ts_utc,PRESSURE1,p_41
0c333f93-a9cf-4a36-bea2-11d9d5e40e8c,jdbc:hive2://myserver:10000,mydatabase,table2,ts_utc,PRESSURE2,p_42
0c333f93-a9cf-4a36-bea2-11d9d5e40e8c,jdbc:hive2://myserver:10000,mydatabase,table2,ts_utc,PRESSURE3,p_43

JSONObject constraintsJson: json representing query constraints. The query generator must know how to parse and apply these. Example constraints for a time series query: {"@constraintSet":{"@op":"AND","@constraints":[{"@var":"PRESSURE1","@operator1":">","@value1":{"@value":"100","@type":"number"}},{"@var":"PRESSURE2","@operator1":">","@value1":{"@value":"50","@type":"number"},"@operator2":"<","@value2":{"@value":"60","@type":"number"}}]},"@timeConstraint":{"@var":"_Time_","@operator1":">=","@value1":{"@value":"04/07/2016 2:00:00 AM","@type":"datetime"}}}
JSONArray flagsJsonArray: a json array of flags. A given query generator may or may not need flags. Example: ["RDB_QUERYGEN_OMIT_ALIASES","RDB_QUERYGEN_RAW_TIMESTAMP"]

The query generation service output is TableResultSet containing these 3 columns:

UUID: a unique id provided by the EDC Dispatcher for fusing the external query results with semantic results
Query: the query to be run on the external store
ConfigJSON: json configuration needed to instantiate the query execution client. If no configuration is needed, use empty json: {}

Here is an example TableResultSet. Note that the same UUID may appear multiple times in the table, with different queries.

UUID,Query,ConfigJSON
75866089-2d1c-4d47-9143-144d20acd28b,"SELECT cast(ts_utc AS timestamp) AS `timestamp`, p_41 AS `PRESSURE1`, p_42 AS `PRESSURE2` FROM mydatabase.table2 WHERE (p_41 > 100) ORDER BY `timestamp`","{""database"":""mydatabase"",""port"":""10000"",""host"":""myserver""}"
41628f59-cf61-48ea-a5fd-beebe0d4d62c,"SELECT cast(ts_utc AS timestamp) AS `timestamp`, p_32 AS `PRESSURE2` FROM mydatabase.table1 ORDER BY `timestamp`","{""database"":""mydatabase"",""port"":""10000"",""host"":""myserver""}"

An example client is RDBTimeCoherentTimeSeriesQueryGenClient.

Query Execution Service

This can be any query service that has a corresponding QueryExecuteClient, implementing this method:

public TableResultSet execute(String query) throws Exception{

The client will be instantiated using 1) EDC mnemonic Exec* fields (Execclienttype,Execprotocol,Exechost,Execport,Execendpoint) and 2) the ConfigJSON provided by the query generator.

The service endpoint has to take a @RequestBody (not @RequestParams), because the client sends the parameters together in a JSON object.

An example client is com.ge.research.semtk.edc.client.HiveClient. In this case, the EDC mnemonic Exec* fields point to the SemTK Hive service. The ConfigJSON contains the Hive JDBC host/port, as well as the Hive database name.

Future: if a query execution client needed to connect directly to various service locations, a single set of EDC mnemonic Exec* fields would not be sufficient. In this case, the client could use ConfigJson to override the EDC mnemonic Exec* fields (e.g. Execprotocol,Exechost,Execport).

3. Using the EDC ontology

The EDC ontology (sparqlEdc.sadl) is an ontology describing generic kinds of external data connections. Domain-specific ontologies should import the EDC ontology to give them a starting point for describing things like time series data (TimeSeriesDataset) or external file systems (ExternalFileSystem), as illustrated below.

For example, here is a snippet of the EDC ontology that describes a time series dataset. The time series dataset has a mapping of measurements and a database location.

...
// "ghost class" with no instance data - EDC will retrieve from external source
ExternalDataValue is a class.

// extend in project-specific ontology to allow multiple mnemonics for same ExternalDataValue 
// (e.g. to trigger TimeSeriesValue differently for different projects)
ExternalDataSource is a class.

TimeSeriesValue is a type of ExternalDataValue,
	described by hasDateTime with a single value of type time,
	described by hasValue with a single value of type float.

TimeSeriesDataset is a type of ExternalDataSource,
 	described by map with a single value of type TimeSeriesGeneratorMap,
 	described by location with a single value of type Location.
TimeSeriesRowDataset is a type of TimeSeriesDataset.

TimeSeriesGeneratorMap is a class,                                        
 	described by mapName with a single value of type string,         
 	described by hasMeasurement with values of type TimeSeriesGenerator.

TimeSeriesGenerator is a type of EDCValueGenerator,
	described by tag with a single value of type string,
	described by variableName with a single value of type string,
	described by hasValues with values of type TimeSeriesValue.

Location is a type of EDCLocation,
	described by databaseServer with a single value of type string,
	described by database with a single value of type string,
	described by tableName with a single value of type string,
	described by dateTimeMarker with values of type DateTimeMarker.	
...

A domain-specific ontology may import the EDC ontology so that it inherits the concepts above. It may then extend them with additional domain-specific attributes. For example, the snippet below from a use-case-specific ontology extends TimeSeriesGenerator to a Measurement object that adds a description and units. The domain-specific ontology will most often also link the EDC data to related data in the semantic store. In the example below, the EDC data (Table) is linked to information about a test id and engineer.

import "http://research.ge.com/kdl/sparqlgraph/externalDataConnection" as sparqlEdc.
...
Measurement is a type of sparqlEdc:TimeSeriesGenerator, 
 	described by timeseries:hasDescription with a single value of type string,                  
 	described by timeseries:hasUnits with a single value of type string. 

Table is a type of sparqlEdc:TimeSeriesRowDataset.

EquipmentTest is a top-level class,
	described by id with a single value of type int,
	described by engineer with a single value of type string,
	described by a measurementData with a single value of type timeseries:Table.
...

4. Creating EDC instance data

EDC instance data will vary widely based on the details of the external data and its source. Here we present an example where time series data is being retrieved from a relational database.

Here is some sample instance data describing a set of external database tables. This instance data might be be used to populate the Location object above.

test_num,database,database_server,table_name,timestamp_column
11,databaseA,jdbc:hive2://myserver:10000,test11_data,ts_utc
22,databaseB,jdbc:hive2://myserver:10000,test22_data,ts_utc

Here is some sample instance data describing data columns found in the external database tables. This instance data might be used to populate the Measurement object above.

test_num,column name,variable name,description,type,units,
11,c_1,AMBIENT_PRESSURE,Ambient Pressure,Float,psi
11,c_2,AMBIENT_TEMP,Ambient Temperature,Float,degF
11,c_3,AIR_FLOW,Air Flow,Float,litersPerSec
22,p_1,AMBIENT_PRESSURE,Ambient Pressure,Float,psi
22,p_2,AMBIENT_TEMP,Ambient Temperature,Float,degF

5. Managing EDC configurations

Configuration for EDC is stored in the triple store, in a dedicated "service configuration" dataset (http://research.ge.com/semtk/services), using semantic model sparqlEdcServices.owl. This dataset contains EDC configuration information (e.g. when to trigger EDC, what service to use to generate queries, what service to use to execute queries, etc).

An EDC mnemonic is a configuration for a set of EDC behavior. EDC mnemonics can be listed, inserted, and deleted using the Utility Service. To add a new mnemonic to configure your new EDC connection, use the edc/insertEdcMnemonic endpoint - sample input can be found here. The rest of this section describes what the mnemonic configuration means.

If the EDC Dispatcher receives a query that contains the trigger class value, then it will attempt to retrieve external data, provided that a trigger class source (data source) can be found for the given query. In this example, if the EDC Dispatcher receives a query with TimeSeriesValue, then it will trigger retrieval of external data, provided that a Table can also be found.

Mnemonic Name           SampleMnemonic
Triggerclassvalue       http://research.ge.com/kdl/sparqlgraph/externalDataConnection#TimeSeriesValue
Triggerclasssource      http://research.ge.com/project/timeseries#Table

The EDC Dispatcher will use the following service to generate queries to the external data store. In this example, the edcQueryGenerationService is used, which generates Hive time series queries like this: select cast(ts_utc AS timestamp) as timestamp, c_1 AS AMBIENT_PRESSURE, c_2 AS AMBIENT_TEMP, c_3 AS AIR_FLOW from mydatabase.test11_data order by timestamp

Genclienttype           com.ge.research.semtk….RDBTimeCoherentTimeSeriesQueryGenClient
Genprotocol             http
Genurl                  myserver
Genport                 12054
Genendpoint             edcQueryGeneration/hive/generateTimeSeriesQueries

The EDC Dispatcher will use the following service to execute the queries on the external data store. In this example, the external queries are executed on Hive.

Execclienttype          com.ge.research.semtk.edc.client.HiveClient
Execprotocol            http
Execurl                 myserver
Execport                12055
Execendpoint            hiveService/queryHive

The below configuration is used to tell the UI what constraints components to display. In this example, the UI should allow a user to constrain time series variables (e.g. retrieve data where PRESSURE < 100)

Constraintname          timeSeriesRow
Constraintvarclass      http://research.ge.com/kdl/sparqlgraph/externalDataConnection#TimeSeriesGenerator
Constraintvarkeyname    variableName

Below are the parameters needed to generate the queries. The EDC Dispatcher will automatically query the triple store for these parameters and provide them to the query generator. The set of Paramsparqlids must match the list of column names expected by the query generator. Note that these will never be returned as columns, even if they are selected in the original query, as EDC will replace them with the retrieved ExternalDataValues. In this example, in order to generate the proper Hive time series queries, we need tag name, variable name, database server, database name, table name, and timestamp column. These values are passed to the edcQueryGenerationService which uses them to generate the Hive queries.

Paramclass              http://research.ge.com/kdl/sparqlgraph/externalDataConnection#TimeSeriesGenerator
Paramkeyname            tag
Paramsparqlid           TagName

Paramclass              http://research.ge.com/kdl/sparqlgraph/externalDataConnection#TimeSeriesGenerator
Paramkeyname            variableName
Paramsparqlid           VarName

Paramclass              http://research.ge.com/kdl/sparqlgraph/externalDataConnection#Location
Paramkeyname            databaseServer
Paramsparqlid           DatabaseServer

Paramclass              http://research.ge.com/kdl/sparqlgraph/externalDataConnection#Location
Paramkeyname            database
Paramsparqlid           Database

Paramclass              http://research.ge.com/kdl/sparqlgraph/externalDataConnection#Location
Paramkeyname            tableName
Paramsparqlid           TableName

Paramclass              http://research.ge.com/kdl/sparqlgraph/externalDataConnection#DateTimeMarker
Paramkeyname            timeStampColumnList
Paramsparqlid           TimeStampColumnList

Below are restrictions placed on the external query. If the query does not conform, the EDC Dispatcher will reject it. In this example, we want to reject the query if it asks for data from more than 50 (Hive) locations, so as not to overwhelm the Hive server with queries.

restrictclass           http://research.ge.com/kdl/sparqlgraph/externalDataConnection#Location
restrictoperator        <=
restrictoperand         50

6. Query EDC data

General information about storing and executing nodegroups is here

For nodegroups that include external data connections (EDC), it is possible to add EDC-related constraints to a nodegroup execution. For example, an EDC constraint may specify thresholds for returning externally-stored time series data, for example, only return data where the value in column PRESSURE exceeds 50.

EDC constraints and their UI components are customizable per external data type.

EDC constraints are currently in supported SparqlForm UI but not in the SparqlGraph UI. A SparqlForm UI element appears when an EDC element is added to the query.

To use EDC constraints using service endpoints

Below is example of specifying the same EDC constraints as inputs to a service call to /nodeGroupExecution/dispatchById.

{
  "nodeGroupId": "GetMachineTestInfo",
  "sparqlConnection": "...",
  "externalDataConnectionConstraints":"{\"@constraintSet\":{\"@op\":\"AND\",\"@constraints\":[{\"@var\":\"PRESSURE\",\"@operator1\":\">\",\"@value1\":{\"@value\":\"50\",\"@type\":\"number\"}},{\"@var\":\"TEMPERATURE\",\"@operator1\":\"=\",\"@value1\":{\"@value\":\"11\",\"@type\":\"number\"}}]},\"@timeConstraint\":{\"@var\":\"_Time_\",\"@operator1\":\">=\",\"@value1\":{\"@value\":\"02/02/2018 4:00:00 AM\",\"@type\":\"datetime\"}}}"
}

Below is a sample CURL call using EDC constraints

curl -X POST http://myserver:port/nodeGroupExecution/dispatchById --header "Content-Type: application/json" --header "Accept: */*" --data '{ "nodeGroupId": "GetMachineTestInfo", "runtimeConstraints":"[{\"SparqlID\":\"?personnel\",\"Operator\":\"MATCHES\",\"Operands\":[\"Bob Smith\"]}]", "sparqlConnection": "{\"name\": \"%NODEGROUP%\",\"domain\": \"%NODEGROUP%\",\"model\": [],\"data\": []}", "externalDataConnectionConstraints": "{\"@timeConstraint\":{\"@var\":\"_Time_\",\"@operator1\":\">=\",\"@value1\":{\"@value\":\"03/29/2017 04:00:00 PM\",\"@type\":\"datetime\"},\"@operator2\":\"<=\",\"@value2\":{\"@value\":\"03/29/2017 07:00:00 PM\",\"@type\":\"datetime\"}}}" }'