FDC (Federated Data) - ge-semtk/semtk GitHub Wiki

Should I use EDC or FDC?

SemTK EDC and SemTK FDC both enable retrieval of data from sources outside of the knowledge graph. The following diagram shows the differences between them.

FDC

SemTK's FDC capability allows for ingesting external data into the semantic store for federated query.

SemTK has two FDC capabilities:

  • FDC Caches: ingest data from various sources into the semantic store, building a graph which can be queried for as long as needed
  • FDC Queries: execute a single thin-thread query, automatically ingesting data into the semantic store for this query only

Both FDC Caches and FDC Queries use FDC Data Generators.

FDC Data Generators

An FDC Data Generator is a REST service that takes one or more tables of input, and generates a table of output. These services are used for FDC Caches and FDC Queries.

Typical uses of FDC Data Generators include:

  • calculating a value
  • retrieving information like weather from a third-party REST service
  • running queries against relational data sources

For sample code in Java, see the FdcSampleService module. General steps are:

  • check the right number of tables have been sent in
  • check each table has the expected columns
  • query, generate, calculate new data based on the inputs
  • return a table of output

In Java, these services typically extend NodegroupProviderRestController in order to implement a /getNodegroup endpoint which services nodegroups by id out of the jar's /src/main/resources/nodegroups/.json

FDC Cache

An FDC cache uses FDC Data Generators to load a complex set of data from multiple data sources.

Loading an FDC cache is by repeating these steps:

  • run a query on the graph (or bootstrap the first time)
  • send the table to an FDC Data Generator
  • ingesting the results

Configuring an FDC Cache

The following example demonstrates the information that must be loaded into the SemTK service graph to specify an FDC Cache. It is written in SADL and is called an "FDC Cache Spec". The SADL and owl for the parent class can be found at sparqlGraphLibrary/src/main/resources/semantics.

import "http://research.ge.com/semtk/fdcCacheSpec".

SampleSpec is a FdcCacheSpec,
	with id "sampleSpec",
	
	with step ( a FdcCacheStep 
		with retrieval( a Retrieval 
			with inputNodegroupId "sampleInputNg1",
			with serviceURL "http://localhost:12066/fdcSample/aircraftLocation",
			with ingestNodeGroupId "fdcSampleAircraftLocation"
		)
		with sequence 1
	),
	with step ( a FdcCacheStep 
		with retrieval( a Retrieval 
			with inputNodegroupId "fdcSampleCacheGetLocations",
			with serviceURL "http://localhost:12066/fdcSample/elevation",
			with ingestNodeGroupId "fdcSampleElevation"
		)
		with sequence 2
	).

import fdcCacheSpec, whose sadl and owl files can be found in the distribution.

This FdcCacheSpec contains

  • its name: "sampleSpec"
  • two steps (any number of steps may be used)

Each FdcCacheStep contains

  • a retrieval
  • a sequence number starting with 1 and incrementing by one, indicating the order in which the steps are executed

Each Retrieval contains

  • the name of the input nodegroup which is run to generate input (except for the first step). This nodegroup may be in the nodegroup store, or returned by the FDC Data Generator.
  • the URL for the FDC Data Generator
  • the name of the ingestion nodegroup, which also may be in the nodegroup store or returned by the FDC Data Generator. Note that unlike FDC Queries, this ingestion step may ingest arbitrarily complex sets of triples

To load an FDC Cache Spec, generate OWL from the SADL file, and load directly into the SemTK service graph.

To get a list of loaded FDC Cache Specs, use the UtilityService endpoint /utility/fdc/getFdcCacheSpecList

To delete an FDC Cache Spec, use the UtilityService endpoint /utility/fdc/deleteFdcCacheSpec

Running an FDC Cache

A cache may be established through the FdcCacheService.

Parameters to the endpoint /cacheUsingTableBootstrap include:

  • specId of the cache being run
  • conn json (see below)
  • bootstrapTableJsonStr which is a string of json representing the table to be sent to the first step's FDC Data Generator instead of running a query (see the swagger endpoint for the latest example format).
  • maxAgeSeconds such that if the exact same cache has been loaded to this connection within this number of seconds, the endpoint will succeed without performing any new load

the conn json

This connection's model must contain all the model information required by the nodegroups used by this cache spec.

The connection's first data endpoint will be used to hold the cached data.

  • this must be either empty, or a previous load of the exact same specId with the same bootstrap table
  • if it contains old data (determined by maxAgeSeconds) then the graph will be cleared and a new load will take place

FDC Queries

FDC queries use FDC Data Generators to retrieve FDC Types, one class at a time. The FDC query dispatcher walks through the nodegroup and automatically builds sub-nodegroups to generate input tables for the FDC Data Generators, and ingests their results.

Creating an FDC Type

An FDC type consists of:

  • an FDC class
  • an FDC Data Generation service

An FDC class represents data that will be retrieved or generated by the dispatcher early in the SPARQL query execution process. The FDC execution service retrieves or generates the data.

Creating a new FDC Class

An FDC Class is an OWL/RDF class that is a subclass of FDCData.

FDCData is defined in federatedDataConnection, which should first be synced to your triplestore for importing. Then a new FDC data type might look like this in SADL:

uri "http://research.ge.com/semtk/demo/fdcDistance" alias fdcDistance. 

import "http://research.ge.com/semtk/federatedDataConnection".
 
FDCDistance is a type of fdc:FDCData,
	described by distanceNm with a single value of type float,
	described by location1 with a single value of type Location,
	described by location2 with a single value of type Location.
	
Location is a top-level class,
	described by latitude with a single value of type float,
	described by longitude with a single value of type float.

In this example, note that FDCDistance will be retrieved or calculated, while Location is a class necessary for this calculation, but not an FDC data type itsself.

This type's owl file must be sync'ed with the triplestore so that it is accessible via owl import

Configuring an FDC Type

This example shows the information which must be in the services graph in order to configure an FDC Type.

import "http://research.ge.com/semtk/fdcServices".

DistanceConfig is a FdcConfig,
	with fdcClass fdcSampleTest:Distance,
	with serviceURL "http://localhost:12066/fdcSample/distance",
	with ingestNodegroupId "fdcSampleDistance",
	with input DistanceInput1,
	with input DistanceInput2.

DistanceInput1 is a FdcInput,
	with inputIndex 1,
	with subgraphLink (a FdcInputSubgraphLink 
		with subjectClass fdcSampleTest:Distance
		with predicateProp location1,
		with objectClass fdcSampleTest:Location
		)
	with param (a FdcParam with columnName "latitude1", with classURI fdcSampleTest:Location, with propertyURI fdcSampleTest:latitude),
	with param (a FdcParam with columnName "longitude1", with classURI fdcSampleTest:Location, with propertyURI fdcSampleTest:longitude),
	with param (a FdcParam with columnName "location1", with classURI fdcSampleTest:Location).

DistanceInput2 is a FdcInput,
	with inputIndex 2,
	with subgraphLink (a FdcInputSubgraphLink 
		with subjectClass fdcSampleTest:Distance, 
		with predicateProp location2,
		with objectClass fdcSampleTest:Location
		)
	with param (a FdcParam with columnName "latitude2", with classURI fdcSampleTest:Location, with propertyURI fdcSampleTest:latitude),
	with param (a FdcParam with columnName "longitude2", with classURI fdcSampleTest:Location, with propertyURI fdcSampleTest:longitude),
	with param (a FdcParam with columnName "location2", with classURI fdcSampleTest:Location).

import fdcServices, whose sadl and owl files can be found in the distribution.

DistanceConfig is an FdcConfig with the following

  • the class being configured
  • the URL of the FDC Data Generator
  • the name of a nodegroup (in the nodegroup store, or returned by the FDC Data Generator) which can be used to ingest the results. This nodegroup must only ingest the named class and it's properties.
  • a list of inputs

FdcInputs each contain

  • an integer index starting at 1 and incrementing by 1
  • a subgraph link where either the subjectClass or the objectClass is the FdcType, and the predicateProp and remaining class describe a link in the nodegroup from the FdcType node to another node. When building an input nodegroup to generate data, the FDC dispatcher will break this link and use the side not containing the FdcType as the input nodegroup.
  • a list of params. Each represents a column in the input table sent to the FDC data generator. Each is specified by the column name, and a classURI / propertyURI which describes where in the input nodegroup the value can be found.