ingest - nsip/n3 GitHub Wiki

Background

First some basics on how n3 manages data.

N3 works only with data in json format.

As part of the wider n3 eco-system we provide transformers that will render other typical formats such as csv and xml (generic xml and specifically sif xml) that can be used to render data into json format.

All data is assigned to a context in n3.

A context is an arbitrary container that will hold data from multiple sources and potentially in multiple formats.

Contexts should be created according to the business needs of the user, typical patterns for us in hndling educaiton data could be:

  • a context for all data about a teaching-group
  • a context for all data about a school
  • a context for all data of a particular type (such as SIF)
  • a context for data with a particular use (such as NAPLAN results)

Data can only be ingested by n3 into an existing context, so the starting point for any data import is to create a context that will act as the destination for the data.

Creating a Context

Once you've started an instance of n3w server, it exposes an endpoint for creating environments. The endpoint can be found here:

    http://localhost:1323/admin/newdemocontext

Note: in the current distribution this method creates only contexts with a status of demo.

Calling this endpoint with a payload that includes a user-id (userName) and a name for the context (contextName) will create the context in n3.

Note: userName can be any valid string. contextName will be used to create a streaming channel to receive messages on, so needs to conform to the underlying nats-streaming-server requirement for channel names which means it can only use characters a-z, A-Z, numbers 0-9 and underscore within the name.

using curl to create a context:

curl -s  -X POST http://localhost:1323/admin/newdemocontext -d userName=yourusername -d contextName=yourcontextname

e.g.

curl -s  -X POST http://localhost:1323/admin/newdemocontext -d userName=nsip2 -d contextName=demoSchool1

this method will respond with:

{
  "message": "Context created and activated successfully. Use this token to publish: /n3/publish and query: /n3/graphql data. Token must be provided in Autorization: Bearer header.",
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJkZW1vIiwiY25hbWUiOiJkZW1vU2Nob29sMSIsInVuYW1lIjoibnNpcDIifQ.ApE7iD1WKjE8Qb9YLTGTEa1YaBob1eaGgC3ehRVT7TU"
}

the token returned is for this user in the specified context. You need to keep the token for use in publishing and querying data in n3; it should be provided in any web calls made to the publish or query endpoints in the 'Authorization: Bearer' header of those calls.

In the case of publishing data, this token identifies the user who is submitting new data to the system, and the context that the data is to be submitted to.

Configuring a Context

Now that you have a valid context that can receive data, the context needs to be configured.

When a context is created it creates the following directory structure in the n3w folder:

├── contexts // - root contexts folder
   ├── nsip1 // - user of the context
   │   └── demoSchool1 // - context name
   │       ├── crdt // - crdt management folder
   │       │   ├── config // *config file here - datatypes.toml
   │       │   ├── recv
   │       │   └── send
   │       └── d6 // - data store folder
   │           └── config // *config file here - datatypes.toml

and creates default context configuration files in the locations shown above.

The content of these files is identical, just remember that once you have updated the configuration to put a copy in both locations.

Note: yup, we know this is annoying, the dev roadmap has a task to put configs in a single location to avoid this duplication, will update docs when this is released.

N3 tries hard to involve the user as little as possible in having to maintain data schemas and relationships, but some configuration is always necessary to provide useful meta-data for the system about the data it will see in a context.

So, in a text editor, open one of the config files (datatypes.toml), and set up the necessary configuration.

By default the configs created for a new context contain entries for SIF, XAPI and a number of other sample json data types which can be used as a basis. So if you are dealing with SIF and XAPI data there are probably no changes required, but if you want to add your own data formats then the following guide explains the different sections of the config file.

Configuration in detail

n3 will make sense of any object it ingests, by breaking it down into its component nodes (for JSON or XML), expressing those nodes as tuples, and constructing a graph database for those tuples. That graph database can then be queried dynamically. This breakdown of objects presumes that (a) each object has a unique identifier, and (b) each object has attributes that form links on a graph (i.e. if two attributes in different objects have the same value, and are identified as link attributes, then a graph link is formed between the two objects).

In order to make the best sense of the objects it is provided, n3 has a configuration file which can be edited, to classify the objects it ingests into distinct data models. This classification involves:

  1. working out what data model the object belongs to, based on its attributes
  2. assigning an attribute of the data model as its unique identifier. (If no single attribute is present, a combination of attributes can be nominated instead.)
  3. identifying the attributes of the model to be used as links in the overall graph

If an object is not classified as belonging to a data model, it is assigned to the JSON fallback data model: a unique ID is generated and automatically assigned to the object, and it is stored and made accessible through GraphQL. However, the object will not be represented in the object graph for the node, so it will not be available for traversal queries (see queries section), but is avialable to other query types.

If a given object has no field identified as a unique identifier, then a system-generated unique id will be created for it.

For example, xAPI statements do not require an id to be provided. xAPI will often represent a stream of facts rather than entities that need to be updated - once a fact is asserted it won't need to be updated in a traditional database sense so the need for a unique entity id is limited. Each statement still needs an individual id within the datastore, however, and so it's legitimate for the system to create one. System generated ids are not presented in query output since they weren't in the original data.

Currently, the default configuration of the data classifier is in code: https://github.com/nsip/n3-deep6/blob/master/config.go; this is used to create the start-up configs when a new context is created, but as config is refactored it will exist as a default config file in future releases.

All attributes are named using JSON Paths.

To illustrate: the the data classifier included with n3 to classify the sample data differentiates SIF data and xAPI data as follows:

[classifier](/nsip/n3/wiki/classifier)
data_model = "SIF"
required_paths = ["*.RefId"]
n3id = "*.RefId"
links = ["RefId","LocalId"]

[classifier](/nsip/n3/wiki/classifier)
data_model = "XAPI"
required_paths = ["actor.name", "actor.mbox", "object.id", "verb.id"]
n3id = "id"
links = ["actor.mbox","actor.name","object.id","object.definition.name"]

[classifier](/nsip/n3/wiki/classifier)
data_model = "Syllabus"
required_paths = ["learning_area", "subject", "stage"]
n3id = "id"
links = ["learning_area", "subject", "stage"]
unique = ["subject","stage"]
  • required_paths
    • Any object with an attribute RefID one level down from the root is considered to belong to the SIF data model (e.g. {"StudentPersonal": {"RefId": ... } })
      • An alternate approach would be to nest all SIF objects in a {"SIF": ...} wrapper. In that case, the required path would be ["SIF"]
    • Any object with all of the attributes actor.name, actor.mbox, object.id and verb.id is considered to belong to the xAPI data model (e.g. { "actor": { "name": ..., "mbox": ... }, "object" : { "id": ... }, "verb" : { "id": ... } }
    • Any object with the attributes learning_area, subject, stage is considered to belong to the Syllabus data model
  • n3id, unique
    • Any attribute RefID one level down from the root is used as the identifier of objects under the SIF data model
    • The attribute id is used as the identifier of objects under the xAPI and Syllabus data models
  • links, unique
    • Any attribute that contains in its pathname RefId or LocalId is a candidate link under the SIF data model
    • Any attribute that contains in its pathname the strings actor.mbox, actor.name, object.id, object.definition.name is a candidate link under the xAPI data model
    • Any attribute that contains in its pathname the strings learning_area, subject, stage is a candidate link under the Syllabus data model
    • A candidate link with value VAL in object S is converted to a graph link ("references") between objects S and O if:
      • VAL is the same as the unique identifier of O (an id-based link)
      • VAL is any other attribute contained in O (a "property link")
    • Two objects with compound identifiers (unique) are linked if their identifiers are identical (e.g. subject + year level for syllabus and for lesson)

As a result, explicit SIF-style links based on GUIDs are constructed (the value StudentPersonalRefId matches a recognised instance of StudentPersonal.RefId). However, any attribute value given as a LocalId of a SIF object is searched in any other object, and if found, is used as the basis of a link. The same applies for the email address of a student under xAPI (actor.mbox): this is used as the basis of a link with any object contain the same email address as an attribute (e.g. StudentPersonal.PersonInfo.EmailList.Email), despite the fact that the SIF data model and xAPI data model do not share any student identifiers in common.

Links attributes can obviously generate an excess of useless links unless they are chosen carefully.

It is quite valid to provide no links for a given data-model. The heaxstore in n3 makes no distinction between forward or backward links, so explicit linking is not mandatory for all data-models.

SIF data could be added with no linking specification. When xAPI data is added that does have links specified, the linker will search for SIF objects that have candidate value-links and connect them even when the SIF specification does not call out explicit links.

To add new data-models to the classifier, edit the sections as described above to provide the necessary hints to the graph engine. New default config files are generated whenever a new context is created. Reminder: for any n3 context, the classifier config files can be found in the following locations:

[n3-installation-folder]/contexts/n3demo/[your-context-name]/crdt/config
[n3-installation-folder]/contexts/n3demo/[your-context-name]/d6/config

Publishing data

For these examples we'll publish data to the default demonstration context and user. (This context and a set of sample data can be imported to n3 by the bundled load(.exe) tool - this is simply a convenience wrapper around the calls that will be described below for users who do not have access to curl.)

Once a context is configured the process to publish data to it is realtively simple.

Let's create the default demo context, which will be automatically assigned a copy of the default config.

curl -s  -X POST http://localhost:1323/admin/newdemocontext -d userName=n3Demo -d contextName=mySchool

the server will respond with:

 {"message":"Context created and activated successfully. Use this token to publish: /n3/publish and query: /n3/graphql data. Token must be provided in Autorization: Bearer header.","token":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJkZW1vIiwiY25hbWUiOiJteVNjaG9vbCIsInVuYW1lIjoibjNEZW1vIn0.VTD8C6pwbkQ32u-vvuHnxq3xijdwNTd54JAyt1iLF3I"}

With this token we can now submit data to the context.

To submit some sample data (we'll use an exisiting data file from those bundled with n3), issue the following command (assuming that you run curl in the installation folder of n3):

curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhdWQiOiJkZW1vIiwiY25hbWUiOiJteVNjaG9vbCIsInVuYW1lIjoibjNEZW1vIn0.VTD8C6pwbkQ32u-vvuHnxq3xijdwNTd54JAyt1iLF3I" \
--data-binary @sample_data/lessons/lessons.json \ http://localhost:1323/n3/publish

where http://localhost:1323/n3/publish is the api endpoint to send data to (in this case we've sent the sample lessons data file)

the server will then respond on success with:

"data published ok"

or will provide an error description if there are any issues.

Once data is successfully imported to the system you can move on to the query section of this wiki to find out how to retrieve data and make queries.