Data Verification - ge-high-assurance/RACK GitHub Wiki

Verification Methodology

The RACK tooling provides multiple methods for verifying the data that is loaded into RACK:

The check tool in the ASSIST toolset
- Example run of check against RACK-in-a-box image (run this locally or within the container/VM):
```
check -m http://localhost:3030/
```
  Note that the above requires the installation of SWI Prolog. This is already available and installed in the RACK box image, so this command can be run via a docker exec command:
```
$ docker container ls
CONTAINER ID   IMAGE                            ...   NAME
1ab3beb878d7   gehighassurance/rack-box:...           festive_einstein
...
$ docker exec -it {CONTAINER_ID_or_NAME} /home/ubuntu/RACK/assist/bin/check -m http://localhost:3030/
```
  It is also possible to run it locally if SWI Prolog is installed. The check script is written as a Unix script with a shebang line to run the swipl SWI Prolog command, but it is possible to run check in non-Unix environments by invoking it via swipl directly:
```
swipl -s assist/bin/check -- -m http://localhost:3030/
```
  If you receive a "Connection refused" message when running it locally, check to make sure you are running the RACK box with port 3030 enabled (you can also visit http://localhost:3030/ in your browser to verify this is available; that URL should serve an "Apache Jena Fuseki" page).
  
  See https://github.com/ge-high-assurance/RACK/tree/master/assist#assist-dv----data-verification and https://github.com/ge-high-assurance/RACK/tree/master/assist/bin#command-line-usage for more information.
SemTK ingestion performs verifications against both the model and nodegroup ingestion rules. Checks data types and qualified cardinality.
SemTK cardinality checker - checks cardinality counts: reports/section/cardinality wiki

Which methodology to use

The SemTK verification is performed when ingesting via the nodegroup.

The check tool verifications can be run at any time on either local OWL files or a live RACK database to perform verification on the existing data.

There are significant overlaps between the two methods: many issues will be detected by either method (see the table below). The SemTK verification is explicitly defined and extensible via creation of another nodegroup query (via the SemTK UI). The check verification provides both automatic and explicit verification (by editing the assist/bin/checks files). A common methodology might be for a user exploration process to update or create new query nodegroups, whereas an automated process like CI testing would utilize the check tool which does not require any user interaction (and nodegroups created by the SemTK process can be migrated into the check process as they are determined to be viable and desirable verification).

Types of Verification

There are two types of verification:

validity - does the data map to valid ontology elements and have valid values
consistency - does the data integrate with existing data properly

The following is a summary of the various checks currently available:

Check	Tool	Validity?	Consistency?	Description
missing notes	check	✅		Each defined item should have a description of that item. This can be provided via a SADL `note` field or the corresponding CSV column entry.
ontology basis	check	✅		All defined items should have an inheritance from one of the items defined in the `PROV-S` set of base classes. Items which do not inherit from one of these are likely to be data islands and not integrated properly with the rest of the data.
instance types	check		✅	Verifies that an object isn't declared to be an instance of multiple separate ontology classes. At present, the RACK ontology does not utilize multiple inheritance (although RDF itself allows this).
cardinality	check, semtk cardinality	✅	✅	Verifies that object property relations conform to the cardinality restrictions defined in the ontology (e.g. must be one, must be more than 0, etc.). In RDF/OWL terms, a "Restriction".
multiple optional	check	✅	✅	Verifies that object properties marked as optional are either not specified or specified only once (conceptually a subset of cardinality, but in RDF/OWL terms, a "FunctionalProperty").
invalid enum value	check, ingest	✅		Specification of a property value that is not one of the valid enumerated set of values.
wrong type	check, ingest	✅		Providing a value of the wrong type, based on the ontology (e.g. specifying a string where an object is needed).
value range exceeded	check, ingest	✅		Specification of a value outside the defined range for values of that property (when defined).

Semtk ingest templates:

optionally contain validation steps that are independent of the model, such as non-empty columns. See data validation
translate input strings into typed values using rules explained at ingestion type handling

Additional checks may be provided in future versions of RACK.