Ingesting CDM - ashish-gehani/SPADE GitHub Wiki

The DARPA Transparent Computing program defined a Common Data Model (CDM) to represent data provenance and information flow. SPADE's CDM reporter ingests provenance emitted by SPADE's CDM storage, in either Avro binary or JSON format, that conforms to the schema in the cfg/spade.storage.CDM.avsc file.

Configuring CDM reporting

The CDM reporter requires at least one argument, which is the inputFile containing the CDM. If the file is in JSON format, its name must include the .json extension. If the file is in Avro binary format, its name must include the .bin extension. Note that this must be done in the SPADE controller (after the SPADE server has been started):

-> add reporter CDM inputFile=/tmp/cdm.json
Adding reporter CDM... done

The waitForLog=false option can be used to ensure that ingestions stops when the reporter is removed. Note that by default, the reporter will continue to process all records even after it is removed.

-> add reporter CDM inputFile=/tmp/cdm.json waitForLog=false
Adding reporter CDM... done

Collection ingestion

If the CDM records are stored in a collection of files, they can be ingested together with the rotate option. If rotate=true is specified, the inputFile is processed first. Next, files with the same name but .1, .2, ... extensions are processed in ascending order. For example, /tmp/cdm.json, /tmp/cdm.json.1, /tmp/cdm.json.2, and /tmp/cdm.json.3 can be ingested with the command:

-> add reporter CDM inputFile=/tmp/cdm.json rotate=true
Adding reporter CDM... done

The reporter can be deactivated using the following command in the SPADE controller:

-> remove reporter CDM
Shutting down reporter CDM... done