Next Steps - nlaprell/odi-ikm-sql-to-marklogic-dmsdk GitHub Wiki

Document Generation

The IKM currently generates a flat JSON or XML file for each row in the source query. The RECORD_IDENTIFIER option can be set to the column name that holds a unique identifier for a group of rows (the source query should be sorted by this field). This causes the document generated to be an array of JSON objects representing each grouped row.

Example Source

RECORD_ID	PERSON_ID	NAME	ADDRESS	PHONE
1	328	Frank	123 N. 1st Ave
2	328	Frank		555.555.5555
3	328	Frank		123.456.7890
4	454	Robert	456 N. 2nd Ave

Example Output Document

The generated document, with the RECORD_IDENTIFIER option set to "PERSON_ID" would appear as follows:

[
  {
    "RECORD_ID": "1",
    "PERSON_ID": "328",
    "NAME": "Frank",
    "ADDRESS": "123 N. 1st Ave",
    "PHONE": ""
  },{
    "RECORD_ID": "2",
    "PERSON_ID": "328",
    "NAME": "Frank",
    "ADDRESS": "",
    "PHONE": "555.555.5555"
  },{
    "RECORD_ID": "3",
    "PERSON_ID": "328",
    "NAME": "Frank",
    "ADDRESS": "",
    "PHONE": "123.456.7890"
  }
]

Note that record 4 is not part of the document since it has a different identifier. It would be in its own document.

Next Steps

Logic needs to be created in MarkLogic to convert the JSON file into a multidimensional JSON/XML file. This should be user configurable, but should include a baseline behavior.

Empty/Null values should be ignored.
The presence of more than a single value for a given key should cause the final element to be an array (or a node with "value" children for XML).

Configuration

The data steward should be able to configure relational aspects of the final generated document, including row relations. This can be done via a configuration file, or programatically through the source queries.

Configuration File

An XML/JSON file can be used to define how rows relate within a document based on identifiers present in the row. The following source query result as an example:

DOCUMENT_TYPE	ENTITY_TYPE	RECORD_ID	PERSON_ID	NAME	EVENT_ID	START	END
person	name	1	328	Frank
person	event	2	328		333
person	event_start	2	328		333	08:30
person	event_end	2	328		333		14:00

The configuration should be set for the DOCUMENT_TYPE "person" and reference the ENTITY_TYPE values and their relationship to each other. It could also identify datatypes. In the above example, we want to set configuration that defines (probably as a default) name and event as top level elements of the document, then defines "event_start" and "event_end" as children of "event."

Programatic Configuration

An alternative to the use of configuration files would be to include configuration options in the source query (and therefore the document that is ingested into MarkLogic). Take the following example source query:

DOCUMENT_TYPE	ENTITY_TYPE	PARENT_ENTITY	RECORD_ID	PERSON_ID	NAME	EVENT_ID	START	END
person	name		1	328	Frank
person	event		2	328		333
person	event_start	event	2	328		333	08:30
person	event_end	event	2	328		333		14:00

The transformation logic in MarkLogic would use known column names to identify each row as having a parent entity. In this example, the first two records having a null PARENT_ENTITY would suggest they belong at the root of the XML/JSON produced, while the value of "event" in the last two rows would cause those records to be inserted as children of the "event" element. A recursive processing of each row would identify a basic map of where the data belongs, without the need for external configuration.

Logging/Exception handling

Logging is currently handled via IKM options that enable logging to an external file for status, success, and failures.

Next Steps

Use of internal Oracle logging would be ideal. Initial research suggested that the ODI console is not available to IKMs for logging. There may be a better place for it to go.
Batch Success logging could provide a count of milliseconds since the previous success/failure to provide basic benchmarking.
Batch Failure logging currently provides a failed batch number (the document name in context has no meaning since it is a UUID) which can be looked up to determine failed documents. Aggregating identifying information from the source query here would be better.

Misc

Basic code cleanup and refactoring would help future development. An additional IKM task could be used to hold functions to clean the code up a bit.