Next Steps - nlaprell/odi-ikm-sql-to-marklogic-dmsdk GitHub Wiki

Document Generation

The IKM currently generates a flat JSON or XML file for each row in the source query. The RECORD_IDENTIFIER option can be set to the column name that holds a unique identifier for a group of rows (the source query should be sorted by this field). This causes the document generated to be an array of JSON objects representing each grouped row.

Example Source

RECORD_ID PERSON_ID NAME ADDRESS PHONE
1 328 Frank 123 N. 1st Ave
2 328 Frank 555.555.5555
3 328 Frank 123.456.7890
4 454 Robert 456 N. 2nd Ave

Example Output Document

The generated document, with the RECORD_IDENTIFIER option set to "PERSON_ID" would appear as follows:

[
  {
    "RECORD_ID": "1",
    "PERSON_ID": "328",
    "NAME": "Frank",
    "ADDRESS": "123 N. 1st Ave",
    "PHONE": ""
  },{
    "RECORD_ID": "2",
    "PERSON_ID": "328",
    "NAME": "Frank",
    "ADDRESS": "",
    "PHONE": "555.555.5555"
  },{
    "RECORD_ID": "3",
    "PERSON_ID": "328",
    "NAME": "Frank",
    "ADDRESS": "",
    "PHONE": "123.456.7890"
  }
]

Note that record 4 is not part of the document since it has a different identifier. It would be in its own document.

Next Steps

Logic needs to be created in MarkLogic to convert the JSON file into a multidimensional JSON/XML file. This should be user configurable, but should include a baseline behavior.

  • Empty/Null values should be ignored.
  • The presence of more than a single value for a given key should cause the final element to be an array (or a node with "value" children for XML).

Configuration

The data steward should be able to configure relational aspects of the final generated document, including row relations. This can be done via a configuration file, or programatically through the source queries.

Configuration File

An XML/JSON file can be used to define how rows relate within a document based on identifiers present in the row. The following source query result as an example:

DOCUMENT_TYPE ENTITY_TYPE RECORD_ID PERSON_ID NAME EVENT_ID START END
person name 1 328 Frank
person event 2 328 333
person event_start 2 328 333 08:30
person event_end 2 328 333 14:00

The configuration should be set for the DOCUMENT_TYPE "person" and reference the ENTITY_TYPE values and their relationship to each other. It could also identify datatypes. In the above example, we want to set configuration that defines (probably as a default) name and event as top level elements of the document, then defines "event_start" and "event_end" as children of "event."

Programatic Configuration

An alternative to the use of configuration files would be to include configuration options in the source query (and therefore the document that is ingested into MarkLogic). Take the following example source query:

DOCUMENT_TYPE ENTITY_TYPE PARENT_ENTITY RECORD_ID PERSON_ID NAME EVENT_ID START END
person name 1 328 Frank
person event 2 328 333
person event_start event 2 328 333 08:30
person event_end event 2 328 333 14:00

The transformation logic in MarkLogic would use known column names to identify each row as having a parent entity. In this example, the first two records having a null PARENT_ENTITY would suggest they belong at the root of the XML/JSON produced, while the value of "event" in the last two rows would cause those records to be inserted as children of the "event" element. A recursive processing of each row would identify a basic map of where the data belongs, without the need for external configuration.

Logging/Exception handling

Logging is currently handled via IKM options that enable logging to an external file for status, success, and failures.

Next Steps

  • Use of internal Oracle logging would be ideal. Initial research suggested that the ODI console is not available to IKMs for logging. There may be a better place for it to go.
  • Batch Success logging could provide a count of milliseconds since the previous success/failure to provide basic benchmarking.
  • Batch Failure logging currently provides a failed batch number (the document name in context has no meaning since it is a UUID) which can be looked up to determine failed documents. Aggregating identifying information from the source query here would be better.

Misc

  • Basic code cleanup and refactoring would help future development. An additional IKM task could be used to hold functions to clean the code up a bit.