Next Steps - nlaprell/odi-ikm-sql-to-marklogic-dmsdk GitHub Wiki
Document Generation
The IKM currently generates a flat JSON or XML file for each row in the source query. The RECORD_IDENTIFIER option can be set to the column name that holds a unique identifier for a group of rows (the source query should be sorted by this field). This causes the document generated to be an array of JSON objects representing each grouped row.
Example Source
RECORD_ID | PERSON_ID | NAME | ADDRESS | PHONE |
---|---|---|---|---|
1 | 328 | Frank | 123 N. 1st Ave | |
2 | 328 | Frank | 555.555.5555 | |
3 | 328 | Frank | 123.456.7890 | |
4 | 454 | Robert | 456 N. 2nd Ave |
Example Output Document
The generated document, with the RECORD_IDENTIFIER option set to "PERSON_ID" would appear as follows:
[
{
"RECORD_ID": "1",
"PERSON_ID": "328",
"NAME": "Frank",
"ADDRESS": "123 N. 1st Ave",
"PHONE": ""
},{
"RECORD_ID": "2",
"PERSON_ID": "328",
"NAME": "Frank",
"ADDRESS": "",
"PHONE": "555.555.5555"
},{
"RECORD_ID": "3",
"PERSON_ID": "328",
"NAME": "Frank",
"ADDRESS": "",
"PHONE": "123.456.7890"
}
]
Note that record 4 is not part of the document since it has a different identifier. It would be in its own document.
Next Steps
Logic needs to be created in MarkLogic to convert the JSON file into a multidimensional JSON/XML file. This should be user configurable, but should include a baseline behavior.
- Empty/Null values should be ignored.
- The presence of more than a single value for a given key should cause the final element to be an array (or a node with "value" children for XML).
Configuration
The data steward should be able to configure relational aspects of the final generated document, including row relations. This can be done via a configuration file, or programatically through the source queries.
Configuration File
An XML/JSON file can be used to define how rows relate within a document based on identifiers present in the row. The following source query result as an example:
DOCUMENT_TYPE | ENTITY_TYPE | RECORD_ID | PERSON_ID | NAME | EVENT_ID | START | END |
---|---|---|---|---|---|---|---|
person | name | 1 | 328 | Frank | |||
person | event | 2 | 328 | 333 | |||
person | event_start | 2 | 328 | 333 | 08:30 | ||
person | event_end | 2 | 328 | 333 | 14:00 |
The configuration should be set for the DOCUMENT_TYPE "person" and reference the ENTITY_TYPE values and their relationship to each other. It could also identify datatypes. In the above example, we want to set configuration that defines (probably as a default) name and event as top level elements of the document, then defines "event_start" and "event_end" as children of "event."
Programatic Configuration
An alternative to the use of configuration files would be to include configuration options in the source query (and therefore the document that is ingested into MarkLogic). Take the following example source query:
DOCUMENT_TYPE | ENTITY_TYPE | PARENT_ENTITY | RECORD_ID | PERSON_ID | NAME | EVENT_ID | START | END |
---|---|---|---|---|---|---|---|---|
person | name | 1 | 328 | Frank | ||||
person | event | 2 | 328 | 333 | ||||
person | event_start | event | 2 | 328 | 333 | 08:30 | ||
person | event_end | event | 2 | 328 | 333 | 14:00 |
The transformation logic in MarkLogic would use known column names to identify each row as having a parent entity. In this example, the first two records having a null PARENT_ENTITY would suggest they belong at the root of the XML/JSON produced, while the value of "event" in the last two rows would cause those records to be inserted as children of the "event" element. A recursive processing of each row would identify a basic map of where the data belongs, without the need for external configuration.
Logging/Exception handling
Logging is currently handled via IKM options that enable logging to an external file for status, success, and failures.
Next Steps
- Use of internal Oracle logging would be ideal. Initial research suggested that the ODI console is not available to IKMs for logging. There may be a better place for it to go.
- Batch Success logging could provide a count of milliseconds since the previous success/failure to provide basic benchmarking.
- Batch Failure logging currently provides a failed batch number (the document name in context has no meaning since it is a UUID) which can be looked up to determine failed documents. Aggregating identifying information from the source query here would be better.
Misc
- Basic code cleanup and refactoring would help future development. An additional IKM task could be used to hold functions to clean the code up a bit.