Tying DFDL to Data Format Specification Documents - OpenGridForum/DFDL GitHub Wiki
We get asked for a way to tie the contents of a DFDL schema back to the specification documents for the format.
Other than mentioning these documents in block comments in the DFDL schema files, doing this at a more granular level has proven to be quite problematic. The paper specifications are often so partial, incorrect, and complemented by other information that they aren’t really referencible in a meaningful way.
This is a case study of development of a fairly large DFDL Schema for a message data format. The schema ultimately contains more than 100 ".dfdl.xsd" files.
We were given 2 spec documents. One older, one newer.
We were told they correspond to R8 (Release 8) and R9 versions, though that terminology is not found in the documents, nor their file names nor titles.
We were given test data for perhaps 85% of the messages in these specs.
We were also given files derived from an OMG IDL data description. This is ostensibly machine readable, but it is not complete. This was consulted as a 3rd source of information when a discrepancy between spec and de-facto test data was found.
Numerous clarifications about the specifications were provided in emails. Bugs were found in the specs shown by them not agreeing with de-facto test data.
The R9 version was described as a superset of R8.
Very late in the project we were advised that there are a number of incompatibilities between R8 and R9, so that R9 is clearly not just a superset, and furthermore that processing of R8 data must reject R9-only messages.
The list of messages with deltas/incompatibilities between R8 and R9 was provided in an email message. This list, however, identified the messages by section numbers of a draft of the R9 spec which is different (likely older) than the one we were given, and the section numbers of many messages had changed, making positive identification of the exact set of R9 message deltas impossible.
We worked out the correspondence of the messages between the spec we have for R9, and the ones described in the delta email by use of other information about the kinds of messages. This was submitted back to the customer for confirmation.
Test data was divided into R9 and R8 data, but this data was of different forms. The R8 data was provided as PCAP files of TCP data. The R9 data was provided in ".bin" files of individual messages.
The ".bin" files were in many cases corrupted by added LF characters added at the end of the data. These are binary data files and should not have line endings.
For some messages that were supposedly R9 we found they matched the R8 specification by testing.
The above seems chaotic but is actually typical of data projects.
Given all the above, what’s a specification?
When we create a DFDL schema for cybersecurity use, system accreditors want to understand where in the "specification" the description of a particular DFDL element resides. But what is the specification in this case?
The contribution of DFDL here is significant. Once the de-facto data, the various specs, and the clarifications are brought together in the DFDL schema, it becomes a new reference for the data format.
When creating a DFDL schema from a specification, one should capture the prose of the specification as well as the format details. If the DFDL schema captures descriptive text about the elements of the format as well, then new documentation can be generated from the DFDL schema.
The DFDL schema is also operationalized and can be shown to be correct with de-facto data. The Apache Daffodil project has been promoting a standard schema project layout so as to package DFDL schemas with associated test data and test infrastructure so that all DFDL schemas have Built-in-Self-Test (BIST).
Corrections to the DFDL schema are intended to be robustly versioned, in that it is a formal language artifact that is robustly versioned using configuration management systems as is done with software. The ability to directly regression test the DFDL schemas with de-facto data and to add more such data to cover fixes to discovered flaws enables the DFDL schema to converge on correctness as versions are refined over time.
Human documents simply do not support this level of robust configuration management and as they are not directly testable, creation of new versions to fix errors often fixes one thing but introduces another flaw elsewhere.
Human documents are also notoriously bad at maintaining accurate cross references; hence, section content is often repeated when there should be no doubt that the exact same thing is intended by sharing a common definition. DFDL schemas can, and should, never repeat themselves. Shared DFDL type, group, and element declarations can be used to ensure this, and the documentation generated from the DFDL schema would naturally contain cross-references (or links) to shared definitions.
DFDL is a standard; hence, it can be learned and an investment in learning it is usable over a career of working with data.
DFDL also has an open-source implementation, Apache Daffodil, meaning its use has zero cost, and tools like one to convert DFDL to HTML documentation can start from the Apache Daffodil code base that already knows how to navigate and assemble DFDL schemas. Tools for testing, for data-format-debugging, etc., are all feasible due to the DFDL standard being something with longevity.
Finally, DFDL is not a programming language - its descriptive capabilities match what is needed to describe data formats. Hence, data formats described using DFDL are more secure than those described using a complete, fully powerful programming language.
Ultimately, a comprehensive DFDL schema should become the new reference for a data format standard. Over time it should displace the paper/pdf-style of specifications, replacing them with documentation generated from the DFDL schema which is hyperlinked to shared sections so it is not redundant.
The testability of a DFDL schema, and incorporation of test data into the DFDL schema project tree, along with robust configuration management, insures that the DFDL schema converges toward equivalence of the functionality with the documentation of it.