Home - jacquesfauquex/DCKV GitHub Wiki
DCKV representation of DICOM dataset
Background
DICOM binary (DICM) metadata format improves the key/value paradigm of any basic metadata in three axis:
- the attributes are orderered within lists in function of 4-bytes tags identifying each attribute.
- an attribute can be multivalued,
- an attribute can be the root of an array of enclosed ordered lists of attributes
A translation of DICM into XML facilitates the discrete access to any attribute of the root ordered list, or of enclosed ones, thanks to the XML tool XPath.
Another translation of the binary model into JSON simplifies the parsing of subsets of metadata in QIDO responses using ecmascript (javascript) and the many other languages which support maps (also called associative array, or dictionary).
Both the XML and JSON translations are text-based representations derived from the explicit binary syntax. They replace the binary structuring glue by textual markup, which allows to replicate :
- the association of multiple values to one key,
- the encapsulation of various items into one sequence,
- the nesting of the various attributes of a dataset into one item
Correctly parsing an attribute implies parsing the complete context preceding it, in order to discover its chain of encapsulation. This seems really burdensome in simple use cases. Even when the attribute should be found at the ground level, previous sequences alter the classification into the serialized file and force the parser to dig into them before reaching the ground level attribute of interest.
Our new representation aims at incorporating the context into the key part for the attribute, so that each attribute is fully defined individually by the key.
We call this new representation "Dicom Contextualized Key Value" (DCKV).
DCKV can still be translated back and forth from and to the already existing DICM, XML and JSON representations of DICOM datasets. It has been designed to easily serialize to anyone of the three others.
Implementation of the "ascending order" rule
In DICOM representations, the attributes shall be ordered in tag ascending order within the base dataset (and also within any encapsulated dataset).
A tag in its binary form is a sequence of two two-bytes words (group and unit). The order of the bytes within the two-bytes words depends on the endianness of the computer. But as for now, big endian has been deprecated, and the canonical little endian binary representation of the tag in DICOM binary is a sequence of four bytes as follows:
- 0 group less significant byte (g)
- 1 group most significant byte (G)
- 2 unit less significant byte (u)
- 3 unit most significant byte (U)
Such serialization makes ordering tags difficult, because it implies permuting the byte order in each of the tags before ordering them. This is in fact what is performed in the text representation of a tag by means of a chain of 8 hexadecimal chars (two consecutive ones represent one byte) which represents the order GgUu, ready for ordering.
So our internal model of a tag is a chain of 4 bytes ordered GgUu.
Key format
### SQIT
A list of attributes (AT) is called an item (IT). A special attribute type sequence (SQ) contains a positional list of items (IT 1 comes before IT 2, which comes before IT 3 and so on).
SQIT*
Sequence containment recursion is authorized.
When the attribute is buried into one or more levels of encapsulation of sequence, a chain of sequence-item is required to locate it fully.
ATRC
In order to interpret correctly an attribute, its value representation (R) and charset (C) need to be known. we make this information available in:
- R: two ascii letters VR (value representation) datatype
- C: uint16 index of charset defined in attribute (0008,0005) of the dataset. Index defined here
We label the attribute with its complementary R and C properties ATRC (which is 8 bytes long).
PREF
Root tags in the dataset of a DICOM instance are not prefixed by any item number. This is so because the standard builds up on instances.
But as far as DCKV is concerned, that's the study which is fundamental.
In order to keep together all the attributes of all the instances of a study, we add a prefix series/instance PREF which classifies the attributes by series/sop/frame/instances: PREF (SQIT)* ATRC
we call this prefixed DCKV format eDCKV (exam DCKV).
PREF
, SQIT
and ATRC
are 8 bytes length each, which is a nice size for 64 bit computing.
Sequences and items delimitations
With the purpose of simplifying the serialization into binary DICOM, we also materialize each item start, item end and sequence end of the dataset as if they were attributes. To make it possible, we created 4 private representations vr.
- SQ start : R=0x0000 C=0x0000
- item start tag: AT=0x00000000 R=0x2B2B C=0x0000
- item end tag : AT=0xFFFFFFFF R=0x5F5F C=0x0000
- SQ end : R=0xFFFF C=0x0000
0x2B2B in ASCII is ++ 0x5F5F in ASCII is --
Example: Sequence with an empty item and an item with contents :
- aaaaaaaa-00000000
- aaaaaaaa.00000001_00000000-2B2B0000
- aaaaaaaa.00000001_FFFFFFFF-5F5F0000
- aaaaaaaa.00000002_00000000-2B2B0000
- aaaaaaaa.00000002_bbbbbbbb-44410000
- aaaaaaaa.00000002_FFFFFFFF-5F5F0000
- aaaaaaaa-FFFF0000
Range selection
The key ordered list allows for range selection. A specific range selection type, the sharedPrefix one is usefull for the sellection of:
- groups. For instance, group 0002, private groups (odds with the exception of 0001,0003,0005,0007,FFFF)
- sequence contents (all the items). Also works for encapsulated sequences
- item contents
Lazy parsing and value format
We register values as are in the binary DICM representation (including padding), that is as a byte chain.
Our parsing is lazy. value parsing is differed until representation needs. This optimizes the parsing process, since many attributes will never be represented, which implies that their value doesn´t need to be parsed.
Another benefit of lazy parsing is that serialization of the values is merely a copy operation.