Row compressed datg file - molgenis/systemsgenetics GitHub Wiki

The datg files are used to efficiently store a double matrix. Each row is compressed using LZ4 blocks allowing for fast random access to specific rows. Only when the number of columns is very limited and row based compression no longer makes sense are rows combined into blocks.

Each datg file is accompanied with a rows.txt.gz and cols.txt.gz, these gzipped text files contain must contain the exact number of rows as coded within the datg file. These file can be tab separated and the first column of each file respectively should be treated as row and column names of the matrix stored in the datg file. The order of these files must therefor match the order of rows and columns in the matrix.

Within the datg file there is a index allowing random access to specific rows. It is also possible to efficiently stream through all rows in order.

Specification

The datg files start with the actually row compressed matrix, then the row indices, and finally some metadata

Data

The double values of each row are encoded using IEEE 754 floating-point bit layout and each row (or sometimes multiple row) are compressed into a single LZ4 block. The java implementation uses: [https://github.com/lz4/lz4-java/blob/master/src/java/net/jpountz/lz4/LZ4BlockOutputStream.java]

Note: The maximum number of columns is 4.194.304. Writer should stop and return an error upon try to create a datg file with more columns.

Indices

The indices are an ordered a array of longs with the byte index of each block in the file. This array is also compressed using into LZ4 block.

Meta data

The meta data contains the following and is not compressed

String with info of arbitrary size.

UTF8 string with dataset name
UTF8 string name of entities on rows
UTF8 string name of entities on columns

44 bytes of additional meta data

long with creation date of file as number of seconds since 1970-01-01T00:00:00Z
int number of rows
int number of columns
long file index of the compressed block with indices
long file index of start of metadata block (this is variable due to the strings)
int number of rows per compressed block
3 bytes reserved for future, reader should fail if not zero to allow future updates of this format
1 byte with flags
- Bit 1-7: reserve, current reader should fail if not zero.
- Bit 8: indicates compression algorithm.
  - 0 for jpountz.lz4 blocks, depricated
  - 1 for lz4 frames, all writers should use the lz4 frames as these are better portable.
4 magic bytes 85, 77, 67, 71

The start of the meta data after the strings is simply the file length -44

Reference implementation

The Java implementation can be found here:

[https://github.com/PatrickDeelen/systemsgenetics/blob/master/genetica-libraries/src/main/java/umcg/genetica/math/matrix2/DoubleMatrixDatasetRowCompressedWriter.java] [https://github.com/PatrickDeelen/systemsgenetics/blob/master/genetica-libraries/src/main/java/umcg/genetica/math/matrix2/DoubleMatrixDatasetRowCompressedReader.java]