File format - nongli/Trevni GitHub Wiki

This file contains the file format description for Trevni.

Overview

Trevni is a nested, columnar, pax format. Rows are first partitioned horizontally and then split into columns. To support nested data, we use the dremel encoding. The container format is fairly simple:

File/Column Header/MetaData
Column1
Column2
...

Each column looks like:

# of blocks, 
block 1 metadata,     
block 2 metadata,
... 
block n metadata,
block 1 data, 
block 2 data,
... 
block n data

Each horizontal partitioning results in a separate HDFS file. HDFS should be configured so that this file always has one block and the block size should be relatively large. Larger blocks allow for better IO throughput (more large sequential reads) but reduce the amount of parallelism (i.e. for mapreduce) and increase memory footprint for writing. Having one block per file and completely standalone files means that no remote IO is necessary to process the data. We recommend 1GB per file/block but this is adjustable.

Data types

Trevni defines the following data types used in the file format:

  • Null, requires zero Bytes. Sometimes used in array columns.
  • Long 64-bit signed values, represented in zig-zag format.
  • Int, like Long, but restricted to 32-bit signed values
  • Fixed32 32-bit values stored as four Bytes, little-endian.
  • Fixed64 64-bit values stored as eight Bytes, little-endian.
  • Float 32-bit IEEE floating point value, little-endian
  • Double 64-bit IEEE floating point value, little-endian
  • String encoded as Long followed by that many Bytes
  • Bytes as above, may be used to encapsulate more complex objects

Additional data encodings can be specified in the column metadata. This includes compression, block encodings, etc.

Format description

The following is a pseudo-BNF grammar for Trevni file format. Comments are prefixed with dashes:

trevni ::=
  <file-header>
  <column>+

file-header ::=
  <file-version-header>
  <row-count>
  <column-count>
  <file-metadata>
  <column-metadata>+
  <column-start>+

file-version-header ::= Byte[4] {'T', 'r', 'v', 1}
-- Number of rows in this file
row-count ::= Fixed64
-- Number of rows in this file
column-count ::= Fixed32
-- offset into file for start of column
column-start ::= Fixed64

file-metadata ::= <metadata>

column-metadata ::= <metadata>

column ::=
  <block-count>
  <block-descriptor>+
  <block>+

-- number of blocks
block-count ::= Fixed32

block-descriptor ::=
  <block-row-count>
  <block-uncompressed-size>
  <block-compressed-size>
  [ <first-value> ]

block-row-count ::= Fixed32
block-uncompressed-size ::= Fixed32
block-compressed-size ::= Fixed32

first-value := -- First value in column in type of column

block ::=      
  [ <column-checksum> ]
  [<definition-array> ]
  [<repetition-array> ]
  <column-values>*

definition-array ::= IntegerArray
repetition-array ::= IntegerArray
column-values ::= -- Serialized column values

-- A collection of key-value pairs defining metadata values for the file.
-- Text key and value pairs.
metadata ::=
  <meta-count>
  <metadata-pair>*

meta-count ::= Long

metadata-pair ::=
  String
  Bytes

Predefined file metadata

These are metadata keys for the file:

  • Default compression codec (string, optional): default compression used for the columns. If this is set then columns inherit this property if they don't specify a compression codec in the column metadata.
  • Default checksum algorithm (string, optional): default checksum algorithm for data blocks. Can be overwritten in column metadata. Crc32 must be supported.

Predefined column metadata:

These are the metadata keys for each column:

  • Compression Codec/Checksum Algorithm: same as the file metadata keys, but specific for this column
  • Name (string, required): the name of this column
  • Type (enum, required): one of the types above
  • HasInitialValue (bool, optional, default to false): If true, the block descriptor will contain (at the end), the first value for that block. This can be used, for example, if you know the data is sorted to skip blocks.
  • MaxRepetition (int, optional, default to 0): Max repetition level.
  • MaxDefinition (int, optional, default to 0): Max definition level.
  • ArrayWidth(int, optional, default to 0): Specifies that the values should be interpreted as a row major matrix with this fixed width.
  • Parent (string, optional): the name of the parent column if there is one (nested only)
⚠️ **GitHub.com Fallback** ⚠️