Data - pwdlugosz/Rye GitHub Wiki

Cells

The fundamental unit of data storage in Rye is called a cell. A cell is a 16 byte C# value type that acts as a tagged union:

  • Bytes 0-7 hold the Boolean, double, date-time, and integer values (64 bit); this will also store the hash code and length of string and byte[] values.
  • Byte 8 holds the data type information
  • Byte 9 determines if the value is null or not
  • Bytes 10 and 11 are not in use
  • Bytes 12-15 hold a pointer to either a string variable or a byte array

The six fundamental data types in Rye are (.Net name): BOOL (Boolean), INT (long), DOUBLE (double), DATE (DateTime), STRING (string), BLOB (byte[]).

Within the C# source code, developers can access additional variables: UINT (ulong), INTA (int at byte position 0-3) and INTB (int at byte position 4-7). UINT is used to quickly determine if two cells are not equal; INTA and INTB are used for dense data coding.

Cell Matrices

A cell matrix is a multidimensional array of like typed cells. Rye has built in support for matrix operations, like addition, subtraction, scalar-multiplication, true matrix multiplication, division, checked division, transposing, inversion and calculating the matrix determinate. The code for matrix inversion was adapted from the NIST Java matrix library.

Records

A collection of cells with possibly different types (one dimension) is called a record. Unlike matrices, records are designed to live in collections...

Extents

An extent is a single collection of records. Extents have schemas that store the name, type and size of each field. When adding a record to an extent, the extent will, if needed, cast all type mis-matches to match the schema. Extents are designed to live in memory and have a finite record capacity.

Tables

A table is collection of extents, where each extent is flushed to disk. Tables reside in a single binary file, with extent allocated to a specific region. The table meta data resides in the first 1024 bytes of the file, and contains meta data, the schema, information about how the table is sorted, and a page table that points to each page's location on the disk.

Abstract Data Structures

Volumes

A volume is something that exists behind the scenes in C#, and is really a sub-section of a table. Suppose you want to run a query over a table with 25 extents, but you want to spread the load over 4 threads. To do this, you'd create four volumes from the table, with 7,6,6 and 6 extents each. Each volume then get's passed to the query processor to run. If you just wanted to execute on one thread, you'd create a volume with 25 extents. Rye will do all this automatically as long as the user supplies the total thread count.

Heap

A heap is C# only data structure that acts a lot like Dictionary<string,T>, with the distinction that it allows the user to access values by key (which is a string) or by direct reference (Heap[int Index]). Within Rye, heaps are generally used over dictionaries.

Quack

Quacks are still experimental, but they are essentially a stack/queue hybrid (hence the name). It allows the user the switch between LIFO and FIFO. This is useful when your querying priorities change over time. For example, Rye caches a finite number of extents, dumping certain extents back to disk when the cache is full, and certain algorithms may perform better when the cache dumps the extents back to disk in a LIFO or FIFO manner.

Keys

Keys are used mostly for sorting and joining. A key holds two items, an integer reference to a field in a table/extent, and a flag saying the key sorts by Ascending or Descending. A single value in a key can be collapsed into a single cell using dense coding (the field offset is stored in INTA and the affinity is stored in INTB), while the entire key can be collapsed into a record.

Compound Records

A compound record is an array of records, and is only used as temporary storage for aggregate function results. Certain aggregate functions, like average, require holding two fields (count and sum), which are held in a record. Using multiple aggregates requires multiple holding records, which form a compound record. Compound records can be flattened into vanilla records.

Key Value Sets

These are designed only to be used in aggregate/grouping operations. The key is record and the value is a compound record. Key value sets can be collapsed into extents in two ways:

  • An interim collapse, which flatten the compound record. This is used to save the key value set to disk if there are too many tuples to fit into memory.
  • A finalizing collapse, which converts the key value set to a vanilla extent. This is used when the aggregation is complete and the query needs to return the data to a table/extent.
⚠️ **GitHub.com Fallback** ⚠️