Query - pwdlugosz/Rye GitHub Wiki

#Overview Rye queries are split into two steps: (1) node invoke phase, where an algorithm is run over a data volume and (2) the consolidation phase where the results of phase one are combined. Phase one can be run over multiple threads where phase two is done in a single thread.

Every query supports the WHERE clause which will filter the results prior to data processing, and supports running on more than one thread.

Sort

Rye sorts extents using the C#'s native intro-sort algorithm, which is a quick-sort / insertion-sort hybrid. When sorting tables, Rye sorts each extent then does an n x (n - 1) merge sort over each extent.

In the near future, the goal is to implement a Tim-Sort...

Read / Select

Read and Select are synonyms. This is akin to a SQL SELECT statement that reads on only one table and doesn't have a DISTINCT/GROUP BY expression. The twist is that it supports declarations of local variables that won't change as it traverses to the next row, it supports variable assignment and for/while loops in the query, and it allows inserting results into multiple tables. The read/select statement will create a memory structure called 'LOCAL' that the user can add scalars and matrixes to. The query processor will allocate five variables to 'LOCAL' and update them throughout the query run:

  • ROW_ID: the position of the current record ID being read
  • EXTENT_ID: The ID of the current extent being read
  • KEY_CHANGE: if the BY{} statement is used, this will be true if this is the next record will result in a key change.
  • IS_FIRST: true if the current record is the first one read, false otherwise (may get spurious results in a multi-thread run)
  • IS_LAST: true if the current record is the last one in the read, false otherwise (may get spurious results in a multi-thread run)

Aggregate

This is akin to a SQL SELECT with a GROUP BY or without a GROUP BY but with aggregates. It supports common aggregates, such as SUM, COUNT, COUNT_NULL, COUNT_ALL (COUNT(*) in SQL), MIN, MAX, AVG, STDEV, VAR, COVAR, CORR, SLOPE, INTERCEPT and FREQ. To use FREQ, pass it a Boolean expression and it will return the ratio of records that match that criteria. FREQ, AVG, STDEV, VAR, COVAR, CORR, SLOPE and INTERCPT can all have an optional final parameter that acts as a weight variable.

Immediately after the aggregate expression, the user can add a WHERE expression to filter variables passed to the aggregate. This feature exists for two reasons (1) this in lieu of the having a pivot table feature and (2) the designers have found it's common for data analysis to embed a SQL-like case statement in aggregates and this feature is just a better optimized version of that.

Example of an aggregate:

  • SUM(X == Y ? A : B) AS SUM_OF_AORB
  • AVG(X, W) WHERE A < B
  • FREQ(A == B) WHERE C == YEAR(NOW())

Aggregates can be implemented within Rye two different ways:

  • Hash Table: key (groupings) and values (working aggregates) are stored in hash tables. If a hash table becomes full, it is saved to disk and a new hash table is created. During the consolidation phase, all hash tables are merged (n ^ 2 - n operation). This algorithm is very quick if the cardinality is low (few distinct groupings), but very slow for many distinct groupings.
  • Ordered: the data is sorted by the grouping expression. When the keys change, the aggregate is saved to the final table and reset. This cannot be used if the grouping keys contain volatile expressions (GUID(), NOW(), TICKS()). This is much faster if the data is already sorted by the keys, but the actual cost of sorting can slow down the algorithm.

Rye's query process can detect if the data set is sorted by the keys used to group the data (or if it's sorted by a superset) and will choose the 'Ordered' algorithm, otherwise it uses the 'Hash Table' method. The user can override one method by using the HINT statement, where 'SORT' forces the 'Ordered' algorithm and 'HASH' forces the 'Hash Table' algorithm.

Join

This is akin to a simple SQL SELECT (without aggregates). Join can only join two tables. The query processor will decide between either a nested loop join or a sort merge join; note that if the processor chooses a sort merge join, it will permanently sort either or both of the tables by the merge key. The user has to specify the equality predicate separate from the other predicate of the join statement.

Currently, the query engine defaults to the sort merge algorithm, but better optimization is coming...

Update

This is akin to a SQL update, but it doesn't support having a join in the update statement.

Delete

This is akin to a SQL delete, but it doesn't support having a join in the delete statement.