High Cardinality Index - marcusbb/astyanax GitHub Wiki

Prior to native secondary indexes, folks were indexing their data themselves, either by using a wide row format with fully de-normalized data in each searchable column of that row or they would build CFs specifically designed to hold references back to the original data.

This utility is a generalization of the second form.

The rationalization is that we can keep the index in a separate CF that will serve as a reverse lookup. The indexed CF will have the general form:

row key (composite) = CF_name:index_column:index_val col[] = row keys: meta_data

The default is there is one column family that represents every index, however that is up to the user to decide where you can have index column family defined for each configured index.

Rationale: Very widely distributed index. Each column/value index will be represented as a row, and hence will be distributed very evenly around the cluster, no hot spots. It is also the intention that the index CF is implemented as a "row cached" column family, hence lookup will not need to go to disk.

Why at the client? I'm assuming that implementation at the server would be a bit more complicated and would either require modifications to code to have the logic for index coordination at the server, making a coordination layer above the column family reads and writes. This would be significantly more complicated, plus I am assuming that both reads and writes from the application are through this client.

The index ideally should be in the server (service) that makes this opaque to the client, but there are some performance reasons why the client is a good place to have this feature. Coordinate at the client. This means however that the client will be performing a "join" of sorts where the first read is to the indexed CF and then the second to the primary CF. Since these are 2 row based reads they should be very fast, and the index CF should also be row cached so that disk seeks should not occur on an index read. Because of the coordination at the client updates to the indexed column delta can be most efficient calculated - therefore a mutation to update the index is done only when necessary.
However, where the secondary index does win is on general put. Because a new index value requires 2 writes, to 2 separate nodes the write is not as efficient as the write of the native secondary index - which updates the copy of the index CF in the node where the row is stored.

If Cassandra supports row key affinity across CFs then I will re-think this solution.

The second major principle of this feature, which is determine if this will be the right solution for you, is that you have a read CF and write to the same CF, using the same row key that you used to read from. Why? This feature also has a local cache to track the changes that are being made in transit for that particular transaction (unit of work). Therefore HC Index makes an decision on how to maintain the index, either by inserting the index, or updating it. What is the difference between inserting the index and updating it? Because this is a reverse look up facility, and change made to a indexed column of a row, will require 2 operations. 1. Inserting a new index value that makes to the same key. 2. Removing the old value that was previously pointed to this key.

TODO: store a configurable set of columns with the reverse index, so that a read on the index CF won't force a "join" (a second read).