Cassandra - keshavbaweja-git/guides GitHub Wiki

Primary Key uniquely identifies a row in a column family.
Primary Key can be made up of more than one column.
Primary Key in Cassandra is made up of - Partition Key and Clustering Key.
Partition Key decides how data is distributed across cluster.
Clustering Key decides how data is stored on disk on a single node.
Partition Key can also be made up of more than one column.(compound/composite)
When there are multiple columns in a primary key, first column is the partition key, unless specified explicitly

How is data distributed across Cassandra cluster.
Each node is assigned a token (actually a range of tokens)
Tokens determine the rows stored on a node
Token is generate by Partitioner, a hash function.
Single data center vs Multiple data centers
Properties of a node
- initial_token - first token in the token range assigned to a node.
Same partitioner which is used to generate tokens for nodes, is used to generate a token for a partition key in the same token range. Row are then mapped to the node that is assigned the token range that includes token generated for partition key.

Replication Factor = number of copies of a row.
Replication Placement Strategy
Row data is not split across nodes. If a partition key exists on a node, all corresponding column value pairs are stored on that node.
Partition Key determines which rows will be hosted on a node.

Random Partitioner MD5 hashing algorithm
Murmur3 Partitioner, default partitioner, Murmur3 hash algorithm, better performance than Random Partitioner These two don't allow range and aggregate queries
Byte Ordered Partitioner - deprecated, allows range and aggregate queries, hexadecimal representation of partition key. Keys are thus sorted on a node and across nodes.
Ordered storage of data has a number of limitations. Load balancing is impacted as a range of keys are stored on a single node. Hot spots - heavy load on nodes for a particular range.

PK1:CK-a1:CK-b1:rest of the columns1
    CK-a2:CK-b2:rest of the columns2

Secondary index is used to access data based on a column that is not part of the primary key.
Secondary index is stored in a separate column family.
Best used when indexed column does not have very high cardinality.
Secondary index is stored on one node only and is not replicated to other nodes.
Every query on secondary index is forwarded to all nodes in the cluster, and can be an expensive operation.
On a large cluster, secondary index queries can display poor performance.
Secondary index should not be used for column family that is updated frequently
Secondary index should not be defined for Counter columns