Cassandra - keshavbaweja-git/guides GitHub Wiki
Keys
- Primary Key uniquely identifies a row in a column family.
- Primary Key can be made up of more than one column.
- Primary Key in Cassandra is made up of - Partition Key and Clustering Key.
- Partition Key decides how data is distributed across cluster.
- Clustering Key decides how data is stored on disk on a single node.
- Partition Key can also be made up of more than one column.(compound/composite)
- When there are multiple columns in a primary key, first column is the partition key, unless specified explicitly
Data distribution
- How is data distributed across Cassandra cluster.
- Each node is assigned a token (actually a range of tokens)
- Tokens determine the rows stored on a node
- Token is generate by Partitioner, a hash function.
- Single data center vs Multiple data centers
- Properties of a node
- initial_token - first token in the token range assigned to a node.
- Same partitioner which is used to generate tokens for nodes, is used to generate a token for a partition key in the same token range. Row are then mapped to the node that is assigned the token range that includes token generated for partition key.
Replication
- Replication Factor = number of copies of a row.
- Replication Placement Strategy
- Row data is not split across nodes. If a partition key exists on a node, all corresponding column value pairs are stored on that node.
- Partition Key determines which rows will be hosted on a node.
Types of Partitioner
- Random Partitioner MD5 hashing algorithm
- Murmur3 Partitioner, default partitioner, Murmur3 hash algorithm, better performance than Random Partitioner
These two don't allow range and aggregate queries
- Byte Ordered Partitioner - deprecated, allows range and aggregate queries, hexadecimal representation of partition key. Keys are thus sorted on a node and across nodes.
- Ordered storage of data has a number of limitations. Load balancing is impacted as a range of keys are stored on a single node. Hot spots - heavy load on nodes for a particular range.
Composite Primary Key
PK1:CK-a1:CK-b1:rest of the columns1
CK-a2:CK-b2:rest of the columns2
- Rows are sorted by values of clustering keys
- Rows for same partition key are stored together.
- Presence of Partition Key leads to some restrictions on query model.
- Look up by column values can be prohibitively expensive.
- Cassandra has restrictions on columns that can be used in lookups.
Restrictions on Partition Key
- All columns of a Partition Key should be restricted in a query
- Only IN and = opertaors are allowed on Partition Key
- Range and Like operators are not allowed on Partition Key
- ORDER BY is not supported on Partition Key
Restrictions on Clustering Key
Restrictions on Secondary index
- Secondary index is used to access data based on a column that is not part of the primary key.
- Secondary index is stored in a separate column family.
- Best used when indexed column does not have very high cardinality.
- Secondary index is stored on one node only and is not replicated to other nodes.
- Every query on secondary index is forwarded to all nodes in the cluster, and can be an expensive operation.
- On a large cluster, secondary index queries can display poor performance.
- Secondary index should not be used for column family that is updated frequently
- Secondary index should not be defined for Counter columns