Cassandra - keshavbaweja-git/guides GitHub Wiki

Keys

  • Primary Key uniquely identifies a row in a column family.
  • Primary Key can be made up of more than one column.
  • Primary Key in Cassandra is made up of - Partition Key and Clustering Key.
  • Partition Key decides how data is distributed across cluster.
  • Clustering Key decides how data is stored on disk on a single node.
  • Partition Key can also be made up of more than one column.(compound/composite)
  • When there are multiple columns in a primary key, first column is the partition key, unless specified explicitly

Data distribution

  • How is data distributed across Cassandra cluster.
  • Each node is assigned a token (actually a range of tokens)
  • Tokens determine the rows stored on a node
  • Token is generate by Partitioner, a hash function.
  • Single data center vs Multiple data centers
  • Properties of a node
    • initial_token - first token in the token range assigned to a node.
  • Same partitioner which is used to generate tokens for nodes, is used to generate a token for a partition key in the same token range. Row are then mapped to the node that is assigned the token range that includes token generated for partition key.

Replication

  • Replication Factor = number of copies of a row.
  • Replication Placement Strategy
  • Row data is not split across nodes. If a partition key exists on a node, all corresponding column value pairs are stored on that node.
  • Partition Key determines which rows will be hosted on a node.

Types of Partitioner

  • Random Partitioner MD5 hashing algorithm
  • Murmur3 Partitioner, default partitioner, Murmur3 hash algorithm, better performance than Random Partitioner These two don't allow range and aggregate queries
  • Byte Ordered Partitioner - deprecated, allows range and aggregate queries, hexadecimal representation of partition key. Keys are thus sorted on a node and across nodes.
  • Ordered storage of data has a number of limitations. Load balancing is impacted as a range of keys are stored on a single node. Hot spots - heavy load on nodes for a particular range.

Composite Primary Key

PK1:CK-a1:CK-b1:rest of the columns1
    CK-a2:CK-b2:rest of the columns2
  • Rows are sorted by values of clustering keys
  • Rows for same partition key are stored together.
  • Presence of Partition Key leads to some restrictions on query model.
  • Look up by column values can be prohibitively expensive.
  • Cassandra has restrictions on columns that can be used in lookups.

Restrictions on Partition Key

  • All columns of a Partition Key should be restricted in a query
  • Only IN and = opertaors are allowed on Partition Key
  • Range and Like operators are not allowed on Partition Key
  • ORDER BY is not supported on Partition Key

Restrictions on Clustering Key

Restrictions on Secondary index

  • Secondary index is used to access data based on a column that is not part of the primary key.
  • Secondary index is stored in a separate column family.
  • Best used when indexed column does not have very high cardinality.
  • Secondary index is stored on one node only and is not replicated to other nodes.
  • Every query on secondary index is forwarded to all nodes in the cluster, and can be an expensive operation.
  • On a large cluster, secondary index queries can display poor performance.
  • Secondary index should not be used for column family that is updated frequently
  • Secondary index should not be defined for Counter columns