Page 69 - Building Big Data Applications
P. 69
Chapter 2 Infrastructure and technology 63
As we have learned so far, a keyspace provides the data structure for Cassandra to
store the column families and the subgroups. To store the keyspace and the metadata
associated with it, Cassandra provides the architecture of a cluster, often referred as the
“ring”. Cassandra distributes data to the nodes by arranging them in a ring that forms the
cluster.
Data partitioning
Data partitioning can be done either by the client library or by any node of the cluster
and can be calculated using different algorithms; there are two native algorithms that are
provided with Cassandra:
The first algorithm is the RandomPartitionerda hash-based distribution, where the
keys are more equally partitioned across the different nodes, providing better load
balancing. In this partitioning each row and all the columns associated with the
rowkey are stored on the same physical node and columns are sorted based on
their name.
The second algorithm is the OrderPreservingPartitionerdcreates partitions based
on the key and data grouped by keys, which will boost performance of range
queries since the query will need to hit lesser number of nodes to get all the ranges
of data
Data sorting
When defining a column, you can specify how the columns will be sorted when results
are returned to the client. Columns are sorted by the “compare with” type defined on
their enclosing column family. You can specify a custom sort order, the default provided
options are as follows:
BytesTypedSimple sort by byte value. No validation is performed.
AsciiTypedSimilar to BytesType but validates that the input can be parsed as US-
ASCII.
UTF8TypedA string encoded as UTF8
LongTypedA 64-bit long
LexicalUUIDTypedA 128bitUUID, compared lexically (by byte value)
TimeUUIDType: A 128bit version 1UUID, compared by timestamp
IntegerdFaster than a log, supports fewer or longer lengths.
Consistency management
The architecture model for Cassandra is AP with eventual consistency. Cassandra’s
consistency is measured by how recent and concurrent are all replicas for one row of
data. Though the database is built on eventual consistency model, real world applica-
tions will mandate consistency for all read and write operations. In order to manage the