Introduction
Understanding Cassandra’s internal architecture helps you design better data models, troubleshoot issues, and optimize performance. This article explores how Cassandra achieves distributed, fault-tolerant storage.
Distributed Architecture
Peer-to-Peer Design
Cassandra uses a peer-to-peer distributed architecture:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Cassandra Cluster โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ Node 1 โโโโบโ Node 2 โโโโบโ Node 3 โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ โ โ โ
โ โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโ โ
โ โ โ
โ Gossip Protocol โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
No Single Point of Failure
Every node in Cassandra is identical:
- Any node can serve any request
- No master node
- Data replicated across multiple nodes
Gossip Protocol
How Gossip Works
Gossip is Cassandra’s peer-to-peer communication protocol:
// Simplified gossip process
// Each node periodically exchanges state with 1-3 random nodes
// State includes:
// - Node location
// - Load information
// - Schema version
// - Token range ownership
Gossip Properties
-- View gossip state
nodetool gossipinfo
-- Output example:
/10.0.0.1
generation: 1234567890
version: 42
heartbeat: 12345
DC: dc1
RACK: rack1
STATUS: NORMAL
RELEASE_VERSION: 4.0.0
Storage Engine
Write Path
When you write to Cassandra:
Write Request
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Commit Log โ (Durable write)
โโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Memtable โ (In-memory)
โโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ SSTable โ (On-disk, immutable)
โโโโโโโโโโโโโโโโโโโโ
Commit Log
-- Commit log stores all writes for durability
-- Sequential write, very fast
-- Configuration
-- commitlog_directory: /var/lib/cassandra/commitlog
-- commitlog_sync: periodic (default) or batch
Memtable
-- Memtable: in-memory structure
-- Sorted by partition key
-- Flushed to SSTable when threshold reached
-- memtable_threshold: memory threshold
-- memtable_flush_writers: number of flush writers
SSTable (Sorted String Table)
SSTables are immutable, sorted files on disk:
SSTable Structure:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Bloom Filter โ (ๅฟซ้ๆฃๆฅkeyๆฏๅฆๅญๅจ)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Partition Index โ (ๅๅบkey็ดขๅผ)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Partition Summary โ (็ดขๅผๆ่ฆ)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Data โ (ๅฎ้
ๆฐๆฎ)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Compression Chunk โ (ๅ็ผฉๅ)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Compaction
Compaction merges SSTables:
-- Size-Tiered Compaction (STCS)
-- Merges similar-sized SSTables
-- Leveled Compaction (LCS)
-- Fixed-size levels, better for reads
-- Time-Window Compaction (TWCS)
-- For time-series data
Data Distribution
Virtual Nodes
Each physical node owns multiple token ranges (vnodes):
# cassandra.yaml
num_tokens: 256 # Default virtual nodes per physical node
Partition Key
The partition key determines data distribution:
-- Simple partition key
CREATE TABLE users (
user_id UUID,
name TEXT,
PRIMARY KEY (user_id)
);
-- Hash(user_id) determines node
-- Composite partition key
CREATE TABLE orders (
user_id UUID,
order_id TIMEUUID,
total DECIMAL,
PRIMARY KEY ((user_id), order_id)
);
-- Hash(user_id) determines primary node
Replication Strategy
-- SimpleStrategy (single datacenter)
CREATE KEYSPACE myapp
WITH REPLICATION = {
'class': 'SimpleStrategy',
'replication_factor': 3
};
-- NetworkTopologyStrategy (multi-datacenter)
CREATE KEYSPACE myapp
WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'dc1': 3,
'dc2': 3
};
Consistency
Tunable Consistency
Cassandra allows you to tune consistency per query:
-- Consistency levels:
-- ANY -- Any node (including hinted handoff)
-- ONE -- One replica
-- TWO -- Two replicas
-- THREE -- Three replicas
-- QUORUM -- Majority
-- LOCAL_ONE -- Closest replica
-- LOCAL_QUORUM-- Quorum in local DC
-- EACH_QUORUM -- Quorum in each DC
-- ALL -- All replicas
-- Set consistency for session
CONSISTENCY QUORUM;
-- Set consistency for single query
SELECT * FROM users
WHERE user_id = 123
CONSISTENCY ONE;
Consistency vs Availability
CAP Theorem:
- Cassandra provides tunable availability
- You can choose between:
- Strong consistency: More replicas needed (QUORUM, ALL)
- High availability: Fewer replicas needed (ONE, ANY)
Read Path
When you read from Cassandra:
Read Request
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Bloom Filter โโโโบ If no, skip SSTable
โโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Memtable โโโโบ Check first
โโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ SSTables โโโโบ Merge results
โโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโ
โ Return Result โ
โโโโโโโโโโโโโโโโโโโโ
Read Repair
Read repair ensures consistency:
-- On read, Cassandra:
-- 1. Queries requested number of replicas
-- 2. Compares data from all replicas
-- 3. Sends updates to out-of-date replicas
-- 4. Returns most recent data to client
Anti-Entropy
Merkle Trees
Cassandra uses merkle trees for consistency checks:
# Run repair to compare merkle trees
nodetool repair mykeyspace
-- Repair process:
-- 1. Build merkle tree for each SSTable
-- 2. Compare trees between replicas
-- 3. Exchange different data
-- 4. Ensure consistency
Hinted Handoff
When a replica is down, Cassandra stores hints:
-- Enable hinted handoff
-- hinted_handoff_enabled: true
-- Maximum hint window
-- max_hint_window: 3 hours
-- Hint storage
-- hints_directory: /var/lib/cassandra/hints
Conclusion
Understanding Cassandra’s internalsโgossip protocol, storage engine, and tunable consistencyโhelps you design better applications and troubleshoot issues. These fundamentals are essential for running Cassandra in production.
In the next article, we’ll explore Cassandra 5.0 features and the evolving ecosystem.
Comments