Cassandra Internals: Storage Engine, Consistency, and Data Distribution

Introduction

Understanding Cassandra’s internal architecture helps you design better data models, troubleshoot issues, and optimize performance. This article explores how Cassandra achieves distributed, fault-tolerant storage.

Distributed Architecture

Peer-to-Peer Design

Cassandra uses a peer-to-peer distributed architecture:

┌─────────────────────────────────────────────┐
│            Cassandra Cluster                 │
│  ┌─────────┐   ┌─────────┐   ┌─────────┐  │
│  │  Node 1 │◄─►│  Node 2 │◄─►│  Node 3 │  │
│  └─────────┘   └─────────┘   └─────────┘  │
│       │             │             │        │
│       └─────────────┼─────────────┘        │
│                     │                      │
│              Gossip Protocol              │
└─────────────────────────────────────────────┘

No Single Point of Failure

Every node in Cassandra is identical:

Any node can serve any request
No master node
Data replicated across multiple nodes

Gossip Protocol

How Gossip Works

Gossip is Cassandra’s peer-to-peer communication protocol:

// Simplified gossip process
// Each node periodically exchanges state with 1-3 random nodes

// State includes:
// - Node location
// - Load information
// - Schema version
// - Token range ownership

Gossip Properties

-- View gossip state
nodetool gossipinfo

-- Output example:
/10.0.0.1
  generation: 1234567890
  version: 42
  heartbeat: 12345
  DC: dc1
  RACK: rack1
  STATUS: NORMAL
  RELEASE_VERSION: 4.0.0

Storage Engine

Write Path

When you write to Cassandra:

Write Request
    │
    ▼
┌──────────────────┐
│  Commit Log      │  (Durable write)
└──────────────────┘
    │
    ▼
┌──────────────────┐
│   Memtable       │  (In-memory)
└──────────────────┘
    │
    ▼
┌──────────────────┐
│   SSTable        │  (On-disk, immutable)
└──────────────────┘

Commit Log

-- Commit log stores all writes for durability
-- Sequential write, very fast

-- Configuration
-- commitlog_directory: /var/lib/cassandra/commitlog
-- commitlog_sync: periodic (default) or batch

Memtable

-- Memtable: in-memory structure
-- Sorted by partition key
-- Flushed to SSTable when threshold reached

-- memtable_threshold: memory threshold
-- memtable_flush_writers: number of flush writers

SSTable (Sorted String Table)

SSTables are immutable, sorted files on disk:

SSTable Structure:
┌────────────────────────────────────┐
│  Bloom Filter                     │  (快速检查key是否存在)
├────────────────────────────────────┤
│  Partition Index                  │  (分区key索引)
├────────────────────────────────────┤
│  Partition Summary                │  (索引摘要)
├────────────────────────────────────┤
│  Data                             │  (实际数据)
├────────────────────────────────────┤
│  Compression Chunk                 │  (压缩块)
└────────────────────────────────────┘

Compaction

Compaction merges SSTables:

-- Size-Tiered Compaction (STCS)
-- Merges similar-sized SSTables

-- Leveled Compaction (LCS)
-- Fixed-size levels, better for reads

-- Time-Window Compaction (TWCS)
-- For time-series data

Data Distribution

Virtual Nodes

Each physical node owns multiple token ranges (vnodes):

# cassandra.yaml
num_tokens: 256  # Default virtual nodes per physical node

Partition Key

The partition key determines data distribution:

-- Simple partition key
CREATE TABLE users (
    user_id UUID,
    name TEXT,
    PRIMARY KEY (user_id)
);
-- Hash(user_id) determines node

-- Composite partition key
CREATE TABLE orders (
    user_id UUID,
    order_id TIMEUUID,
    total DECIMAL,
    PRIMARY KEY ((user_id), order_id)
);
-- Hash(user_id) determines primary node

Replication Strategy

-- SimpleStrategy (single datacenter)
CREATE KEYSPACE myapp 
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
};

-- NetworkTopologyStrategy (multi-datacenter)
CREATE KEYSPACE myapp 
WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,
    'dc2': 3
};

Consistency

Tunable Consistency

Cassandra allows you to tune consistency per query:

-- Consistency levels:
-- ANY         -- Any node (including hinted handoff)
-- ONE         -- One replica
-- TWO         -- Two replicas
-- THREE       -- Three replicas
-- QUORUM      -- Majority
-- LOCAL_ONE   -- Closest replica
-- LOCAL_QUORUM-- Quorum in local DC
-- EACH_QUORUM -- Quorum in each DC
-- ALL         -- All replicas

-- Set consistency for session
CONSISTENCY QUORUM;

-- Set consistency for single query
SELECT * FROM users 
WHERE user_id = 123
CONSISTENCY ONE;

Consistency vs Availability

CAP Theorem:

Cassandra provides tunable availability
You can choose between:
- Strong consistency: More replicas needed (QUORUM, ALL)
- High availability: Fewer replicas needed (ONE, ANY)

Read Path

When you read from Cassandra:

Read Request
    │
    ▼
┌──────────────────┐
│  Bloom Filter    │──► If no, skip SSTable
└──────────────────┘
    │
    ▼
┌──────────────────┐
│  Memtable        │──► Check first
└──────────────────┘
    │
    ▼
┌──────────────────┐
│  SSTables        │──► Merge results
└──────────────────┘
    │
    ▼
┌──────────────────┐
│  Return Result   │
└──────────────────┘

Read Repair

Read repair ensures consistency:

-- On read, Cassandra:
-- 1. Queries requested number of replicas
-- 2. Compares data from all replicas
-- 3. Sends updates to out-of-date replicas
-- 4. Returns most recent data to client

Anti-Entropy

Merkle Trees

Cassandra uses merkle trees for consistency checks:

# Run repair to compare merkle trees
nodetool repair mykeyspace

-- Repair process:
-- 1. Build merkle tree for each SSTable
-- 2. Compare trees between replicas
-- 3. Exchange different data
-- 4. Ensure consistency

Hinted Handoff

When a replica is down, Cassandra stores hints:

-- Enable hinted handoff
-- hinted_handoff_enabled: true

-- Maximum hint window
-- max_hint_window: 3 hours

-- Hint storage
-- hints_directory: /var/lib/cassandra/hints

Conclusion

Understanding Cassandra’s internals—gossip protocol, storage engine, and tunable consistency—helps you design better applications and troubleshoot issues. These fundamentals are essential for running Cassandra in production.

In the next article, we’ll explore Cassandra 5.0 features and the evolving ecosystem.