Apache Cassandra: The Complete Guide to Distributed NoSQL Database

Introduction

Apache Cassandra is a distributed, decentralized, elastic, scalable, highly available,fault tolerance, and tunable consistent NoSQL database system. Originally developed by Facebook to power their Inbox Search feature, Cassandra was open-sourced in 2008 and became an Apache top-level project in 2010.

In 2026, Cassandra continues to be the go-to choice for applications requiring high write throughput, linear scalability, and fault tolerance across multiple data centers. This comprehensive guide covers everything you need to get started with Cassandra.

What is Apache Cassandra?

Cassandra is a distributed NoSQL database designed for:

High Write Throughput: Optimized for write-heavy workloads
Linear Scalability: Add nodes to increase capacity
High Availability: No single point of failure
Tunable Consistency: Balance between consistency and availability
Multi-Datacenter Support: Cross-datacenter replication

Cassandra vs Traditional Databases

Feature	Cassandra	Traditional RDBMS
Data Model	Wide-column	Row-based
Scaling	Horizontal	Vertical + Limited Horizontal
Query Language	CQL (similar to SQL)	SQL
Consistency	Tunable	Strong
Joins	Not supported	Supported
Transactions	Limited	Full ACID
Schema	Flexible	Fixed

Installation

Docker Installation

# Start Cassandra container
docker run --name cassandra \
  -d cassandra:latest

# Connect to Cassandra
docker exec -it cassandra cqlsh

# Start with specific version
docker run --name cassandra-4.0 \
  -d cassandra:4.0

Package Installation

# Add Apache repository
echo "deb https://downloads.apache.org/cassandra/debian 40x main" | sudo tee /etc/apt/sources.list.d/cassandra.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
sudo apt-get update
sudo apt-get install cassandra

# Start Cassandra
sudo service cassandra start

# Check status
nodetool status

Cassandra Query Language (CQL)

Keyspaces and Tables

-- Create keyspace
CREATE KEYSPACE myapp 
WITH REPLICATION = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
};

-- Or with NetworkTopologyStrategy for multi-DC
CREATE KEYSPACE myapp 
WITH REPLICATION = {
    'class': 'NetworkTopologyStrategy',
    'dc1': 3,
    'dc2': 3
};

-- Use keyspace
USE myapp;

-- Create table
CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    username TEXT,
    email TEXT,
    created_at TIMESTAMP
);

-- Create table with composite partition key
CREATE TABLE orders (
    order_id UUID,
    user_id UUID,
    status TEXT,
    total DECIMAL,
    created_at TIMESTAMP,
    PRIMARY KEY ((user_id), created_at, order_id)
) WITH CLUSTERING ORDER BY (created_at DESC, order_id DESC);

Data Types

-- Basic types
TEXT            -- String
INT             -- 32-bit integer
BIGINT          -- 64-bit integer
UUID            -- Universally Unique Identifier
TIMEUUID        -- Time-based UUID
BOOLEAN         -- True/False
DECIMAL         -- Arbitrary precision
BLOB            -- Binary data
TIMESTAMP       -- Date and time

-- Collections
LIST<T>         -- Ordered list
SET<T>          -- Unique unordered set
MAP<K,V>        -- Key-value pairs

-- Examples
CREATE TABLE user_profiles (
    user_id UUID PRIMARY KEY,
    username TEXT,
    tags SET<TEXT>,
    preferences MAP<TEXT, TEXT>,
    phone_numbers LIST<TEXT>
);

CRUD Operations

-- INSERT (upsert)
INSERT INTO users (user_id, username, email, created_at)
VALUES (uuid(), 'john_doe', '[email protected]', now());

-- INSERT with TTL (expire after 3600 seconds)
INSERT INTO sessions (session_id, data)
VALUES (uuid(), 'session_data')
USING TTL 3600;

-- SELECT
SELECT * FROM users;
SELECT username, email FROM users WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- SELECT with LIMIT
SELECT * FROM orders LIMIT 100;

-- SELECT with ordering
SELECT * FROM orders 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000
ORDER BY created_at DESC;

-- UPDATE
UPDATE users 
SET email = '[email protected]' 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- UPDATE collections
UPDATE users 
SET tags = tags + {'developer'} 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

UPDATE users 
SET preferences['theme'] = 'dark' 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- DELETE
DELETE FROM users 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

-- DELETE from collection
DELETE tags FROM users 
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;

Data Modeling

Partition Keys

The partition key determines data distribution across nodes:

-- Simple partition key
CREATE TABLE events (
    event_id TIMEUUID,
    event_type TEXT,
    data TEXT,
    PRIMARY KEY (event_type, event_id)
);

-- Composite partition key
CREATE TABLE user_events (
    user_id UUID,
    event_type TEXT,
    event_id TIMEUUID,
    data TEXT,
    PRIMARY KEY ((user_id, event_type), event_id)
);

Clustering Columns

Clustering columns determine data ordering within a partition:

-- Clustering for time-series
CREATE TABLE sensor_data (
    sensor_id TEXT,
    timestamp TIMESTAMP,
    temperature DECIMAL,
    humidity DECIMAL,
    PRIMARY KEY (sensor_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

-- Multiple clustering columns
CREATE TABLE user_posts (
    user_id UUID,
    created_at TIMESTAMP,
    post_id TIMEUUID,
    title TEXT,
    content TEXT,
    PRIMARY KEY (user_id, created_at, post_id)
) WITH CLUSTERING ORDER BY (created_at DESC, post_id DESC);

Query-First Design

Cassandra requires you to design tables based on your query patterns:

-- Query 1: Get all orders for a user
CREATE TABLE orders_by_user (
    user_id UUID,
    order_id TIMEUUID,
    total DECIMAL,
    status TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY (user_id, order_id)
) WITH CLUSTERING ORDER BY (order_id DESC);

-- Query 2: Get all orders by status (requires separate table)
CREATE TABLE orders_by_status (
    status TEXT,
    order_id TIMEUUID,
    user_id UUID,
    total DECIMAL,
    created_at TIMESTAMP,
    PRIMARY KEY (status, order_id)
);

-- Query 3: Get recent orders
CREATE TABLE recent_orders (
    year INT,
    month INT,
    order_id TIMEUUID,
    user_id UUID,
    total DECIMAL,
    PRIMARY KEY ((year, month), order_id)
) WITH CLUSTERING ORDER BY (order_id DESC);

Consistency Levels

-- Check current consistency
CONSISTENCY;

-- Set consistency level
CONSISTENCY QUORUM;

-- Consistency levels:
-- ANY           -- At least one node
-- ONE           -- One node
-- TWO           -- Two nodes
-- THREE         -- Three nodes
-- QUORUM        -- Majority (replication_factor/2 + 1)
-- LOCAL_ONE     -- Closest node in local DC
-- LOCAL_QUORUM  -- Quorum in local DC
-- EACH_QUORUM   -- Quorum in each DC
-- ALL           -- All replicas

Secondary Indexes

-- Create secondary index
CREATE INDEX idx_user_email ON users(email);

-- Create index on collection
CREATE INDEX idx_user_tags ON users(tags);

-- Query using secondary index
SELECT * FROM users WHERE email = '[email protected]';

-- Note: Secondary indexes work best for low-cardinality fields

Materialized Views

-- Create materialized view
CREATE MATERIALIZED VIEW user_orders_view AS
SELECT * FROM orders
WHERE user_id IS NOT NULL
AND order_id IS NOT NULL
PRIMARY KEY (user_id, order_id);

-- Materialized views automatically stay in sync
INSERT INTO orders (user_id, order_id, total, status)
VALUES (123, now(), 100.00, 'pending');

-- View automatically includes the new data
SELECT * FROM user_orders_view 
WHERE user_id = 123;

Python Integration

Using cassandra-driver

from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement

# Connect to Cassandra
cluster = Cluster(['127.0.0.1'])
session = cluster.connect('myapp')

# Simple query
session.execute("""
    INSERT INTO users (user_id, username, email, created_at)
    VALUES (uuid(), 'john', '[email protected]', now())
""")

# Prepared statement
prepared = session.prepare("""
    INSERT INTO users (user_id, username, email, created_at)
    VALUES (?, ?, ?, ?)
""")

session.execute(prepared, [uuid(), 'jane', '[email protected]', datetime.now()])

# Select data
rows = session.execute("SELECT * FROM users")
for row in rows:
    print(row.username, row.email)

# Close connection
cluster.shutdown()

Using DataStax Driver

from datastax import Cluster
from datastax.errors import UnexpectedQueryExecution

# Similar API to cassandra-driver
cluster = Cluster(['127.0.0.1'])
session = cluster.connect('myapp')

Conclusion

Cassandra provides a powerful distributed database solution for applications requiring high write throughput and linear scalability. Understanding CQL, data modeling with partition keys, and clustering columns is essential for building efficient Cassandra applications.

In the next article, we’ll explore Cassandra operations: backup strategies, repair operations, cluster management, and monitoring.

Apache Cassandra: The Complete Guide to Distributed NoSQL Database

Introduction

What is Apache Cassandra?

Cassandra vs Traditional Databases

Installation

Docker Installation

Package Installation

Cassandra Query Language (CQL)

Keyspaces and Tables

Data Types

CRUD Operations

Data Modeling

Partition Keys

Clustering Columns

Query-First Design

Consistency Levels

Secondary Indexes

Materialized Views

Python Integration

Using cassandra-driver

Using DataStax Driver

Conclusion

Comments

Share this article

👍 Was this article helpful?