Introduction
Apache Cassandra is a distributed, decentralized, elastic, scalable, highly available,ๅฎน้, and tunable consistent NoSQL database system. Originally developed by Facebook to power their Inbox Search feature, Cassandra was open-sourced in 2008 and became an Apache top-level project in 2010.
In 2026, Cassandra continues to be the go-to choice for applications requiring high write throughput, linear scalability, and fault tolerance across multiple data centers. This comprehensive guide covers everything you need to get started with Cassandra.
What is Apache Cassandra?
Cassandra is a distributed NoSQL database designed for:
- High Write Throughput: Optimized for write-heavy workloads
- Linear Scalability: Add nodes to increase capacity
- High Availability: No single point of failure
- Tunable Consistency: Balance between consistency and availability
- Multi-Datacenter Support: Cross-datacenter replication
Cassandra vs Traditional Databases
| Feature | Cassandra | Traditional RDBMS |
|---|---|---|
| Data Model | Wide-column | Row-based |
| Scaling | Horizontal | Vertical + Limited Horizontal |
| Query Language | CQL (similar to SQL) | SQL |
| Consistency | Tunable | Strong |
| Joins | Not supported | Supported |
| Transactions | Limited | Full ACID |
| Schema | Flexible | Fixed |
Installation
Docker Installation
# Start Cassandra container
docker run --name cassandra \
-d cassandra:latest
# Connect to Cassandra
docker exec -it cassandra cqlsh
# Start with specific version
docker run --name cassandra-4.0 \
-d cassandra:4.0
Package Installation
# Add Apache repository
echo "deb https://downloads.apache.org/cassandra/debian 40x main" | sudo tee /etc/apt/sources.list.d/cassandra.list
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
sudo apt-get update
sudo apt-get install cassandra
# Start Cassandra
sudo service cassandra start
# Check status
nodetool status
Cassandra Query Language (CQL)
Keyspaces and Tables
-- Create keyspace
CREATE KEYSPACE myapp
WITH REPLICATION = {
'class': 'SimpleStrategy',
'replication_factor': 3
};
-- Or with NetworkTopologyStrategy for multi-DC
CREATE KEYSPACE myapp
WITH REPLICATION = {
'class': 'NetworkTopologyStrategy',
'dc1': 3,
'dc2': 3
};
-- Use keyspace
USE myapp;
-- Create table
CREATE TABLE users (
user_id UUID PRIMARY KEY,
username TEXT,
email TEXT,
created_at TIMESTAMP
);
-- Create table with composite partition key
CREATE TABLE orders (
order_id UUID,
user_id UUID,
status TEXT,
total DECIMAL,
created_at TIMESTAMP,
PRIMARY KEY ((user_id), created_at, order_id)
) WITH CLUSTERING ORDER BY (created_at DESC, order_id DESC);
Data Types
-- Basic types
TEXT -- String
INT -- 32-bit integer
BIGINT -- 64-bit integer
UUID -- Universally Unique Identifier
TIMEUUID -- Time-based UUID
BOOLEAN -- True/False
DECIMAL -- Arbitrary precision
BLOB -- Binary data
TIMESTAMP -- Date and time
-- Collections
LIST<T> -- Ordered list
SET<T> -- Unique unordered set
MAP<K,V> -- Key-value pairs
-- Examples
CREATE TABLE user_profiles (
user_id UUID PRIMARY KEY,
username TEXT,
tags SET<TEXT>,
preferences MAP<TEXT, TEXT>,
phone_numbers LIST<TEXT>
);
CRUD Operations
-- INSERT (upsert)
INSERT INTO users (user_id, username, email, created_at)
VALUES (uuid(), 'john_doe', '[email protected]', now());
-- INSERT with TTL (expire after 3600 seconds)
INSERT INTO sessions (session_id, data)
VALUES (uuid(), 'session_data')
USING TTL 3600;
-- SELECT
SELECT * FROM users;
SELECT username, email FROM users WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;
-- SELECT with LIMIT
SELECT * FROM orders LIMIT 100;
-- SELECT with ordering
SELECT * FROM orders
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000
ORDER BY created_at DESC;
-- UPDATE
UPDATE users
SET email = '[email protected]'
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;
-- UPDATE collections
UPDATE users
SET tags = tags + {'developer'}
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;
UPDATE users
SET preferences['theme'] = 'dark'
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;
-- DELETE
DELETE FROM users
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;
-- DELETE from collection
DELETE tags FROM users
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000;
Data Modeling
Partition Keys
The partition key determines data distribution across nodes:
-- Simple partition key
CREATE TABLE events (
event_id TIMEUUID,
event_type TEXT,
data TEXT,
PRIMARY KEY (event_type, event_id)
);
-- Composite partition key
CREATE TABLE user_events (
user_id UUID,
event_type TEXT,
event_id TIMEUUID,
data TEXT,
PRIMARY KEY ((user_id, event_type), event_id)
);
Clustering Columns
Clustering columns determine data ordering within a partition:
-- Clustering for time-series
CREATE TABLE sensor_data (
sensor_id TEXT,
timestamp TIMESTAMP,
temperature DECIMAL,
humidity DECIMAL,
PRIMARY KEY (sensor_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
-- Multiple clustering columns
CREATE TABLE user_posts (
user_id UUID,
created_at TIMESTAMP,
post_id TIMEUUID,
title TEXT,
content TEXT,
PRIMARY KEY (user_id, created_at, post_id)
) WITH CLUSTERING ORDER BY (created_at DESC, post_id DESC);
Query-First Design
Cassandra requires you to design tables based on your query patterns:
-- Query 1: Get all orders for a user
CREATE TABLE orders_by_user (
user_id UUID,
order_id TIMEUUID,
total DECIMAL,
status TEXT,
created_at TIMESTAMP,
PRIMARY KEY (user_id, order_id)
) WITH CLUSTERING ORDER BY (order_id DESC);
-- Query 2: Get all orders by status (requires separate table)
CREATE TABLE orders_by_status (
status TEXT,
order_id TIMEUUID,
user_id UUID,
total DECIMAL,
created_at TIMESTAMP,
PRIMARY KEY (status, order_id)
);
-- Query 3: Get recent orders
CREATE TABLE recent_orders (
year INT,
month INT,
order_id TIMEUUID,
user_id UUID,
total DECIMAL,
PRIMARY KEY ((year, month), order_id)
) WITH CLUSTERING ORDER BY (order_id DESC);
Consistency Levels
-- Check current consistency
CONSISTENCY;
-- Set consistency level
CONSISTENCY QUORUM;
-- Consistency levels:
-- ANY -- At least one node
-- ONE -- One node
-- TWO -- Two nodes
-- THREE -- Three nodes
-- QUORUM -- Majority (replication_factor/2 + 1)
-- LOCAL_ONE -- Closest node in local DC
-- LOCAL_QUORUM -- Quorum in local DC
-- EACH_QUORUM -- Quorum in each DC
-- ALL -- All replicas
Secondary Indexes
-- Create secondary index
CREATE INDEX idx_user_email ON users(email);
-- Create index on collection
CREATE INDEX idx_user_tags ON users(tags);
-- Query using secondary index
SELECT * FROM users WHERE email = '[email protected]';
-- Note: Secondary indexes work best for low-cardinality fields
Materialized Views
-- Create materialized view
CREATE MATERIALIZED VIEW user_orders_view AS
SELECT * FROM orders
WHERE user_id IS NOT NULL
AND order_id IS NOT NULL
PRIMARY KEY (user_id, order_id);
-- Materialized views automatically stay in sync
INSERT INTO orders (user_id, order_id, total, status)
VALUES (123, now(), 100.00, 'pending');
-- View automatically includes the new data
SELECT * FROM user_orders_view
WHERE user_id = 123;
Python Integration
Using cassandra-driver
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
# Connect to Cassandra
cluster = Cluster(['127.0.0.1'])
session = cluster.connect('myapp')
# Simple query
session.execute("""
INSERT INTO users (user_id, username, email, created_at)
VALUES (uuid(), 'john', '[email protected]', now())
""")
# Prepared statement
prepared = session.prepare("""
INSERT INTO users (user_id, username, email, created_at)
VALUES (?, ?, ?, ?)
""")
session.execute(prepared, [uuid(), 'jane', '[email protected]', datetime.now()])
# Select data
rows = session.execute("SELECT * FROM users")
for row in rows:
print(row.username, row.email)
# Close connection
cluster.shutdown()
Using DataStax Driver
from datastax import Cluster
from datastax.errors import UnexpectedQueryExecution
# Similar API to cassandra-driver
cluster = Cluster(['127.0.0.1'])
session = cluster.connect('myapp')
Conclusion
Cassandra provides a powerful distributed database solution for applications requiring high write throughput and linear scalability. Understanding CQL, data modeling with partition keys, and clustering columns is essential for building efficient Cassandra applications.
In the next article, we’ll explore Cassandra operations: backup strategies, repair operations, cluster management, and monitoring.
Comments