Introduction
Understanding Neo4j’s internal architecture helps you design better graph models, optimize queries, and troubleshoot performance issues. While Neo4j presents a simple graph model to users, internally it employs sophisticated techniques for storage, indexing, and query execution. This article explores the key components that make Neo4j efficient at handling connected data.
Storage Architecture
Neo4j stores graph data across multiple files, each optimized for specific access patterns.
Store Files
Neo4j uses a custom storage engine with several file types:
data/databases/neo4j/
โโโ neostore # Main store
โโโ neostore.counts.db.a # Count store
โโโ neostore.counts.db.b
โโโ neostore.labeltokenstore.db # Label indexes
โโโ neostore.labeltokenstore.db.names
โโโ neostore.nodestore.db # Node records
โโโ neostore.nodestore.db.names
โโโ neostore.propertystore.db # Property records
โโโ neostore.propertystore.db.arrays
โโโ neostore.propertystore.db.index
โโโ neostore.propertystore.db.index.names
โโโ neostore.propertystore.db.strings
โโโ neostore.relationshipstore.db # Relationship records
โโโ neostore.relationshipstore.db.names
โโโ neostore.relationshipgroupstore.db
โโโ neostore.schemastore.db
Node Storage
Each node is stored as a fixed-size record (15 bytes):
// Simplified node record structure
struct NodeRecord {
int64_t id; // 8 bytes - unique node ID
int32_t relOffset; // 4 bytes - first relationship pointer
int32_t propOffset; // 4 bytes - first property pointer
int64_t labelBits; // 8 bytes - inline label storage
// Total: 24 bytes per node
};
The relationship pointer forms a linked list through all relationships of a node.
Relationship Storage
Relationships are stored in fixed-size records (33 bytes):
// Simplified relationship record
struct RelationshipRecord {
int64_t id; // 8 bytes
int64_t firstNode; // 8 bytes - start node ID
int64_t secondNode; // 8 bytes - end node ID
int32_t relOffset; // 4 bytes - next relationship (same firstNode)
int32_t nextRelOffset; // 4 bytes - next relationship (same secondNode)
int32_t propOffset; // 4 bytes - first property
int16_t type; // 2 bytes - relationship type
// Total: 38 bytes per relationship
};
This dual-linked list structure enables efficient traversal from both directions.
Property Storage
Properties use a dynamic string store and dedicated index files:
-- View property storage info
CALL db.schema.visualization()
Properties are stored in dedicated property blocks that support various data types, with string and array properties stored in separate overflow files.
Index Architecture
Neo4j maintains several index types for efficient lookups.
Label Indexes
When you create a node with labels, Neo4j maintains a label index:
-- Create labeled index
CREATE INDEX person_name IF NOT EXISTS FOR (p:Person) ON (p.name)
-- Index is stored as a b-tree
-- Lookup: O(log n) for exact matches
Relationship Type Indexes
Similar indexes exist for relationship types:
-- Index on relationship properties
CREATE INDEX knows_since IF NOT EXISTS
FOR ()-[r:KNOWS]-() ON (r.since)
Full-Text Indexes
For text search:
-- Create full-text index
CREATE FULLTEXT INDEX person_fulltext IF NOT EXISTS
FOR (p:Person) ON [p.name, p.bio]
Index Caching
Neo4j caches indexes in memory for fast access:
# Index cache settings
db.index.memory=2G
Relationship Traversal
The core of Neo4j’s power is efficient relationship traversal.
Native Graph Traversal
When you traverse relationships, Neo4j uses the native pointer structure:
-- This query uses native pointers
MATCH (alice:Person {name: 'Alice'})-[:KNOWS]->(friend)
RETURN friend.name
Execution:
- Find Alice’s node record
- Follow relOffset to first relationship
- Read relationship type (KNOWS)
- Follow to friend node
- No index lookup needed
This is why graph queries are fastโthe database follows pointers rather than joining tables.
Variable-Length Traversal
For paths of uncertain length:
-- Find all friends up to 3 hops away
MATCH (alice)-[:KNOWS*1..3]-(friend)
RETURN DISTINCT friend
-- Execution: BFS with state
-- Maintains visited set to avoid cycles
The executor uses breadth-first search with pruning for variable-length patterns.
Pattern Matching Optimization
Cypher optimizer chooses efficient traversal order:
-- Optimizer chooses start node
MATCH (a:Person)-[:KNOWS]->(b)-[:WORKS_AT]->(c)
WHERE a.name = 'Alice'
RETURN b, c
The optimizer identifies the most selective pattern (Alice) as the starting point.
Query Execution Pipeline
Understanding the query pipeline helps optimize Cypher queries.
Parsing and Rewriting
-- Original query
MATCH (p:Person {name: 'Alice'})-[:KNOWS]->(friend)
RETURN friend.name
Pipeline:
- Parse โ Abstract Syntax Tree (AST)
- Semantic Analysis โ Validate types, labels, properties
- Rewrite โ Simplify, apply rules
- Logical Plan โ Algebraic representation
- Cost Planning โ Choose execution strategy
- Physical Plan โ Executable plan
Execution Plans
View query plans:
-- Explain without executing
EXPLAIN MATCH (p:Person) RETURN count(p)
-- Profile actual execution
PROFILE MATCH (p:Person {name: 'Alice'})-[:KNOWS]->(f) RETURN f
Example plan:
Planner COST
Runtime PIPE
| +ProduceResults
| +Filter
| +Expand(ALL)
| +NodeIndexSeek
Operator Types
Key execution operators:
- NodeIndexSeek: B-tree index lookup
- NodeIndexScan: Full index scan
- Expand: Relationship traversal
- Apply: Nested iteration
- HashJoin: Equi-join using hash tables
- CartesianProduct: Cross product
- Aggregation: Grouping and aggregation
Caching
Neo4j uses multiple caches for performance.
Object Cache
Caches nodes and relationships as objects:
# Configure cache
db.cache.implementation=soft
db.cache.type=soft
Cache types:
- None: No caching
- hard: Evict when memory is full
- soft: Evict based on GC pressure
Query Cache
Caches compiled query plans:
# Query cache size
server.query.cache_size=1000
The query cache stores compiled execution plans, avoiding recompilation.
Page Cache
Caches Neo4j store files:
# Page cache size
server.memory.pagecache.size=4G
Page cache is critical for read-heavy workloads.
Transaction Management
Neo4j ensures ACID properties.
Write Transaction Flow
BEGIN
// Transaction 1: Add node
CREATE (p:Person {name: 'NewPerson'})
// Transaction 2: Add relationship
MATCH (a:Person {name: 'Alice'}), (b:Person {name: 'NewPerson'})
CREATE (a)-[:KNOWS]->(b)
COMMIT
Steps:
- Begin - Start transaction, acquire transaction guard
- Lock - Acquire write locks on affected nodes
- Write - Modify store files
- Commit - Write to transaction log, release locks
- Checkpoint - Periodic persistence to store files
Lock Management
Neo4j uses fine-grained locking:
-- Lock specific node
MATCH (p:Person {name: 'Alice'})
SET p.lastLogin = timestamp()
Lock types:
- Node locks: For node modifications
- Relationship locks: For relationship modifications
- Schema locks: For index/schema changes
Concurrency Control
MVCC (Multi-Version Concurrency Control) provides:
- Read consistency: Each query sees a consistent snapshot
- No blocking: Readers don’t wait for writers
- Isolation: ACID transaction isolation
Memory Management
Understanding memory usage is crucial for performance.
Memory Pools
# Memory configuration
server.memory.heap.initial_size=4G # Query execution
server.memory.heap.max_size=8G # Grows as needed
server.memory.pagecache.size=4G # Store file cache
Memory Allocation
-- View memory usage
CALL dbms.listTransactions()
CALL dbms.memory()
Shows:
- Heap used/peak
- Page cache used/peak
- Transaction memory
Garbage Collection
GC tuning for production:
# JVM options in neo4j.conf
server.jvm.additional=-XX:+UseG1GC
server.jvm.additional=-XX:MaxGCPauseMillis=100
server.jvm.additional=-XX:+ParallelRefProcEnabled
Performance Characteristics
Understanding performance helps model design.
Complexity
| Operation | Time Complexity |
|---|---|
| Node lookup by ID | O(1) |
| Index lookup | O(log n) |
| Relationship traversal | O(1) per hop |
| Pattern matching | Varies by query |
Scalability
Neo4j scales differently than relational databases:
- Read scaling: Add read replicas
- Write scaling: Sharding (requires careful design)
- Storage: Scales horizontally with federation
Conclusion
Neo4j’s internal architecture enables efficient graph operations through carefully designed storage structures, indexes, and execution strategies. The native pointer-based relationship traversal is the key to its performanceโthe database follows pointers rather than performing expensive joins.
Understanding these internals helps you design better models, write more efficient queries, and troubleshoot performance issues. The graph model naturally lends itself to efficient traversal, making Neo4j ideal for connected data applications.
In the next article, we’ll explore recent Neo4j developments and trends for 2025-2026.
Comments