Neo4j Internals: Understanding the Graph Engine

Introduction

Understanding Neo4j’s internal architecture helps you design better graph models, optimize queries, and troubleshoot performance issues. While Neo4j presents a simple graph model to users, internally it employs sophisticated techniques for storage, indexing, and query execution. This article explores the key components that make Neo4j efficient at handling connected data.

Storage Architecture

Neo4j stores graph data across multiple files, each optimized for specific access patterns.

Store Files

Neo4j uses a custom storage engine with several file types:

data/databases/neo4j/
├── neostore                    # Main store
├── neostore.counts.db.a       # Count store
├── neostore.counts.db.b       
├── neostore.labeltokenstore.db # Label indexes
├── neostore.labeltokenstore.db.names
├── neostore.nodestore.db      # Node records
├── neostore.nodestore.db.names
├── neostore.propertystore.db  # Property records
├── neostore.propertystore.db.arrays
├── neostore.propertystore.db.index
├── neostore.propertystore.db.index.names
├── neostore.propertystore.db.strings
├── neostore.relationshipstore.db   # Relationship records
├── neostore.relationshipstore.db.names
├── neostore.relationshipgroupstore.db
└── neostore.schemastore.db

Node Storage

Each node is stored as a fixed-size record (15 bytes):

// Simplified node record structure
struct NodeRecord {
    int64_t id;           // 8 bytes - unique node ID
    int32_t relOffset;    // 4 bytes - first relationship pointer
    int32_t propOffset;   // 4 bytes - first property pointer
    int64_t labelBits;    // 8 bytes - inline label storage
    // Total: 24 bytes per node
};

The relationship pointer forms a linked list through all relationships of a node.

Relationship Storage

Relationships are stored in fixed-size records (33 bytes):

// Simplified relationship record
struct RelationshipRecord {
    int64_t id;           // 8 bytes
    int64_t firstNode;    // 8 bytes - start node ID
    int64_t secondNode;   // 8 bytes - end node ID
    int32_t relOffset;    // 4 bytes - next relationship (same firstNode)
    int32_t nextRelOffset; // 4 bytes - next relationship (same secondNode)
    int32_t propOffset;   // 4 bytes - first property
    int16_t type;         // 2 bytes - relationship type
    // Total: 38 bytes per relationship
};

This dual-linked list structure enables efficient traversal from both directions.

Property Storage

Properties use a dynamic string store and dedicated index files:

-- View property storage info
CALL db.schema.visualization()

Properties are stored in dedicated property blocks that support various data types, with string and array properties stored in separate overflow files.

Index Architecture

Neo4j maintains several index types for efficient lookups.

Label Indexes

When you create a node with labels, Neo4j maintains a label index:

-- Create labeled index
CREATE INDEX person_name IF NOT EXISTS FOR (p:Person) ON (p.name)

-- Index is stored as a b-tree
-- Lookup: O(log n) for exact matches

Relationship Type Indexes

Similar indexes exist for relationship types:

-- Index on relationship properties
CREATE INDEX knows_since IF NOT EXISTS 
FOR ()-[r:KNOWS]-() ON (r.since)

Full-Text Indexes

For text search:

-- Create full-text index
CREATE FULLTEXT INDEX person_fulltext IF NOT EXISTS 
FOR (p:Person) ON [p.name, p.bio]

Index Caching

Neo4j caches indexes in memory for fast access:

# Index cache settings
db.index.memory=2G

Relationship Traversal

The core of Neo4j’s power is efficient relationship traversal.

Native Graph Traversal

When you traverse relationships, Neo4j uses the native pointer structure:

-- This query uses native pointers
MATCH (alice:Person {name: 'Alice'})-[:KNOWS]->(friend)
RETURN friend.name

Execution:

Find Alice’s node record
Follow relOffset to first relationship
Read relationship type (KNOWS)
Follow to friend node
No index lookup needed

This is why graph queries are fast—the database follows pointers rather than joining tables.

Variable-Length Traversal

For paths of uncertain length:

-- Find all friends up to 3 hops away
MATCH (alice)-[:KNOWS*1..3]-(friend)
RETURN DISTINCT friend

-- Execution: BFS with state
-- Maintains visited set to avoid cycles

The executor uses breadth-first search with pruning for variable-length patterns.

Pattern Matching Optimization

Cypher optimizer chooses efficient traversal order:

-- Optimizer chooses start node
MATCH (a:Person)-[:KNOWS]->(b)-[:WORKS_AT]->(c)
WHERE a.name = 'Alice'
RETURN b, c

The optimizer identifies the most selective pattern (Alice) as the starting point.

Query Execution Pipeline

Understanding the query pipeline helps optimize Cypher queries.

Parsing and Rewriting

-- Original query
MATCH (p:Person {name: 'Alice'})-[:KNOWS]->(friend)
RETURN friend.name

Pipeline:

Parse → Abstract Syntax Tree (AST)
Semantic Analysis → Validate types, labels, properties
Rewrite → Simplify, apply rules
Logical Plan → Algebraic representation
Cost Planning → Choose execution strategy
Physical Plan → Executable plan

Execution Plans

View query plans:

-- Explain without executing
EXPLAIN MATCH (p:Person) RETURN count(p)

-- Profile actual execution
PROFILE MATCH (p:Person {name: 'Alice'})-[:KNOWS]->(f) RETURN f

Example plan:

Planner COST
Runtime PIPE
|   +ProduceResults
|   +Filter
|   +Expand(ALL)
|   +NodeIndexSeek

Operator Types

Key execution operators:

NodeIndexSeek: B-tree index lookup
NodeIndexScan: Full index scan
Expand: Relationship traversal
Apply: Nested iteration
HashJoin: Equi-join using hash tables
CartesianProduct: Cross product
Aggregation: Grouping and aggregation

Caching

Neo4j uses multiple caches for performance.

Object Cache

Caches nodes and relationships as objects:

# Configure cache
db.cache.implementation=soft
db.cache.type=soft

Cache types:

None: No caching
hard: Evict when memory is full
soft: Evict based on GC pressure

Query Cache

Caches compiled query plans:

# Query cache size
server.query.cache_size=1000

The query cache stores compiled execution plans, avoiding recompilation.

Page Cache

Caches Neo4j store files:

# Page cache size
server.memory.pagecache.size=4G

Page cache is critical for read-heavy workloads.

Transaction Management

Neo4j ensures ACID properties.

Write Transaction Flow

BEGIN

// Transaction 1: Add node
CREATE (p:Person {name: 'NewPerson'})

// Transaction 2: Add relationship
MATCH (a:Person {name: 'Alice'}), (b:Person {name: 'NewPerson'})
CREATE (a)-[:KNOWS]->(b)

COMMIT

Steps:

Begin - Start transaction, acquire transaction guard
Lock - Acquire write locks on affected nodes
Write - Modify store files
Commit - Write to transaction log, release locks
Checkpoint - Periodic persistence to store files

Lock Management

Neo4j uses fine-grained locking:

-- Lock specific node
MATCH (p:Person {name: 'Alice'})
SET p.lastLogin = timestamp()

Lock types:

Node locks: For node modifications
Relationship locks: For relationship modifications
Schema locks: For index/schema changes

Concurrency Control

MVCC (Multi-Version Concurrency Control) provides:

Read consistency: Each query sees a consistent snapshot
No blocking: Readers don’t wait for writers
Isolation: ACID transaction isolation

Memory Management

Understanding memory usage is crucial for performance.

Memory Pools

# Memory configuration
server.memory.heap.initial_size=4G    # Query execution
server.memory.heap.max_size=8G        # Grows as needed
server.memory.pagecache.size=4G       # Store file cache

Memory Allocation

-- View memory usage
CALL dbms.listTransactions()
CALL dbms.memory()

Shows:

Heap used/peak
Page cache used/peak
Transaction memory

Garbage Collection

GC tuning for production:

# JVM options in neo4j.conf
server.jvm.additional=-XX:+UseG1GC
server.jvm.additional=-XX:MaxGCPauseMillis=100
server.jvm.additional=-XX:+ParallelRefProcEnabled

Performance Characteristics

Understanding performance helps model design.

Complexity

Operation	Time Complexity
Node lookup by ID	O(1)
Index lookup	O(log n)
Relationship traversal	O(1) per hop
Pattern matching	Varies by query

Scalability

Neo4j scales differently than relational databases:

Read scaling: Add read replicas
Write scaling: Sharding (requires careful design)
Storage: Scales horizontally with federation

Conclusion

Neo4j’s internal architecture enables efficient graph operations through carefully designed storage structures, indexes, and execution strategies. The native pointer-based relationship traversal is the key to its performance—the database follows pointers rather than performing expensive joins.

Understanding these internals helps you design better models, write more efficient queries, and troubleshoot performance issues. The graph model naturally lends itself to efficient traversal, making Neo4j ideal for connected data applications.

In the next article, we’ll explore recent Neo4j developments and trends for 2025-2026.