Skip to main content
โšก Calmops

Neo4j Internals: Understanding the Graph Engine

Introduction

Understanding Neo4j’s internal architecture helps you design better graph models, optimize queries, and troubleshoot performance issues. While Neo4j presents a simple graph model to users, internally it employs sophisticated techniques for storage, indexing, and query execution. This article explores the key components that make Neo4j efficient at handling connected data.

Storage Architecture

Neo4j stores graph data across multiple files, each optimized for specific access patterns.

Store Files

Neo4j uses a custom storage engine with several file types:

data/databases/neo4j/
โ”œโ”€โ”€ neostore                    # Main store
โ”œโ”€โ”€ neostore.counts.db.a       # Count store
โ”œโ”€โ”€ neostore.counts.db.b       
โ”œโ”€โ”€ neostore.labeltokenstore.db # Label indexes
โ”œโ”€โ”€ neostore.labeltokenstore.db.names
โ”œโ”€โ”€ neostore.nodestore.db      # Node records
โ”œโ”€โ”€ neostore.nodestore.db.names
โ”œโ”€โ”€ neostore.propertystore.db  # Property records
โ”œโ”€โ”€ neostore.propertystore.db.arrays
โ”œโ”€โ”€ neostore.propertystore.db.index
โ”œโ”€โ”€ neostore.propertystore.db.index.names
โ”œโ”€โ”€ neostore.propertystore.db.strings
โ”œโ”€โ”€ neostore.relationshipstore.db   # Relationship records
โ”œโ”€โ”€ neostore.relationshipstore.db.names
โ”œโ”€โ”€ neostore.relationshipgroupstore.db
โ””โ”€โ”€ neostore.schemastore.db

Node Storage

Each node is stored as a fixed-size record (15 bytes):

// Simplified node record structure
struct NodeRecord {
    int64_t id;           // 8 bytes - unique node ID
    int32_t relOffset;    // 4 bytes - first relationship pointer
    int32_t propOffset;   // 4 bytes - first property pointer
    int64_t labelBits;    // 8 bytes - inline label storage
    // Total: 24 bytes per node
};

The relationship pointer forms a linked list through all relationships of a node.

Relationship Storage

Relationships are stored in fixed-size records (33 bytes):

// Simplified relationship record
struct RelationshipRecord {
    int64_t id;           // 8 bytes
    int64_t firstNode;    // 8 bytes - start node ID
    int64_t secondNode;   // 8 bytes - end node ID
    int32_t relOffset;    // 4 bytes - next relationship (same firstNode)
    int32_t nextRelOffset; // 4 bytes - next relationship (same secondNode)
    int32_t propOffset;   // 4 bytes - first property
    int16_t type;         // 2 bytes - relationship type
    // Total: 38 bytes per relationship
};

This dual-linked list structure enables efficient traversal from both directions.

Property Storage

Properties use a dynamic string store and dedicated index files:

-- View property storage info
CALL db.schema.visualization()

Properties are stored in dedicated property blocks that support various data types, with string and array properties stored in separate overflow files.

Index Architecture

Neo4j maintains several index types for efficient lookups.

Label Indexes

When you create a node with labels, Neo4j maintains a label index:

-- Create labeled index
CREATE INDEX person_name IF NOT EXISTS FOR (p:Person) ON (p.name)

-- Index is stored as a b-tree
-- Lookup: O(log n) for exact matches

Relationship Type Indexes

Similar indexes exist for relationship types:

-- Index on relationship properties
CREATE INDEX knows_since IF NOT EXISTS 
FOR ()-[r:KNOWS]-() ON (r.since)

Full-Text Indexes

For text search:

-- Create full-text index
CREATE FULLTEXT INDEX person_fulltext IF NOT EXISTS 
FOR (p:Person) ON [p.name, p.bio]

Index Caching

Neo4j caches indexes in memory for fast access:

# Index cache settings
db.index.memory=2G

Relationship Traversal

The core of Neo4j’s power is efficient relationship traversal.

Native Graph Traversal

When you traverse relationships, Neo4j uses the native pointer structure:

-- This query uses native pointers
MATCH (alice:Person {name: 'Alice'})-[:KNOWS]->(friend)
RETURN friend.name

Execution:

  1. Find Alice’s node record
  2. Follow relOffset to first relationship
  3. Read relationship type (KNOWS)
  4. Follow to friend node
  5. No index lookup needed

This is why graph queries are fastโ€”the database follows pointers rather than joining tables.

Variable-Length Traversal

For paths of uncertain length:

-- Find all friends up to 3 hops away
MATCH (alice)-[:KNOWS*1..3]-(friend)
RETURN DISTINCT friend

-- Execution: BFS with state
-- Maintains visited set to avoid cycles

The executor uses breadth-first search with pruning for variable-length patterns.

Pattern Matching Optimization

Cypher optimizer chooses efficient traversal order:

-- Optimizer chooses start node
MATCH (a:Person)-[:KNOWS]->(b)-[:WORKS_AT]->(c)
WHERE a.name = 'Alice'
RETURN b, c

The optimizer identifies the most selective pattern (Alice) as the starting point.

Query Execution Pipeline

Understanding the query pipeline helps optimize Cypher queries.

Parsing and Rewriting

-- Original query
MATCH (p:Person {name: 'Alice'})-[:KNOWS]->(friend)
RETURN friend.name

Pipeline:

  1. Parse โ†’ Abstract Syntax Tree (AST)
  2. Semantic Analysis โ†’ Validate types, labels, properties
  3. Rewrite โ†’ Simplify, apply rules
  4. Logical Plan โ†’ Algebraic representation
  5. Cost Planning โ†’ Choose execution strategy
  6. Physical Plan โ†’ Executable plan

Execution Plans

View query plans:

-- Explain without executing
EXPLAIN MATCH (p:Person) RETURN count(p)

-- Profile actual execution
PROFILE MATCH (p:Person {name: 'Alice'})-[:KNOWS]->(f) RETURN f

Example plan:

Planner COST
Runtime PIPE
|   +ProduceResults
|   +Filter
|   +Expand(ALL)
|   +NodeIndexSeek

Operator Types

Key execution operators:

  • NodeIndexSeek: B-tree index lookup
  • NodeIndexScan: Full index scan
  • Expand: Relationship traversal
  • Apply: Nested iteration
  • HashJoin: Equi-join using hash tables
  • CartesianProduct: Cross product
  • Aggregation: Grouping and aggregation

Caching

Neo4j uses multiple caches for performance.

Object Cache

Caches nodes and relationships as objects:

# Configure cache
db.cache.implementation=soft
db.cache.type=soft

Cache types:

  • None: No caching
  • hard: Evict when memory is full
  • soft: Evict based on GC pressure

Query Cache

Caches compiled query plans:

# Query cache size
server.query.cache_size=1000

The query cache stores compiled execution plans, avoiding recompilation.

Page Cache

Caches Neo4j store files:

# Page cache size
server.memory.pagecache.size=4G

Page cache is critical for read-heavy workloads.

Transaction Management

Neo4j ensures ACID properties.

Write Transaction Flow

BEGIN

// Transaction 1: Add node
CREATE (p:Person {name: 'NewPerson'})

// Transaction 2: Add relationship
MATCH (a:Person {name: 'Alice'}), (b:Person {name: 'NewPerson'})
CREATE (a)-[:KNOWS]->(b)

COMMIT

Steps:

  1. Begin - Start transaction, acquire transaction guard
  2. Lock - Acquire write locks on affected nodes
  3. Write - Modify store files
  4. Commit - Write to transaction log, release locks
  5. Checkpoint - Periodic persistence to store files

Lock Management

Neo4j uses fine-grained locking:

-- Lock specific node
MATCH (p:Person {name: 'Alice'})
SET p.lastLogin = timestamp()

Lock types:

  • Node locks: For node modifications
  • Relationship locks: For relationship modifications
  • Schema locks: For index/schema changes

Concurrency Control

MVCC (Multi-Version Concurrency Control) provides:

  • Read consistency: Each query sees a consistent snapshot
  • No blocking: Readers don’t wait for writers
  • Isolation: ACID transaction isolation

Memory Management

Understanding memory usage is crucial for performance.

Memory Pools

# Memory configuration
server.memory.heap.initial_size=4G    # Query execution
server.memory.heap.max_size=8G        # Grows as needed
server.memory.pagecache.size=4G       # Store file cache

Memory Allocation

-- View memory usage
CALL dbms.listTransactions()
CALL dbms.memory()

Shows:

  • Heap used/peak
  • Page cache used/peak
  • Transaction memory

Garbage Collection

GC tuning for production:

# JVM options in neo4j.conf
server.jvm.additional=-XX:+UseG1GC
server.jvm.additional=-XX:MaxGCPauseMillis=100
server.jvm.additional=-XX:+ParallelRefProcEnabled

Performance Characteristics

Understanding performance helps model design.

Complexity

Operation Time Complexity
Node lookup by ID O(1)
Index lookup O(log n)
Relationship traversal O(1) per hop
Pattern matching Varies by query

Scalability

Neo4j scales differently than relational databases:

  • Read scaling: Add read replicas
  • Write scaling: Sharding (requires careful design)
  • Storage: Scales horizontally with federation

Conclusion

Neo4j’s internal architecture enables efficient graph operations through carefully designed storage structures, indexes, and execution strategies. The native pointer-based relationship traversal is the key to its performanceโ€”the database follows pointers rather than performing expensive joins.

Understanding these internals helps you design better models, write more efficient queries, and troubleshoot performance issues. The graph model naturally lends itself to efficient traversal, making Neo4j ideal for connected data applications.

In the next article, we’ll explore recent Neo4j developments and trends for 2025-2026.

Resources

Comments