Solr Internals: Lucene, Indexing, and Search

Introduction

Understanding Solr’s internal architecture helps you optimize queries and troubleshoot issues. This article explores how Solr achieves powerful search capabilities.

Apache Lucene Foundation

Solr is built on Apache Lucene:

┌─────────────────────────────────────┐
│           Solr                          │
│  ┌─────────────────────────────────┐  │
│  │    Request Handler               │  │
│  │    Query Parser                 │  │
│  │    Response Writer              │  │
│  └─────────────────────────────────┘  │
│  ┌─────────────────────────────────┐  │
│  │       Lucene Library               │  │
│  │  - IndexWriter                   │  │
│  │  - IndexReader                  │  │
│  │  - Searcher                      │  │
│  └─────────────────────────────────┘  │
└─────────────────────────────────────┘

Inverted Index

How It Works

Documents:
1. "Solr is fast"
2. "Solr is powerful"

Inverted Index:
┌────────────┬───────────────┐
│ Term       │ Doc IDs      │
├────────────┼───────────────┤
│ Solr       │ 1, 2         │
│ is         │ 1, 2         │
│ fast       │ 1             │
│ powerful   │ 2             │
└────────────┴───────────────┘

Index Structure

Index (shard)
  ├── segments_1
  │   ├── segment_N.nvd (term vectors)
  │   ├── segment_N.nvm (term vector metadata)
  │   ├── segment_N.doc (stored fields)
  │   ├── segment_N.fdt (stored field data)
  │   ├── segment_N.fdx (stored field index)
  │   ├── segment_N.tip (term index)
  │   └── segment_N.tbk (term block)
  ├── segments.gen
  └── write.lock

Document Indexing

Index Pipeline

Document
    │
    ▼
┌─────────────┐
│  Analyzer   │──► Tokenize, lowercase, stem
└─────────────┘
    │
    ▼
┌─────────────┐
│  IndexWriter│──► Build inverted index
└─────────────┘
    │
    ▼
┌─────────────┐
│  Segment   │──► Write to segment
└─────────────┘

Analysis Chain

// Field type with analyzer
{
  "name": "title",
  "type": "text_en",
  "analyzer": {
    "tokenizer": {
      "type": "standard"
    },
    "filters": [
      "lowercase",
      "asciifolding",
      "porter_stem"
    ]
  }
}

Segment Merging

Merge Policy

// TieredMergePolicy (default)
{
  "class": "solr.TieredMergePolicyFactory",
  "maxMergeAtOnce": 10,
  "segmentsPerTier": 10
}

// Optimize (force merge)
curl "http://localhost:8983/solr/gettingstarted/update?optimize=true"

Query Execution

Query Flow

│
▼

┌

Query─────────────┐
│  QueryParser │──► Parse query syntax
└─────────────┘
    │
    ▼
┌─────────────┐
│   BooleanQuery│──► Build query plan
└─────────────┘
    │
    ▼
┌─────────────┐
│   IndexReader │──► Execute across segments
└─────────────┘
    │
    ▼
┌─────────────┐
│    Scorer  │──► Score documents
└─────────────┘
    │
    ▼
Results

Caching

// Query result cache
<queryResultCache class="solr.LRUCache" size="10000"/>

// Filter cache
<filterCache class="solr.LRUCache" size="10000"/>

TF-IDF Scoring

Formula

Score(q,d) = sum(tf(t in d) * idf(t) * boost(t.field) * lengthNorm(t.field in d))

where:
- tf(t in d) = term frequency in document
- idf(t) = inverse document frequency
- lengthNorm = 1/sqrt(field length)

Conclusion

Understanding Solr’s internals—Lucene, inverted index, and query execution—helps you design better schemas and optimize search performance.

Solr Internals: Lucene, Indexing, and Search

Introduction

Apache Lucene Foundation

Inverted Index

How It Works

Index Structure

Document Indexing

Index Pipeline

Analysis Chain

Segment Merging

Merge Policy

Query Execution

Query Flow

Caching

TF-IDF Scoring

Formula

Conclusion

Comments

Share this article

👍 Was this article helpful?