Introduction
Understanding Solr’s internal architecture helps you optimize queries and troubleshoot issues. This article explores how Solr achieves powerful search capabilities.
Apache Lucene Foundation
Solr is built on Apache Lucene:
┌─────────────────────────────────────┐
│ Solr │
│ ┌─────────────────────────────────┐ │
│ │ Request Handler │ │
│ │ Query Parser │ │
│ │ Response Writer │ │
│ └─────────────────────────────────┘ │
│ ┌─────────────────────────────────┐ │
│ │ Lucene Library │ │
│ │ - IndexWriter │ │
│ │ - IndexReader │ │
│ │ - Searcher │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────┘
Inverted Index
How It Works
Documents:
1. "Solr is fast"
2. "Solr is powerful"
Inverted Index:
┌────────────┬───────────────┐
│ Term │ Doc IDs │
├────────────┼───────────────┤
│ Solr │ 1, 2 │
│ is │ 1, 2 │
│ fast │ 1 │
│ powerful │ 2 │
└────────────┴───────────────┘
Index Structure
Index (shard)
├── segments_1
│ ├── segment_N.nvd (term vectors)
│ ├── segment_N.nvm (term vector metadata)
│ ├── segment_N.doc (stored fields)
│ ├── segment_N.fdt (stored field data)
│ ├── segment_N.fdx (stored field index)
│ ├── segment_N.tip (term index)
│ └── segment_N.tbk (term block)
├── segments.gen
└── write.lock
Document Indexing
Index Pipeline
Document
│
▼
┌─────────────┐
│ Analyzer │──► Tokenize, lowercase, stem
└─────────────┘
│
▼
┌─────────────┐
│ IndexWriter│──► Build inverted index
└─────────────┘
│
▼
┌─────────────┐
│ Segment │──► Write to segment
└─────────────┘
Analysis Chain
// Field type with analyzer
{
"name": "title",
"type": "text_en",
"analyzer": {
"tokenizer": {
"type": "standard"
},
"filters": [
"lowercase",
"asciifolding",
"porter_stem"
]
}
}
Segment Merging
Merge Policy
// TieredMergePolicy (default)
{
"class": "solr.TieredMergePolicyFactory",
"maxMergeAtOnce": 10,
"segmentsPerTier": 10
}
// Optimize (force merge)
curl "http://localhost:8983/solr/gettingstarted/update?optimize=true"
Query Execution
Query Flow
│
▼
┌
Query─────────────┐
│ QueryParser │──► Parse query syntax
└─────────────┘
│
▼
┌─────────────┐
│ BooleanQuery│──► Build query plan
└─────────────┘
│
▼
┌─────────────┐
│ IndexReader │──► Execute across segments
└─────────────┘
│
▼
┌─────────────┐
│ Scorer │──► Score documents
└─────────────┘
│
▼
Results
Caching
// Query result cache
<queryResultCache class="solr.LRUCache" size="10000"/>
// Filter cache
<filterCache class="solr.LRUCache" size="10000"/>
TF-IDF Scoring
Formula
Score(q,d) = sum(tf(t in d) * idf(t) * boost(t.field) * lengthNorm(t.field in d))
where:
- tf(t in d) = term frequency in document
- idf(t) = inverse document frequency
- lengthNorm = 1/sqrt(field length)
Conclusion
Understanding Solr’s internals—Lucene, inverted index, and query execution—helps you design better schemas and optimize search performance.
Comments