Introduction
Solr powers production search for thousands of organizations — from Fortune 500 e-commerce platforms to government intelligence systems. Its mature ecosystem of faceting, distributed indexing, real-time streaming, and pluggable relevance models makes it the go-to search platform when reliability and throughput matter.
This guide covers six major use cases with concrete schema patterns, query configurations, index design decisions, and performance trade-offs. Each section includes runnable examples and production-hardened advice.
E-commerce Search
Solr dominates e-commerce search because its core strengths — faceting, spell correction, autocomplete, and function-based boosting — directly map to the UX expectations of modern product catalogs.
Schema Pattern
{
"fields": [
{"name": "id", "type": "string", "indexed": true, "required": true},
{"name": "name", "type": "text_en", "indexed": true, "stored": true},
{"name": "description", "type": "text_en", "indexed": true, "stored": false},
{"name": "category", "type": "string", "indexed": true, "docValues": true},
{"name": "category_path", "type": "string", "indexed": true, "multiValued": true},
{"name": "brand", "type": "string", "indexed": true, "docValues": true},
{"name": "price", "type": "pfloat", "indexed": true, "docValues": true},
{"name": "in_stock", "type": "boolean", "indexed": true, "docValues": true},
{"name": "rating", "type": "pfloat", "indexed": true, "docValues": true},
{"name": "review_count", "type": "pint", "indexed": true, "docValues": true},
{"name": "tags", "type": "string", "multiValued": true, "indexed": true},
{"name": "sale_price", "type": "pfloat", "indexed": true, "docValues": true}
]
}
Key choices: Use docValues=true on every field used for faceting, sorting, or grouping. This stores data column-wise on disk, avoiding field cache memory pressure at query time. Mark description as stored=false — product listing pages rarely need the full description text.
Advanced Faceting
Faceted navigation is the backbone of e-commerce drill-down. Beyond simple field faceting, Solr supports pivot, range, and interval faceting.
Pivot faceting builds hierarchical drill-downs — category then brand:
curl "http://localhost:8983/solr/products/select?q=*:*&rows=0&\
facet=true&\
facet.pivot=category,brand&\
facet.pivot.mincount=1"
Response contains nested counts so you can render breadcrumb-style navigation.
Range faceting for price histograms:
curl "http://localhost:8983/solr/products/select?q=*:*&rows=0&\
facet=true&\
facet.range=price&\
facet.range.start=0&\
facet.range.end=1000&\
facet.range.gap=50"
Interval faceting for arbitrary bucketing:
curl "http://localhost:8983/solr/products/select?q=*:*&rows=0&\
facet=true&\
f.interval=price&\
f.price.facet.interval=[0,50)&\
f.price.facet.interval=[50,100)&\
f.price.facet.interval=[100,200)&\
f.price.facet.interval=[200,)&\
f.price.facet.interval.set=in_stock:[true TO true]"
Performance tip: facet.range with docValues-backed fields runs in O(n) per segment and avoids field cache entirely. Interval faceting is slightly faster for arbitrary buckets because it avoids range scan overhead.
Autocomplete and Suggest
Solr’s suggest module supports multiple lookup implementations. The analyzing infix suggester works well for product autocomplete:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">products_suggest</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">name</str>
<str name="suggestAnalyzerFieldType">text_en</str>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
Query it with:
curl "http://localhost:8983/solr/products/suggest?\
suggest=true&\
suggest.dictionary=products_suggest&\
suggest.q=wire&\
suggest.count=10&\
suggest.cfq=category:Electronics"
The suggest.cfq (context filter query) restricts suggestions to a category — critical for multi-department stores.
Spell Correction
No search experience survives typos. Solr provides multiple spell checker implementations:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<str name="name">default</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
<str name="field">name</str>
<str name="distanceMeasure">internal</str>
<str name="accuracy">0.7</str>
<str name="maxEdits">2</str>
<str name="minPrefix">1</str>
</lst>
</searchComponent>
Use spellcheck.extendedResult=true to get frequency and suggestion weight so your frontend can decide whether to show “Did you mean?” vs. auto-correcting:
curl "http://localhost:8983/solr/products/select?q=electroniks&\
spellcheck=true&\
spellcheck.extendedResult=true"
Relevance Tuning and Boosting
Raw text search rarely produces the best product ranking. Solr’s function queries let you blend text relevance with business signals:
curl "http://localhost:8983/solr/products/select?q=wireless+headphones&\
defType=edismax&\
qf=name^4+description^1+tags^2&\
bf=product(rating,3,1,0)^2+if(in_stock,10,0)^5+recip(ms(NOW,launch_date),3.16e-11,1,1)^3"
Breaking this down:
qfassigns per-field boost: name gets 4x, tags 2x, description 1xbf(boost functions) adds: rating weighted by a product curve, stock availability (+10 if in stock), recency boost usingrecipwith millisecond conversion (3.16e-11= 1 year in ms)
Merchandising and Personalization
Solr doesn’t have a built-in merchandising UI, but you can implement it with query elevation and sorting overrides:
# Elevate specific products for a category landing page
curl "http://localhost:8983/solr/products/select?q=*:*&\
fq=category:Headphones&\
enableElevation=true&\
forceElevation=true&\
elevateIds=PROMO-001,PROMO-002"
User-specific personalization is achievable with function queries and user profile data indexed as payloads or stored in auxiliary fields:
# Boost by user's preferred brand
curl "http://localhost:8983/solr/products/select?q=wireless&\
defType=edismax&\
qf=name^3+tags^1&\
bf=query($ub,500)&\
ub=brand:Sony"
Enterprise Search
Enterprise document search demands content extraction from binary formats, permission-trimmed result sets, and federation across data silos.
Document Crawling with Apache Tika
Solr integrates natively with Apache Tika via the ExtractingRequestHandler. PDFs, Word documents, and presentations are indexed as easily as HTML:
curl -X POST "http://localhost:8983/solr/docs/update/extract?\
literal.id=doc001&\
literal.author=CalmOps&\
fmap.content=body" \
-F "myfile=@/path/to/quarterly-report.pdf"
Tika auto-detects format, extracts text, metadata (author, title, page count), and language. Use fmap.content=body to map the extracted text into your schema’s body field.
Schema for Enterprise Documents
{
"fields": [
{"name": "id", "type": "string"},
{"name": "title", "type": "text_en", "indexed": true, "boost": 3},
{"name": "body", "type": "text_en", "indexed": true, "stored": false},
{"name": "author", "type": "string", "indexed": true, "docValues": true},
{"name": "created_date", "type": "pdate", "indexed": true, "docValues": true},
{"name": "modified_date", "type": "pdate", "indexed": true, "docValues": true},
{"name": "doc_type", "type": "string", "indexed": true, "docValues": true},
{"name": "allowed_groups", "type": "string", "multiValued": true, "indexed": true},
{"name": "allowed_users", "type": "string", "multiValued": true, "indexed": true},
{"name": "denied_groups", "type": "string", "multiValued": true, "indexed": true},
{"name": "content_type", "type": "string", "indexed": true},
{"name": "file_size", "type": "plong", "indexed": true},
{"name": "page_count", "type": "pint", "indexed": true}
]
}
Permission-Based Access Control
Enterprise search result visibility must respect document-level permissions. The standard pattern indexes access control lists (ACLs) as multi-valued fields and filters at query time:
curl "http://localhost:8983/solr/docs/select?q=quarterly+report&\
fq={!tag=perms}(allowed_users:user123+OR+allowed_groups:executives)&\
fq=-denied_groups:contractors"
Use filter query tagging ({!tag=perms}) if you need to exclude this filter from facet calculations — facet counts should show total available documents, not just the user’s visible set.
Performance note: ACL fields should use docValues=true and be sorted consistently. For organizations with thousands of groups, consider indexing a computed visibility_token that encodes the full access calculation at index time rather than combining multiple Boolean queries at query time.
Content Enrichment and Metadata Extraction
Use Solr’s update chain to enrich documents during ingestion:
<updateRequestProcessorChain name="enrich">
<processor class="solr.UUIDUpdateProcessorFactory">
<str name="fieldName">id</str>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
<str name="fieldName">body</str>
<str name="source">\b(\d{3}-\d{2}-\d{4})\b</str>
<str name="replacement">[SSN_REDACTED]</str>
</processor>
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">title</str>
<str name="dest">search_text</str>
</processor>
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">body</str>
<str name="dest">search_text</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
This chain assigns UUIDs, redacts PII (SSNs), and merges fields into a single search_text field for simple unified querying.
Federated Search with Shard Configuration
When enterprise data lives in separate Solr clouds (HR docs, engineering wikis, legal contracts), federate across collections with the /admin/collections API or a custom search component:
curl "http://localhost:8983/solr/hr/select?q=compliance&shards=http://solr-wiki:8983/solr/wiki,http://solr-legal:8983/solr/legal&\
shards.info=true"
Add shards.info=true to see per-shard timing — essential for debugging slow federation targets.
Site Search
Site search has different demands from e-commerce: sitemap-driven indexing, search analytics to identify content gaps, and A/B testing to measure search-driven conversions.
Sitemap-Based Incremental Indexing
For content-heavy sites, tie Solr indexing to sitemap changes rather than full reindexes:
#!/bin/bash
# Incremental site reindex from sitemap
SITE_URL="https://example.com"
for url in $(curl -s "${SITE_URL}/sitemap.xml" | grep -oP '<loc>\K[^<]+'); do
curl -s "${url}" | python3 -c "
import sys, json, html
content = sys.stdin.read()
# Extract title, meta, body from HTML (simplified)
doc = {
'id': '${url}',
'title': '...',
'body': '...',
'last_modified': 'NOW',
'sitemap_priority': '0.8'
}
print(json.dumps(doc))
" | curl -X POST "http://localhost:8983/solr/site/update?commit=true" \
-H "Content-Type: application/json" -d @-
done
Better approach: Use a sitemap parser that tracks the lastmod field from the sitemap and skips URLs that haven’t changed.
Search Analytics
Solr’s /admin/stats and the query response times expose performance data, but for usage analytics (what users searched for, zero-result queries), implement your own tracking:
// Analytics schema
{
"fields": [
{"name": "id", "type": "string"},
{"name": "query_string", "type": "string", "indexed": true},
{"name": "num_results", "type": "pint", "indexed": true},
{"name": "clicked_rank", "type": "pint", "indexed": true},
{"name": "session_id", "type": "string", "indexed": true},
{"name": "timestamp", "type": "pdate", "indexed": true, "docValues": true}
]
}
Analyze zero-result queries to identify content gaps:
curl "http://localhost:8983/solr/analytics/select?q=num_results:0&\
facet=true&\
facet.field=query_string&\
facet.limit=20&\
sort=count+desc&\
rows=0"
A/B Testing Search Results
Run search experiments by assigning users to control or treatment groups and routing queries through different request handlers:
# Control
curl "http://localhost:8983/solr/products/select?q=wireless&defType=edismax&qf=name^4+description^1"
# Treatment — boosted by images/reviews
curl "http://localhost:8983/solr/products/select?q=wireless&defType=edismax&\
qf=name^4+description^1&\
bf=if(exists(image_url),5,0)^2+mul(review_count,0.1)^1"
Track per-group click-through rates in your analytics schema to validate treatment improvements stat-sig before rolling out.
Synonyms and Stop Words
Manage synonyms via Solr’s synonym filter in your schema’s analysis chain:
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory"
synonyms="synonyms.txt"
expand="true"
ignoreCase="true"/>
<filter class="solr.StopFilterFactory"
words="stopwords.txt"
ignoreCase="true"/>
</analyzer>
In synonyms.txt:
laptop, notebook, ultrabook
cellphone, mobile, smartphone
tv, television, telly
Use SynonymGraphFilterFactory (not the deprecated SynonymFilterFactory) to ensure multi-word synonyms produce correct phrase query behavior.
Log and Time-Series Analytics
Solr’s streaming expressions and time-based rollups make it a viable alternative to dedicated analytics stores for moderate-scale log aggregation.
Time-Range Index Rollups
For operational logs with high ingest volume (millions/day), roll up old data into hourly summaries rather than retaining raw documents:
# Streaming expression to aggregate hourly counts
curl -X POST "http://localhost:8983/solr/logs/stream" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d 'expr=rollup(
search(logs, q="timestamp:[NOW-7DAY TO NOW]", fl="timestamp,severity", rows=1000000),
over="timestamp",
bucket="+1HOUR",
groupBySum("error_count", "severity:ERROR"),
groupBySum("warn_count", "severity:WARN")
)'
This returns hourly error/warn counts without materializing every raw log line.
Streaming Expressions for Real-Time Aggregation
Streaming expressions let you express analytics queries without MapReduce overhead:
expr=top(
group(
search(logs, q="*:*", fl="endpoint,response_time", rows=100000),
by="endpoint",
having(avg(response_time), >, 200)
),
n=10,
sort="avg(response_time) desc"
)
This finds the 10 slowest endpoints by average response time. Streaming expressions execute server-side and can be embedded in dashboards.
SQL Aggregation via Parallel SQL
Solr exposes parallel SQL for JDBC/ODBC-compatible analytics:
SELECT
TO_CHAR(timestamp, 'YYYY-MM-DD HH24:00') AS hour_bucket,
COUNT(*) AS request_count,
AVG(response_time) AS avg_latency,
PERCENTILE_DISC(0.99) WITHIN GROUP (ORDER BY response_time) AS p99_latency
FROM logs
WHERE timestamp >= '2026-04-01T00:00:00Z'
GROUP BY hour_bucket
ORDER BY hour_bucket
Run via the SQL handler:
curl "http://localhost:8983/solr/logs/sql?stmt=$(python3 -c "import urllib.parse; print(urllib.parse.quote(\"\"\"SELECT TO_CHAR(timestamp, 'YYYY-MM-DD HH24:00') AS hour_bucket, COUNT(*) AS request_count, AVG(response_time) AS avg_latency FROM logs WHERE timestamp >= '2026-04-01T00:00:00Z' GROUP BY hour_bucket ORDER BY hour_bucket\"\"\"))")"
For production, use a JDBC driver instead of URL-encoded curl calls.
Geospatial Search
Solr’s spatial capabilities handle lat/lon queries, bounding box filters, and distance-based sorting with sub-millisecond latency at billions of documents.
Geospatial Schema
{
"fields": [
{"name": "store_id", "type": "string"},
{"name": "store_name", "type": "text_en"},
{"name": "location", "type": "location_rpt", "indexed": true, "multiValued": false},
{"name": "address", "type": "string"},
{"name": "hours", "type": "string"},
{"name": "rating", "type": "pfloat"}
]
}
The location_rpt field type enables both indexed and document-level spatial queries. For simple lat/lon filters, location (without _rpt) is sufficient and faster.
Bounding Box and Radius Queries
# Find stores within 10km of a point
curl "http://localhost:8983/solr/stores/select?q=*:*&\
fq={!geofilt pt=37.7749,-122.4194 sfield=location d=10}"
# Bounding box filter
curl "http://localhost:8983/solr/stores/select?q=*:*&\
fq={!bbox pt=37.7749,-122.4194 sfield=location d=10}"
geofilt does exact circle filtering (more accurate). bbox does rectangular filtering (faster, useful for map tile rendering).
Distance Sorting and Score Boosting
Sort results by distance for “nearest store” queries:
curl "http://localhost:8983/solr/stores/select?q=*:*&\
fq={!geofilt sfield=location pt=37.7749,-122.4194 d=25}&\
sort=geodist(location,37.7749,-122.4194) asc"
Boost by distance in relevance scoring:
curl "http://localhost:8983/solr/stores/select?q=coffee&\
defType=edismax&\
qf=store_name^2&\
bf=recip(geodist(location,37.7749,-122.4194),1,1000,100)"
The recip function maps distance to a 0-1000 boost score, decreasing smoothly.
Publishing and Media Search
Media publishers use Solr for metadata search, tag-based navigation, and content recommendations across catalogs of millions of articles, videos, and audio files.
Schema for Media Assets
{
"fields": [
{"name": "id", "type": "string"},
{"name": "title", "type": "text_en", "boost": 5},
{"name": "caption", "type": "text_en"},
{"name": "tags", "type": "string", "multiValued": true, "indexed": true, "docValues": true},
{"name": "media_type", "type": "string", "indexed": true, "docValues": true},
{"name": "duration_secs", "type": "pint", "indexed": true},
{"name": "resolution", "type": "string", "indexed": true},
{"name": "license", "type": "string", "indexed": true},
{"name": "upload_date", "type": "pdate", "indexed": true, "docValues": true},
{"name": "view_count", "type": "plong", "indexed": true, "docValues": true},
{"name": "featured", "type": "boolean", "indexed": true}
]
}
Tag-Based Navigation with Boosted Filtering
# Browse by tag with promoted content on top
curl "http://localhost:8983/solr/media/select?q=*:*&\
fq=tags:photography&\
sort=featured+desc,view_count+desc&\
facet=true&\
facet.field=tags&\
facet.limit=50"
Content Recommendations with MoreLikeThis
Solr’s MoreLikeThis (MLT) component generates content recommendations without a separate ML pipeline:
curl "http://localhost:8983/solr/media/mlt?q=id:article-042&\
mlt.fl=title,caption,tags&\
mlt.mintf=2&\
mlt.mindf=1&\
mlt.minwl=3&\
mlt.maxwl=20&\
mlt.maxqt=25&\
mlt.count=10&\
mlt.boost=true"
Parameters explained:
mlt.mintf=2— term must appear at least twice in the source docmlt.mindf=1— term must appear in at least 1 other docmlt.minwl=3,mlt.maxwl=20— min/max word length for extracted termsmlt.boost=true— score MLT results by term relevance
For video platforms, use MLT on tags and captions to surface “watch next” recommendations.
Best Practices
Query Time vs. Index Time Decisions
| Decision | Index Time | Query Time | When To Choose |
|---|---|---|---|
| Field boosting | Copy to boosted field at index | qf parameter |
Query time: iterating is easier |
| ACL computation | Pre-compute visibility token | fq with user/group |
Index time: faster query (token is single term) |
| Synonym expansion | Expand at index (store all forms) | SynonymGraphFilter at query | Query time: smaller index, flexible |
| Text normalization | Lowercasing, stemming at index | Same analyzers applied to query | Always at index (smaller index) |
Cache strategies:
- queryResultCache: Enable with
size=500for common product listing pages. Monitor withadmin/statistics. - filterCache: Use
class=solr.FastLRUCachewithsize=1000for faceted navigation filters. High hit ratio is critical for drill-down UX. - documentCache: Keep small (size=100-200). Mostly useful for reindexing, not reads.
Replication factor choices in SolrCloud:
- 2 replicas: Minimum for standard HA. One node failure still allows quorum.
- 3 replicas: Recommended for write-heavy workloads. Tolerates two concurrent failures.
numShardsvs.replicationFactor: Prefer fewer shards with more replicas. A 2-shard, 3-replica cluster outperforms a 6-shard, 1-replica cluster because cross-shard query merging is expensive.
Index Design Rules
- Always use
docValues=trueon sort/facet/group fields. Never rely on field cache. - Set
stored=falseon fields your application never reads back. Reduces index size 30-50%. - Use
multiValuedsparingly — multi-valued fields can’t use certain query-time features (e.g., per-field sorting). - Prefer
text_enovertext_generalfor English content. The stemming chain improves recall significantly. - Configure
maxShardsPerNodeto prevent uneven distribution. A reasonable value ismaxShardsPerNode=4for most workloads.
Resources
- Apache Solr Reference Guide
- SolrCloud Documentation
- Streaming Expressions Reference
- Apache Tika Integration
- Spatial Search in Solr
- MoreLikeThis Component
- Solr Performance Optimizations
Comments