Solr in Production: E-commerce, Enterprise Search

Introduction

Solr powers production search for thousands of organizations — from Fortune 500 e-commerce platforms to government intelligence systems. Its mature ecosystem of faceting, distributed indexing, real-time streaming, and pluggable relevance models makes it the go-to search platform when reliability and throughput matter.

This guide covers six major use cases with concrete schema patterns, query configurations, index design decisions, and performance trade-offs. Each section includes runnable examples and production-hardened advice.

E-commerce Search

Solr dominates e-commerce search because its core strengths — faceting, spell correction, autocomplete, and function-based boosting — directly map to the UX expectations of modern product catalogs.

Schema Pattern

{
  "fields": [
    {"name": "id", "type": "string", "indexed": true, "required": true},
    {"name": "name", "type": "text_en", "indexed": true, "stored": true},
    {"name": "description", "type": "text_en", "indexed": true, "stored": false},
    {"name": "category", "type": "string", "indexed": true, "docValues": true},
    {"name": "category_path", "type": "string", "indexed": true, "multiValued": true},
    {"name": "brand", "type": "string", "indexed": true, "docValues": true},
    {"name": "price", "type": "pfloat", "indexed": true, "docValues": true},
    {"name": "in_stock", "type": "boolean", "indexed": true, "docValues": true},
    {"name": "rating", "type": "pfloat", "indexed": true, "docValues": true},
    {"name": "review_count", "type": "pint", "indexed": true, "docValues": true},
    {"name": "tags", "type": "string", "multiValued": true, "indexed": true},
    {"name": "sale_price", "type": "pfloat", "indexed": true, "docValues": true}
  ]
}

Key choices: Use docValues=true on every field used for faceting, sorting, or grouping. This stores data column-wise on disk, avoiding field cache memory pressure at query time. Mark description as stored=false — product listing pages rarely need the full description text.

Advanced Faceting

Faceted navigation is the backbone of e-commerce drill-down. Beyond simple field faceting, Solr supports pivot, range, and interval faceting.

Pivot faceting builds hierarchical drill-downs — category then brand:

curl "http://localhost:8983/solr/products/select?q=*:*&rows=0&\
facet=true&\
facet.pivot=category,brand&\
facet.pivot.mincount=1"

Response contains nested counts so you can render breadcrumb-style navigation.

Range faceting for price histograms:

curl "http://localhost:8983/solr/products/select?q=*:*&rows=0&\
facet=true&\
facet.range=price&\
facet.range.start=0&\
facet.range.end=1000&\
facet.range.gap=50"

Interval faceting for arbitrary bucketing:

curl "http://localhost:8983/solr/products/select?q=*:*&rows=0&\
facet=true&\
f.interval=price&\
f.price.facet.interval=[0,50)&\
f.price.facet.interval=[50,100)&\
f.price.facet.interval=[100,200)&\
f.price.facet.interval=[200,)&\
f.price.facet.interval.set=in_stock:[true TO true]"

Performance tip: facet.range with docValues-backed fields runs in O(n) per segment and avoids field cache entirely. Interval faceting is slightly faster for arbitrary buckets because it avoids range scan overhead.

Autocomplete and Suggest

Solr’s suggest module supports multiple lookup implementations. The analyzing infix suggester works well for product autocomplete:

<searchComponent name="suggest" class="solr.SuggestComponent">
  <lst name="suggester">
    <str name="name">products_suggest</str>
    <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
    <str name="dictionaryImpl">DocumentDictionaryFactory</str>
    <str name="field">name</str>
    <str name="suggestAnalyzerFieldType">text_en</str>
    <str name="buildOnCommit">true</str>
  </lst>
</searchComponent>

Query it with:

curl "http://localhost:8983/solr/products/suggest?\
suggest=true&\
suggest.dictionary=products_suggest&\
suggest.q=wire&\
suggest.count=10&\
suggest.cfq=category:Electronics"

The suggest.cfq (context filter query) restricts suggestions to a category — critical for multi-department stores.

Spell Correction

No search experience survives typos. Solr provides multiple spell checker implementations:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
  <lst name="spellchecker">
    <str name="name">default</str>
    <str name="classname">solr.DirectSolrSpellChecker</str>
    <str name="field">name</str>
    <str name="distanceMeasure">internal</str>
    <str name="accuracy">0.7</str>
    <str name="maxEdits">2</str>
    <str name="minPrefix">1</str>
  </lst>
</searchComponent>

Use spellcheck.extendedResult=true to get frequency and suggestion weight so your frontend can decide whether to show “Did you mean?” vs. auto-correcting:

curl "http://localhost:8983/solr/products/select?q=electroniks&\
spellcheck=true&\
spellcheck.extendedResult=true"

Relevance Tuning and Boosting

Raw text search rarely produces the best product ranking. Solr’s function queries let you blend text relevance with business signals:

curl "http://localhost:8983/solr/products/select?q=wireless+headphones&\
defType=edismax&\
qf=name^4+description^1+tags^2&\
bf=product(rating,3,1,0)^2+if(in_stock,10,0)^5+recip(ms(NOW,launch_date),3.16e-11,1,1)^3"

Breaking this down:

qf assigns per-field boost: name gets 4x, tags 2x, description 1x
bf (boost functions) adds: rating weighted by a product curve, stock availability (+10 if in stock), recency boost using recip with millisecond conversion (3.16e-11 = 1 year in ms)

Merchandising and Personalization

Solr doesn’t have a built-in merchandising UI, but you can implement it with query elevation and sorting overrides:

# Elevate specific products for a category landing page
curl "http://localhost:8983/solr/products/select?q=*:*&\
fq=category:Headphones&\
enableElevation=true&\
forceElevation=true&\
elevateIds=PROMO-001,PROMO-002"

User-specific personalization is achievable with function queries and user profile data indexed as payloads or stored in auxiliary fields:

# Boost by user's preferred brand
curl "http://localhost:8983/solr/products/select?q=wireless&\
defType=edismax&\
qf=name^3+tags^1&\
bf=query($ub,500)&\
ub=brand:Sony"

Enterprise Search

Enterprise document search demands content extraction from binary formats, permission-trimmed result sets, and federation across data silos.

Document Crawling with Apache Tika

Solr integrates natively with Apache Tika via the ExtractingRequestHandler. PDFs, Word documents, and presentations are indexed as easily as HTML:

curl -X POST "http://localhost:8983/solr/docs/update/extract?\
literal.id=doc001&\
literal.author=CalmOps&\
fmap.content=body" \
  -F "myfile=@/path/to/quarterly-report.pdf"

Tika auto-detects format, extracts text, metadata (author, title, page count), and language. Use fmap.content=body to map the extracted text into your schema’s body field.

Schema for Enterprise Documents

{
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "title", "type": "text_en", "indexed": true, "boost": 3},
    {"name": "body", "type": "text_en", "indexed": true, "stored": false},
    {"name": "author", "type": "string", "indexed": true, "docValues": true},
    {"name": "created_date", "type": "pdate", "indexed": true, "docValues": true},
    {"name": "modified_date", "type": "pdate", "indexed": true, "docValues": true},
    {"name": "doc_type", "type": "string", "indexed": true, "docValues": true},
    {"name": "allowed_groups", "type": "string", "multiValued": true, "indexed": true},
    {"name": "allowed_users", "type": "string", "multiValued": true, "indexed": true},
    {"name": "denied_groups", "type": "string", "multiValued": true, "indexed": true},
    {"name": "content_type", "type": "string", "indexed": true},
    {"name": "file_size", "type": "plong", "indexed": true},
    {"name": "page_count", "type": "pint", "indexed": true}
  ]
}

Permission-Based Access Control

Enterprise search result visibility must respect document-level permissions. The standard pattern indexes access control lists (ACLs) as multi-valued fields and filters at query time:

curl "http://localhost:8983/solr/docs/select?q=quarterly+report&\
fq={!tag=perms}(allowed_users:user123+OR+allowed_groups:executives)&\
fq=-denied_groups:contractors"

Use filter query tagging ({!tag=perms}) if you need to exclude this filter from facet calculations — facet counts should show total available documents, not just the user’s visible set.

Performance note: ACL fields should use docValues=true and be sorted consistently. For organizations with thousands of groups, consider indexing a computed visibility_token that encodes the full access calculation at index time rather than combining multiple Boolean queries at query time.

Content Enrichment and Metadata Extraction

Use Solr’s update chain to enrich documents during ingestion:

<updateRequestProcessorChain name="enrich">
  <processor class="solr.UUIDUpdateProcessorFactory">
    <str name="fieldName">id</str>
  </processor>
  <processor class="solr.RegexReplaceProcessorFactory">
    <str name="fieldName">body</str>
    <str name="source">\b(\d{3}-\d{2}-\d{4})\b</str>
    <str name="replacement">[SSN_REDACTED]</str>
  </processor>
  <processor class="solr.CloneFieldUpdateProcessorFactory">
    <str name="source">title</str>
    <str name="dest">search_text</str>
  </processor>
  <processor class="solr.CloneFieldUpdateProcessorFactory">
    <str name="source">body</str>
    <str name="dest">search_text</str>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory"/>
  <processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>

This chain assigns UUIDs, redacts PII (SSNs), and merges fields into a single search_text field for simple unified querying.

Federated Search with Shard Configuration

When enterprise data lives in separate Solr clouds (HR docs, engineering wikis, legal contracts), federate across collections with the /admin/collections API or a custom search component:

curl "http://localhost:8983/solr/hr/select?q=compliance&shards=http://solr-wiki:8983/solr/wiki,http://solr-legal:8983/solr/legal&\
shards.info=true"

Add shards.info=true to see per-shard timing — essential for debugging slow federation targets.

Site Search

Site search has different demands from e-commerce: sitemap-driven indexing, search analytics to identify content gaps, and A/B testing to measure search-driven conversions.

Sitemap-Based Incremental Indexing

For content-heavy sites, tie Solr indexing to sitemap changes rather than full reindexes:

#!/bin/bash
# Incremental site reindex from sitemap
SITE_URL="https://example.com"
for url in $(curl -s "${SITE_URL}/sitemap.xml" | grep -oP '<loc>\K[^<]+'); do
  curl -s "${url}" | python3 -c "
import sys, json, html
content = sys.stdin.read()
# Extract title, meta, body from HTML (simplified)
doc = {
  'id': '${url}',
  'title': '...',
  'body': '...',
  'last_modified': 'NOW',
  'sitemap_priority': '0.8'
}
print(json.dumps(doc))
" | curl -X POST "http://localhost:8983/solr/site/update?commit=true" \
  -H "Content-Type: application/json" -d @-
done

Better approach: Use a sitemap parser that tracks the lastmod field from the sitemap and skips URLs that haven’t changed.

Search Analytics

Solr’s /admin/stats and the query response times expose performance data, but for usage analytics (what users searched for, zero-result queries), implement your own tracking:

// Analytics schema
{
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "query_string", "type": "string", "indexed": true},
    {"name": "num_results", "type": "pint", "indexed": true},
    {"name": "clicked_rank", "type": "pint", "indexed": true},
    {"name": "session_id", "type": "string", "indexed": true},
    {"name": "timestamp", "type": "pdate", "indexed": true, "docValues": true}
  ]
}

Analyze zero-result queries to identify content gaps:

curl "http://localhost:8983/solr/analytics/select?q=num_results:0&\
facet=true&\
facet.field=query_string&\
facet.limit=20&\
sort=count+desc&\
rows=0"

A/B Testing Search Results

Run search experiments by assigning users to control or treatment groups and routing queries through different request handlers:

# Control
curl "http://localhost:8983/solr/products/select?q=wireless&defType=edismax&qf=name^4+description^1"

# Treatment — boosted by images/reviews
curl "http://localhost:8983/solr/products/select?q=wireless&defType=edismax&\
qf=name^4+description^1&\
bf=if(exists(image_url),5,0)^2+mul(review_count,0.1)^1"

Track per-group click-through rates in your analytics schema to validate treatment improvements stat-sig before rolling out.

Synonyms and Stop Words

Manage synonyms via Solr’s synonym filter in your schema’s analysis chain:

<analyzer type="query">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.SynonymGraphFilterFactory"
          synonyms="synonyms.txt"
          expand="true"
          ignoreCase="true"/>
  <filter class="solr.StopFilterFactory"
          words="stopwords.txt"
          ignoreCase="true"/>
</analyzer>

In synonyms.txt:

laptop, notebook, ultrabook
cellphone, mobile, smartphone
tv, television, telly

Use SynonymGraphFilterFactory (not the deprecated SynonymFilterFactory) to ensure multi-word synonyms produce correct phrase query behavior.

Log and Time-Series Analytics

Solr’s streaming expressions and time-based rollups make it a viable alternative to dedicated analytics stores for moderate-scale log aggregation.

Time-Range Index Rollups

For operational logs with high ingest volume (millions/day), roll up old data into hourly summaries rather than retaining raw documents:

# Streaming expression to aggregate hourly counts
curl -X POST "http://localhost:8983/solr/logs/stream" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d 'expr=rollup(
    search(logs, q="timestamp:[NOW-7DAY TO NOW]", fl="timestamp,severity", rows=1000000),
    over="timestamp",
    bucket="+1HOUR",
    groupBySum("error_count", "severity:ERROR"),
    groupBySum("warn_count", "severity:WARN")
  )'

This returns hourly error/warn counts without materializing every raw log line.

Streaming Expressions for Real-Time Aggregation

Streaming expressions let you express analytics queries without MapReduce overhead:

expr=top(
  group(
    search(logs, q="*:*", fl="endpoint,response_time", rows=100000),
    by="endpoint",
    having(avg(response_time), >, 200)
  ),
  n=10,
  sort="avg(response_time) desc"
)

This finds the 10 slowest endpoints by average response time. Streaming expressions execute server-side and can be embedded in dashboards.

SQL Aggregation via Parallel SQL

Solr exposes parallel SQL for JDBC/ODBC-compatible analytics:

SELECT
  TO_CHAR(timestamp, 'YYYY-MM-DD HH24:00') AS hour_bucket,
  COUNT(*) AS request_count,
  AVG(response_time) AS avg_latency,
  PERCENTILE_DISC(0.99) WITHIN GROUP (ORDER BY response_time) AS p99_latency
FROM logs
WHERE timestamp >= '2026-04-01T00:00:00Z'
GROUP BY hour_bucket
ORDER BY hour_bucket

Run via the SQL handler:

curl "http://localhost:8983/solr/logs/sql?stmt=$(python3 -c "import urllib.parse; print(urllib.parse.quote(\"\"\"SELECT TO_CHAR(timestamp, 'YYYY-MM-DD HH24:00') AS hour_bucket, COUNT(*) AS request_count, AVG(response_time) AS avg_latency FROM logs WHERE timestamp >= '2026-04-01T00:00:00Z' GROUP BY hour_bucket ORDER BY hour_bucket\"\"\"))")"

For production, use a JDBC driver instead of URL-encoded curl calls.

Geospatial Search

Solr’s spatial capabilities handle lat/lon queries, bounding box filters, and distance-based sorting with sub-millisecond latency at billions of documents.

Geospatial Schema

{
  "fields": [
    {"name": "store_id", "type": "string"},
    {"name": "store_name", "type": "text_en"},
    {"name": "location", "type": "location_rpt", "indexed": true, "multiValued": false},
    {"name": "address", "type": "string"},
    {"name": "hours", "type": "string"},
    {"name": "rating", "type": "pfloat"}
  ]
}

The location_rpt field type enables both indexed and document-level spatial queries. For simple lat/lon filters, location (without _rpt) is sufficient and faster.

Bounding Box and Radius Queries

# Find stores within 10km of a point
curl "http://localhost:8983/solr/stores/select?q=*:*&\
fq={!geofilt pt=37.7749,-122.4194 sfield=location d=10}"

# Bounding box filter
curl "http://localhost:8983/solr/stores/select?q=*:*&\
fq={!bbox pt=37.7749,-122.4194 sfield=location d=10}"

geofilt does exact circle filtering (more accurate). bbox does rectangular filtering (faster, useful for map tile rendering).

Distance Sorting and Score Boosting

Sort results by distance for “nearest store” queries:

curl "http://localhost:8983/solr/stores/select?q=*:*&\
fq={!geofilt sfield=location pt=37.7749,-122.4194 d=25}&\
sort=geodist(location,37.7749,-122.4194) asc"

Boost by distance in relevance scoring:

curl "http://localhost:8983/solr/stores/select?q=coffee&\
defType=edismax&\
qf=store_name^2&\
bf=recip(geodist(location,37.7749,-122.4194),1,1000,100)"

The recip function maps distance to a 0-1000 boost score, decreasing smoothly.

Publishing and Media Search

Media publishers use Solr for metadata search, tag-based navigation, and content recommendations across catalogs of millions of articles, videos, and audio files.

Schema for Media Assets

{
  "fields": [
    {"name": "id", "type": "string"},
    {"name": "title", "type": "text_en", "boost": 5},
    {"name": "caption", "type": "text_en"},
    {"name": "tags", "type": "string", "multiValued": true, "indexed": true, "docValues": true},
    {"name": "media_type", "type": "string", "indexed": true, "docValues": true},
    {"name": "duration_secs", "type": "pint", "indexed": true},
    {"name": "resolution", "type": "string", "indexed": true},
    {"name": "license", "type": "string", "indexed": true},
    {"name": "upload_date", "type": "pdate", "indexed": true, "docValues": true},
    {"name": "view_count", "type": "plong", "indexed": true, "docValues": true},
    {"name": "featured", "type": "boolean", "indexed": true}
  ]
}

# Browse by tag with promoted content on top
curl "http://localhost:8983/solr/media/select?q=*:*&\
fq=tags:photography&\
sort=featured+desc,view_count+desc&\
facet=true&\
facet.field=tags&\
facet.limit=50"

Content Recommendations with MoreLikeThis

Solr’s MoreLikeThis (MLT) component generates content recommendations without a separate ML pipeline:

curl "http://localhost:8983/solr/media/mlt?q=id:article-042&\
mlt.fl=title,caption,tags&\
mlt.mintf=2&\
mlt.mindf=1&\
mlt.minwl=3&\
mlt.maxwl=20&\
mlt.maxqt=25&\
mlt.count=10&\
mlt.boost=true"

Parameters explained:

mlt.mintf=2 — term must appear at least twice in the source doc
mlt.mindf=1 — term must appear in at least 1 other doc
mlt.minwl=3, mlt.maxwl=20 — min/max word length for extracted terms
mlt.boost=true — score MLT results by term relevance

For video platforms, use MLT on tags and captions to surface “watch next” recommendations.

Best Practices

Query Time vs. Index Time Decisions

Decision	Index Time	Query Time	When To Choose
Field boosting	Copy to boosted field at index	`qf` parameter	Query time: iterating is easier
ACL computation	Pre-compute visibility token	`fq` with user/group	Index time: faster query (token is single term)
Synonym expansion	Expand at index (store all forms)	SynonymGraphFilter at query	Query time: smaller index, flexible
Text normalization	Lowercasing, stemming at index	Same analyzers applied to query	Always at index (smaller index)

Cache strategies:

queryResultCache: Enable with size=500 for common product listing pages. Monitor with admin/statistics.
filterCache: Use class=solr.FastLRUCache with size=1000 for faceted navigation filters. High hit ratio is critical for drill-down UX.
documentCache: Keep small (size=100-200). Mostly useful for reindexing, not reads.

Replication factor choices in SolrCloud:

2 replicas: Minimum for standard HA. One node failure still allows quorum.
3 replicas: Recommended for write-heavy workloads. Tolerates two concurrent failures.
numShards vs. replicationFactor: Prefer fewer shards with more replicas. A 2-shard, 3-replica cluster outperforms a 6-shard, 1-replica cluster because cross-shard query merging is expensive.

Index Design Rules

Always use docValues=true on sort/facet/group fields. Never rely on field cache.
Set stored=false on fields your application never reads back. Reduces index size 30-50%.
Use multiValued sparingly — multi-valued fields can’t use certain query-time features (e.g., per-field sorting).
Prefer text_en over text_general for English content. The stemming chain improves recall significantly.
Configure maxShardsPerNode to prevent uneven distribution. A reasonable value is maxShardsPerNode=4 for most workloads.

Solr in Production: E-commerce, Enterprise Search

Introduction

E-commerce Search

Schema Pattern

Advanced Faceting

Autocomplete and Suggest

Spell Correction

Relevance Tuning and Boosting

Merchandising and Personalization

Enterprise Search

Document Crawling with Apache Tika

Schema for Enterprise Documents

Permission-Based Access Control

Content Enrichment and Metadata Extraction

Federated Search with Shard Configuration

Site Search

Sitemap-Based Incremental Indexing

Search Analytics

A/B Testing Search Results

Synonyms and Stop Words

Log and Time-Series Analytics

Time-Range Index Rollups

Streaming Expressions for Real-Time Aggregation

SQL Aggregation via Parallel SQL

Geospatial Search

Geospatial Schema

Bounding Box and Radius Queries

Distance Sorting and Score Boosting

Publishing and Media Search

Schema for Media Assets

Tag-Based Navigation with Boosted Filtering

Content Recommendations with MoreLikeThis

Best Practices

Query Time vs. Index Time Decisions

Index Design Rules

Resources

Comments

Share this article

👍 Was this article helpful?