Apache Solr is a highly reliable, scalable full-text search platform built on Apache Lucene. Indexing custom data into Solr is the core operation that turns raw data into searchable documents. This guide covers the full pipeline: schema definition, data ingestion methods, indexing strategies, performance tuning, and common pitfalls.
Solr Architecture Overview
Solr organizes data into cores (standalone mode) or collections (SolrCloud mode). A collection is split into shards for horizontal scaling, and each shard can have replicas for high availability and read throughput.
Client โ Load Balancer โ SolrCloud (ZooKeeper ensemble)
โ
โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ
Shard 1 Shard 2 Shard N
โโโโดโโโ โโโโดโโโ โโโโดโโโ
Leader R1 R2 Leader R1 R2 Leader R1 R2
Every document indexed into Solr is parsed, analyzed (tokenized, filtered, stemmed), and written to an inverted index on disk. A commit makes the index visible to searches.
Schema Design
Solr’s schema defines how documents are structured and how fields are analyzed. There are two modes.
Managed Schema vs Classic Schema
| Feature | Managed Schema | Classic schema.xml |
|---|---|---|
| API-driven | Yes (Schema API) | No โ file-based |
| Auto field creation | Optional via schemaFactory |
Manual only |
| Runtime changes | Yes, no restart | Requires reload/core reload |
| Recommended for | New projects, dynamic schemas | Legacy or locked-down environments |
In managed-schema mode you use the Schema API at runtime:
# Add a string field
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-field":{"name":"hostname","type":"string","stored":true,"indexed":true}}' \
http://localhost:8983/solr/mycoll/schema
# Add a multi-valued field
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-field":{"name":"tags","type":"string","multiValued":true,"stored":true,"indexed":true}}' \
http://localhost:8983/solr/mycoll/schema
Field Types
Solr ships with dozens of built-in field types. The most common ones are:
| Field Type | Use Case | Example Data |
|---|---|---|
text_general |
Full-text search with stemming | Article body, descriptions |
string |
Exact match, faceting, sorting | IDs, SKUs, hostnames |
int / long |
Numeric range queries | Prices, counts, timestamps |
float / double |
Decimal arithmetic | Ratings, coordinates |
date / pdate |
Date/time queries | Created-at, event date |
boolean |
True/false flags | Published status |
location |
Geo-spatial queries | Latitude, longitude |
You can define custom field types in the schema:
// Add a custom text type that lowercases and splits on whitespace
{
"add-field-type": {
"name": "my_text",
"class": "solr.TextField",
"analyzer": {
"tokenizer": {"class": "solr.WhitespaceTokenizerFactory"},
"filters": [
{"class": "solr.LowerCaseFilterFactory"}
]
}
}
}
Dynamic Fields
Dynamic fields let you index fields whose names you don’t know at schema-design time. They match by suffix or prefix pattern:
# Match any field ending in _s as a string
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-dynamic-field":{"name":"*_s","type":"string","stored":true,"indexed":true}}' \
http://localhost:8983/solr/mycoll/schema
# Match any field ending in _i as an integer
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-dynamic-field":{"name":"*_i","type":"int","stored":true,"indexed":true}}' \
http://localhost:8983/solr/mycoll/schema
With these dynamic fields in place, a document containing price_i: 299 and vendor_s: "Acme" will be indexed automatically without explicit field definitions.
Copy Fields
Copy fields populate a single catch-all field from multiple sources so you can search across all fields at once:
# Copy the "name", "url", and "description" fields into "_text_"
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-copy-field":{"source":"name","dest":"_text_"}}' \
http://localhost:8983/solr/mycoll/schema
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-copy-field":{"source":"url","dest":"_text_"}}' \
http://localhost:8983/solr/mycoll/schema
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-copy-field":{"source":"description","dest":"_text_"}}' \
http://localhost:8983/solr/mycoll/schema
You can also copy everything:
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-copy-field":{"source":"*","dest":"_text_"}}' \
http://localhost:8983/solr/mycoll/schema
Data Import Methods
Solr accepts data in multiple formats. Choose the one that fits your pipeline.
JSON
JSON is the most developer-friendly format. Each document is a map, and you send an array of documents:
[
{
"id": 1,
"name": "CalmOps Blog",
"url": "https://calmops.com",
"description": "Software engineering knowledge hub",
"tags": ["devops", "backend", "web"],
"published": true
},
{
"id": 2,
"name": "ABCD Search",
"url": "https://abcd.net",
"description": "A fast privacy-focused search engine",
"tags": ["search", "privacy"],
"published": true
}
]
Post it with the /update handler:
curl -X POST -H 'Content-Type: application/json' \
--data-binary @documents.json \
http://localhost:8983/solr/mycoll/update?commit=true
JSON also supports partial updates and atomic operations:
// Atomic update: increment views, replace description, add a tag
{
"id": 1,
"views_i": {"set": 0},
"description": {"add": " Updated description"},
"tags": {"add": "new-tag"}
}
CSV
CSV is ideal for tabular data exported from databases or spreadsheets:
curl -X POST -H 'Content-Type: application/csv' \
--data-binary @documents.csv \
http://localhost:8983/solr/mycoll/update?commit=true
Example CSV:
id,name,url,description
1,CalmOps Blog,https://calmops.com,Software engineering knowledge hub
2,ABCD Search,https://abcd.net,A fast privacy-focused search engine
Use CSV query parameters to configure parsing: &separator=%09 for TSV, &header=false if no header row, &skip="0,2" to skip lines.
XML
Solr also accepts the legacy Solr XML format:
<add>
<doc>
<field name="id">1</field>
<field name="name">CalmOps Blog</field>
<field name="url">https://calmops.com</field>
<field name="description">Software engineering knowledge hub</field>
<field name="tags">devops</field>
<field name="tags">backend</field>
</doc>
<doc>
<field name="id">2</field>
<field name="name">ABCD Search</field>
<field name="url">https://abcd.net</field>
<field name="description">A fast privacy-focused search engine</field>
<field name="tags">search</field>
</doc>
</add>
curl -X POST -H 'Content-Type: application/xml' \
--data-binary @documents.xml \
http://localhost:8983/solr/mycoll/update?commit=true
Data Import Handler (DIH) for Databases
DIH indexes directly from relational databases via JDBC. Configure db-data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="org.postgresql.Driver"
url="jdbc:postgresql://localhost:5432/myapp"
user="app_user"
password="secret"/>
<document>
<entity name="article"
query="SELECT id, title, url, body, created_at FROM articles WHERE published = true">
<field column="id" name="id"/>
<field column="title" name="name"/>
<field column="url" name="url"/>
<field column="body" name="description"/>
<field column="created_at" name="created_date"/>
</entity>
</document>
</dataConfig>
Register the DIH request handler in solrconfig.xml:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
</requestHandler>
Trigger a full import:
curl http://localhost:8983/solr/mycoll/dataimport?command=full-import&clean=true&commit=true
Apache Tika for Binary Documents
Solr’s ExtractingRequestHandler (using Apache Tika) indexes binary files (PDF, Word, HTML):
curl -X POST -H 'Content-Type: application/pdf' \
--data-binary @report.pdf \
'http://localhost:8983/solr/mycell/update/extract?literal.id=doc1&literal.name=Q1+Report&commit=true'
Tika extracts text content automatically and populates the content and stream_text fields. Use literal. parameters to add static metadata.
Import Method Comparison
| Method | Format | Speed | Best For |
|---|---|---|---|
| JSON POST | JSON array | Fast | API integration, partial updates |
| CSV POST | CSV/TSV | Fastest bulk | Tabular exports, spreadsheets |
| XML POST | Solr XML | Moderate | Legacy systems |
| DIH | JDBC | Moderate (row fetch) | Live database sync |
| Tika | Binary | Slow (parsing) | PDFs, Office docs, HTML |
| Post Tool | Any | Fast | Ad-hoc files, development |
Indexing Strategies
Full Index vs Incremental Index
Full index: Wipe the collection and re-index everything. Use for initial loads or when schema changes require re-indexing:
# Delete everything first
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"delete":{"query":"*:*"}}' \
http://localhost:8983/solr/mycoll/update?commit=true
# Then re-index
curl -X POST -H 'Content-Type: application/json' \
--data-binary @full-export.json \
http://localhost:8983/solr/mycoll/update?commit=true
Incremental index: Index only documents that changed since the last run. Requires a timestamp or version field:
# With DIH: delta-import uses a delta query
curl http://localhost:8983/solr/mycoll/dataimport?command=delta-import
# Manual: send only changed documents
curl -X POST -H 'Content-Type: application/json' \
--data-binary @delta.json \
http://localhost:8983/solr/mycoll/update?commit=true
Commit Strategies
A commit makes indexed documents visible to search. Solr supports three commit types:
| Type | Command | When Data Is Visible | Cost |
|---|---|---|---|
| Hard commit | commit=true |
Immediately after commit | High (fsync) |
| Soft commit | softCommit=true |
Immediately (memory) | Low (no fsync) |
| Auto commit | autoCommit in solrconfig |
At interval or doc count | Configurable |
Auto-commit configuration in solrconfig.xml:
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>true</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>
Use soft commits during bulk indexing to avoid the fsync overhead of hard commits on every batch. Issue one hard commit at the end:
# Index without committing
curl -X POST -H 'Content-Type: application/json' \
--data-binary @batch-1.json \
http://localhost:8983/solr/mycoll/update
curl -X POST -H 'Content-Type: application/json' \
--data-binary @batch-2.json \
http://localhost:8983/solr/mycoll/update
# Single hard commit at the end
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"commit":{}}' \
http://localhost:8983/solr/mycoll/update
Optimize
Over time the index accumulates segments. Merging them improves search speed:
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"optimize":{"maxSegments":1}}' \
http://localhost:8983/solr/mycoll/update
Run optimize during low-traffic periods. It rewrites the entire index and is I/O intensive.
Batch Indexing with Python (pysolr)
For production pipelines, use the pysolr client:
#!/usr/bin/env python3
"""Batch index documents into Solr using pysolr."""
import pysolr
import json
from pathlib import Path
SOLR_URL = "http://localhost:8983/solr/mycoll"
BATCH_SIZE = 500
def load_documents(path: str) -> list[dict]:
with open(path) as f:
return json.load(f)
def batch_index(docs: list[dict]):
solr = pysolr.Solr(SOLR_URL, always_commit=False)
total = len(docs)
for i in range(0, total, BATCH_SIZE):
batch = docs[i:i + BATCH_SIZE]
solr.add(batch)
print(f"Indexed {min(i + BATCH_SIZE, total)}/{total}")
solr.commit()
print("Done. Hard commit issued.")
if __name__ == "__main__":
docs = load_documents("data/export.json")
batch_index(docs)
Send documents from a script using curl in a loop:
#!/bin/bash
# Batch index JSON files with soft commits
set -e
SOLR_URL="http://localhost:8983/solr/mycoll/update"
BATCH_DIR="./batches"
for f in "$BATCH_DIR"/*.json; do
echo "Indexing $f..."
curl -s -X POST -H 'Content-Type: application/json' \
--data-binary @"$f" \
"$SOLR_URL?softCommit=true"
echo
done
# Final hard commit
curl -s -X POST -H 'Content-Type: application/json' \
--data-binary '{"commit":{}}' \
"$SOLR_URL"
echo "All batches committed."
Performance Tuning
Merge Factors
Lucene’s merge policy controls when segments are merged. Tune for indexing throughput vs search performance:
<mergePolicyFactory class="org.apache.lucene.index.TieredMergePolicyFactory">
<int name="segsPerTier">10</int>
<double name="deletesPctAllowed">20</double>
<double name="floorSegmentMB">50</double>
<double name="maxMergedSegmentMB">5120</double>
</mergePolicyFactory>
Higher segsPerTier (10-20) reduces merge overhead during indexing. Lower it (5) for better search performance on an already-built index.
Thread Pools
Concurrent indexing requires adequate thread pool configuration in solrconfig.xml:
<int name="maxIndexingThreads">8</int>
<int name="maxIndexingThreadsPerQueue">4</int>
Match these to your CPU core count. Oversubscribing leads to context-switch overhead.
Batch Sizing
Send documents in batches of 500-2000 for optimal throughput. Smaller batches increase HTTP overhead; larger batches consume more memory on the Solr JVM. Measure with your data size:
# Test batch of 1000
curl -X POST -H 'Content-Type: application/json' \
--data-binary @batch-1000.json \
http://localhost:8983/solr/mycoll/update?softCommit=true
Monitor the Solr admin UI (localhost:8983/solr) under Plugins / Stats for update handler response times to find your sweet spot.
JVM Heap and Caching
Solr’s indexing performance depends heavily on heap allocation:
# Set in bin/solr.in.sh
SOLR_JAVA_MEM="-Xms8g -Xmx8g"
For write-heavy workloads, increase the document cache and disable the query result cache:
<queryResultCache size="0" initialSize="0" autowarmCount="0"/>
<documentCache size="5000" initialSize="5000" autowarmCount="0"/>
Schema Version Management with the Schema API
The Schema API supports versioning and rollback. Operations return a version number that you can use for optimistic concurrency:
# Add a new field, capture the version
curl -s -X POST -H 'Content-Type: application/json' \
--data-binary '{"add-field":{"name":"rating","type":"float","stored":true}}' \
http://localhost:8983/solr/mycoll/schema
# Response includes "version": 42
# Revert to a previous schema version
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"rollback":{"version":41}}' \
http://localhost:8983/solr/mycoll/schema
# Check current schema version
curl http://localhost:8983/solr/mycoll/schema/version
You can also validate a schema change without applying it:
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"add-field":{"name":"test_field","type":"unknown_type"}}' \
http://localhost:8983/solr/mycoll/schema
# Returns error: "Field type 'unknown_type' not found"
Troubleshooting Indexing Errors
Field Not Found
Error: Document contains field "custom_field" but schema doesn't have it
Fix: Add the field or a matching dynamic field:
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"add-dynamic-field":{"name":"*_field","type":"string","stored":true,"indexed":true}}' \
http://localhost:8983/solr/mycoll/schema
If managed schema is not enabled, add the field to schema.xml manually and reload the core:
curl http://localhost:8983/solr/admin/cores?action=RELOAD&core=mycoll
Type Mismatch
Error: Invalid Number: abc123 or expected Date String but got 12345
Fix: Ensure the field value matches the field type. Use the Schema API to check field definitions:
curl http://localhost:8983/solr/mycoll/schema/fields/price
If the type is wrong, replace the field (drop and re-add with correct type) and re-index:
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"delete-field":{"name":"price"}}' \
http://localhost:8983/solr/mycoll/schema
Document Too Large
Error: Document is too large: the content exceeds the maximum size
Fix: Increase the maxContentLength in solrconfig.xml:
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<int name="maxContentLength">104857600</int>
</lst>
</requestHandler>
Or filter out oversized documents from your pipeline:
MAX_DOC_SIZE = 10 * 1024 * 1024 # 10 MB
def filter_large_docs(docs: list[dict]) -> list[dict]:
return [d for d in docs if len(json.dumps(d)) < MAX_DOC_SIZE]
409 Conflict on Schema API
Error: Schema modification failed because another operation is in progress
Fix: Wait and retry. Schema operations are serialized. Avoid concurrent schema changes:
# Exponential backoff retry
for i in 1 2 4 8; do
if curl -s -X POST ...; then
break
fi
sleep $i
done
Out of Memory During Indexing
Error: java.lang.OutOfMemoryError: Java heap space
Fix: Reduce batch size, increase heap, or reduce the number of concurrent indexing threads:
SOLR_JAVA_MEM="-Xms16g -Xmx16g"
<int name="maxIndexingThreads">4</int>
Complete End-to-End Walkthrough
# 1. Start Solr in cloud mode
bin/solr start -e cloud -noprompt
# 2. Create a collection
bin/solr create -c sites -s 1 -rf 1
# 3. Define schema fields via Schema API
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-field": {"name":"name", "type":"text_general", "stored":true}}' \
http://localhost:8983/solr/sites/schema
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-field": {"name":"url", "type":"string", "stored":true}}' \
http://localhost:8983/solr/sites/schema
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-field": {"name":"description", "type":"text_general", "stored":true}}' \
http://localhost:8983/solr/sites/schema
# 4. Add a copy field for catch-all search
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-copy-field": {"source":"*","dest":"_text_"}}' \
http://localhost:8983/solr/sites/schema
# 5. Index JSON data
bin/post -c sites ~/data/solr-sites.json
# 6. Verify
curl "http://localhost:8983/solr/sites/select?q=search&wt=json"
The POST tool (bin/post) automatically issues a hard commit. If using curl directly, append ?commit=true to the URL or send a separate commit command.
Resources
- Apache Solr Guide - Indexing
- Schema API Documentation
- Uploading Data with POST
- Data Import Handler
- Uploading Data with Apache Tika
- pysolr Documentation
- Solr Performance Tuning
- Indexing Nested Documents
Comments