Apache Solr is a highly reliable, scalable full-text search platform built on Apache Lucene. Indexing custom data into Solr is the core operation that turns raw data into searchable documents. This guide covers the full pipeline: schema definition, data ingestion methods, indexing strategies, performance tuning, and common pitfalls.
Solr Architecture Overview
Solr organizes data into cores (standalone mode) or collections (SolrCloud mode). A collection is split into shards for horizontal scaling, and each shard can have replicas for high availability and read throughput.
Client → Load Balancer → SolrCloud (ZooKeeper ensemble)
│
┌───────────────┼───────────────┐
Shard 1 Shard 2 Shard N
┌──┴──┐ ┌──┴──┐ ┌──┴──┐
Leader R1 R2 Leader R1 R2 Leader R1 R2
Every document indexed into Solr is parsed, analyzed (tokenized, filtered, stemmed), and written to an inverted index on disk. A commit makes the index visible to searches.
Schema Design
Solr’s schema defines how documents are structured and how fields are analyzed. There are two modes.
Managed Schema vs Classic Schema
| Feature | Managed Schema | Classic schema.xml |
|---|---|---|
| API-driven | Yes (Schema API) | No – file-based |
| Auto field creation | Optional via schemaFactory |
Manual only |
| Runtime changes | Yes, no restart | Requires reload/core reload |
| Recommended for | New projects, dynamic schemas | Legacy or locked-down environments |
In managed-schema mode you use the Schema API at runtime:
# Add a string field
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-field":{"name":"hostname","type":"string","stored":true,"indexed":true}}' \
http://localhost:8983/solr/mycoll/schema
# Add a multi-valued field
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-field":{"name":"tags","type":"string","multiValued":true,"stored":true,"indexed":true}}' \
http://localhost:8983/solr/mycoll/schema
Field Types
Solr ships with dozens of built-in field types. The most common ones are:
| Field Type | Use Case | Example Data |
|---|---|---|
text_general |
Full-text search with stemming | Article body, descriptions |
string |
Exact match, faceting, sorting | IDs, SKUs, hostnames |
int / long |
Numeric range queries | Prices, counts, timestamps |
float / double |
Decimal arithmetic | Ratings, coordinates |
date / pdate |
Date/time queries | Created-at, event date |
boolean |
True/false flags | Published status |
location |
Geo-spatial queries | Latitude, longitude |
You can define custom field types in the schema:
// Add a custom text type that lowercases and splits on whitespace
{
"add-field-type": {
"name": "my_text",
"class": "solr.TextField",
"analyzer": {
"tokenizer": {"class": "solr.WhitespaceTokenizerFactory"},
"filters": [
{"class": "solr.LowerCaseFilterFactory"}
]
}
}
}
Dynamic Fields
Dynamic fields let you index fields whose names you don’t know at schema-design time. They match by suffix or prefix pattern:
# Match any field ending in _s as a string
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-dynamic-field":{"name":"*_s","type":"string","stored":true,"indexed":true}}' \
http://localhost:8983/solr/mycoll/schema
# Match any field ending in _i as an integer
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-dynamic-field":{"name":"*_i","type":"int","stored":true,"indexed":true}}' \
http://localhost:8983/solr/mycoll/schema
With these dynamic fields in place, a document containing price_i: 299 and vendor_s: "Acme" will be indexed automatically without explicit field definitions.
Copy Fields
Copy fields populate a single catch-all field from multiple sources so you can search across all fields at once:
# Copy the "name", "url", and "description" fields into "_text_"
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-copy-field":{"source":"name","dest":"_text_"}}' \
http://localhost:8983/solr/mycoll/schema
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-copy-field":{"source":"url","dest":"_text_"}}' \
http://localhost:8983/solr/mycoll/schema
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-copy-field":{"source":"description","dest":"_text_"}}' \
http://localhost:8983/solr/mycoll/schema
You can also copy everything:
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-copy-field":{"source":"*","dest":"_text_"}}' \
http://localhost:8983/solr/mycoll/schema
Data Import Methods
Solr accepts data in multiple formats. Choose the one that fits your pipeline.
JSON
JSON is the most developer-friendly format. Each document is a map, and you send an array of documents:
[
{
"id": 1,
"name": "CalmOps Blog",
"url": "https://calmops.com",
"description": "Software engineering knowledge hub",
"tags": ["devops", "backend", "web"],
"published": true
},
{
"id": 2,
"name": "ABCD Search",
"url": "https://abcd.net",
"description": "A fast privacy-focused search engine",
"tags": ["search", "privacy"],
"published": true
}
]
Post it with the /update handler:
curl -X POST -H 'Content-Type: application/json' \
--data-binary @documents.json \
http://localhost:8983/solr/mycoll/update?commit=true
JSON also supports partial updates and atomic operations:
// Atomic update: increment views, replace description, add a tag
{
"id": 1,
"views_i": {"set": 0},
"description": {"add": " Updated description"},
"tags": {"add": "new-tag"}
}
CSV
CSV is ideal for tabular data exported from databases or spreadsheets:
curl -X POST -H 'Content-Type: application/csv' \
--data-binary @documents.csv \
http://localhost:8983/solr/mycoll/update?commit=true
Example CSV:
id,name,url,description
1,CalmOps Blog,https://calmops.com,Software engineering knowledge hub
2,ABCD Search,https://abcd.net,A fast privacy-focused search engine
Use CSV query parameters to configure parsing: &separator=%09 for TSV, &header=false if no header row, &skip="0,2" to skip lines.
XML
Solr also accepts the legacy Solr XML format:
<add>
<doc>
<field name="id">1</field>
<field name="name">CalmOps Blog</field>
<field name="url">https://calmops.com</field>
<field name="description">Software engineering knowledge hub</field>
<field name="tags">devops</field>
<field name="tags">backend</field>
</doc>
<doc>
<field name="id">2</field>
<field name="name">ABCD Search</field>
<field name="url">https://abcd.net</field>
<field name="description">A fast privacy-focused search engine</field>
<field name="tags">search</field>
</doc>
</add>
curl -X POST -H 'Content-Type: application/xml' \
--data-binary @documents.xml \
http://localhost:8983/solr/mycoll/update?commit=true
Data Import Handler (DIH) for Databases
DIH indexes directly from relational databases via JDBC. Configure db-data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="org.postgresql.Driver"
url="jdbc:postgresql://localhost:5432/myapp"
user="app_user"
password="secret"/>
<document>
<entity name="article"
query="SELECT id, title, url, body, created_at FROM articles WHERE published = true">
<field column="id" name="id"/>
<field column="title" name="name"/>
<field column="url" name="url"/>
<field column="body" name="description"/>
<field column="created_at" name="created_date"/>
</entity>
</document>
</dataConfig>
Register the DIH request handler in solrconfig.xml:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
</requestHandler>
Trigger a full import:
curl http://localhost:8983/solr/mycoll/dataimport?command=full-import&clean=true&commit=true
Apache Tika for Binary Documents
Solr’s ExtractingRequestHandler (using Apache Tika) indexes binary files (PDF, Word, HTML):
curl -X POST -H 'Content-Type: application/pdf' \
--data-binary @report.pdf \
'http://localhost:8983/solr/mycell/update/extract?literal.id=doc1&literal.name=Q1+Report&commit=true'
Tika extracts text content automatically and populates the content and stream_text fields. Use literal. parameters to add static metadata.
Import Method Comparison
| Method | Format | Speed | Best For |
|---|---|---|---|
| JSON POST | JSON array | Fast | API integration, partial updates |
| CSV POST | CSV/TSV | Fastest bulk | Tabular exports, spreadsheets |
| XML POST | Solr XML | Moderate | Legacy systems |
| DIH | JDBC | Moderate (row fetch) | Live database sync |
| Tika | Binary | Slow (parsing) | PDFs, Office docs, HTML |
| Post Tool | Any | Fast | Ad-hoc files, development |
Indexing Strategies
Full Index vs Incremental Index
Full index: Wipe the collection and re-index everything. Use for initial loads or when schema changes require re-indexing:
# Delete everything first
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"delete":{"query":"*:*"}}' \
http://localhost:8983/solr/mycoll/update?commit=true
# Then re-index
curl -X POST -H 'Content-Type: application/json' \
--data-binary @full-export.json \
http://localhost:8983/solr/mycoll/update?commit=true
Incremental index: Index only documents that changed since the last run. Requires a timestamp or version field:
# With DIH: delta-import uses a delta query
curl http://localhost:8983/solr/mycoll/dataimport?command=delta-import
# Manual: send only changed documents
curl -X POST -H 'Content-Type: application/json' \
--data-binary @delta.json \
http://localhost:8983/solr/mycoll/update?commit=true
Commit Strategies
A commit makes indexed documents visible to search. Solr supports three commit types:
| Type | Command | When Data Is Visible | Cost |
|---|---|---|---|
| Hard commit | commit=true |
Immediately after commit | High (fsync) |
| Soft commit | softCommit=true |
Immediately (memory) | Low (no fsync) |
| Auto commit | autoCommit in solrconfig |
At interval or doc count | Configurable |
Auto-commit configuration in solrconfig.xml:
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>true</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>
Use soft commits during bulk indexing to avoid the fsync overhead of hard commits on every batch. Issue one hard commit at the end:
# Index without committing
curl -X POST -H 'Content-Type: application/json' \
--data-binary @batch-1.json \
http://localhost:8983/solr/mycoll/update
curl -X POST -H 'Content-Type: application/json' \
--data-binary @batch-2.json \
http://localhost:8983/solr/mycoll/update
# Single hard commit at the end
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"commit":{}}' \
http://localhost:8983/solr/mycoll/update
Optimize
Over time the index accumulates segments. Merging them improves search speed:
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"optimize":{"maxSegments":1}}' \
http://localhost:8983/solr/mycoll/update
Run optimize during low-traffic periods. It rewrites the entire index and is I/O intensive.
Batch Indexing with Python (pysolr)
For production pipelines, use the pysolr client:
#!/usr/bin/env python3
"""Batch index documents into Solr using pysolr."""
import pysolr
import json
from pathlib import Path
SOLR_URL = "http://localhost:8983/solr/mycoll"
BATCH_SIZE = 500
def load_documents(path: str) -> list[dict]:
with open(path) as f:
return json.load(f)
def batch_index(docs: list[dict]):
solr = pysolr.Solr(SOLR_URL, always_commit=False)
total = len(docs)
for i in range(0, total, BATCH_SIZE):
batch = docs[i:i + BATCH_SIZE]
solr.add(batch)
print(f"Indexed {min(i + BATCH_SIZE, total)}/{total}")
solr.commit()
print("Done. Hard commit issued.")
if __name__ == "__main__":
docs = load_documents("data/export.json")
batch_index(docs)
Send documents from a script using curl in a loop:
#!/bin/bash
# Batch index JSON files with soft commits
set -e
SOLR_URL="http://localhost:8983/solr/mycoll/update"
BATCH_DIR="./batches"
for f in "$BATCH_DIR"/*.json; do
echo "Indexing $f..."
curl -s -X POST -H 'Content-Type: application/json' \
--data-binary @"$f" \
"$SOLR_URL?softCommit=true"
echo
done
# Final hard commit
curl -s -X POST -H 'Content-Type: application/json' \
--data-binary '{"commit":{}}' \
"$SOLR_URL"
echo "All batches committed."
Performance Tuning
Merge Factors
Lucene’s merge policy controls when segments are merged. Tune for indexing throughput vs search performance:
<mergePolicyFactory class="org.apache.lucene.index.TieredMergePolicyFactory">
<int name="segsPerTier">10</int>
<double name="deletesPctAllowed">20</double>
<double name="floorSegmentMB">50</double>
<double name="maxMergedSegmentMB">5120</double>
</mergePolicyFactory>
Higher segsPerTier (10-20) reduces merge overhead during indexing. Lower it (5) for better search performance on an already-built index.
Thread Pools
Concurrent indexing requires adequate thread pool configuration in solrconfig.xml:
<int name="maxIndexingThreads">8</int>
<int name="maxIndexingThreadsPerQueue">4</int>
Match these to your CPU core count. Oversubscribing leads to context-switch overhead.
Batch Sizing
Send documents in batches of 500-2000 for optimal throughput. Smaller batches increase HTTP overhead; larger batches consume more memory on the Solr JVM. Measure with your data size:
# Test batch of 1000
curl -X POST -H 'Content-Type: application/json' \
--data-binary @batch-1000.json \
http://localhost:8983/solr/mycoll/update?softCommit=true
Monitor the Solr admin UI (localhost:8983/solr) under Plugins / Stats for update handler response times to find your sweet spot.
JVM Heap and Caching
Solr’s indexing performance depends heavily on heap allocation:
# Set in bin/solr.in.sh
SOLR_JAVA_MEM="-Xms8g -Xmx8g"
For write-heavy workloads, increase the document cache and disable the query result cache:
<queryResultCache size="0" initialSize="0" autowarmCount="0"/>
<documentCache size="5000" initialSize="5000" autowarmCount="0"/>
Schema Version Management with the Schema API
The Schema API supports versioning and rollback. Operations return a version number that you can use for optimistic concurrency:
# Add a new field, capture the version
curl -s -X POST -H 'Content-Type: application/json' \
--data-binary '{"add-field":{"name":"rating","type":"float","stored":true}}' \
http://localhost:8983/solr/mycoll/schema
# Response includes "version": 42
# Revert to a previous schema version
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"rollback":{"version":41}}' \
http://localhost:8983/solr/mycoll/schema
# Check current schema version
curl http://localhost:8983/solr/mycoll/schema/version
You can also validate a schema change without applying it:
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"add-field":{"name":"test_field","type":"unknown_type"}}' \
http://localhost:8983/solr/mycoll/schema
# Returns error: "Field type 'unknown_type' not found"
Troubleshooting Indexing Errors
Field Not Found
Error: Document contains field "custom_field" but schema doesn't have it
Fix: Add the field or a matching dynamic field:
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"add-dynamic-field":{"name":"*_field","type":"string","stored":true,"indexed":true}}' \
http://localhost:8983/solr/mycoll/schema
If managed schema is not enabled, add the field to schema.xml manually and reload the core:
curl http://localhost:8983/solr/admin/cores?action=RELOAD&core=mycoll
Type Mismatch
Error: Invalid Number: abc123 or expected Date String but got 12345
Fix: Ensure the field value matches the field type. Use the Schema API to check field definitions:
curl http://localhost:8983/solr/mycoll/schema/fields/price
If the type is wrong, replace the field (drop and re-add with correct type) and re-index:
curl -X POST -H 'Content-Type: application/json' \
--data-binary '{"delete-field":{"name":"price"}}' \
http://localhost:8983/solr/mycoll/schema
Document Too Large
Error: Document is too large: the content exceeds the maximum size
Fix: Increase the maxContentLength in solrconfig.xml:
<requestHandler name="/update" class="solr.UpdateRequestHandler">
<lst name="defaults">
<int name="maxContentLength">104857600</int>
</lst>
</requestHandler>
Or filter out oversized documents from your pipeline:
MAX_DOC_SIZE = 10 * 1024 * 1024 # 10 MB
def filter_large_docs(docs: list[dict]) -> list[dict]:
return [d for d in docs if len(json.dumps(d)) < MAX_DOC_SIZE]
409 Conflict on Schema API
Error: Schema modification failed because another operation is in progress
Fix: Wait and retry. Schema operations are serialized. Avoid concurrent schema changes:
# Exponential backoff retry
for i in 1 2 4 8; do
if curl -s -X POST ...; then
break
fi
sleep $i
done
Out of Memory During Indexing
Error: java.lang.OutOfMemoryError: Java heap space
Fix: Reduce batch size, increase heap, or reduce the number of concurrent indexing threads:
SOLR_JAVA_MEM="-Xms16g -Xmx16g"
<int name="maxIndexingThreads">4</int>
Complete End-to-End Walkthrough
# 1. Start Solr in cloud mode
bin/solr start -e cloud -noprompt
# 2. Create a collection
bin/solr create -c sites -s 1 -rf 1
# 3. Define schema fields via Schema API
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-field": {"name":"name", "type":"text_general", "stored":true}}' \
http://localhost:8983/solr/sites/schema
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-field": {"name":"url", "type":"string", "stored":true}}' \
http://localhost:8983/solr/sites/schema
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-field": {"name":"description", "type":"text_general", "stored":true}}' \
http://localhost:8983/solr/sites/schema
# 4. Add a copy field for catch-all search
curl -X POST -H 'Content-type:application/json' \
--data-binary '{"add-copy-field": {"source":"*","dest":"_text_"}}' \
http://localhost:8983/solr/sites/schema
# 5. Index JSON data
bin/post -c sites ~/data/solr-sites.json
# 6. Verify
curl "http://localhost:8983/solr/sites/select?q=search&wt=json"
The POST tool (bin/post) automatically issues a hard commit. If using curl directly, append ?commit=true to the URL or send a separate commit command.
Resources
- Apache Solr Guide - Indexing
- Schema API Documentation
- Uploading Data with POST
- Data Import Handler
- Uploading Data with Apache Tika
- pysolr Documentation
- Solr Performance Tuning
- Indexing Nested Documents
Comments