Skip to main content
โšก Calmops

Indexing Custom Data into Solr Search Engine

ๅฐ†่‡ชๅฎšไน‰ๆ•ฐๆฎ็ดขๅผ•ๅˆฐ Solr ๆœ็ดขๅผ•ๆ“Ž

Apache Solr is a highly reliable, scalable full-text search platform built on Apache Lucene. Indexing custom data into Solr is the core operation that turns raw data into searchable documents. This guide covers the full pipeline: schema definition, data ingestion methods, indexing strategies, performance tuning, and common pitfalls.

Solr Architecture Overview

Solr organizes data into cores (standalone mode) or collections (SolrCloud mode). A collection is split into shards for horizontal scaling, and each shard can have replicas for high availability and read throughput.

Client โ†’ Load Balancer โ†’ SolrCloud (ZooKeeper ensemble)
                              โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          Shard 1          Shard 2         Shard N
         โ”Œโ”€โ”€โ”ดโ”€โ”€โ”          โ”Œโ”€โ”€โ”ดโ”€โ”€โ”          โ”Œโ”€โ”€โ”ดโ”€โ”€โ”
      Leader R1 R2    Leader R1 R2      Leader R1 R2

Every document indexed into Solr is parsed, analyzed (tokenized, filtered, stemmed), and written to an inverted index on disk. A commit makes the index visible to searches.

Schema Design

Solr’s schema defines how documents are structured and how fields are analyzed. There are two modes.

Managed Schema vs Classic Schema

Feature Managed Schema Classic schema.xml
API-driven Yes (Schema API) No โ€“ file-based
Auto field creation Optional via schemaFactory Manual only
Runtime changes Yes, no restart Requires reload/core reload
Recommended for New projects, dynamic schemas Legacy or locked-down environments

In managed-schema mode you use the Schema API at runtime:

# Add a string field
curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-field":{"name":"hostname","type":"string","stored":true,"indexed":true}}' \
  http://localhost:8983/solr/mycoll/schema

# Add a multi-valued field
curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-field":{"name":"tags","type":"string","multiValued":true,"stored":true,"indexed":true}}' \
  http://localhost:8983/solr/mycoll/schema

Field Types

Solr ships with dozens of built-in field types. The most common ones are:

Field Type Use Case Example Data
text_general Full-text search with stemming Article body, descriptions
string Exact match, faceting, sorting IDs, SKUs, hostnames
int / long Numeric range queries Prices, counts, timestamps
float / double Decimal arithmetic Ratings, coordinates
date / pdate Date/time queries Created-at, event date
boolean True/false flags Published status
location Geo-spatial queries Latitude, longitude

You can define custom field types in the schema:

// Add a custom text type that lowercases and splits on whitespace
{
  "add-field-type": {
    "name": "my_text",
    "class": "solr.TextField",
    "analyzer": {
      "tokenizer": {"class": "solr.WhitespaceTokenizerFactory"},
      "filters": [
        {"class": "solr.LowerCaseFilterFactory"}
      ]
    }
  }
}

Dynamic Fields

Dynamic fields let you index fields whose names you don’t know at schema-design time. They match by suffix or prefix pattern:

# Match any field ending in _s as a string
curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-dynamic-field":{"name":"*_s","type":"string","stored":true,"indexed":true}}' \
  http://localhost:8983/solr/mycoll/schema

# Match any field ending in _i as an integer
curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-dynamic-field":{"name":"*_i","type":"int","stored":true,"indexed":true}}' \
  http://localhost:8983/solr/mycoll/schema

With these dynamic fields in place, a document containing price_i: 299 and vendor_s: "Acme" will be indexed automatically without explicit field definitions.

Copy Fields

Copy fields populate a single catch-all field from multiple sources so you can search across all fields at once:

# Copy the "name", "url", and "description" fields into "_text_"
curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-copy-field":{"source":"name","dest":"_text_"}}' \
  http://localhost:8983/solr/mycoll/schema

curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-copy-field":{"source":"url","dest":"_text_"}}' \
  http://localhost:8983/solr/mycoll/schema

curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-copy-field":{"source":"description","dest":"_text_"}}' \
  http://localhost:8983/solr/mycoll/schema

You can also copy everything:

curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-copy-field":{"source":"*","dest":"_text_"}}' \
  http://localhost:8983/solr/mycoll/schema

Data Import Methods

Solr accepts data in multiple formats. Choose the one that fits your pipeline.

JSON

JSON is the most developer-friendly format. Each document is a map, and you send an array of documents:

[
  {
    "id": 1,
    "name": "CalmOps Blog",
    "url": "https://calmops.com",
    "description": "Software engineering knowledge hub",
    "tags": ["devops", "backend", "web"],
    "published": true
  },
  {
    "id": 2,
    "name": "ABCD Search",
    "url": "https://abcd.net",
    "description": "A fast privacy-focused search engine",
    "tags": ["search", "privacy"],
    "published": true
  }
]

Post it with the /update handler:

curl -X POST -H 'Content-Type: application/json' \
  --data-binary @documents.json \
  http://localhost:8983/solr/mycoll/update?commit=true

JSON also supports partial updates and atomic operations:

// Atomic update: increment views, replace description, add a tag
{
  "id": 1,
  "views_i": {"set": 0},
  "description": {"add": " Updated description"},
  "tags": {"add": "new-tag"}
}

CSV

CSV is ideal for tabular data exported from databases or spreadsheets:

curl -X POST -H 'Content-Type: application/csv' \
  --data-binary @documents.csv \
  http://localhost:8983/solr/mycoll/update?commit=true

Example CSV:

id,name,url,description
1,CalmOps Blog,https://calmops.com,Software engineering knowledge hub
2,ABCD Search,https://abcd.net,A fast privacy-focused search engine

Use CSV query parameters to configure parsing: &separator=%09 for TSV, &header=false if no header row, &skip="0,2" to skip lines.

XML

Solr also accepts the legacy Solr XML format:

<add>
  <doc>
    <field name="id">1</field>
    <field name="name">CalmOps Blog</field>
    <field name="url">https://calmops.com</field>
    <field name="description">Software engineering knowledge hub</field>
    <field name="tags">devops</field>
    <field name="tags">backend</field>
  </doc>
  <doc>
    <field name="id">2</field>
    <field name="name">ABCD Search</field>
    <field name="url">https://abcd.net</field>
    <field name="description">A fast privacy-focused search engine</field>
    <field name="tags">search</field>
  </doc>
</add>
curl -X POST -H 'Content-Type: application/xml' \
  --data-binary @documents.xml \
  http://localhost:8983/solr/mycoll/update?commit=true

Data Import Handler (DIH) for Databases

DIH indexes directly from relational databases via JDBC. Configure db-data-config.xml:

<dataConfig>
  <dataSource type="JdbcDataSource"
              driver="org.postgresql.Driver"
              url="jdbc:postgresql://localhost:5432/myapp"
              user="app_user"
              password="secret"/>
  <document>
    <entity name="article"
            query="SELECT id, title, url, body, created_at FROM articles WHERE published = true">
      <field column="id" name="id"/>
      <field column="title" name="name"/>
      <field column="url" name="url"/>
      <field column="body" name="description"/>
      <field column="created_at" name="created_date"/>
    </entity>
  </document>
</dataConfig>

Register the DIH request handler in solrconfig.xml:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">db-data-config.xml</str>
  </lst>
</requestHandler>

Trigger a full import:

curl http://localhost:8983/solr/mycoll/dataimport?command=full-import&clean=true&commit=true

Apache Tika for Binary Documents

Solr’s ExtractingRequestHandler (using Apache Tika) indexes binary files (PDF, Word, HTML):

curl -X POST -H 'Content-Type: application/pdf' \
  --data-binary @report.pdf \
  'http://localhost:8983/solr/mycell/update/extract?literal.id=doc1&literal.name=Q1+Report&commit=true'

Tika extracts text content automatically and populates the content and stream_text fields. Use literal. parameters to add static metadata.

Import Method Comparison

Method Format Speed Best For
JSON POST JSON array Fast API integration, partial updates
CSV POST CSV/TSV Fastest bulk Tabular exports, spreadsheets
XML POST Solr XML Moderate Legacy systems
DIH JDBC Moderate (row fetch) Live database sync
Tika Binary Slow (parsing) PDFs, Office docs, HTML
Post Tool Any Fast Ad-hoc files, development

Indexing Strategies

Full Index vs Incremental Index

Full index: Wipe the collection and re-index everything. Use for initial loads or when schema changes require re-indexing:

# Delete everything first
curl -X POST -H 'Content-Type: application/json' \
  --data-binary '{"delete":{"query":"*:*"}}' \
  http://localhost:8983/solr/mycoll/update?commit=true

# Then re-index
curl -X POST -H 'Content-Type: application/json' \
  --data-binary @full-export.json \
  http://localhost:8983/solr/mycoll/update?commit=true

Incremental index: Index only documents that changed since the last run. Requires a timestamp or version field:

# With DIH: delta-import uses a delta query
curl http://localhost:8983/solr/mycoll/dataimport?command=delta-import

# Manual: send only changed documents
curl -X POST -H 'Content-Type: application/json' \
  --data-binary @delta.json \
  http://localhost:8983/solr/mycoll/update?commit=true

Commit Strategies

A commit makes indexed documents visible to search. Solr supports three commit types:

Type Command When Data Is Visible Cost
Hard commit commit=true Immediately after commit High (fsync)
Soft commit softCommit=true Immediately (memory) Low (no fsync)
Auto commit autoCommit in solrconfig At interval or doc count Configurable

Auto-commit configuration in solrconfig.xml:

<autoCommit>
  <maxDocs>10000</maxDocs>
  <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
  <openSearcher>true</openSearcher>
</autoCommit>

<autoSoftCommit>
  <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>

Use soft commits during bulk indexing to avoid the fsync overhead of hard commits on every batch. Issue one hard commit at the end:

# Index without committing
curl -X POST -H 'Content-Type: application/json' \
  --data-binary @batch-1.json \
  http://localhost:8983/solr/mycoll/update

curl -X POST -H 'Content-Type: application/json' \
  --data-binary @batch-2.json \
  http://localhost:8983/solr/mycoll/update

# Single hard commit at the end
curl -X POST -H 'Content-Type: application/json' \
  --data-binary '{"commit":{}}' \
  http://localhost:8983/solr/mycoll/update

Optimize

Over time the index accumulates segments. Merging them improves search speed:

curl -X POST -H 'Content-Type: application/json' \
  --data-binary '{"optimize":{"maxSegments":1}}' \
  http://localhost:8983/solr/mycoll/update

Run optimize during low-traffic periods. It rewrites the entire index and is I/O intensive.

Batch Indexing with Python (pysolr)

For production pipelines, use the pysolr client:

#!/usr/bin/env python3
"""Batch index documents into Solr using pysolr."""

import pysolr
import json
from pathlib import Path

SOLR_URL = "http://localhost:8983/solr/mycoll"
BATCH_SIZE = 500

def load_documents(path: str) -> list[dict]:
    with open(path) as f:
        return json.load(f)

def batch_index(docs: list[dict]):
    solr = pysolr.Solr(SOLR_URL, always_commit=False)
    total = len(docs)
    for i in range(0, total, BATCH_SIZE):
        batch = docs[i:i + BATCH_SIZE]
        solr.add(batch)
        print(f"Indexed {min(i + BATCH_SIZE, total)}/{total}")
    solr.commit()
    print("Done. Hard commit issued.")

if __name__ == "__main__":
    docs = load_documents("data/export.json")
    batch_index(docs)

Send documents from a script using curl in a loop:

#!/bin/bash
# Batch index JSON files with soft commits
set -e

SOLR_URL="http://localhost:8983/solr/mycoll/update"
BATCH_DIR="./batches"

for f in "$BATCH_DIR"/*.json; do
  echo "Indexing $f..."
  curl -s -X POST -H 'Content-Type: application/json' \
    --data-binary @"$f" \
    "$SOLR_URL?softCommit=true"
  echo
done

# Final hard commit
curl -s -X POST -H 'Content-Type: application/json' \
  --data-binary '{"commit":{}}' \
  "$SOLR_URL"
echo "All batches committed."

Performance Tuning

Merge Factors

Lucene’s merge policy controls when segments are merged. Tune for indexing throughput vs search performance:

<mergePolicyFactory class="org.apache.lucene.index.TieredMergePolicyFactory">
  <int name="segsPerTier">10</int>
  <double name="deletesPctAllowed">20</double>
  <double name="floorSegmentMB">50</double>
  <double name="maxMergedSegmentMB">5120</double>
</mergePolicyFactory>

Higher segsPerTier (10-20) reduces merge overhead during indexing. Lower it (5) for better search performance on an already-built index.

Thread Pools

Concurrent indexing requires adequate thread pool configuration in solrconfig.xml:

<int name="maxIndexingThreads">8</int>
<int name="maxIndexingThreadsPerQueue">4</int>

Match these to your CPU core count. Oversubscribing leads to context-switch overhead.

Batch Sizing

Send documents in batches of 500-2000 for optimal throughput. Smaller batches increase HTTP overhead; larger batches consume more memory on the Solr JVM. Measure with your data size:

# Test batch of 1000
curl -X POST -H 'Content-Type: application/json' \
  --data-binary @batch-1000.json \
  http://localhost:8983/solr/mycoll/update?softCommit=true

Monitor the Solr admin UI (localhost:8983/solr) under Plugins / Stats for update handler response times to find your sweet spot.

JVM Heap and Caching

Solr’s indexing performance depends heavily on heap allocation:

# Set in bin/solr.in.sh
SOLR_JAVA_MEM="-Xms8g -Xmx8g"

For write-heavy workloads, increase the document cache and disable the query result cache:

<queryResultCache size="0" initialSize="0" autowarmCount="0"/>
<documentCache size="5000" initialSize="5000" autowarmCount="0"/>

Schema Version Management with the Schema API

The Schema API supports versioning and rollback. Operations return a version number that you can use for optimistic concurrency:

# Add a new field, capture the version
curl -s -X POST -H 'Content-Type: application/json' \
  --data-binary '{"add-field":{"name":"rating","type":"float","stored":true}}' \
  http://localhost:8983/solr/mycoll/schema
# Response includes "version": 42

# Revert to a previous schema version
curl -X POST -H 'Content-Type: application/json' \
  --data-binary '{"rollback":{"version":41}}' \
  http://localhost:8983/solr/mycoll/schema

# Check current schema version
curl http://localhost:8983/solr/mycoll/schema/version

You can also validate a schema change without applying it:

curl -X POST -H 'Content-Type: application/json' \
  --data-binary '{"add-field":{"name":"test_field","type":"unknown_type"}}' \
  http://localhost:8983/solr/mycoll/schema
# Returns error: "Field type 'unknown_type' not found"

Troubleshooting Indexing Errors

Field Not Found

Error: Document contains field "custom_field" but schema doesn't have it

Fix: Add the field or a matching dynamic field:

curl -X POST -H 'Content-Type: application/json' \
  --data-binary '{"add-dynamic-field":{"name":"*_field","type":"string","stored":true,"indexed":true}}' \
  http://localhost:8983/solr/mycoll/schema

If managed schema is not enabled, add the field to schema.xml manually and reload the core:

curl http://localhost:8983/solr/admin/cores?action=RELOAD&core=mycoll

Type Mismatch

Error: Invalid Number: abc123 or expected Date String but got 12345

Fix: Ensure the field value matches the field type. Use the Schema API to check field definitions:

curl http://localhost:8983/solr/mycoll/schema/fields/price

If the type is wrong, replace the field (drop and re-add with correct type) and re-index:

curl -X POST -H 'Content-Type: application/json' \
  --data-binary '{"delete-field":{"name":"price"}}' \
  http://localhost:8983/solr/mycoll/schema

Document Too Large

Error: Document is too large: the content exceeds the maximum size

Fix: Increase the maxContentLength in solrconfig.xml:

<requestHandler name="/update" class="solr.UpdateRequestHandler">
  <lst name="defaults">
    <int name="maxContentLength">104857600</int>
  </lst>
</requestHandler>

Or filter out oversized documents from your pipeline:

MAX_DOC_SIZE = 10 * 1024 * 1024  # 10 MB

def filter_large_docs(docs: list[dict]) -> list[dict]:
    return [d for d in docs if len(json.dumps(d)) < MAX_DOC_SIZE]

409 Conflict on Schema API

Error: Schema modification failed because another operation is in progress

Fix: Wait and retry. Schema operations are serialized. Avoid concurrent schema changes:

# Exponential backoff retry
for i in 1 2 4 8; do
  if curl -s -X POST ...; then
    break
  fi
  sleep $i
done

Out of Memory During Indexing

Error: java.lang.OutOfMemoryError: Java heap space

Fix: Reduce batch size, increase heap, or reduce the number of concurrent indexing threads:

SOLR_JAVA_MEM="-Xms16g -Xmx16g"
<int name="maxIndexingThreads">4</int>

Complete End-to-End Walkthrough

# 1. Start Solr in cloud mode
bin/solr start -e cloud -noprompt

# 2. Create a collection
bin/solr create -c sites -s 1 -rf 1

# 3. Define schema fields via Schema API
curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-field": {"name":"name", "type":"text_general", "stored":true}}' \
  http://localhost:8983/solr/sites/schema

curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-field": {"name":"url", "type":"string", "stored":true}}' \
  http://localhost:8983/solr/sites/schema

curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-field": {"name":"description", "type":"text_general", "stored":true}}' \
  http://localhost:8983/solr/sites/schema

# 4. Add a copy field for catch-all search
curl -X POST -H 'Content-type:application/json' \
  --data-binary '{"add-copy-field": {"source":"*","dest":"_text_"}}' \
  http://localhost:8983/solr/sites/schema

# 5. Index JSON data
bin/post -c sites ~/data/solr-sites.json

# 6. Verify
curl "http://localhost:8983/solr/sites/select?q=search&wt=json"

The POST tool (bin/post) automatically issues a hard commit. If using curl directly, append ?commit=true to the URL or send a separate commit command.

Resources

Comments