Meilisearch Internals: Search Engine Architecture and Algorithms

Introduction

Understanding how Meilisearch works internally helps developers make better architectural decisions, optimize search relevance, and troubleshoot issues. While Meilisearch is designed to be simple from a user’s perspective, its internals represent sophisticated engineering that delivers the instant search experiences users expect.

This article explores Meilisearch’s architecture, from how documents are stored and indexed to how queries are processed and ranked. We will examine the key data structures, algorithms, and optimizations that make Meilisearch fast and reliable.

Architecture Overview

Meilisearch is built in Rust, a systems programming language known for its performance and memory safety. This choice of language contributes significantly to Meilisearch’s speed and low resource usage.

High-Level Architecture

Meilisearch’s architecture consists of several key components working together:

HTTP Server - Handles incoming requests and responses
Index Engine - Manages document indexing and storage
Search Engine - Processes queries and returns results
Task System - Handles asynchronous operations
Storage Layer - Manages persistent data

The HTTP server runs asynchronously using tokio, Rust’s async runtime, allowing it to handle many concurrent connections efficiently. When a request arrives, it is processed by the appropriate component based on the endpoint.

Document Flow

Understanding how documents flow through Meilisearch helps understand the system:

Ingestion - Documents are received via the API
Parsing - Documents are parsed and validated
Tokenization - Text fields are tokenized into searchable terms
Indexing - Tokens are added to the inverted index
Storage - Documents and index data are persisted

This pipeline is designed to be efficient while maintaining data consistency. Meilisearch uses eventual consistency for better write throughput, ensuring that newly indexed documents become searchable within milliseconds.

The Inverted Index

The inverted index is the fundamental data structure that enables fast full-text search. Unlike a forward index that maps documents to terms, an inverted index maps terms to documents.

How Inverted Index Works

In a traditional database, searching for documents containing a word requires scanning every document. With an inverted index, Meilisearch can instantly locate all documents containing a term.

Consider these documents:

Doc 1: "The quick brown fox"
Doc 2: "The lazy dog"
Doc 3: "A quick fox"

The inverted index would look like:

"the"     -> [Doc 1, Doc 2]
"quick"   -> [Doc 1, Doc 3]
"brown"   -> [Doc 1]
"fox"     -> [Doc 1, Doc 3]
"lazy"    -> [Doc 2]
"dog"     -> [Doc 2]
"a"       -> [Doc 3]

When searching for “quick fox”, Meilisearch:

Looks up “quick” in the inverted index -> [Doc 1, Doc 3]
Looks up “fox” in the inverted index -> [Doc 1, Doc 3]
Intersects the results -> [Doc 1, Doc 3]

This intersection operation is extremely fast, typically taking microseconds even for large document sets.

Index Structure

Meilisearch stores the inverted index efficiently using various compression techniques:

Postings lists - Lists of document IDs for each term
Term dictionary - Sorted list of all unique terms
Term frequencies - How often each term appears in each document
Positions - Where each term appears in each document

This structure allows Meilisearch to quickly find matching documents and calculate relevance scores.

Segment-Based Indexing

Meilisearch uses a segment-based approach to indexing. New documents are added to an in-memory segment, and periodically, the segment is flushed to disk as an immutable segment file.

Benefits of segment-based indexing include:

Fast writes - New documents are added to memory
Crash safety - Periodic flushing to disk
Concurrency - Multiple segments can be searched in parallel
Compaction - Segments can be merged to optimize performance

When searching, Meilisearch queries all relevant segments and merges the results, presenting a unified view to the user.

BM25 Ranking Algorithm

Meilisearch uses BM25 (Best Matching 25) as its primary relevance ranking algorithm. BM25 is a probabilistic ranking function that has proven effective over decades of information retrieval research.

Understanding BM25

BM25 calculates a relevance score based on:

Term Frequency (TF) - How often does the term appear in the document?
Inverse Document Frequency (IDF) - How rare or common is the term across all documents?
Document Length - Are we comparing fairly across different document lengths?

The BM25 formula:

score(Q, D) = Σ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl))

Where:

Q is the query
D is the document
qi is a query term
f(qi, D) is the term frequency
|D| is the document length
avgdl is the average document length
k1 and b are tuning parameters (typically k1=1.5, b=0.75)

Why BM25?

BM25 remains popular because it:

Works well without extensive tuning
Handles term frequency saturation (diminishing returns for repeated terms)
Accounts for document length normalization
Has solid theoretical foundation in probabilistic relevance

BM25 in Meilisearch

Meilisearch implements BM25 with some optimizations:

# Conceptual BM25 implementation
def bm25_score(doc_terms, query_terms, idf, avg_doc_len, k1=1.5, b=0.75):
    score = 0.0
    doc_len = len(doc_terms)
    
    for term in query_terms:
        if term not in idf:
            continue
            
        tf = doc_terms.count(term)
        idf_weight = idf[term]
        
        # BM25 formula component
        numerator = tf * (k1 + 1)
        denominator = tf + k1 * (1 - b + b * (doc_len / avg_doc_len))
        
        score += idf_weight * (numerator / denominator)
    
    return score

Configuring BM25 Parameters

Meilisearch allows tuning BM25 parameters through settings:

curl -X PATCH 'http://localhost:7700/indexes/books/settings' \
  -H 'Authorization: Bearer your_master_key' \
  -H 'Content-Type: application/json' \
  -d '{
    "rankingRules": [
      "words",
      "typo",
      "proximity",
      "attribute",
      "sort",
      "exactness"
    ]
  }'

The “words” rule controls how much term frequency matters in the ranking.

Tokenization

Tokenization is the process of converting raw text into searchable tokens. This step is crucial for search quality and is one of the areas where Meilisearch excels.

How Tokenization Works

Meilisearch’s tokenizer performs several operations:

Normalization - Convert to lowercase, remove accents
Segmentation - Split text into words
Filtering - Remove stop words and punctuation
Stemming - Reduce words to their root form (language-specific)

For example, the text “The Quick Brown Fox!” might be tokenized to:

["quick", "brown", "fox"]

Normalization

Normalization ensures that variations of a word match each other:

Case folding - “Search” becomes “search”
Accent removal - “café” becomes “cafe”
Character folding - Various Unicode forms are standardized

This means searches for “Search” will find documents containing “search”, “SEARCH”, or “sÉArch”.

Language-Specific Processing

Meilisearch handles multiple languages with language-specific tokenization:

# Configure for French
curl -X PATCH 'http://localhost:7700/indexes/books/settings' \
  -H 'Authorization: Bearer your_master_key' \
  -H 'Content-Type: application/json' \
  -d '{
    "tokenizer": {
      "documentSeparator": ".",
      "maxTokenLength": 500
    }
  }'

For optimal results with different languages, configure the appropriate settings for your content language.

N-grams

Meilisearch supports n-gram tokenization for partial matching:

curl -X PATCH 'http://localhost:7700/indexes/books/settings' \
  -H 'Authorization: Bearer your_master_key' \
  -H 'Content-Type: application/json' \
  -d '{
    "ngramsEnabled": true,
    "maxNgramDistance": 2
  }'

This enables searching for partial words, which is useful for autocomplete and certain languages.

Caching Strategies

Caching is essential for Meilisearch’s fast response times. The search engine employs multiple caching strategies.

Query Cache

Meilisearch caches recent query results:

{
  "query": "gatsby",
  "hits": [...],
  "cache": true
}

This cache is automatically invalidated when relevant documents change, ensuring results remain fresh.

Filter and Sort Caches

Results of expensive operations like filtering and sorting are cached:

Filter results are cached per unique filter combination
Sort operations are cached and reused
Facet distributions are cached until data changes

Cache Invalidation

Meilisearch automatically invalidates caches when:

New documents are added to an index
Documents are updated or deleted
Settings change that affect search results

This automatic invalidation ensures users always see current results without manual intervention.

Memory Management

Efficient memory usage is crucial for performance. Meilisearch manages memory through several mechanisms.

Memory-Mapped Files

Meilisearch uses memory-mapped files (mmap) for index data:

Index files are mapped to virtual memory
The operating system handles paging
Memory is allocated on-demand as pages are accessed

Benefits include:

Fast startup times
Efficient memory usage across processes
Operating system manages caching

In-Memory Structures

Some data structures are kept entirely in memory:

Filter caches
Recent query caches
Segment metadata

This in-memory storage provides microsecond access times for frequently used data.

Resource Limits

Configure resource limits to control memory usage:

MEILI_MAX_INDEXING_MEMORY=2GiB
MEILI_MAX_INDEXING_THREADS=4

These settings prevent Meilisearch from consuming excessive resources.

Query Processing Pipeline

Understanding how queries are processed helps optimize search performance.

Query Processing Stages

When a search request arrives, Meilisearch processes it through several stages:

Parse - Parse the query string
Tokenize - Convert query to tokens
Plan - Determine which indexes and segments to search
Execute - Retrieve matching documents from each segment
Rank - Calculate relevance scores
Merge - Combine results from all segments
Format - Apply highlighting, truncation, pagination

Each stage is optimized for speed, and the entire pipeline typically completes in under 50 milliseconds.

Parallel Execution

Meilisearch parallelizes query execution:

Segment Parallelism - Multiple segments are searched simultaneously
Thread Pool - Search threads are managed in a pool
Async I/O - Non-blocking operations maximize throughput

Configure parallelism:

MEILI_MAX_SEARCH_THREADS=4

Early Termination

To maintain fast response times, Meilisearch can terminate searches early:

Stop searching additional segments once enough results are found
Limit the number of documents scored
Apply pagination limits

This ensures that even complex queries return quickly.

Typo Tolerance Implementation

Typo tolerance is one of Meilisearch’s signature features. Let’s examine how it works internally.

Edit Distance

Meilisearch calculates the edit distance (Levenshtein distance) between query terms and indexed terms:

levenshtein("gats", "gatsby") = 2
levenshtein("gatsby", "gats") = 2
levenshtein("gatsby", "great") = 3

Tolerance Configuration

The typo tolerance algorithm considers:

Word length - Longer words allow more typos
Edit distance - Number of insertions, deletions, or substitutions
Character swaps - Transposition of adjacent characters

Default settings:

Words under 4 characters: 0 typos allowed
Words 4-6 characters: 1 typo allowed
Words over 6 characters: 2 typos allowed

Matching with Typos

When processing a query like “gatsby”:

Meilisearch looks for exact matches first
If no matches, it searches for terms within the allowed edit distance
Results are ranked by relevance, with exact matches ranking higher

This provides the “search-as-you-type” experience users expect.

Ranking Customization

Meilisearch’s ranking can be customized to match specific requirements.

Ranking Rules

The ranking rules control how results are ordered:

curl -X PATCH 'http://localhost:7700/indexes/books/settings' \
  -H 'Authorization: Bearer your_master_key' \
  -H 'Content-Type: application/json' \
  -d '{
    "rankingRules": [
      "words",        # Number of matched words
      "typo",         # Fewer typos preferred
      "proximity",    # Words closer together
      "attribute",    # Important attributes
      "sort",         # Sorted field
      "exactness"     # Exact matches preferred
    ]
  }'

Custom Weights

Apply different weights to different attributes:

curl -X PATCH 'http://localhost:7700/indexes/books/settings' \
  -H 'Authorization: Bearer your_master_key' \
  -H 'Content-Type: application/json' \
  -d '{
    "searchableAttributes": [
      "title(10)",
      "author(5)",
      "description(1)"
    ]
  }'

This gives titles 10x the weight of descriptions.

Tie Breaking

When results have equal relevance scores, tie-breaking rules determine the order:

curl -X PATCH 'http://localhost:7700/indexes/books/settings' \
  -H 'Authorization: Bearer your_master_key' \
  -H 'Content-Type: application/json' \
  -d '{
    "rankingRules": [
      "words",
      "typo",
      "proximity",
      "attribute",
      "sort:publication_year:desc",
      "exactness"
    ]
  }'

Performance Characteristics

Understanding Meilisearch’s performance characteristics helps with capacity planning.

Latency

Meilisearch provides consistent, low-latency search:

Typical query: 1-10ms for most queries
Complex queries: 10-50ms for complex filters
Large result sets: 50-100ms with pagination

These times are consistent regardless of index size due to the inverted index structure.

Throughput

Search throughput depends on hardware and query complexity:

Simple queries: 10,000+ queries per second
Complex queries: 1,000-5,000 queries per second
Indexing: 10,000-50,000 documents per second

Scalability

Meilisearch scales well within a single node:

Documents: Tested up to 10 million documents
Index size: Up to 100GB per index
Memory: Requires approximately 10-20% of index size

For larger deployments, consider sharding across multiple instances.

Debugging and Diagnostics

Understanding internals helps with troubleshooting.

Task Inspection

Monitor indexing tasks:

curl -X GET 'http://localhost:7700/tasks' \
  -H 'Authorization: Bearer your_master_key"

Index Statistics

Get detailed index statistics:

curl -X GET 'http://localhost:7700/indexes/books/stats' \
  -H 'Authorization: Bearer your_master_key"

Returns:

Number of documents
Indexing status
Field distribution
Index size

Logging

Enable detailed logging for debugging:

MEILI_LOG_LEVEL=debug
MEILI_LOG_FILE=/var/log/meilisearch.log

External Resources

Conclusion

Meilisearch’s internals represent sophisticated engineering that delivers the fast, relevant search experiences users expect. The combination of Rust’s performance, the inverted index structure, BM25 ranking, and smart tokenization creates a search engine that excels in both speed and relevance.

Understanding these internals helps developers:

Make better configuration decisions
Optimize search relevance
Troubleshoot issues effectively
Plan capacity appropriately

In the next article, we will explore Meilisearch’s recent developments and what to expect in 2025-2026.