Some Text Retrieval Toolkits

Open Source Search Engines

MeiliSearch
Easy to use and deploy search engine. Written in Rust.

MeiliSearch is a powerful, fast, open-source, easy to use and deploy search engine. Both searching and indexing are highly customizable. Features such as typo-tolerance, filters, and synonyms are provided out-of-the-box.

  • Search-as-you-type experience (answers < 50 milliseconds)
  • Full-text search
  • Typo tolerant (understands typos and misspelling)
  • Faceted search and filters
  • Supports hanzi (Chinese characters)
  • Supports synonyms
  • Easy to install, deploy, and maintain
  • Whole documents are returned
  • Highly customizable
  • RESTful API

Apache Nutch

Web-search software project.
Apache Nutch is a highly extensible and scalable open source web crawler software project. Stemming from Apache Lucene, the project comprises two codebases, namely:

Nutch 1.x (ACTIVE): A well matured, production ready crawler. 1.x enables fine grained configuration, relying on Apache Hadoop data structures, which are great for batch processing.

Nutch 2.x (INACTIVE): An emerging alternative taking direct inspiration from 1.x, but which differs in one key area; storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions. No more releases or bug fixes are anticipated for this codebase.

Xapian Probabilistic information retrieval library.
Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It has built-in support for several families of weighting models and also supports a rich set of boolean query operators.

Features

  • Written in C++
  • Highly portable
  • Ranked search (so the most relevant documents are more likely to come near the top of the results list) with built-in support for multiple models from the Probabilistic, Divergence from Randomness, and Language Modelling families of weighting models. Custom user-supplied weighting models are also supported.
  • boolean search operators
  • Wildcard search is supported (e.g. “xap*”)
  • Synonyms are supported, both explicitly (e.g. “~cash”) and as an automatic form of query expansion.
  • Dynamically generated snippets from matching documents can be generated, with matching words, phrases and wildcards highlighted.
  • Xapian can suggest spelling corrections for user supplied queries. This is based on words which occur in the data being indexed, so works even for words which wouldn’t be found in a dictionary (e.g. “xapian” would be suggested as a correction for “xapain”).
  • Supports database files > 2GB - essential for scaling to large document collections.

More to see features.

Typesense
Fast, typo-tolerant search engine. Written in C++. Worth to try.

  • Typo Tolerance: Handles typographical errors elegantly, out-of-the-box.
  • Simple and Delightful: Simple to set-up, integrate with, operate and scale.
  • ⚡ Blazing Fast: Built in C++. Meticulously architected from the ground-up for low-latency (<50ms) instant searches.
  • Tunable Ranking: Easy to tailor your search results to perfection.
  • Sorting: Sort results based on a particular field at query time (helpful for features like “Sort by Price (asc)”).
  • Faceting & Filtering: Drill down and refine results.
  • Grouping & Distinct: Group similar results together to show more variety.
  • Federated Search: Search across multiple collections (indices) in a single HTTP request.
  • Geo Search: Search and sort by results around a geographic location.
  • Scoped API Keys: Generate API keys that only allow access to certain records, for multi-tenant applications.
  • Synonyms: Define words as equivalents of each other, so searching for a word will also return results for the synonyms defined.
  • Curation & Merchandizing: Boost particular records to a fixed position in the search results, to feature them.
  • Raft-based Clustering: Setup a distributed cluster that is highly available.
  • Seamless Version Upgrades: As new versions of Typesense come out, upgrading is as simple as swapping out the binary and restarting Typesense.
  • No Runtime Dependencies: Typesense is a single binary that you can run locally or in production with a single command.

Sphinx
Search engine designed with indexing database content in mind. Not very easy to use.

Apache Solr Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.

Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world’s largest internet sites.

Apache Lucene
Search engine library

[ElasticSearch]
Flexible and powerful distributed RESTful search engine and analytics engine.

Lemur/Indri

References