Import Data from MongoDB into Meilisearch

将MongoDB的数据导入Meilisearch里

Background

MongoDB is an document(JSON) based database, which is schema-less and mainly stores JSON format files.

Meilisearch is an open source search engine that has a search-as-you-type feature(like Algolia), which is written in Rust and is very fast in creating index and getting search results.

When building index, Meilisearch is very fast and needs very small amount of memory, for example, building an index which contains 200M documents, the total size of the JSON(called collection in MongoDB) file is around 1.1G(not compressed), it only takes around 20 mins, and the index site is around 8.5G.

Meilisearch supports JSON/CSV data format.

We will talk about how to import data from MongoDB to Meilisearch.

Prepare data for Meilisearch(add a id field for Meilisearch)

Meilisearch needs a unique field called id, id should only contain [0-9a-zA-z_] and should not contain any special characters. If the id field is not provided, any field that contains id will be used as the id field, such as _id, user_id etc.

In MongoDB, there is a _id field. When exported it’s like:

{
	"_id":{"$oid":"623a9cdace6b4611493b8525"}
}

This cannot be used in meilisearch, because it contains a special character $. We can use the ObjectID as id, not directly, we should have a field called _id and has a string value for that id, like this:

{
	"id": "623a9cdace6b4611493b8525"
}

This is fine for Meilisearch.

Then we can crate a view in MongoDB, eliminate the _id, and add a field call id which value is from _id.

execute this in mongosh shell

var pipeline = [{$addFields: {id:{"$toString": "$_id"}}}, {$project: {_id: 0}}]
db.createView("view_for_export", "sites", pipeline)

sites is the collection name, then we will have a new view called view_for_export, then we export the data:

$ mongoexport --port=27017 --db=db1 --collection=view_for_export --out=sites.json --jsonArray

Note: the data format must be jsonArray for Meilisearch to import. The jsonArray format is like this:

[
	{},
	{},
	{}
]

Import data to Meilisearch

before importing data, we should change the payload size limit of Meilisearch.

$ nohup meilisearch --http-addr 0.0.0.0:7700 --master-key="abcd123123" --http-payload-size-limit=100Gb &

Import data

curl \
  -X POST 'http://localhost:7700/indexes/sites/documents' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer abcd123123 \
  --data-binary @sites.json

View the task status, change the task id 21 to your task id

curl   -H 'Authorization: Bearer abcd123123'  -X GET 'http://localhost:7700/tasks/21'
{"uid":21,"indexUid":"sites","status":"succeeded","type":"documentAdditionOrUpdate","details":{"receivedDocuments":1938716,"indexedDocuments":1938716},"duration":"PT1189.588458063S","enqueuedAt":"2022-09-22T03:03:48.818566287Z","startedAt":"2022-09-22T03:03:48.846784313Z","finishedAt":"2022-09-22T03:23:38.435242376Z"}

Within 20 minutes, 1938716 documents(1.1G not compressed) were indexed by Meilisearch, the total index size is 8.5GB, less than 1G memory was used during the process.

This is the end of the article.