Skip to main content
โšก Calmops

Building Vector Search with Redis: From Embeddings to Semantic Retrieval

TL;DR

This article maps out a production-ready vector search pipeline using Redis (RediSearch / Redis Stack). It includes concrete, runnable examples (Python + Node), index creation and tuning for HNSW, a benchmark harness to measure recall vs latency, production scaling patterns, and an operational checklist.

Redis delivers low-latency KNN queries, co-located document storage (RedisJSON) and secondary indexes (RediSearch), making hybrid search (filters + semantic similarity) simple and fast. For many RAG and recommendation systems, this single-system approach reduces complexity compared to separate vector DBs plus a document store.

2. Pipeline overview

  1. Ingest documents (JSON) with metadata
  2. Compute or obtain embeddings from a provider
  3. Store embeddings (JSON path or separate binary key)
  4. Create RediSearch index with VECTOR (HNSW)
  5. Query with KNN and apply post-processing/reranking

3. Embedding provider choices

  • Hosted APIs (OpenAI, Anthropic, AWS Bedrock): easy integration, stable models, pay-per-call.
  • Self-hosted (sentence-transformers, ggml-based models): lower marginal cost at scale, more ops overhead.
  • Recommendation: start with a provider for development; add a local bulk-reembedding pipeline for refreshes.

Dimension advice: prefer powers-of-two common dims (768/1024/1536). Verify model dimension before building index.

4. Storage strategies (detailed)

A. RedisJSON (single object)

  • Pros: one key per doc, easier queries, RediSearch can index JSON paths directly.
  • Cons: updating vectors rewrites JSON; larger objects increase memory churn.

B. Separate vector key (HASH or STRING binary)

  • Pros: cheap vector updates, less JSON churn.
  • Cons: need to coordinate doc โ†’ vec coherence, more complex indexing.

Hybrid: store metadata in JSON and keep vector bytes in a separate HSET or STRING; create a synthetic document index that references the vector path.

Example (JSON + HNSW vectors, 1536-dim):

FT.CREATE idx:docs ON JSON PREFIX 1 "doc:" SCHEMA \
  $.title AS title TEXT SORTABLE \
  $.body AS body TEXT \
  $.embedding AS embedding VECTOR HNSW  \
    TYPE FLOAT32 DIM 1536 DISTANCE_METRIC COSINE \
    PARAMS {"M":16, "EF_CONSTRUCTION":200}

Notes:

  • Set DIM and TYPE to match your embedding bytes.
  • If using cluster, ensure doc keys and their vectors are collocated (use hash tags: doc:{tenant}:).

6. Serializing embeddings (Python and Node)

Python (numpy):

import numpy as np

def to_vec_bytes(embedding_list):
    return np.array(embedding_list, dtype=np.float32).tobytes()

Node (Float32Array -> Buffer):

function toVecBuffer(embedding){
  const floatArray = new Float32Array(embedding);
  return Buffer.from(floatArray.buffer);
}

7. Query examples (Python runnable)

This Python example uses redis-py client to issue a raw FT.SEARCH with KNN. Adjust for your client library helpers when available.

import numpy as np
from redis import Redis

r = Redis()

# query vector (numpy float32)
q = np.random.rand(1536).astype('float32')
q_bytes = q.tobytes()

# raw FT.SEARCH with KNN (pass vector bytes via PARAMS)
query = "*=>[KNN 10 @embedding $BLOB AS score]"
res = r.execute_command(
    'FT.SEARCH', 'idx:docs', query,
    'PARAMS', '2', 'BLOB', q_bytes,
    'RETURN', '3', 'title', 'body', 'score', 'DIALECT', '2'
)

# parse `res` according to RediSearch result format
print(res)

Notes: the exact parameterization differs by client and RediSearch version; DIALECT 2 enables newer query features.

8. Hybrid search example

Apply filters then KNN (e.g., category + semantic):

FT.SEARCH idx:docs "@category:{finance}=>[KNN 10 @embedding $BLOB AS score]" PARAMS 2 BLOB <vec> RETURN 3 title body score

Pattern: filter candidates first to reduce vector work and improve precision.

9. Benchmark harness (Python)

A simple harness to measure latency and Recall@K vs brute-force:

import time, numpy as np
from redis import Redis

r = Redis()

# load small corpus of N vectors into memory for brute-force baseline
corpus = np.load('corpus.npy')  # shape (N, D)
queries = np.load('queries.npy')  # shape (Q, D)

# function to brute force top-K
def brute_topk(q, k=10):
    sims = corpus @ q
    idx = np.argpartition(-sims, k)[:k]
    return set(idx)

# function to query Redis and get ids
def redis_query(q_bytes, k=10):
    res = r.execute_command('FT.SEARCH','idx:docs', f'*=>[KNN {k} @embedding $BLOB AS score]','PARAMS','2','BLOB',q_bytes,'RETURN','1','__key','DIALECT','2')
    # parse response -> list of ids
    # ...
    return ids

# run benchmark
k=10
start = time.time()
for q in queries:
    q_bytes = q.astype('float32').tobytes()
    ids = redis_query(q_bytes, k)
    # compute recall vs brute
    recall = len(brute_topk(q,k).intersection(ids)) / k
    # record latency

print('Done')

Measure recall while varying ef (efRuntime) to find sweet spot.

10. HNSW tuning guidance

  • M (16-64): higher => more memory and better recall.
  • efConstruction (100-500): affects build-time index quality.
  • ef (efRuntime) per query: tune based on latency budget; higher -> better recall but higher latency.

Practical flow: build at moderate efConstruction, test ef from 20..200 and measure recall/latency curve.

11. Scaling & topology

  • Per-tenant indexes or per-domain partitioning are simplest to scale horizontally.
  • For multi-tenant systems, route queries to the node owning the tenant’s index.
  • When cross-index search is required, run queries in parallel across shards and merge results at application level.

12. Operational checklist (expanded)

  • Secure: ACLs, TLS, and network controls
  • Backup: test restore of RDB/AOF and module data
  • Observability: Prometheus exporter, RedisInsight for cluster visibility
  • Reindex: scripted reindex with blue-green switch to avoid downtime
  • Cost: monitor memory and consider RoF or quantization when cost becomes limiting

13. FAQs and common pitfalls

  • Q: Why are KNN queries slower during concurrent writes? A: HNSW index updates are heavier than point writes; prefer batched writes or maintain a write buffer and background merge.

  • Q: What causes poor recall? A: Low ef or low M; also vector normalization mismatches or using poor-quality embeddings.

14. Next steps & repo idea

Provide a runnable sample repo with these components:

  • ingestion worker (reads docs, computes embeddings, writes Redis)
  • index builder/script
  • query service (HTTP) exposing /search
  • benchmark harness

This will be implemented as a companion GitHub repo linked from this article.

Resources


Appendix: Common FT.SEARCH examples, parsing notes, and CLI tips added to repo.

Comments