Introduction
Vector databases store and search numerical embeddings — the vector representations that AI models generate for text, images, and audio. They are the core retrieval engine for RAG pipelines, semantic search, recommendation systems, and anomaly detection. As of May 2026, the major platforms have converged on feature parity (hybrid search, GPU acceleration, disk indexes) but differ significantly in deployment model, operational overhead, and cost structure.
The market has diversified beyond the original four contenders. pgvector brings vector search into PostgreSQL, eliminating the operational overhead of a separate database for smaller workloads. Chroma and LanceDB target rapid prototyping. Turbopuffer challenges managed pricing. The right choice depends on your scale, team expertise, infrastructure constraints, and budget.
This guide provides Python API examples for Pinecone, Weaviate (v1.37), Milvus (v2.5), Qdrant, and pgvector — covering upsert, similarity search, hybrid search, and filtering — and includes performance benchmarks, cost comparison, RAG pipeline patterns, and a decision framework for choosing the right platform.
How Vector Search Works
flowchart LR
A[Raw Data<br/>text, images, audio] --> B[Embedding Model<br/>e.g. text-embedding-3-large]
B --> C[Vector Embedding<br/>[0.012, -0.045, ..., 0.098]]
C --> D[Vector Database]
D --> E[(Index<br/>HNSW / IVF / DiskANN)]
Q[Query: "red running shoes"] --> B
B --> QV[Query Vector]
QV --> D
D --> R[Top-K nearest neighbors]
R --> S[Semantically relevant results]
The database indexes all stored vectors using approximate nearest neighbor (ANN) algorithms — most commonly HNSW (Hierarchical Navigable Small World), IVF (Inverted File Index), or DiskANN for disk-based indexes. At query time, the index returns the K nearest vectors by cosine similarity or Euclidean distance.
Python API Examples
Pinecone (Managed, Serverless)
Pinecone remains the managed-cloud default. Its serverless architecture removes all infrastructure management. The April 2026 Dedicated Read Nodes GA provides fixed-cost scaling for predictable workloads, claiming up to 97% lower costs at high query volumes compared to on-demand pricing:
from pinecone import Pinecone, ServerlessSpec
import os
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
if "example-index" not in pc.list_indexes().names():
pc.create_index(
name="example-index",
dimension=1024,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index("example-index")
index.upsert(vectors=[
{
"id": "doc-001",
"values": [0.012, -0.045, 0.098],
"metadata": {"title": "RAG Architecture Guide", "category": "ai", "year": 2026}
}
])
results = index.query(
vector=[0.015, -0.042, 0.095],
top_k=5,
filter={"category": {"$eq": "ai"}, "year": {"$gte": 2025}},
include_metadata=True
)
print(results.matches)
Pinecone is best for teams that value zero infrastructure overhead and predictable performance. The trade-off is lock-in — you cannot self-host or inspect the underlying storage. Pricing becomes expensive at scale, hitting approximately $70-100/month for workloads that cost $25-30 on self-hosted alternatives.
Weaviate v1.37 (Self-Hosted or Cloud)
Weaviate v1.37 (April 2026) introduced a built-in MCP Server for LLM integration, Diversity Search with Maximum Marginal Relevance (MMR) reranking, Incremental Backups, and Extensible Tokenizers. Its hybrid search is among the strongest in the field — vector + BM25 + metadata-filtering composition is native:
import weaviate
client = weaviate.connect_to_local()
collection = client.collections.create(
name="Documents",
vectorizer_config=weaviate.config.Configure.Vectorizer.none(),
properties=[
{"name": "title", "dataType": "text"},
{"name": "content", "dataType": "text"},
{"name": "category", "dataType": "text"}
]
)
collection.data.insert({
"title": "HNSW Index Optimization",
"content": "Choosing the right ef_construction and M parameters...",
"category": "database"
})
response = collection.query.hybrid(
query="index optimization parameters",
alpha=0.75,
limit=10,
filters={
"path": ["category"],
"operator": "Equal",
"valueString": "database"
}
)
The alpha parameter controls the balance between vector and keyword search. An alpha of 0.75 means 75% vector similarity, 25% keyword. This is the most common hybrid search configuration for RAG pipelines.
Weaviate’s v1.37 MCP Server exposes the database to Claude, Cursor, and VS Code as RBAC-governed tools for agentic querying and data ingestion without writing API code. This is unique among vector databases and makes Weaviate the natural choice for AI agent workflows that need database access.
Weaviate’s GraphQL API is polarizing — some teams love its expressiveness for complex queries, others find it verbose for simple similarity searches. Performance is solid but not chart-topping: written in Go, it does not match Rust-based alternatives in raw tail latency.
Milvus 2.5 (Self-Hosted, Kubernetes)
Milvus is the most popular open-source vector database for large-scale deployments. It supports multiple index types (HNSW, IVF, DiskANN) and is built for Kubernetes-native scaling:
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType, IndexType
connections.connect(host="localhost", port="19530")
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1024),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=512),
FieldSchema(name="year", dtype=DataType.INT64)
]
schema = CollectionSchema(fields, "Document embeddings")
collection = Collection("documents", schema)
collection.create_index("embedding", {
"index_type": "HNSW",
"metric_type": "COSINE",
"params": {"M": 16, "efConstruction": 200}
})
collection.load()
collection.insert([
[0.012, -0.045, 0.098],
"Milvus 2.5 Release Notes",
2026
])
results = collection.search(
data=[[0.015, -0.042, 0.095]],
anns_field="embedding",
param={"metric_type": "COSINE", "params": {"ef": 64}},
limit=5,
expr="year >= 2025",
output_fields=["title", "year"]
)
Milvus requires Kubernetes for production deployment. Zilliz Cloud provides a managed alternative with GPU acceleration. Milvus scales to 10B+ vectors in distributed mode, making it the choice for enterprise-scale workloads where a dedicated infrastructure team is available.
Qdrant (Self-Hosted or Cloud, Rust)
Qdrant is written in Rust, giving it the latency edge among open-source vector databases. Cloud GPU-accelerated indexing and Multi-AZ clusters launched in April 2026, reducing index build time by up to 10x on supported hardware:
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct, Filter, FieldCondition, Range
client = QdrantClient(url="http://localhost:6333")
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
hnsw_config={
"m": 16,
"ef_construct": 200,
"full_scan_threshold": 10000
},
optimizers_config={"gpu_indexing": True}
)
client.upsert(
collection_name="documents",
points=[
PointStruct(
id=1,
vector=[0.012, -0.045, 0.098],
payload={"title": "Qdrant GPU Indexing", "year": 2026}
)
]
)
results = client.search(
collection_name="documents",
query_vector=[0.015, -0.042, 0.095],
limit=5,
query_filter=Filter(
must=[
FieldCondition(key="year", range=Range(gte=2025))
]
)
)
Qdrant has the strongest payload filtering among all vector databases — complex filter syntax, payload indexes, and nested condition support. Its Rust implementation delivers the best raw query latency and throughput among open-source options.
pgvector (PostgreSQL Extension)
For teams already running PostgreSQL, pgvector eliminates the operational complexity of a separate vector database. Recent performance improvements have narrowed the gap with purpose-built solutions for workloads under 10M vectors:
import psycopg2
conn = psycopg2.connect("dbname=vectors user=postgres")
cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
embedding vector(1024),
title TEXT,
year INT
)
""")
cur.execute("""
INSERT INTO documents (embedding, title, year)
VALUES (%s, %s, %s)
""", ([0.012, -0.045, 0.098], "pgvector Guide", 2026))
cur.execute("""
SELECT title, year, embedding <=> %s::vector AS distance
FROM documents
WHERE year >= 2025
ORDER BY distance
LIMIT 5
""", ([0.015, -0.042, 0.095],))
cur.execute("CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)")
conn.commit()
pgvector is the right choice when you want one database for everything and your vector workload fits within PostgreSQL’s scaling limits (~10-50M vectors). Beyond that, operational friction increases significantly compared to purpose-built solutions.
Feature Comparison (2026)
| Feature | Pinecone | Weaviate v1.37 | Milvus 2.5 | Qdrant | pgvector |
|---|---|---|---|---|---|
| Deployment | Managed only | Self-host + Cloud | Self-host | Self-host + Cloud | Postgres extension |
| Open Source | No | Yes (BSD-3) | Yes (Apache 2.0) | Yes (Apache 2.0) | Yes (PostgreSQL) |
| Hybrid Search | Via sparse-dense | Built-in (alpha) | Plugin | Built-in | Manual |
| GPU Acceleration | No | No | Yes (Zilliz) | Yes (Cloud) | No |
| MCP Server | No | Yes (v1.37) | No | No | No |
| Metadata Filtering | Good | Strong (GraphQL) | Good | Excellent | Full SQL |
| Max Scale | Billions | Hundreds of millions | 10B+ | Billions | 10-50M |
| Language | Proprietary | Go | Go + C++ | Rust | C (extension) |
| SOC 2 | Yes | Yes (Cloud) | No | Yes (Cloud) | Varies |
Performance Benchmarks
Testing with 1M vectors (1024 dimensions, cosine similarity), 2x Intel Xeon, 64GB RAM, NVIDIA A10G for GPU tests:
| Index Type | Build Time | Query Latency p50 | Query Latency p99 | Recall@10 |
|---|---|---|---|---|
| HNSW (M=16, ef=200) | 12 min | 8 ms | 18 ms | 99.2% |
| IVF (nlist=4096) | 6 min | 15 ms | 30 ms | 96.5% |
| DiskANN | 20 min | 25 ms | 50 ms | 97.0% |
| HNSW + GPU (Qdrant) | 2.1 min | 5 ms | 12 ms | 99.1% |
Vendor-specific latency at 10M vectors (1536 dimensions, k=10):
| Database | p50 | p95 | p99 | QPS (1 node) |
|---|---|---|---|---|
| Pinecone | 28ms | 45ms | 78ms | 10,500 |
| Weaviate | 39ms | 62ms | 105ms | 8,200 |
| Qdrant | 22ms | 38ms | 54ms | 15,300 |
| Milvus | 30ms | 55ms | 85ms | 9,800 |
| pgvector | 35ms | 60ms | 95ms | 7,200 |
Qdrant leads in raw performance. Pinecone and Milvus handle the largest scales but at higher latency. Weaviate’s GraphQL API and module ecosystem add value at the cost of speed.
Cost Comparison
Scenario: 1M vectors (1024 dimensions), 1M queries/month:
| Solution | Storage/Month | Queries/Month | Total |
|---|---|---|---|
| Pinecone Serverless | $35 | $8 | ~$43 |
| Qdrant Cloud | $25 | Included | ~$25 |
| Weaviate Cloud | $30 | Included | ~$30 |
| Self-hosted (Qdrant) | ~$50 (infra) | N/A | ~$50 |
| pgvector (existing Postgres) | ~$0 (existing) | N/A | ~$0 |
Scenario: 100M vectors (1024 dimensions):
| Solution | Estimated Monthly Cost |
|---|---|
| Pinecone | ~$800 |
| Qdrant Cloud | ~$400 |
| Weaviate Cloud | ~$500 |
| Self-hosted (8 nodes) | ~$600 |
The cost gap between managed and self-hosted narrows as scale increases, since infrastructure costs dominate. Pinecone’s simplicity premium is most justified at small to medium scales. For 100M+ vectors, self-hosted Qdrant or Milvus typically wins on cost.
RAG Pipeline Patterns
Basic RAG with Metadata Filtering
The most common pattern — retrieve relevant documents, then augment the LLM prompt:
def basic_rag(query: str, collection: str, category: str) -> str:
query_vec = embedding_model.encode(query)
results = vector_db.search(
collection_name=collection,
query_vector=query_vec.tolist(),
query_filter=Filter(must=[
FieldCondition(key="category", match=MatchValue(value=category))
]),
limit=5
)
context = "\n\n".join([r.payload["content"] for r in results])
prompt = f"""Answer based on this context:
{context}
Question: {query}"""
return llm.invoke(prompt)
Hybrid Search RAG
Combines semantic similarity with keyword matching for better retrieval when exact term matches matter:
def hybrid_rag(query: str, alpha: float = 0.75) -> str:
query_vec = embedding_model.encode(query)
results = vector_db.hybrid_search(
query_text=query,
query_vector=query_vec.tolist(),
alpha=alpha,
limit=10
)
context = "\n\n".join([r.payload["content"] for r in results])
prompt = f"Context:\n{context}\n\nQuestion: {query}"
return llm.invoke(prompt)
Multi-Vector Search
For multi-modal embeddings (text + image), search across multiple vector fields:
def multi_vector_search(text_embedding, image_embedding):
"""Search combining text and image similarity scores."""
text_results = vector_db.search(
collection_name="documents",
query_vector=text_embedding,
limit=20
)
image_results = vector_db.search(
collection_name="documents",
query_vector=image_embedding,
limit=20
)
combined = {}
for r in text_results:
combined[r.id] = combined.get(r.id, 0) + r.score * 0.6
for r in image_results:
combined[r.id] = combined.get(r.id, 0) + r.score * 0.4
return sorted(combined.items(), key=lambda x: x[1], reverse=True)[:10]
Decision Guide
flowchart TD
A[How many vectors?] --> B{<10M?}
B -->|Yes| C{Using Postgres?}
C -->|Yes| pgvector["pgvector<br/>No new infra"]
C -->|No| D[Chroma / LanceDB<br/>Rapid prototyping]
B -->|No, 10-100M| E{Managed or self-host?}
E -->|Managed| Pinecone["Pinecone<br/>Zero ops, $43-800/mo"]
E -->|Self-host| Qdrant["Qdrant<br/>Best perf, open source"]
A -->|100M-1B+| F{Kubernetes team?}
F -->|Yes| Milvus["Milvus<br/>K8s-native, 10B scale"]
F -->|No| G{Need hybrid search?}
G -->|Yes| Weaviate["Weaviate<br/>Best hybrid, MCP"]
G -->|No| Qdrant
style pgvector fill:#336791,color:#fff
style Pinecone fill:#f59e0b,color:#fff
style Qdrant fill:#10b981,color:#fff
style Milvus fill:#6366f1,color:#fff
style Weaviate fill:#ec4899,color:#fff
Resources
- Pinecone Python SDK Documentation
- Weaviate v1.37 Release Notes — MCP Server, Diversity Search
- Milvus 2.5 Release Notes — Latest security patches
- Qdrant Cloud GPU Indexing — Multi-AZ and GPU features
- pgvector Documentation — PostgreSQL vector extension
- Hugging Face Embedding Models — Compatible embedding models
- Encore: Best Vector Databases 2026 — Comparison guide
- Firecrawl: Vector Database Benchmarks — Independent testing
Comments