Vector databases have become essential infrastructure for modern AI applications. They enable efficient storage and retrieval of high-dimensional embeddings, powering semantic search, recommendation systems, and AI-driven features. This guide covers everything you need to know about vector databases and embeddings in Python. See Python Guide for more context. See Python Guide for more context. See Python Guide for more context.
Understanding Embeddings
Embeddings are numerical representations of data—text, images, or other content—in high-dimensional space. They capture semantic meaning, allowing similar items to have similar embeddings.
Creating Text Embeddings
# Using OpenAI embeddings
from openai import OpenAI
client = OpenAI()
def get_embedding(text):
"""Generate embedding for text using OpenAI."""
response = client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
# Example usage
text = "Python is a powerful programming language"
embedding = get_embedding(text)
print(f"Embedding dimension: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
Using Hugging Face Embeddings
from sentence_transformers import SentenceTransformer
# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings
sentences = [
"Python is great for data science",
"Machine learning requires embeddings",
"Vector databases store high-dimensional data"
]
embeddings = model.encode(sentences)
print(f"Shape: {embeddings.shape}") # (3, 384)
# Calculate similarity between sentences
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print(f"Similarity: {similarity[0][0]:.4f}")
Batch Embedding Generation
def batch_embed(texts, batch_size=100):
"""Generate embeddings for large text collections."""
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
batch_embeddings = model.encode(batch)
embeddings.extend(batch_embeddings)
print(f"Processed {min(i + batch_size, len(texts))}/{len(texts)}")
return embeddings
# Example
texts = ["Sample text " + str(i) for i in range(1000)]
embeddings = batch_embed(texts)
Vector Database Fundamentals
Vector databases are optimized for storing and querying embeddings using approximate nearest neighbor (ANN) search algorithms.
Key Concepts
- Approximate Nearest Neighbor (ANN): Fast search algorithm that trades accuracy for speed
- Indexing: Structures that enable efficient similarity search
- Dimensionality: Number of dimensions in embeddings (typically 384-1536)
- Distance Metrics: Cosine similarity, Euclidean distance, dot product
Comparison of Vector Databases
| Database | Type | Best For | Scalability |
|---|---|---|---|
| Pinecone | Managed | Production, serverless | High |
| Weaviate | Self-hosted | Flexibility, control | Medium-High |
| Milvus | Self-hosted | Large-scale, open-source | Very High |
| FAISS | Library | Local, research | Medium |
| Qdrant | Self-hosted | Performance, filtering | High |
Working with Pinecone
Pinecone is a managed vector database service ideal for production applications.
Setup and Basic Operations
import pinecone
from openai import OpenAI
# Initialize Pinecone
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
# Create index
index_name = "documents"
if index_name not in pinecone.list_indexes():
pinecone.create_index(
name=index_name,
dimension=1536,
metric="cosine"
)
index = pinecone.Index(index_name)
# Generate embeddings and upsert
client = OpenAI()
documents = [
{"id": "1", "text": "Python is versatile"},
{"id": "2", "text": "Machine learning is powerful"},
{"id": "3", "text": "Embeddings enable semantic search"}
]
vectors = []
for doc in documents:
embedding = client.embeddings.create(
input=doc["text"],
model="text-embedding-3-small"
).data[0].embedding
vectors.append((doc["id"], embedding, {"text": doc["text"]}))
# Upsert vectors
index.upsert(vectors=vectors)
# Query
query_embedding = client.embeddings.create(
input="What is Python used for?",
model="text-embedding-3-small"
).data[0].embedding
results = index.query(
vector=query_embedding,
top_k=3,
include_metadata=True
)
for match in results["matches"]:
print(f"ID: {match['id']}, Score: {match['score']:.4f}")
print(f"Text: {match['metadata']['text']}\n")
Hybrid Search with Metadata Filtering
# Upsert with rich metadata
vectors = [
("doc1", embedding1, {
"text": "Python tutorial",
"category": "programming",
"date": "2025-01-01"
}),
("doc2", embedding2, {
"text": "ML basics",
"category": "ai",
"date": "2025-01-02"
})
]
index.upsert(vectors=vectors)
# Query with metadata filtering
results = index.query(
vector=query_embedding,
top_k=5,
filter={"category": {"$eq": "programming"}},
include_metadata=True
)
Working with Weaviate
Weaviate is a self-hosted vector database with built-in ML capabilities.
Setup and Indexing
import weaviate
from weaviate.classes.config import Configure, Property, DataType
# Connect to Weaviate
client = weaviate.connect_to_local()
# Define schema
class_obj = {
"class": "Document",
"properties": [
{
"name": "title",
"dataType": ["text"]
},
{
"name": "content",
"dataType": ["text"]
},
{
"name": "category",
"dataType": ["text"]
}
],
"vectorizer": "text2vec-openai"
}
# Create class
client.schema.create_class(class_obj)
# Add objects
objects = [
{
"title": "Python Guide",
"content": "Learn Python programming",
"category": "tutorial"
},
{
"title": "ML Basics",
"content": "Introduction to machine learning",
"category": "education"
}
]
for obj in objects:
client.data_object.create(
data_object=obj,
class_name="Document"
)
# Semantic search
response = client.query.get("Document", ["title", "content"]).with_near_text({
"concepts": ["Python programming tutorial"]
}).with_limit(3).do()
print(response)
Working with FAISS (Local)
FAISS is Facebook’s library for efficient similarity search, ideal for local development.
Building and Searching
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
# Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
documents = [
"Python is a programming language",
"Machine learning uses embeddings",
"Vector databases store vectors",
"Semantic search finds similar items"
]
embeddings = model.encode(documents)
embeddings = np.array(embeddings).astype('float32')
# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension) # L2 distance
index.add(embeddings)
# Search
query = "What is Python?"
query_embedding = model.encode([query])[0].astype('float32')
distances, indices = index.search(np.array([query_embedding]), k=3)
print("Top 3 results:")
for i, idx in enumerate(indices[0]):
print(f"{i+1}. {documents[idx]} (distance: {distances[0][i]:.4f})")
Using IVF Index for Large Datasets
# For large datasets, use Inverted File (IVF) index
n_clusters = 100
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, n_clusters)
# Train the index
index.train(embeddings)
index.add(embeddings)
# Search with nprobe parameter
index.nprobe = 10 # Number of clusters to search
distances, indices = index.search(np.array([query_embedding]), k=3)
Advanced Embedding Techniques
Dimensionality Reduction
from sklearn.decomposition import PCA
# Reduce embedding dimensions
pca = PCA(n_components=256)
reduced_embeddings = pca.fit_transform(embeddings)
print(f"Original shape: {embeddings.shape}")
print(f"Reduced shape: {reduced_embeddings.shape}")
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.4f}")
Embedding Normalization
from sklearn.preprocessing import normalize
# Normalize embeddings for cosine similarity
normalized = normalize(embeddings, norm='l2')
# Verify normalization
print(f"Norm of first embedding: {np.linalg.norm(normalized[0]):.4f}")
Caching Embeddings
import pickle
import os
def cache_embeddings(documents, cache_file="embeddings.pkl"):
"""Cache embeddings to avoid regenerating."""
if os.path.exists(cache_file):
with open(cache_file, 'rb') as f:
return pickle.load(f)
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(documents)
with open(cache_file, 'wb') as f:
pickle.dump(embeddings, f)
return embeddings
# Usage
embeddings = cache_embeddings(documents)
Common Pitfalls and Best Practices
❌ Bad: Inconsistent Embedding Models
# DON'T: Mix different embedding models
embedding1 = model1.encode("text") # 384 dimensions
embedding2 = model2.encode("text") # 768 dimensions
# These can't be compared directly!
✅ Good: Consistent Embeddings
# DO: Use same model for all embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode("query")
document_embeddings = model.encode(documents)
# Now they're comparable
❌ Bad: Ignoring Embedding Costs
# DON'T: Generate embeddings for every query without caching
for query in queries:
embedding = client.embeddings.create(input=query) # Expensive!
✅ Good: Cache Query Embeddings
# DO: Cache embeddings
embedding_cache = {}
def get_cached_embedding(text):
if text not in embedding_cache:
embedding_cache[text] = client.embeddings.create(input=text)
return embedding_cache[text]
❌ Bad: No Metadata Filtering
# DON'T: Return all results without filtering
results = index.query(vector=query_embedding, top_k=1000)
✅ Good: Use Metadata Filters
# DO: Filter by metadata
results = index.query(
vector=query_embedding,
top_k=10,
filter={"category": {"$eq": "relevant_category"}}
)
Production Considerations
Scaling Embeddings
def scale_embedding_generation(documents, batch_size=100):
"""Generate embeddings at scale with error handling."""
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = []
failed = []
for i in range(0, len(documents), batch_size):
try:
batch = documents[i:i + batch_size]
batch_embeddings = model.encode(batch)
embeddings.extend(batch_embeddings)
except Exception as e:
print(f"Error processing batch {i}: {e}")
failed.extend(batch)
return embeddings, failed
Monitoring Vector Database Performance
import time
def benchmark_search(index, query_embeddings, k=10):
"""Benchmark search performance."""
times = []
for query in query_embeddings:
start = time.time()
results = index.query(vector=query, top_k=k)
times.append(time.time() - start)
print(f"Average query time: {np.mean(times)*1000:.2f}ms")
print(f"P95 query time: {np.percentile(times, 95)*1000:.2f}ms")
print(f"P99 query time: {np.percentile(times, 99)*1000:.2f}ms")
Summary
Vector databases and embeddings are fundamental to modern AI applications. Key takeaways:
- Embeddings capture semantic meaning in high-dimensional space
- Vector databases enable efficient similarity search at scale
- Choose the right tool: Pinecone for managed, Weaviate for flexibility, FAISS for local
- Optimize costs by caching embeddings and using appropriate batch sizes
- Monitor performance and use metadata filtering for better results
- Maintain consistency in embedding models and dimensions
Vector databases unlock powerful semantic search capabilities, making them essential for RAG systems, recommendation engines, and AI-powered applications.
Comments