Introduction
DeepSeek has become one of the most influential AI companies in 2026, challenging OpenAI and Anthropic with open-source models that match or exceed proprietary alternatives. On April 24, 2026, DeepSeek released V4 — a family of Mixture-of-Experts models under the MIT license with 1M-token context windows and industry-leading coding benchmarks. V4-Pro achieves 80.6% on SWE-bench Verified and 93.5 on LiveCodeBench, the highest scores of any publicly available model.
This guide covers the complete DeepSeek model family with a focus on V4, provides Python API integration code using the OpenAI-compatible endpoint, explains deployment with vLLM and Docker for self-hosted scenarios, and includes the July 2026 model name migration timeline.
Model Family Overview
DeepSeek V4 (2026)
Released April 24, 2026, V4 represents a generational leap over V3.2. The architecture uses a hybrid CSA+HCA (Compressed Sparse Attention + Hierarchical Context Attention) mechanism that reduces FLOPs to 27% and KV cache to 10% of V3.2 at long context lengths.
| Model | Parameters | Active per Token | Context | Price (Input / 1M tokens) | SWE-bench Verified |
|---|---|---|---|---|---|
| V4-Flash | 284B | 13B | 1M tokens | $0.14 | 72.1% |
| V4-Pro | 1.6T | 49B | 1M tokens | $1.74 | 80.6% |
| V4-Pro-Max | 1.6T | 49B | 1M tokens | $3.48 | 80.6% |
V4-Flash is optimized for cost-sensitive production workloads. V4-Pro targets complex reasoning, code generation, and research tasks. Both use the Muon optimizer during training, which contributed to 2x training efficiency over V3.2.
Previous Generation Models
DeepSeek V3.2 (late 2025): The immediate predecessor to V4. Still functional but being phased out. Uses 64K token context and the older MoE architecture without CSA+HCA attention.
DeepSeek R1 (January 2025): Reasoning-focused model with chain-of-thought capabilities. Achieved performance comparable to OpenAI o1 on math and logic benchmarks. R1 continues to be available for applications that benefit from explicit reasoning traces.
Janus Pro: Multimodal model with separate visual and language pathways. Supports image understanding (scene description, OCR, chart interpretation) and image generation.
API Integration
DeepSeek provides an OpenAI-compatible API. The same Python openai client library works by changing the base_url and API key.
Migration Alert: July 24, 2026 Cutoff
The legacy model names deepseek-chat and deepseek-reasoner will be fully retired on July 24, 2026 at 15:59 UTC. After this date, requests using those names will return errors. Replace them with deepseek-v4-flash and deepseek-v4-pro.
# BAD — will stop working after July 24, 2026
model = "deepseek-chat" # deprecated
# GOOD — use V4 model identifiers
model = "deepseek-v4-flash" # cost-optimized
model = "deepseek-v4-pro" # maximum capability
Basic Chat Completion
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to merge two sorted lists."}
],
max_tokens=1024,
temperature=0.3
)
print(response.choices[0].message.content)
Streaming Response
For interactive applications, enable streaming to receive tokens as they are generated:
stream = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": "Explain how transformers work in 3 paragraphs."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Structured Output with Function Calling
DeepSeek V4 supports OpenAI-compatible function calling for structured outputs:
import json
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "user", "content": "Extract the company name, revenue, and year from: 'Acme Corp reported $12.5M revenue in 2025.'"}
],
functions=[{
"name": "extract_financial_data",
"description": "Extract structured financial information",
"parameters": {
"type": "object",
"properties": {
"company": {"type": "string"},
"revenue": {"type": "number"},
"year": {"type": "integer"}
},
"required": ["company", "revenue", "year"]
}
}],
function_call={"name": "extract_financial_data"}
)
args = json.loads(response.choices[0].message.function_call.arguments)
print(args) # {'company': 'Acme Corp', 'revenue': 12500000, 'year': 2025}
Embedding Generation
response = client.embeddings.create(
model="deepseek-v4-flash", # embeddings supported on Flash tier
input="DeepSeek V4 supports 1M token context windows."
)
embedding = response.data[0].embedding
print(f"Dimension: {len(embedding)}") # typically 2048 or 4096
Self-Hosted Deployment
DeepSeek publishes model weights on Hugging Face under the MIT license, enabling self-hosted deployment on private infrastructure.
Hardware Requirements
| Model | Minimum GPU | Recommended GPU | VRAM (FP16) | VRAM (INT4) |
|---|---|---|---|---|
| V4-Flash | 1x A100 80GB | 2x A100 80GB | ~160 GB | ~45 GB |
| V4-Pro | 4x A100 80GB | 8x A100 80GB | ~800 GB | ~200 GB |
Deployment with vLLM
vLLM provides optimized inference for DeepSeek V4 with PagedAttention and continuous batching:
# Install vLLM with DeepSeek support (vLLM >= 0.8.0 required)
pip install vllm
# Serve V4-Flash with tensor parallelism across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95 \
--port 8000
Query the local endpoint with the same OpenAI client:
local_client = OpenAI(
api_key="not-needed",
base_url="http://localhost:8000/v1"
)
response = local_client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[{"role": "user", "content": "Hello from local deployment!"}]
)
Deployment with Docker
FROM nvidia/cuda:12.4-runtime-ubuntu22.04
RUN pip install vllm
# Download model on container start
ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "deepseek-ai/DeepSeek-V4-Flash", \
"--tensor-parallel-size", "2", \
"--max-model-len", "131072"]
# Build and run
docker build -t deepseek-v4-flash .
docker run --gpus all -p 8000:8000 deepseek-v4-flash
Kubernetes Deployment
# deepseek-v4-flash-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-v4-flash
spec:
replicas: 1
selector:
matchLabels:
app: deepseek-v4-flash
template:
metadata:
labels:
app: deepseek-v4-flash
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: [
"--model", "deepseek-ai/DeepSeek-V4-Flash",
"--tensor-parallel-size", "2",
"--max-model-len", "131072",
"--gpu-memory-utilization", "0.95"
]
resources:
limits:
nvidia.com/gpu: 2
ports:
- containerPort: 8000
Optimization Techniques
Quantization
INT4 quantization reduces memory requirements by approximately 4x with minimal quality loss:
# Download quantized weights from Hugging Face
# Models are available in FP16, INT8, and INT4 formats
# Serve with INT4 via vLLM
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V4-Flash-INT4 \
--tensor-parallel-size 1 \
--max-model-len 65536
Output Caching
For applications with repeated queries, semantic caching reduces API costs:
import hashlib
import json
cache = {}
def cached_completion(prompt: str, model: str = "deepseek-v4-flash") -> str:
key = hashlib.sha256(prompt.encode()).hexdigest()
if key in cache:
return cache[key]
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=500
)
text = response.choices[0].message.content
cache[key] = text
return text
Pricing Comparison
| Provider | Model | Input / 1M tokens | Output / 1M tokens | Context |
|---|---|---|---|---|
| DeepSeek | V4-Flash | $0.14 | $0.28 | 1M |
| DeepSeek | V4-Pro | $1.74 | $3.48 | 1M |
| OpenAI | GPT-4o | $2.50 | $10.00 | 128K |
| Anthropic | Claude Opus 4.6 | $15.00 | $75.00 | 200K |
| Gemini 2.5 Pro | $1.25 | $5.00 | 1M |
At V4-Flash pricing ($0.14/M input), a million-token document costs $0.14 to process — roughly 100x cheaper than Claude Opus for comparable-quality analysis on most tasks.
Resources
- DeepSeek Official Website
- DeepSeek API Documentation — OpenAI-compatible API reference
- DeepSeek V4 Release Notes — V4-Pro and V4-Flash announcement
- DeepSeek Models on Hugging Face — Model weights under MIT license
- vLLM DeepSeek V4 Guide — Optimized inference deployment
- DeepSeek Discord Community
Comments