Introduction
DeepSeek has become one of the most influential AI companies in 2026, challenging OpenAI and Anthropic with open-source models that match or exceed proprietary alternatives. On April 24, 2026, DeepSeek released V4 — a family of Mixture-of-Experts models under the MIT license with 1M-token context windows and industry-leading coding benchmarks. V4-Pro achieves 80.6% on SWE-bench Verified and 93.5 on LiveCodeBench, the highest scores of any publicly available model.
This guide covers the complete DeepSeek model family with a focus on V4, provides Python API integration code using the OpenAI-compatible endpoint, explains deployment with vLLM and Docker for self-hosted scenarios, and includes the July 2026 model name migration timeline.
Model Family Overview
DeepSeek V4 (2026)
Released April 24, 2026, V4 represents a generational leap over V3.2. The architecture uses a hybrid CSA+HCA (Compressed Sparse Attention + Hierarchical Context Attention) mechanism that reduces FLOPs to 27% and KV cache to 10% of V3.2 at long context lengths.
| Model | Parameters | Active per Token | Context | Price (Input / 1M tokens) | SWE-bench Verified |
|---|---|---|---|---|---|
| V4-Flash | 284B | 13B | 1M tokens | $0.14 | 72.1% |
| V4-Pro | 1.6T | 49B | 1M tokens | $1.74 | 80.6% |
| V4-Pro-Max | 1.6T | 49B | 1M tokens | $3.48 | 80.6% |
V4-Flash is optimized for cost-sensitive production workloads. V4-Pro targets complex reasoning, code generation, and research tasks. Both use the Muon optimizer during training, which contributed to 2x training efficiency over V3.2.
Previous Generation Models
DeepSeek V3.2 (late 2025): The immediate predecessor to V4. Still functional but being phased out. Uses 64K token context and the older MoE architecture without CSA+HCA attention.
DeepSeek R1 (January 2025): Reasoning-focused model with chain-of-thought capabilities. Achieved performance comparable to OpenAI o1 on math and logic benchmarks. R1 continues to be available for applications that benefit from explicit reasoning traces.
Janus Pro: Multimodal model with separate visual and language pathways. Supports image understanding (scene description, OCR, chart interpretation) and image generation. Janus Pro uses a decoupled architecture where visual encoding and language processing run through independent pathways, preventing modality interference and improving both understanding and generation quality.
V4 Architecture Deep Dive
The V4 series introduces several structural innovations that enable its efficiency:
CSA+HCA Attention Mechanism: V4 combines Compressed Sparse Attention (CSA) with Hierarchical Context Attention (HCA). CSA reduces FLOPs by sparsifying attention computations — each token only attends to a subset of relevant tokens rather than the full sequence. HCA creates a hierarchical context representation that caches compressed summaries at multiple granularities, reducing KV cache memory to 10% of V3.2 at long context lengths.
Multi-Token Prediction (MTP): V4 predicts multiple future tokens simultaneously during training, improving sample efficiency and enabling faster inference through speculative decoding.
Thinking Mode Architecture: V4 supports three reasoning levels per model:
| Mode | Description | Best For | Latency Impact |
|---|---|---|---|
| Non-Thinking | Direct generation without explicit reasoning | Simple QA, classification, extraction | Fastest |
| High | Chain-of-thought reasoning with moderate depth | Code generation, analysis, planning | Moderate |
| Max | Extended reasoning with deep search of solution space | Math proofs, complex debugging, competition problems | Highest |
This hybrid thinking/non-thinking design means a single model serves both fast-path and deep-reasoning use cases, eliminating the need for separate models like R1.
Benchmark Performance
| Benchmark (Metric) | V4-Flash High | V4-Flash Max | V4-Pro High | V4-Pro Max | Opus 4.6 Max | GPT-5.4 xHigh |
|---|---|---|---|---|---|---|
| MMLU-Pro (EM) | 86.4 | 86.2 | 87.1 | 87.5 | 89.1 | 87.5 |
| GPQA Diamond (Pass@1) | 87.4 | 88.1 | 89.1 | 90.1 | 91.3 | 93.0 |
| LiveCodeBench (Pass@1) | 88.4 | 91.6 | 89.8 | 93.5 | 88.8 | — |
| SWE-bench Verified | 72.1 | — | 80.6 | 80.6 | 80.8 | — |
| Codeforces (Rating) | 3,052 | 3,052 | 3,206 | 3,206 | 3,168 | 3,052 |
| HMMT 2026 Feb (Pass@1) | 91.9 | 94.8 | 94.0 | 95.2 | 96.2 | 97.7 |
| IMOAnswerBench (Pass@1) | 85.1 | 88.4 | 88.0 | 89.8 | 75.3 | 91.4 |
V4-Pro Max achieves the highest LiveCodeBench score (93.5) of any publicly available model, and its SWE-bench Verified score (80.6%) ties GPT-5.4 xHigh. V4-Flash Max achieves 91.6 on LiveCodeBench at a fraction of the cost — making it the most cost-effective coding model available.
API Integration
DeepSeek provides an OpenAI-compatible API. The same Python openai client library works by changing the base_url and API key.
Migration Alert: July 24, 2026 Cutoff
The legacy model names deepseek-chat and deepseek-reasoner will be fully retired on July 24, 2026 at 15:59 UTC. After this date, requests using those names will return errors. Replace them with deepseek-v4-flash and deepseek-v4-pro.
# BAD — will stop working after July 24, 2026
model = "deepseek-chat" # deprecated
# GOOD — use V4 model identifiers
model = "deepseek-v4-flash" # cost-optimized
model = "deepseek-v4-pro" # maximum capability
Basic Chat Completion
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to merge two sorted lists."}
],
max_tokens=1024,
temperature=0.3
)
print(response.choices[0].message.content)
Streaming Response
For interactive applications, enable streaming to receive tokens as they are generated:
stream = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": "Explain how transformers work in 3 paragraphs."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Structured Output with Function Calling
DeepSeek V4 supports OpenAI-compatible function calling for structured outputs:
import json
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "user", "content": "Extract the company name, revenue, and year from: 'Acme Corp reported $12.5M revenue in 2025.'"}
],
functions=[{
"name": "extract_financial_data",
"description": "Extract structured financial information",
"parameters": {
"type": "object",
"properties": {
"company": {"type": "string"},
"revenue": {"type": "number"},
"year": {"type": "integer"}
},
"required": ["company", "revenue", "year"]
}
}],
function_call={"name": "extract_financial_data"}
)
args = json.loads(response.choices[0].message.function_call.arguments)
print(args) # {'company': 'Acme Corp', 'revenue': 12500000, 'year': 2025}
Embedding Generation
response = client.embeddings.create(
model="deepseek-v4-flash", # embeddings supported on Flash tier
input="DeepSeek V4 supports 1M token context windows."
)
embedding = response.data[0].embedding
print(f"Dimension: {len(embedding)}") # typically 2048 or 4096
Self-Hosted Deployment
DeepSeek publishes model weights on Hugging Face under the MIT license, enabling self-hosted deployment on private infrastructure.
Hardware Requirements
| Model | Minimum GPU | Recommended GPU | VRAM (FP16) | VRAM (INT4) |
|---|---|---|---|---|
| V4-Flash | 1x A100 80GB | 2x A100 80GB | ~160 GB | ~45 GB |
| V4-Pro | 4x A100 80GB | 8x A100 80GB | ~800 GB | ~200 GB |
Deployment with vLLM
vLLM provides optimized inference for DeepSeek V4 with PagedAttention and continuous batching:
# Install vLLM with DeepSeek support (vLLM >= 0.8.0 required)
pip install vllm
# Serve V4-Flash with tensor parallelism across 2 GPUs
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--gpu-memory-utilization 0.95 \
--port 8000
Query the local endpoint with the same OpenAI client:
local_client = OpenAI(
api_key="not-needed",
base_url="http://localhost:8000/v1"
)
response = local_client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Flash",
messages=[{"role": "user", "content": "Hello from local deployment!"}]
)
Deployment with Docker
FROM nvidia/cuda:12.4-runtime-ubuntu22.04
RUN pip install vllm
# Download model on container start
ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "deepseek-ai/DeepSeek-V4-Flash", \
"--tensor-parallel-size", "2", \
"--max-model-len", "131072"]
# Build and run
docker build -t deepseek-v4-flash .
docker run --gpus all -p 8000:8000 deepseek-v4-flash
Kubernetes Deployment
# deepseek-v4-flash-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-v4-flash
spec:
replicas: 1
selector:
matchLabels:
app: deepseek-v4-flash
template:
metadata:
labels:
app: deepseek-v4-flash
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: [
"--model", "deepseek-ai/DeepSeek-V4-Flash",
"--tensor-parallel-size", "2",
"--max-model-len", "131072",
"--gpu-memory-utilization", "0.95"
]
resources:
limits:
nvidia.com/gpu: 2
ports:
- containerPort: 8000
Optimization Techniques
Quantization
INT4 quantization reduces memory requirements by approximately 4x with minimal quality loss:
# Download quantized weights from Hugging Face
# Models are available in FP16, INT8, and INT4 formats
# Serve with INT4 via vLLM
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V4-Flash-INT4 \
--tensor-parallel-size 1 \
--max-model-len 65536
Output Caching
For applications with repeated queries, semantic caching reduces API costs:
import hashlib
import json
cache = {}
def cached_completion(prompt: str, model: str = "deepseek-v4-flash") -> str:
key = hashlib.sha256(prompt.encode()).hexdigest()
if key in cache:
return cache[key]
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=500
)
text = response.choices[0].message.content
cache[key] = text
return text
Advanced API Patterns
Thinking Mode Selection
DeepSeek V4 supports multiple reasoning effort levels. Control this through the system prompt:
# Non-thinking mode (instant response)
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": "You are a helpful assistant. Respond directly without reasoning."},
{"role": "user", "content": "What is the capital of France?"}
]
)
# Max thinking mode (deep reasoning)
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content": "You are a math expert. Reason step by step and show your work."},
{"role": "user", "content": "Prove that the square root of 2 is irrational."}
],
temperature=0.6,
max_tokens=4096
)
V4-Pro in Max thinking mode allocates additional computation to search the solution space more thoroughly. For coding benchmarks, V4-Flash Max improves from 55.2% to 91.6% on LiveCodeBench — a 36-point gain from thinking mode alone.
Structured Output with JSON Mode
DeepSeek V4 supports constrained JSON output for reliable structured data extraction:
response = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content": "Extract information as valid JSON only."},
{"role": "user", "content": "Extract: John Doe, age 35, works at Acme Corp in San Francisco"}
],
response_format={"type": "json_object"}
)
import json
data = json.loads(response.choices[0].message.content)
print(data) # {"name": "John Doe", "age": 35, "company": "Acme Corp", "location": "San Francisco"}
Batch Processing with Concurrent Requests
from concurrent.futures import ThreadPoolExecutor, as_completed
from openai import OpenAI
client = OpenAI(api_key="...", base_url="https://api.deepseek.com")
def process_document(doc_text: str) -> dict:
response = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": "Summarize in 2 sentences."},
{"role": "user", "content": doc_text[:2000]}
],
max_tokens=100
)
return {"summary": response.choices[0].message.content}
documents = ["Doc1 text...", "Doc2 text...", "Doc3 text..."] # 1000+ documents
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(process_document, doc): doc for doc in documents}
for future in as_completed(futures):
result = future.result()
print(result["summary"])
RAG with DeepSeek Embeddings
DeepSeek V4-Flash provides embedding generation for RAG pipelines. Combined with V4’s 1M context window, you can build powerful retrieval-augmented systems:
import numpy as np
from openai import OpenAI
class DeepSeekRAG:
def __init__(self, api_key: str):
self.client = OpenAI(api_key=api_key, base_url="https://api.deepseek.com")
def embed(self, texts: list[str]) -> np.ndarray:
response = self.client.embeddings.create(
model="deepseek-v4-flash",
input=texts
)
return np.array([d.embedding for d in response.data])
def retrieve(self, query: str, documents: list[str], top_k: int = 3) -> list[tuple[str, float]]:
query_emb = self.embed([query])
doc_embs = self.embed(documents)
scores = np.dot(doc_embs, query_emb.T).flatten()
top_indices = np.argsort(scores)[::-1][:top_k]
return [(documents[i], float(scores[i])) for i in top_indices]
def generate(self, query: str, context: str) -> str:
response = self.client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content": "Answer using only the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
],
temperature=0.1
)
return response.choices[0].message.content
rag = DeepSeekRAG(api_key="...")
docs = [
"DeepSeek V4 was released on April 24, 2026 under MIT license.",
"V4 uses Mixture-of-Experts with 1.6T total parameters.",
"The model supports 1M token context windows."
]
results = rag.retrieve("When was DeepSeek V4 released?", docs)
answer = rag.generate("When was DeepSeek V4 released?", "\n".join([d for d, _ in results]))
print(answer)
Fine-Tuning DeepSeek V4
DeepSeek V4 weights (MIT license) support fine-tuning for domain-specific adaptation:
# Using Hugging Face Transformers + PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
model_name = "deepseek-ai/DeepSeek-V4-Flash"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load with quantization for memory efficiency
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="bfloat16",
device_map="auto",
load_in_4bit=True # Requires bitsandbytes
)
# Apply LoRA adapters
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Load and prepare dataset
dataset = load_dataset("json", data_files="training_data.jsonl")
def format_example(example):
return {
"input_ids": tokenizer.apply_chat_template(
[{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["response"]}],
tokenize=True
)
}
dataset = dataset.map(format_example)
# Train (requires GPU cluster)
# from transformers import Trainer, TrainingArguments
# training_args = TrainingArguments(
# output_dir="./deepseek-v4-finetuned",
# per_device_train_batch_size=1,
# gradient_accumulation_steps=16,
# num_train_epochs=3,
# learning_rate=2e-4,
# logging_steps=10,
# save_strategy="epoch",
# )
# trainer = Trainer(model=model, args=training_args, train_dataset=dataset["train"])
# trainer.train()
Multi-Node Deployment
For V4-Pro (1.6T parameters), single-node inference requires 8x A100 80GB GPUs. For production throughput, multi-node deployment with tensor parallelism across nodes is essential:
# docker-compose multi-node vLLM
version: '3.8'
services:
vllm-node1:
image: vllm/vllm-openai:latest
command: [
"--model", "deepseek-ai/DeepSeek-V4-Pro",
"--tensor-parallel-size", "8",
"--pipeline-parallel-size", "2",
"--max-model-len", "131072",
"--gpu-memory-utilization", "0.90",
"--port", "8000"
]
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 8
networks:
- deepseek-net
environment:
- NCCL_SOCKET_IFNAME=eth0
- NCCL_IB_DISABLE=0
vllm-node2:
image: vllm/vllm-openai:latest
command: [
"--model", "deepseek-ai/DeepSeek-V4-Pro",
"--tensor-parallel-size", "8",
"--pipeline-parallel-size", "2",
"--max-model-len", "131072",
"--gpu-memory-utilization", "0.90",
"--port", "8000"
]
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 8
networks:
- deepseek-net
networks:
deepseek-net:
driver: overlay
Pricing Comparison
| Provider | Model | Input / 1M tokens | Output / 1M tokens | Context |
|---|---|---|---|---|
| DeepSeek | V4-Flash | $0.14 | $0.28 | 1M |
| DeepSeek | V4-Pro | $1.74 | $3.48 | 1M |
| OpenAI | GPT-4o | $2.50 | $10.00 | 128K |
| Anthropic | Claude Opus 4.6 | $15.00 | $75.00 | 200K |
| Gemini 2.5 Pro | $1.25 | $5.00 | 1M |
At V4-Flash pricing ($0.14/M input), a million-token document costs $0.14 to process — roughly 100x cheaper than Claude Opus for comparable-quality analysis on most tasks.
Resources
- DeepSeek Official Website
- DeepSeek API Documentation — OpenAI-compatible API reference
- DeepSeek V4 Release Notes — V4-Pro and V4-Flash announcement
- DeepSeek Models on Hugging Face — Model weights under MIT license
- vLLM DeepSeek V4 Guide — Optimized inference deployment
- DeepSeek Discord Community
Comments