Introduction
Anthropic’s Claude API provides access to the Claude model family across three tiers: Haiku (fastest, cheapest), Sonnet (default production tier), and Opus (maximum reasoning capability). As of May 2026, the current generation includes Opus 4.7 ($5/$25 per million tokens), Sonnet 4.6 ($3/$15), and Haiku 4.5 ($1/$5). All three support 1M token context windows (200K for Haiku), vision, function calling, and extended thinking.
This guide covers Python integration for all model tiers, streaming, vision with image analysis, prompt caching for up to 90% cost reduction on repeated context, batch API at 50% discount, extended thinking for complex reasoning tasks, and production deployment patterns.
Model Overview and Pricing
| Model | Release | Input / 1M | Output / 1M | Context | Caching (cache read) | Best For |
|---|---|---|---|---|---|---|
| Opus 4.7 | May 2026 | $5.00 | $25.00 | 1M | $0.50 | Complex reasoning, agents, vision |
| Opus 4.6 | Feb 2026 | $5.00 | $25.00 | 1M | $0.50 | Production code, analysis |
| Sonnet 4.6 | Feb 2026 | $3.00 | $15.00 | 1M | $0.30 | Default production tier |
| Haiku 4.5 | Late 2025 | $1.00 | $5.00 | 200K | $0.10 | High-volume, low-latency |
Prompt caching reduces repeated input tokens by up to 90% on cache reads. Batch API provides 50% discount for asynchronous workloads.
All pricing per Anthropic’s official rates as of May 2026. No changes since the 4.6 family launch.
Basic Setup
pip install anthropic
from anthropic import Anthropic
client = Anthropic(
api_key="sk-ant-api03-..."
)
Chat Completion with Streaming
For interactive applications, enable streaming to show tokens as they arrive:
with client.messages.stream(
model="claude-sonnet-4-20260514",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain how attention works in transformers."}
]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Vision: Image Analysis
Claude can analyze images passed as base64-encoded data:
import base64
with open("architecture-diagram.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-20260514",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the data flow in this architecture diagram."
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data
}
}
]
}
]
)
print(response.content[0].text)
Opus 4.7 supports up to 3.75 MP images (triple the pixel budget of Opus 4.6), making it significantly better at UI screenshots, dense diagrams, and small text in images.
Prompt Caching
Prompt caching stores recently processed context, reducing cost on repeated prefixes such as system prompts or document corpora:
response = client.messages.create(
model="claude-sonnet-4-20260514",
max_tokens=256,
system=[
{
"type": "text",
"text": "You are a legal document analyst. Review the attached contract.",
# Cache the system prompt — it's reused across many requests
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": contract_text,
# Cache the contract text across multiple questions
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "What are the termination clauses?"
}
]
}
]
)
# Check cache hit/miss in response headers
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation: {usage.cache_creation_input_tokens}")
print(f"Cache read: {usage.cache_read_input_tokens}")
Cost comparison for a typical legal analysis workflow (100-page contract, 10 questions):
Without caching: 10 × $0.15 = $1.50 (full contract sent each time)
With caching: $0.19 (first request creates cache) + 9 × $0.02 = $0.37
Savings: ~75%
The cache has a 5-minute TTL (configurable to 1 hour with 2x write cost). Each cache hit costs 0.1x base input rate — a 90% discount on cached tokens.
Batch API
For high-volume, non-urgent workloads, the batch API provides 50% discount:
# Submit a batch of requests (up to 100K requests per batch)
batch = client.batches.create(
requests=[
{
"custom_id": "req-001",
"params": {
"model": "claude-sonnet-4-20260514",
"max_tokens": 256,
"messages": [
{"role": "user", "content": "Summarize: " + doc}
]
}
}
for doc in documents # list of 1000 documents
]
)
# Poll until complete
import time
while True:
batch_status = client.batches.retrieve(batch.id)
if batch_status.processing_status == "ended":
break
time.sleep(10)
# Retrieve results
for result in client.batches.results(batch.id):
print(f"{result.custom_id}: {result.response.body['content'][0]['text']}")
Extended Thinking
For complex reasoning tasks, Claude’s extended thinking mode lets the model show its reasoning process before producing the final answer:
response = client.messages.create(
model="claude-sonnet-4-20260514",
max_tokens=4096,
thinking={
"type": "enabled",
"budget_tokens": 2048 # How many tokens the model can use for thinking
},
messages=[
{
"role": "user",
"content": "Solve this: A train leaves station A at 60 mph. "
"Another train leaves station B at 90 mph. "
"The stations are 300 miles apart. "
"When and where do they meet?"
}
]
)
# The thinking block contains the model's reasoning
thinking_block = response.content[0].thinking
print(f"Reasoning: {thinking_block}")
# The text block contains the final answer
final_answer = response.content[0].text
print(f"Answer: {final_answer}")
Set budget_tokens to allocate how much processing the model should spend on reasoning before generating the response. Higher budgets produce more thorough analysis for complex problems (math proofs, legal analysis, multi-step coding).
Function Calling
import json
response = client.messages.create(
model="claude-sonnet-4-20260514",
max_tokens=512,
tools=[
{
"name": "get_weather",
"description": "Get the weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
],
messages=[
{"role": "user", "content": "What's the weather in Tokyo?"}
]
)
# Check if Claude wants to call a tool
for block in response.content:
if block.type == "tool_use":
tool_name = block.name
tool_input = block.input
print(f"Calling {tool_name} with: {json.dumps(tool_input)}")
Production Routing Strategy
Route requests across model tiers to optimize cost/quality:
def route_request(prompt: str, complexity: str):
"""Route to the appropriate Claude model based on complexity."""
if complexity == "simple":
model = "claude-haiku-4-20251022" # $1/$5 — fast, cheap
elif complexity == "standard":
model = "claude-sonnet-4-20260514" # $3/$15 — default tier
elif complexity == "complex":
model = "claude-opus-4-20260515" # $5/$25 — maximum quality
else:
raise ValueError(f"Unknown complexity: {complexity}")
return client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
# Applying routing can cut 30-40% vs running everything on Sonnet.
# Use Haiku for classification/summarization,
# Sonnet for standard generation,
# Opus only for tasks that measurably benefit.
Resources
- Anthropic API Documentation — Complete API reference
- Claude API Pricing — Official pricing per model
- Prompt Caching Guide — Cache setup, TTL, cost calc
- Batch API Guide — Async 50% discount
- Extended Thinking — Reasoning budget configuration
- Vision Capabilities — Image analysis, media types, limits
Comments