Skip to main content

Claude API Complete Guide 2026: Opus 4.7, Sonnet 4.6, Haiku 4.5 — Integration and Best Practices

Created: March 2, 2026 Larry Qu 5 min read

Introduction

Anthropic’s Claude API provides access to the Claude model family across three tiers: Haiku (fastest, cheapest), Sonnet (default production tier), and Opus (maximum reasoning capability). As of May 2026, the current generation includes Opus 4.7 ($5/$25 per million tokens), Sonnet 4.6 ($3/$15), and Haiku 4.5 ($1/$5). All three support 1M token context windows (200K for Haiku), vision, function calling, and extended thinking.

This guide covers Python integration for all model tiers, streaming, vision with image analysis, prompt caching for up to 90% cost reduction on repeated context, batch API at 50% discount, extended thinking for complex reasoning tasks, and production deployment patterns.

Model Overview and Pricing

Model Release Input / 1M Output / 1M Context Caching (cache read) Best For
Opus 4.7 May 2026 $5.00 $25.00 1M $0.50 Complex reasoning, agents, vision
Opus 4.6 Feb 2026 $5.00 $25.00 1M $0.50 Production code, analysis
Sonnet 4.6 Feb 2026 $3.00 $15.00 1M $0.30 Default production tier
Haiku 4.5 Late 2025 $1.00 $5.00 200K $0.10 High-volume, low-latency

Prompt caching reduces repeated input tokens by up to 90% on cache reads. Batch API provides 50% discount for asynchronous workloads.

All pricing per Anthropic’s official rates as of May 2026. No changes since the 4.6 family launch.

Basic Setup

pip install anthropic
from anthropic import Anthropic

client = Anthropic(
    api_key="sk-ant-api03-..."
)

Chat Completion with Streaming

For interactive applications, enable streaming to show tokens as they arrive:

with client.messages.stream(
    model="claude-sonnet-4-20260514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain how attention works in transformers."}
    ]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Vision: Image Analysis

Claude can analyze images passed as base64-encoded data:

import base64

with open("architecture-diagram.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20260514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe the data flow in this architecture diagram."
                },
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data
                    }
                }
            ]
        }
    ]
)
print(response.content[0].text)

Opus 4.7 supports up to 3.75 MP images (triple the pixel budget of Opus 4.6), making it significantly better at UI screenshots, dense diagrams, and small text in images.

Prompt Caching

Prompt caching stores recently processed context, reducing cost on repeated prefixes such as system prompts or document corpora:

response = client.messages.create(
    model="claude-sonnet-4-20260514",
    max_tokens=256,
    system=[
        {
            "type": "text",
            "text": "You are a legal document analyst. Review the attached contract.",
            # Cache the system prompt — it's reused across many requests
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": contract_text,
                    # Cache the contract text across multiple questions
                    "cache_control": {"type": "ephemeral"}
                },
                {
                    "type": "text",
                    "text": "What are the termination clauses?"
                }
            ]
        }
    ]
)

# Check cache hit/miss in response headers
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation: {usage.cache_creation_input_tokens}")
print(f"Cache read: {usage.cache_read_input_tokens}")

Cost comparison for a typical legal analysis workflow (100-page contract, 10 questions):

Without caching: 10 × $0.15 = $1.50 (full contract sent each time)
With caching:    $0.19 (first request creates cache) + 9 × $0.02 = $0.37
Savings: ~75%

The cache has a 5-minute TTL (configurable to 1 hour with 2x write cost). Each cache hit costs 0.1x base input rate — a 90% discount on cached tokens.

Batch API

For high-volume, non-urgent workloads, the batch API provides 50% discount:

# Submit a batch of requests (up to 100K requests per batch)
batch = client.batches.create(
    requests=[
        {
            "custom_id": "req-001",
            "params": {
                "model": "claude-sonnet-4-20260514",
                "max_tokens": 256,
                "messages": [
                    {"role": "user", "content": "Summarize: " + doc}
                ]
            }
        }
        for doc in documents  # list of 1000 documents
    ]
)

# Poll until complete
import time
while True:
    batch_status = client.batches.retrieve(batch.id)
    if batch_status.processing_status == "ended":
        break
    time.sleep(10)

# Retrieve results
for result in client.batches.results(batch.id):
    print(f"{result.custom_id}: {result.response.body['content'][0]['text']}")

Extended Thinking

For complex reasoning tasks, Claude’s extended thinking mode lets the model show its reasoning process before producing the final answer:

response = client.messages.create(
    model="claude-sonnet-4-20260514",
    max_tokens=4096,
    thinking={
        "type": "enabled",
        "budget_tokens": 2048  # How many tokens the model can use for thinking
    },
    messages=[
        {
            "role": "user",
            "content": "Solve this: A train leaves station A at 60 mph. "
                       "Another train leaves station B at 90 mph. "
                       "The stations are 300 miles apart. "
                       "When and where do they meet?"
        }
    ]
)

# The thinking block contains the model's reasoning
thinking_block = response.content[0].thinking
print(f"Reasoning: {thinking_block}")

# The text block contains the final answer
final_answer = response.content[0].text
print(f"Answer: {final_answer}")

Set budget_tokens to allocate how much processing the model should spend on reasoning before generating the response. Higher budgets produce more thorough analysis for complex problems (math proofs, legal analysis, multi-step coding).

Function Calling

import json

response = client.messages.create(
    model="claude-sonnet-4-20260514",
    max_tokens=512,
    tools=[
        {
            "name": "get_weather",
            "description": "Get the weather for a location",
            "input_schema": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    ],
    messages=[
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ]
)

# Check if Claude wants to call a tool
for block in response.content:
    if block.type == "tool_use":
        tool_name = block.name
        tool_input = block.input
        print(f"Calling {tool_name} with: {json.dumps(tool_input)}")

Production Routing Strategy

Route requests across model tiers to optimize cost/quality:

def route_request(prompt: str, complexity: str):
    """Route to the appropriate Claude model based on complexity."""
    if complexity == "simple":
        model = "claude-haiku-4-20251022"       # $1/$5 — fast, cheap
    elif complexity == "standard":
        model = "claude-sonnet-4-20260514"      # $3/$15 — default tier
    elif complexity == "complex":
        model = "claude-opus-4-20260515"        # $5/$25 — maximum quality
    else:
        raise ValueError(f"Unknown complexity: {complexity}")

    return client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )

# Applying routing can cut 30-40% vs running everything on Sonnet.
# Use Haiku for classification/summarization,
# Sonnet for standard generation,
# Opus only for tasks that measurably benefit.

Resources

Comments

Share this article

Scan to read on mobile

👍 Was this article helpful?