Skip to main content
โšก Calmops

AI Reasoning Models: o1, o3, DeepSeek R1, and How to Use Them

Introduction

Reasoning models (o1, o3, DeepSeek R1) think before they answer. They spend extra compute “thinking through” a problem โ€” checking their work, trying alternative approaches, and catching errors. This makes them dramatically better at math, coding, and multi-step logic, but slower and more expensive than standard models.

When to use reasoning models:

  • Complex math or algorithm problems
  • Multi-step code debugging
  • Problems where you need to verify the answer is correct
  • Tasks where GPT-4o keeps making mistakes

When NOT to use them:

  • Simple Q&A or summarization
  • Creative writing
  • When you need fast responses
  • High-volume, low-complexity tasks

How Reasoning Models Work

Standard model (GPT-4o):
  Input โ†’ [single forward pass] โ†’ Output
  Fast, cheap, good for most tasks

Reasoning model (o1/o3):
  Input โ†’ [think... check... try again... verify...] โ†’ Output
  Slow, expensive, much better on hard problems

The "thinking" is called a "reasoning trace" or "chain of thought"
Some models show it (DeepSeek R1), some hide it (o1)

Model Comparison

Model Best For Speed Cost Open Source
o3 Hardest problems, research Slowest Highest No
o1 Complex coding, math Slow High No
o3-mini Good balance Medium Medium No
o1-mini Coding, math on budget Medium Lower No
DeepSeek R1 Open-source reasoning Medium Free (self-host) Yes
GPT-4o General tasks Fast Standard No

Using OpenAI Reasoning Models

from openai import OpenAI

client = OpenAI()

# o3-mini: good balance of capability and cost
response = client.chat.completions.create(
    model="o3-mini",
    messages=[
        {
            "role": "user",
            "content": """
            A train leaves Chicago at 9am traveling at 60mph toward New York (790 miles away).
            Another train leaves New York at 11am traveling at 80mph toward Chicago.
            At what time do they meet, and how far from Chicago?
            Show your work.
            """
        }
    ],
    # reasoning_effort controls how much the model thinks
    # "low" = faster/cheaper, "high" = slower/better
    reasoning_effort="medium",
)

print(response.choices[0].message.content)
# The model will show step-by-step work and arrive at the correct answer

o1 for Code Debugging

# o1 excels at finding subtle bugs
response = client.chat.completions.create(
    model="o1",
    messages=[
        {
            "role": "user",
            "content": """
            This Python function is supposed to find all prime numbers up to n,
            but it's returning wrong results for some inputs. Find and fix the bug:

            def sieve_of_eratosthenes(n):
                primes = [True] * (n + 1)
                primes[0] = primes[1] = False
                for i in range(2, int(n**0.5)):  # bug is here
                    if primes[i]:
                        for j in range(i*i, n+1, i):
                            primes[j] = False
                return [i for i in range(n+1) if primes[i]]

            print(sieve_of_eratosthenes(10))  # returns [2, 3, 5, 7] โœ“
            print(sieve_of_eratosthenes(25))  # should include 23 but doesn't
            """
        }
    ],
)
# o1 will identify: range(2, int(n**0.5)) should be range(2, int(n**0.5) + 1)
# because int(25**0.5) = 5, but we need to check 5 itself

Comparing o1 vs GPT-4o on Hard Problems

import time

problem = """
Prove that for any positive integer n, the sum of the first n odd numbers equals nยฒ.
Then write a Python function that verifies this for n = 1 to 100.
"""

# GPT-4o: fast but may make errors on the proof
start = time.time()
gpt4o_response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": problem}]
)
print(f"GPT-4o: {time.time()-start:.1f}s")

# o1: slower but more reliable on mathematical reasoning
start = time.time()
o1_response = client.chat.completions.create(
    model="o1-mini",
    messages=[{"role": "user", "content": problem}]
)
print(f"o1-mini: {time.time()-start:.1f}s")

# Typical results:
# GPT-4o: 2.1s โ€” may have gaps in the proof
# o1-mini: 15.3s โ€” rigorous proof with verification

DeepSeek R1: Open-Source Reasoning

DeepSeek R1 is a fully open-source reasoning model that matches o1 on many benchmarks. You can run it locally or via API.

Via API (Cheapest)

from openai import OpenAI  # DeepSeek uses OpenAI-compatible API

client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
    model="deepseek-reasoner",  # R1 model
    messages=[
        {
            "role": "user",
            "content": "What is the time complexity of merge sort? Explain why."
        }
    ],
)

# DeepSeek R1 shows its reasoning trace
print("Reasoning:", response.choices[0].message.reasoning_content)
print("Answer:", response.choices[0].message.content)

Self-Hosted with Ollama

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull DeepSeek R1 (7B fits on consumer GPU)
ollama pull deepseek-r1:7b

# Run
ollama run deepseek-r1:7b "Solve: if 3x + 7 = 22, what is x?"
# Use via Python
import ollama

response = ollama.chat(
    model='deepseek-r1:7b',
    messages=[{
        'role': 'user',
        'content': 'Write a recursive function to compute Fibonacci numbers, then optimize it with memoization.'
    }]
)

print(response['message']['content'])
# Shows <think>...</think> reasoning trace, then the answer

DeepSeek R1 Model Sizes

Model VRAM Required Quality Use Case
deepseek-r1:1.5b 2GB Basic Testing
deepseek-r1:7b 8GB Good Consumer GPU
deepseek-r1:14b 16GB Better RTX 3090/4090
deepseek-r1:32b 32GB Great Workstation
deepseek-r1:70b 80GB Best A100/H100

Prompting Strategies for Reasoning Models

Don’t Over-Prompt

# BAD: reasoning models don't need "think step by step"
# They already do this internally
response = client.chat.completions.create(
    model="o1-mini",
    messages=[{
        "role": "user",
        "content": "Think step by step and carefully reason through this problem..."  # unnecessary
    }]
)

# GOOD: just state the problem clearly
response = client.chat.completions.create(
    model="o1-mini",
    messages=[{
        "role": "user",
        "content": "Find all integer solutions to xยฒ - 5x + 6 = 0"
    }]
)

Give Context, Not Instructions

# BAD: telling the model HOW to think
"First, identify the variables. Then, set up equations. Then solve..."

# GOOD: give context and let the model reason
"I'm debugging a race condition in a multi-threaded Python application. Here's the code: [code]. The issue occurs intermittently under high load."

Use for Verification

# Use reasoning models to verify outputs from faster models
def verify_with_reasoning_model(code: str, test_cases: list) -> dict:
    """Use o1 to verify code correctness."""

    test_str = "\n".join([f"Input: {t['input']}, Expected: {t['expected']}" for t in test_cases])

    response = client.chat.completions.create(
        model="o1-mini",
        messages=[{
            "role": "user",
            "content": f"""
            Verify this code is correct for all test cases.
            If there are bugs, identify them precisely.

            Code:
            {code}

            Test cases:
            {test_str}
            """
        }]
    )

    return {"verification": response.choices[0].message.content}

Cost Optimization

# Route queries to the right model based on complexity

def smart_route(question: str) -> str:
    """Use cheap model for simple questions, reasoning model for hard ones."""

    # Quick complexity check with cheap model
    classifier = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Is this question complex (requires multi-step reasoning, math, or debugging)? Answer only YES or NO.\n\nQuestion: {question}"
        }],
        max_tokens=5,
    )

    is_complex = "YES" in classifier.choices[0].message.content.upper()

    if is_complex:
        # Use reasoning model for hard questions
        response = client.chat.completions.create(
            model="o3-mini",
            messages=[{"role": "user", "content": question}],
            reasoning_effort="medium",
        )
    else:
        # Use fast model for simple questions
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": question}],
        )

    return response.choices[0].message.content

# Examples:
print(smart_route("What is the capital of France?"))  # โ†’ gpt-4o-mini
print(smart_route("Prove the Pythagorean theorem"))   # โ†’ o3-mini

Benchmarks: Where Reasoning Models Win

Task                          GPT-4o    o1-mini   o3
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
AIME 2024 (math competition)   13%       70%      96%
HumanEval (coding)             90%       95%      99%
GPQA (PhD-level science)       53%       73%      87%
Simple Q&A                     95%       95%      95%
Creative writing                โ˜…โ˜…โ˜…โ˜…      โ˜…โ˜…โ˜…       โ˜…โ˜…โ˜…
Speed                          Fast      Slow     Slowest
Cost per 1M tokens             $5        $3       $15

Key insight: For simple tasks, reasoning models aren’t better โ€” just slower and more expensive. Use them selectively.

Resources

Comments