Introduction
Reasoning models (o1, o3, DeepSeek R1) think before they answer. They spend extra compute “thinking through” a problem โ checking their work, trying alternative approaches, and catching errors. This makes them dramatically better at math, coding, and multi-step logic, but slower and more expensive than standard models.
When to use reasoning models:
- Complex math or algorithm problems
- Multi-step code debugging
- Problems where you need to verify the answer is correct
- Tasks where GPT-4o keeps making mistakes
When NOT to use them:
- Simple Q&A or summarization
- Creative writing
- When you need fast responses
- High-volume, low-complexity tasks
How Reasoning Models Work
Standard model (GPT-4o):
Input โ [single forward pass] โ Output
Fast, cheap, good for most tasks
Reasoning model (o1/o3):
Input โ [think... check... try again... verify...] โ Output
Slow, expensive, much better on hard problems
The "thinking" is called a "reasoning trace" or "chain of thought"
Some models show it (DeepSeek R1), some hide it (o1)
Model Comparison
| Model | Best For | Speed | Cost | Open Source |
|---|---|---|---|---|
| o3 | Hardest problems, research | Slowest | Highest | No |
| o1 | Complex coding, math | Slow | High | No |
| o3-mini | Good balance | Medium | Medium | No |
| o1-mini | Coding, math on budget | Medium | Lower | No |
| DeepSeek R1 | Open-source reasoning | Medium | Free (self-host) | Yes |
| GPT-4o | General tasks | Fast | Standard | No |
Using OpenAI Reasoning Models
from openai import OpenAI
client = OpenAI()
# o3-mini: good balance of capability and cost
response = client.chat.completions.create(
model="o3-mini",
messages=[
{
"role": "user",
"content": """
A train leaves Chicago at 9am traveling at 60mph toward New York (790 miles away).
Another train leaves New York at 11am traveling at 80mph toward Chicago.
At what time do they meet, and how far from Chicago?
Show your work.
"""
}
],
# reasoning_effort controls how much the model thinks
# "low" = faster/cheaper, "high" = slower/better
reasoning_effort="medium",
)
print(response.choices[0].message.content)
# The model will show step-by-step work and arrive at the correct answer
o1 for Code Debugging
# o1 excels at finding subtle bugs
response = client.chat.completions.create(
model="o1",
messages=[
{
"role": "user",
"content": """
This Python function is supposed to find all prime numbers up to n,
but it's returning wrong results for some inputs. Find and fix the bug:
def sieve_of_eratosthenes(n):
primes = [True] * (n + 1)
primes[0] = primes[1] = False
for i in range(2, int(n**0.5)): # bug is here
if primes[i]:
for j in range(i*i, n+1, i):
primes[j] = False
return [i for i in range(n+1) if primes[i]]
print(sieve_of_eratosthenes(10)) # returns [2, 3, 5, 7] โ
print(sieve_of_eratosthenes(25)) # should include 23 but doesn't
"""
}
],
)
# o1 will identify: range(2, int(n**0.5)) should be range(2, int(n**0.5) + 1)
# because int(25**0.5) = 5, but we need to check 5 itself
Comparing o1 vs GPT-4o on Hard Problems
import time
problem = """
Prove that for any positive integer n, the sum of the first n odd numbers equals nยฒ.
Then write a Python function that verifies this for n = 1 to 100.
"""
# GPT-4o: fast but may make errors on the proof
start = time.time()
gpt4o_response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": problem}]
)
print(f"GPT-4o: {time.time()-start:.1f}s")
# o1: slower but more reliable on mathematical reasoning
start = time.time()
o1_response = client.chat.completions.create(
model="o1-mini",
messages=[{"role": "user", "content": problem}]
)
print(f"o1-mini: {time.time()-start:.1f}s")
# Typical results:
# GPT-4o: 2.1s โ may have gaps in the proof
# o1-mini: 15.3s โ rigorous proof with verification
DeepSeek R1: Open-Source Reasoning
DeepSeek R1 is a fully open-source reasoning model that matches o1 on many benchmarks. You can run it locally or via API.
Via API (Cheapest)
from openai import OpenAI # DeepSeek uses OpenAI-compatible API
client = OpenAI(
api_key="your-deepseek-api-key",
base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
model="deepseek-reasoner", # R1 model
messages=[
{
"role": "user",
"content": "What is the time complexity of merge sort? Explain why."
}
],
)
# DeepSeek R1 shows its reasoning trace
print("Reasoning:", response.choices[0].message.reasoning_content)
print("Answer:", response.choices[0].message.content)
Self-Hosted with Ollama
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull DeepSeek R1 (7B fits on consumer GPU)
ollama pull deepseek-r1:7b
# Run
ollama run deepseek-r1:7b "Solve: if 3x + 7 = 22, what is x?"
# Use via Python
import ollama
response = ollama.chat(
model='deepseek-r1:7b',
messages=[{
'role': 'user',
'content': 'Write a recursive function to compute Fibonacci numbers, then optimize it with memoization.'
}]
)
print(response['message']['content'])
# Shows <think>...</think> reasoning trace, then the answer
DeepSeek R1 Model Sizes
| Model | VRAM Required | Quality | Use Case |
|---|---|---|---|
| deepseek-r1:1.5b | 2GB | Basic | Testing |
| deepseek-r1:7b | 8GB | Good | Consumer GPU |
| deepseek-r1:14b | 16GB | Better | RTX 3090/4090 |
| deepseek-r1:32b | 32GB | Great | Workstation |
| deepseek-r1:70b | 80GB | Best | A100/H100 |
Prompting Strategies for Reasoning Models
Don’t Over-Prompt
# BAD: reasoning models don't need "think step by step"
# They already do this internally
response = client.chat.completions.create(
model="o1-mini",
messages=[{
"role": "user",
"content": "Think step by step and carefully reason through this problem..." # unnecessary
}]
)
# GOOD: just state the problem clearly
response = client.chat.completions.create(
model="o1-mini",
messages=[{
"role": "user",
"content": "Find all integer solutions to xยฒ - 5x + 6 = 0"
}]
)
Give Context, Not Instructions
# BAD: telling the model HOW to think
"First, identify the variables. Then, set up equations. Then solve..."
# GOOD: give context and let the model reason
"I'm debugging a race condition in a multi-threaded Python application. Here's the code: [code]. The issue occurs intermittently under high load."
Use for Verification
# Use reasoning models to verify outputs from faster models
def verify_with_reasoning_model(code: str, test_cases: list) -> dict:
"""Use o1 to verify code correctness."""
test_str = "\n".join([f"Input: {t['input']}, Expected: {t['expected']}" for t in test_cases])
response = client.chat.completions.create(
model="o1-mini",
messages=[{
"role": "user",
"content": f"""
Verify this code is correct for all test cases.
If there are bugs, identify them precisely.
Code:
{code}
Test cases:
{test_str}
"""
}]
)
return {"verification": response.choices[0].message.content}
Cost Optimization
# Route queries to the right model based on complexity
def smart_route(question: str) -> str:
"""Use cheap model for simple questions, reasoning model for hard ones."""
# Quick complexity check with cheap model
classifier = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Is this question complex (requires multi-step reasoning, math, or debugging)? Answer only YES or NO.\n\nQuestion: {question}"
}],
max_tokens=5,
)
is_complex = "YES" in classifier.choices[0].message.content.upper()
if is_complex:
# Use reasoning model for hard questions
response = client.chat.completions.create(
model="o3-mini",
messages=[{"role": "user", "content": question}],
reasoning_effort="medium",
)
else:
# Use fast model for simple questions
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": question}],
)
return response.choices[0].message.content
# Examples:
print(smart_route("What is the capital of France?")) # โ gpt-4o-mini
print(smart_route("Prove the Pythagorean theorem")) # โ o3-mini
Benchmarks: Where Reasoning Models Win
Task GPT-4o o1-mini o3
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
AIME 2024 (math competition) 13% 70% 96%
HumanEval (coding) 90% 95% 99%
GPQA (PhD-level science) 53% 73% 87%
Simple Q&A 95% 95% 95%
Creative writing โ
โ
โ
โ
โ
โ
โ
โ
โ
โ
Speed Fast Slow Slowest
Cost per 1M tokens $5 $3 $15
Key insight: For simple tasks, reasoning models aren’t better โ just slower and more expensive. Use them selectively.
Resources
- OpenAI o1 Documentation
- DeepSeek R1 Paper
- DeepSeek API
- Ollama โ run models locally
- OpenAI Reasoning Effort
Comments