What Are Reasoning Models?
Reasoning models represent a paradigm shift in large language models. Unlike traditional LLMs that generate responses in a single pass, reasoning models spend more time “thinking” before answering, breaking down complex problems into steps, exploring multiple approaches, and correcting their own mistakes along the way.
This capability emerges from training techniques that emphasize test-time computeโthe idea that allowing models more inference time can dramatically improve performance on difficult tasks. The breakthrough came with OpenAI’s o1 model (codenamed “Strawberry”) and has since been replicated and enhanced by models like DeepSeek V3.2, which achieved gold medal performance at IMO 2025 and IOI 2025.
Core Concepts
Chain-of-Thought (CoT)
Chain-of-thought prompting encourages models to verbalize their reasoning step by step. Rather than jumping to an answer, the model articulates intermediate steps, making the reasoning process transparent and often more accurate.
# Traditional prompting
Q: If Alice has 5 apples and gives 2 to Bob, how many does she have?
A: 3
# Chain-of-thought prompting
Q: If Alice has 5 apples and gives 2 to Bob, how many does she have?
A: Let's think step by step. Alice starts with 5 apples. She gives away 2 apples.
Therefore, she has 5 - 2 = 3 apples remaining. The answer is 3.
Test-Time Compute
Test-time compute refers to the computational resources allocated during inference rather than during training. Reasoning models are designed to use this additional compute strategicallyโspending more tokens and time on harder problems while being efficient on simpler ones.
Reinforcement Learning for Reasoning
Unlike supervised fine-tuning with human-annotated reasoning traces, reasoning models often use reinforcement learning to develop their reasoning capabilities. The model learns to maximize rewards for correct answers and logical consistency, discovering novel reasoning strategies that humans may not have demonstrated.
Process Reward Models (PRM)
PRMs provide feedback at each step of the reasoning process, not just at the final answer. This enables more granular training signals and helps models avoid early mistakes that compound through subsequent reasoning steps.
Leading Reasoning Models
OpenAI o3 / o4-mini Series
OpenAI’s o1 model launched in September 2024, introducing the reasoning model category to mainstream users. The o3 and o4-mini models, released in early 2025, pushed performance further on mathematical and scientific reasoning tasks.
OpenAI o3 (January 2025):
- Advanced mathematical reasoning with 83.6% on AIME 2024 (high reasoning effort)
- 77.0% on GPQA PhD-level science questions
- Strong performance on FrontierMath (32% with Python tools on first attempt)
- 2073 Elo on Codeforces competitive programming
OpenAI o3-mini (January 2025):
- Most cost-efficient reasoning model in the o-series
- Optimized for STEM reasoning (science, math, coding)
- 24% faster response time than o1-mini (7.7s vs 10.16s avg)
- First small reasoning model with full developer features: function calling, Structured Outputs, developer messages
- Three reasoning effort options: low, medium, high
- Supports web search for up-to-date answers with links
- Free ChatGPT users can now access reasoning models
OpenAI o4-mini:
- Compact reasoning model optimized for efficiency
- Maintains strong performance while reducing latency
- Available alongside o3 and o3-mini in ChatGPT and API
Key characteristics across o-series:
- Hidden chain-of-thought that users cannot directly access (except o3-mini with search)
- Higher latency and cost per token compared to non-reasoning models
- Deliberative alignment safety training
- Significantly surpasses GPT-4o on challenging safety and jailbreak evaluations
Resources:
DeepSeek V3 / V3.2 Series
DeepSeek V3.2, released in December 2025, represents a breakthrough in open-source reasoning models with two variants: DeepSeek-V3.2 and DeepSeek-V3.2-Speciale. Built on three key technical innovationsโDeepSeek Sparse Attention (DSA), scalable reinforcement learning, and large-scale agentic task synthesisโthe models harmonize computational efficiency with superior reasoning and agent performance.
Key characteristics:
- Open-source weights (MIT License)
- 685B parameters with DeepSeek Sparse Attention for efficient long-context processing
- Two variants: standard V3.2 for general use, V3.2-Speciale for deep reasoning tasks
- First model to integrate thinking directly into tool-use scenarios
- Supports tool-use in both thinking and non-thinking modes
Performance highlights:
- DeepSeek-V3.2-Speciale achieves gold medal level at IMO 2025 (35/42), IOI 2025, ICPC World Finals, and China’s Mathematical Olympiad
- Comparable to GPT-5 and Gemini-3.0-Pro on reasoning benchmarks
- Surpasses GPT-5 in high-compute variant evaluations
- Strong performance on mathematical proof and logical verification tasks
Technical innovations:
- DeepSeek Sparse Attention (DSA): Efficient attention mechanism reducing computational complexity while preserving long-context performance
- Scalable RL Framework: Robust post-training protocol enabling reasoning capabilities competitive with proprietary models
- Agentic Task Synthesis: Novel pipeline generating 1,800+ environment scenarios with 85K+ complex instructions for tool-use training
Note: The DeepSeek-V3.2-Speciale variant is designed exclusively for deep reasoning tasks and does not support the tool-calling functionality.
Resources:
Claude 4 Series (Opus 4.5 / Sonnet 4.5 / Haiku 4.5)
Anthropic released the Claude 4 series in late 2025, with Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5 representing significant advances in reasoning, coding, and agentic capabilities.
Claude Opus 4.5 (November 2025):
- Best model in the world for coding, agents, and computer use
- Frontier performance with dramatically improved token efficiency
- Strong improvements on everyday tasks like slides and spreadsheets
- Available through Claude subscription and API
Claude Sonnet 4.5 (September 2025):
- New benchmark records in coding, reasoning, and computer use
- Anthropic’s most aligned model to date
- Accompanied by Claude Agent SDK for building capable agents
- Strong balance of capability and efficiency
Claude Haiku 4.5 (October 2025):
- Matches state-of-the-art coding capabilities from previous generations
- Unprecedented speed and cost-efficiency for complex tasks
- Optimized for high-volume, low-latency applications
Key characteristics across Claude 4 series:
- Enhanced extended thinking capabilities
- Strong computer use and agentic behavior
- Improved alignment and instruction following
- Available through Claude Pro, Team, Enterprise plans and API
Resources:
- Claude Opus 4.5 Announcement
- Claude Sonnet 4.5 Announcement
- Claude Haiku 4.5 Announcement
- Claude Documentation
Google Gemini Series
Google has released multiple Gemini models with reasoning capabilities, evolving rapidly through 2025.
Gemini 2.0 Series (December 2024): The Gemini 2.0 series introduced native multimodality and enhanced reasoning capabilities.
- Gemini 2.0 Pro: Google’s flagship model for complex reasoning tasks
- Gemini 2.0 Flash: Optimized for speed and efficiency with competitive reasoning
- Gemini 2.0 Flash Thinking: Specialized for extended reasoning with visible thought process
- Gemini 2.0 Experimental: Testing ground for new reasoning techniques
Key characteristics:
- Native multimodal understanding (text, images, video, audio)
- Native tool use and function calling
- 1M+ token context window on larger models
- Native agentic capabilities for complex task completion
- Strong performance on STEM reasoning, coding, and scientific tasks
Gemini 2.5 Series (Early 2025): The 2.5 series brings significant improvements in reasoning depth and agentic behavior.
Gemini Robotics-ER 1.5 (December 2025): A state-of-the-art embodied reasoning model for robots, excelling in:
- Visual and spatial understanding
- Task planning and progress estimation
- Complex multi-step task execution
- Real-world robotics applications
Additional Models:
- Gemini 3 Pro: Integrated into Gemini CLI for agentic coding
- Gemini 3 Flash: Fast reasoning with state-of-the-art capabilities
- Veo 3.1: Video generation with reasoning capabilities
Resources:
Other Notable Models
Qwen3 Series (May 2025): Alibaba’s latest reasoning model series, including Qwen3-30B-A3B-Thinking-2507 and other variants. Key innovations include hybrid thinking mode (interleaves thinking and non-thinking blocks), configurable thinking length up to 32K tokens, and strong performance on mathematical and coding tasks. Available in multiple sizes from 8B to 30B+ parameters with Apache 2.0 license.
Qwen QwQ-32B: Alibaba’s 32-billion parameter reasoning model that punches above its weight class, demonstrating that reasoning capabilities can be achieved at smaller scales.
Kimi k1.5: ByteDance’s (TikTok’s parent company) reasoning model showing strong performance on Chinese-language reasoning tasks.
Skywork o1: Another open-weight reasoning model contributing to the ecosystem of accessible reasoning models.
Resources:
Technical Deep Dive
How Reasoning Models Are Trained
Training reasoning models involves a sophisticated multi-stage pipeline that combines supervised fine-tuning with reinforcement learning. Unlike traditional language models that primarily learn from next-token prediction, reasoning models are trained to generate extended thought processes that lead to correct answers.
Stage 1: Foundation and Cold Start
The journey begins with a pre-trained language model that has general language capabilities. From this base, a cold start phase initializes reasoning behavior using high-quality reasoning demonstrations.
Cold Start Process:
- The model is fine-tuned on human-annotated or synthetic reasoning traces
- These traces contain step-by-step problem-solving examples showing how to break down complex problems
- Examples include mathematical proofs, coding solutions, and logical deductions
- The goal is to establish a baseline reasoning pattern that the model can improve upon
Key Insight: DeepSeek’s research showed that even a small amount of cold-start data (thousands of examples) can bootstrap reasoning capabilities that pure RL can then amplify. The cold start teaches the model to “think aloud” rather than jump to conclusions.
Stage 2: Reinforcement Learning with GRPO
The core innovation in reasoning model training is Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that doesn’t require a separate reward model.
How GRPO Works:
- For each training problem, the model generates multiple candidate reasoning paths (typically 4-16)
- Each path is scored based on whether it reaches the correct answer
- Paths that lead to correct answers are reinforced; incorrect paths are penalized
- The relative advantage of better paths over worse paths determines the gradient
GRPO Advantage:
- No separate reward model needed (unlike PPO which requires a critic network)
- More stable training dynamics
- Particularly effective for problems with verifiable answers (math, coding, logic puzzles)
- Enables the model to discover novel reasoning strategies not demonstrated in training data
# Conceptual GRPO training step
def grpo_update(model, problem, correct_answer, num_samples=8):
# Generate multiple reasoning paths
samples = [model.generate(problem, max_tokens=MAX_REASONING) for _ in range(num_samples)]
# Score each sample (1 if correct, 0 if incorrect)
rewards = [1 if sample.endswith(correct_answer) else 0 for sample in samples]
# Compute relative advantages
# Better-than-average samples get positive gradients
# Worse-than-average samples get negative gradients
advantage = [r - mean(rewards) for r in rewards]
# Update model to prefer better samples
model.update(advantage)
return samples
Stage 3: Reward Modeling - PRM vs ORM
Two types of reward signals guide reasoning model training:
Outcome Reward Models (ORM):
- Provide a single reward at the end of the reasoning process
- Binary: correct (1) or incorrect (0)
- Simple but provides limited training signal
- Works well when final answers are verifiable
Process Reward Models (PRM):
- Provide feedback at each reasoning step
- Can identify which specific steps are correct or incorrect
- Enables more granular credit assignment
- Helps models avoid early mistakes that compound through reasoning
# PRM provides step-by-step feedback
step_1: "First, I need to identify the variables..." -> PRM: 0.9 (good start)
step_2: "The derivative of xยฒ is 2x..." -> PRM: 1.0 (correct)
step_3: "Therefore the integral is 2xยฒ..." -> PRM: 0.2 (incorrect)
step_4: "Let me recalculate..." -> PRM: 0.8 (recovering)
Hybrid Approaches:
- Many state-of-the-art models use both ORM and PRM
- ORM for final answer validation
- PRM for intermediate step guidance
- DeepSeek V3.2 uses a sophisticated combination of both
Stage 4: Scaling Test-Time Compute
A key insight from reasoning model research is that test-time computeโallowing models more inference timeโcan dramatically improve performance on difficult tasks.
Test-Time Compute Strategies:
-
Parallel Sampling: Generate multiple reasoning paths and select the best one
- More samples = higher probability of finding correct reasoning
- Works well when the model can recognize correct answers
-
Sequential Extension: Allow longer reasoning chains
- Models think for more tokens before answering
- Enables multi-step reasoning that wouldn’t fit in shorter contexts
-
Adaptive Allocation: Spend more compute on harder problems
- Easy problems: minimal thinking
- Hard problems: extensive reasoning
- Optimizes cost-performance tradeoff
# Test-time compute strategies
def test_time_compute(model, problem, strategy="adaptive"):
if strategy == "parallel":
# Generate multiple paths, vote on answer
answers = [model.generate(problem) for _ in range(16)]
return majority_vote(answers)
elif strategy == "sequential":
# Extended thinking with regeneration
for attempt in range(4):
response = model.generate(problem, max_tokens=16384)
if is_verified_correct(response):
return response
return response # Return best attempt
elif strategy == "adaptive":
# Estimate difficulty, allocate compute
difficulty = estimate_difficulty(problem)
samples = {1: 4, 2: 8, 3: 16}[difficulty]
return model.generate(problem, num_samples=samples)
Stage 5: Distillation
Once a capable reasoning model is trained, its capabilities can be distilled into smaller, faster models.
Distillation Process:
- The large reasoning model generates thousands of reasoning traces
- These traces become training data for smaller models
- The smaller model learns to mimic the reasoning patterns
- Result: A smaller model with reasoning capabilities close to the teacher
Key Findings:
- DeepSeek-R1 demonstrated that reasoning can be effectively distilled
- 7B models can achieve significant reasoning capabilities through distillation
- Distilled models are 10-100x faster while maintaining 80-90% of reasoning performance
How Reasoning Models Work Internally
Understanding the internal mechanics of reasoning models reveals why they outperform traditional LLMs on complex tasks.
The Thinking Process
When a reasoning model encounters a problem, it doesn’t just predict the next token. Instead, it enters a reasoning mode characterized by:
- Extended Context Generation: The model produces hundreds to thousands of tokens of internal reasoning
- Self-Correction: The model identifies and corrects its own mistakes mid-stream
- Strategy Switching: The model tries multiple approaches when one isn’t working
- Verification: The model checks its own work before finalizing answers
Emergent Reasoning Patterns:
Through reinforcement learning, reasoning models develop sophisticated behaviors that weren’t explicitly programmed:
- Self-Reflection: “Wait, let me double-check that step…”
- Backtracking: “That approach won’t work, let me try another…”
- Systematic Decomposition: Breaking complex problems into subproblems
- Analogical Reasoning: “This is similar to problem X, which I solved by…”
- Verification: “Let me verify this answer by working backwards…”
Architectural Considerations
While reasoning models use the same transformer architecture as standard LLMs, their training creates different internal representations:
Attention Patterns:
- Reasoning models show increased attention to earlier context during reasoning
- They revisit problem statements multiple times during extended thinking
- Attention becomes more structured, following reasoning chains
Activation Patterns:
- Certain neurons activate specifically during reasoning tasks
- Reasoning models develop specialized circuits for logical operations
- The models show increased activation in regions associated with working memory
The Chain-of-Thought Mechanism
Chain-of-thought prompting works because it:
- Reduces Locality: Breaking problems into steps prevents information from being “lost” across long sequences
- Enables Self-Correction: Each step can be evaluated independently
- Provides Training Signal: Intermediate steps allow more granular feedback
- Matches Human Reasoning: Leveraging patterns humans use for problem-solving
# How chain-of-thought improves reasoning
def compare_approaches():
# Direct answer (traditional LLM)
"The answer is 42." # No visibility into reasoning
# Chain-of-thought (reasoning model)
"""
Let me work through this step by step:
Step 1: Identify what we're solving for
We need to find x where f(x) = 0
Step 2: Apply the quadratic formula
x = [-b ยฑ sqrt(bยฒ - 4ac)] / 2a
Step 3: Substitute values
a=1, b=-5, c=6
x = [5 ยฑ sqrt(25 - 24)] / 2
x = [5 ยฑ 1] / 2
Step 4: Calculate both solutions
x = 6/2 = 3 or x = 4/2 = 2
Step 5: Verify
f(2) = 2ยฒ - 5*2 + 6 = 4 - 10 + 6 = 0 โ
f(3) = 3ยฒ - 5*3 + 6 = 9 - 15 + 6 = 0 โ
The answer is x = 2 or x = 3.
"""
Why Reasoning Models Excel
On Verifiable Tasks:
- Mathematics: Answers can be definitively verified
- Coding: Code can be compiled and tested
- Logic: Solutions can be proven correct
- Factual QA: Claims can be checked against knowledge bases
The Key Advantage: Reasoning models can spend more computational resources on difficult problems, effectively “thinking harder” when needed. This test-time compute scaling provides a new dimension for improving model performance beyond just training compute.
Practical Applications
Mathematical Problem Solving
Reasoning models excel at solving complex mathematical problems, showing strong performance on:
- Competition mathematics (AIME, IMO)
- Graduate-level coursework
- Mathematical proof construction
- Scientific calculations
Software Engineering
The extended reasoning capabilities make these models particularly effective for:
- Complex debugging scenarios
- Algorithm design and optimization
- Architecture decisions
- Multi-file project coordination
Scientific Research
Researchers use reasoning models for:
- Literature synthesis and hypothesis generation
- Experimental design optimization
- Data analysis and interpretation
- Grant proposal review and critique
Legal and Financial Analysis
The ability to trace reasoning steps is valuable for:
- Contract analysis and risk assessment
- Regulatory compliance checking
- Complex financial modeling
- Case law research and synthesis
Prompting Strategies for Reasoning Models
Explicit Step-by-Step Instructions
Solve this problem step by step, showing your work for each step.
Before giving your final answer, verify it by working backwards or
considering alternative approaches.
Self-Verification Prompts
After arriving at your answer, check for:
1. Are there any hidden assumptions?
2. Does the answer make dimensional sense?
3. Can you verify with a different approach?
Constrained Reasoning
First, identify the key constraints in this problem.
Then, develop a systematic approach to address each constraint.
Finally, check that your solution satisfies all constraints.
Benchmarks and Evaluation
Understanding how reasoning models are evaluated helps frame their capabilities:
| Benchmark | Description | Difficulty |
|---|---|---|
| AIME | American Invitational Mathematics Examination | High |
| MMLU | Massive Multitask Language Understanding | Medium-High |
| MATH-500 | 500 challenging math problems | High |
| GPQA | Graduate-Level Google-Proof Q&A | Very High |
| ARC | Abstraction and Reasoning Corpus | Variable |
| BIG-Bench Hard | Diverse challenging tasks | High |
Resources for Further Learning
Official Documentation
- OpenAI o1/o3/o4-mini Guide
- DeepSeek-V3.2 Documentation
- Anthropic Claude Thinking
- Qwen3 Documentation
Technical Papers
- DeepSeek-V3.2 Technical Report
- Qwen3 Technical Report
- OpenAI o1 Technical Summary
- Chain-of-Thought Prompting Elicits Reasoning
Community Resources
Tutorials and Guides
Future Directions
The reasoning model field is evolving rapidly:
-
Longer Reasoning Chains: Pushing the boundaries of how many reasoning steps models can reliably perform
-
Multimodal Reasoning: Extending reasoning capabilities to visual, audio, and video inputs
-
Efficiency Improvements: Making reasoning models faster and cheaper through architecture innovations and distillation
-
Specialized Reasoning: Developing reasoning models optimized for specific domains like medicine, law, or science
-
Hybrid Approaches: Combining reasoning models with retrieval systems, tools, and external compute
Getting Started
To experiment with reasoning models:
-
API Access: Try OpenAI o3/o4-mini, DeepSeek V3.2, Claude 4 series (Opus 4.5, Sonnet 4.5, Haiku 4.5), or Qwen3
-
Local Deployment: Run DeepSeek-V3.2 locally using Ollama, LM Studio, or vLLM
-
Experimentation: Start with mathematical problems, then progress to more complex multi-step tasks
-
Evaluation: Test on benchmarks relevant to your use case to understand capabilities and limitations
Reasoning models represent a fundamental capability advancement in AI systems. Understanding when and how to use them effectively will be essential for building sophisticated AI applications in 2025 and beyond.
Comments