Introduction
Large language models generate text token by token through a process called decoding. While models can produce remarkably fluent text, the standard greedy decoding often selects only the single most probable token at each step, potentially missing better reasoning paths. Self-consistency decoding addresses this limitation by sampling multiple diverse reasoning paths and selecting the most consistent answer through majority voting.
This technique, introduced by researchers at Google, has become a cornerstone method for improving reasoning accuracy in LLMs without requiring additional training or model modifications.
Understanding the Problem
Limitations of Greedy Decoding
Standard decoding strategies have inherent flaws:
Greedy Decoding:
"Think step by step: What is 17 ร 24?"
โ "First, 17 ร 20 = 340"
โ "Then, 17 ร 4 = 68"
โ "Add them: 340 + 68 = 408" โ
โ Correct answer!
But what if the model makes an early mistake?
โ "First, 17 ร 20 = 340"
โ "Then, 17 ร 4 = 64" (wrong!)
โ "Add them: 340 + 64 = 404" โ
โ Wrong answer - and no recovery possible!
The problem: Greedy decoding commits to every token selection, with no mechanism to explore alternatives or recover from early errors.
The Self-Consistency Principle
Self-consistency is based on a simple but powerful observation:
For problems with a unique correct answer, multiple independent reasoning paths are more likely to converge on the correct solution than on an incorrect one.
This is similar to how human experts might solve a problem multiple ways to verify their answer.
How Self-Consistency Works
The Algorithm
import torch
from collections import Counter
class SelfConsistencyDecoder:
def __init__(self, model, tokenizer, num_samples=5, temperature=0.7):
self.model = model
self.tokenizer = tokenizer
self.num_samples = num_samples
self.temperature = temperature
def generate_with_cot(self, prompt, max_length=512):
"""
Generate a single CoT response with sampling
"""
inputs = self.tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = self.model.generate(
inputs["input_ids"],
max_length=max_length,
temperature=self.temperature,
do_sample=True,
top_p=0.9, # Nucleus sampling
pad_token_id=self.tokenizer.pad_token_id
)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
def extract_answer(self, response):
"""
Extract the final answer from CoT response
Different implementations for different answer formats
"""
# Common patterns: "The answer is X", "= X", etc.
import re
# Try numeric extraction
numbers = re.findall(r'[-+]?\d*\.?\d+', response)
if numbers:
return numbers[-1] # Often last number is the answer
# Try explicit answer markers
patterns = [
r'[Aa]nswer[:\s]+(.+?)(?:\.|$)',
r'[Tt]he answer is[:\s]+(.+?)(?:\.|$)',
r'=\s*(.+?)(?:\.|$)'
]
for pattern in patterns:
match = re.search(pattern, response)
if match:
return match.group(1).strip()
return None
def decode(self, prompt):
"""
Main self-consistency decoding process
"""
# Step 1: Generate multiple reasoning paths
responses = []
for _ in range(self.num_samples):
response = self.generate_with_cot(prompt)
responses.append(response)
# Step 2: Extract answers from each response
answers = []
for response in responses:
answer = self.extract_answer(response)
if answer:
answers.append(answer)
# Step 3: Majority vote
if not answers:
# Fallback: return most probable single path
return responses[0]
# Count answer frequencies
answer_counts = Counter(answers)
# Return most common answer
most_common_answer = answer_counts.most_common(1)[0][0]
return most_common_answer
Visual Representation
Prompt: "If a train travels 120km in 2 hours, what is its speed?"
โโโโโโโโโโโโโโโโโโโ
โ Generate Path 1 โ
โ "120 รท 2 = 60" โ โโ
โโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโ โ Sample
โ Generate Path 2 โ โ Multiple
โ "120 รท 2 = 60" โ โ Paths
โโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโ โ
โ Generate Path 3 โ โ
โ "120 รท 2 = 60" โ โโ
โโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Extract Answers โ
โ [60, 60, 60] โ
โโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโ
โ Majority Vote โ
โ 60 (3/3) โ โโโ Final Answer
โโโโโโโโโโโโโโโโโโโ
Implementation Strategies
1. Temperature Sampling
Varying temperature controls randomness:
def generate_diverse_paths(prompt, num_paths=5, temperature=0.7):
"""
Generate diverse reasoning paths using temperature sampling
"""
responses = []
for i in range(num_paths):
# Use different temperature for each path
path_temp = temperature * (1 + i * 0.1)
response = model.generate(
prompt,
temperature=path_temp,
top_p=0.95,
do_sample=True
)
responses.append(response)
return responses
2. Beam Search with Self-Consistency
Combining beam search with majority voting:
def beam_search_with_consistency(prompt, num_beams=5, num_groups=3):
"""
Use multiple beam groups and vote across them
"""
all_candidates = []
for group in range(num_groups):
# Different random seeds for diversity
torch.manual_seed(group * 42)
candidates = model.generate(
prompt,
num_beams=num_beams,
temperature=0.8,
do_sample=True,
output_scores=True,
return_dict_in_generate=True
)
all_candidates.extend(candidates)
# Vote across all candidates
answers = [extract_answer(c) for c in all_candidates]
return majority_vote(answers)
3. Chain-of-Thought Integration
Self-consistency works best with Chain-of-Thought prompting:
def self_consistency_cot(prompt):
"""
Full self-consistency with CoT prompting
"""
# Add CoT prompting
cot_prompt = f"""Think step by step and show your work.
Then provide your final answer.
Question: {prompt}
Let me think step by step:"""
# Generate multiple paths
paths = generate_diverse_paths(
cot_prompt,
num_samples=7,
temperature=0.9
)
# Extract and vote
answers = [extract_answer(p) for p in paths]
return majority_vote(answers)
Performance Analysis
Accuracy Improvements
| Task | Greedy | Self-Consistency (5 samples) | Improvement |
|---|---|---|---|
| Arithmetic (GSM8K) | 17.9% | 47.5% | +165% |
| Multi-digit arithmetic | 55.0% | 78.7% | +43% |
| Commonsense reasoning | 72.4% | 83.2% | +15% |
| Symbolic reasoning | 61.6% | 84.3% | +37% |
Latency Trade-offs
Self-consistency increases inference time linearly with the number of samples:
Total Time = (Single Generation Time) ร (Number of Samples) ร (Voting Time)
Typical values:
- Single generation: ~500ms
- 5 samples: ~2.5s total
- 10 samples: ~5s total
When to Use Self-Consistency
| Use Case | Recommended | Not Recommended |
|---|---|---|
| Math problems | โ High benefit | |
| Logical reasoning | โ High benefit | |
| Factual questions | โ Low benefit | |
| Creative writing | โ Not applicable | |
| Code generation | โ Moderate benefit | |
| Translation | โ Single correct answer unclear |
Advanced Techniques
1. Weighted Voting
Weight votes by generation confidence:
def weighted_majority_vote(responses, model):
"""
Weight each vote by model's confidence
"""
weighted_counts = Counter()
for response in responses:
answer = extract_answer(response)
confidence = calculate_confidence(response, model)
weighted_counts[answer] += confidence
return weighted_counts.most_common(1)[0][0]
def calculate_confidence(response, model):
"""
Estimate confidence from token probabilities
"""
# Use entropy or average token probability
tokens = model.tokenize(response)
probs = [model.predict_prob(t) for t in tokens]
# Lower entropy = higher confidence
import numpy as np
entropy = np.mean([-(p * np.log(p + 1e-10)) for p in probs])
return 1 / (1 + entropy)
2. Semantic Clustering
Group semantically equivalent answers:
from sklearn.cluster import AgglomerativeClustering
def semantic_majority_vote(responses, embeddings):
"""
Cluster semantically similar answers before voting
"""
# Cluster answer embeddings
clusters = AgglomerativeClustering(
metric='cosine',
distance_threshold=0.1
).fit_predict(embeddings)
# Find largest cluster
cluster_counts = Counter(clusters)
dominant_cluster = cluster_counts.most_common(1)[0][0]
# Return representative from largest cluster
cluster_answers = [a for i, a in enumerate(responses) if clusters[i] == dominant_cluster]
return cluster_answers[0]
3. Iterative Refinement
Multiple rounds of self-consistency:
def iterative_self_consistency(prompt, max_rounds=3):
"""
Iteratively refine answers through multiple rounds
"""
current_prompt = prompt
for round in range(max_rounds):
# Generate and vote
answers = generate_and_vote(current_prompt, num_samples=5)
# Check convergence
if len(set(answers)) == 1:
return answers[0] # All agree!
# Add feedback to prompt
current_prompt += f"\nPrevious attempts gave: {answers}"
return majority_vote(answers)
Best Practices
1. Sample Diversity
Maximize reasoning path diversity:
def maximize_diversity(responses):
"""
Strategies for diverse sampling
"""
# Use varied temperatures
temps = [0.3, 0.5, 0.7, 0.9, 1.0]
# Use nucleus sampling with different top-p
top_ps = [0.8, 0.85, 0.9, 0.95, 1.0]
# Use different random seeds
seeds = [42, 123, 456, 789, 1011]
2. Answer Extraction
Handle various answer formats:
def robust_answer_extraction(responses):
"""
Multiple extraction strategies
"""
extractors = [
extract_numeric_last,
extract_after_equals,
extract_in_box,
extract_quoted,
extract_from_options # For multiple choice
]
all_answers = []
for response in responses:
for extractor in extractors:
answer = extractor(response)
if answer:
all_answers.append(answer)
break
return all_answers
3. Error Handling
Deal with extraction failures:
def handle_extraction_failures(responses):
"""
Graceful handling when answers can't be extracted
"""
successful = []
failed = []
for response in responses:
answer = extract_answer(response)
if answer:
successful.append(answer)
else:
failed.append(response)
if successful:
return successful
elif failed:
# Fallback: use any response
return [failed[0]]
else:
return ["UNKNOWN"]
Cost Optimization
Reducing Compute While Maintaining Quality
| Strategy | Samples | Accuracy Retention | Speedup |
|---|---|---|---|
| Standard | 5-10 | 100% | 1x |
| Early stopping | 3-5 | ~85% | 1.5-2x |
| Confidence-based | 3-5 | ~90% | 1.5x |
| Cached paths | Variable | ~95% | 2-3x |
def early_stopping_self_consistency(prompt, max_samples=5, threshold=0.8):
"""
Stop early if consensus reached
"""
answers = []
for i in range(max_samples):
answer = generate_and_extract(prompt)
answers.append(answer)
# Check if consensus reached
counts = Counter(answers)
top_fraction = counts.most_common(1)[0][1] / len(answers)
if top_fraction >= threshold:
break
return majority_vote(answers)
Combining with Other Techniques
With Tree of Thoughts
def tot_with_self_consistency(prompt):
"""
Combine ToT with self-consistency
"""
# Generate multiple thought trees
trees = [generate_tree(prompt) for _ in range(5)]
# Get best path from each tree
paths = [tree.best_path() for tree in trees]
# Vote across paths
answers = [extract_answer(p) for p in paths]
return majority_vote(answers)
With Speculative Decoding
def speculative_with_consistency(prompt):
"""
Combine speculative decoding with self-consistency
"""
# Use smaller draft model for speed
draft_responses = draft_model.sample(prompt, num_samples=5)
# Verify with target model
verified = [
target_model.verify(prompt, draft)
for draft in draft_responses
]
return majority_vote(verified)
Conclusion
Self-consistency decoding represents a powerful ensemble technique that significantly improves LLM reasoning without requiring model retraining. By generating multiple reasoning paths and selecting the most consistent answer, it transforms the inherent stochasticity of language model sampling from a limitation into an advantage.
Key insights:
- Reasoning Path Diversity: Multiple paths increase likelihood of finding correct solutions
- Majority Voting: The correct answer is more likely to appear consistently
- No Training Required: Works with any pretrained LLM
- Compute Trade-off: Accuracy improves with more samples, but at linear cost
The technique exemplifies a broader principle in modern AI: rather than fighting with randomness, we can harness it through intelligent aggregation. As LLMs continue to grow in capability, self-consistency remains a simple yet effective method for extracting reliable answers from potentially noisy generation processes.
Resources
- Self-Consistency Improves Chain of Thought Reasoning in LLMs
- Chain of Thought Prompting Elicits Reasoning
- Google Research: Self-Consistency
- Majority Voting for LLM Reasoning
Comments