Self-Consistency Decoding: Ensemble Reasoning in LLMs

Introduction

Large language models generate text token by token through a process called decoding. While models can produce remarkably fluent text, the standard greedy decoding often selects only the single most probable token at each step, potentially missing better reasoning paths. Self-consistency decoding addresses this limitation by sampling multiple diverse reasoning paths and selecting the most consistent answer through majority voting.

This technique, introduced by researchers at Google, has become a cornerstone method for improving reasoning accuracy in LLMs without requiring additional training or model modifications.

Understanding the Problem

Limitations of Greedy Decoding

Standard decoding strategies have inherent flaws:

Greedy Decoding:
"Think step by step: What is 17 × 24?"
→ "First, 17 × 20 = 340" 
→ "Then, 17 × 4 = 68"
→ "Add them: 340 + 68 = 408" ✓
→ Correct answer!

But what if the model makes an early mistake?
→ "First, 17 × 20 = 340"
→ "Then, 17 × 4 = 64" (wrong!)
→ "Add them: 340 + 64 = 404" ✗
→ Wrong answer - and no recovery possible!

The problem: Greedy decoding commits to every token selection, with no mechanism to explore alternatives or recover from early errors.

The Self-Consistency Principle

Self-consistency is based on a simple but powerful observation:

For problems with a unique correct answer, multiple independent reasoning paths are more likely to converge on the correct solution than on an incorrect one.

This is similar to how human experts might solve a problem multiple ways to verify their answer.

How Self-Consistency Works

The Algorithm

import torch
from collections import Counter


class SelfConsistencyDecoder:
    def __init__(self, model, tokenizer, num_samples=5, temperature=0.7):
        self.model = model
        self.tokenizer = tokenizer
        self.num_samples = num_samples
        self.temperature = temperature
    
    def generate_with_cot(self, prompt, max_length=512):
        """
        Generate a single CoT response with sampling
        """
        inputs = self.tokenizer(prompt, return_tensors="pt")
        
        with torch.no_grad():
            outputs = self.model.generate(
                inputs["input_ids"],
                max_length=max_length,
                temperature=self.temperature,
                do_sample=True,
                top_p=0.9,  # Nucleus sampling
                pad_token_id=self.tokenizer.pad_token_id
            )
        
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    def extract_answer(self, response):
        """
        Extract the final answer from CoT response
        Different implementations for different answer formats
        """
        # Common patterns: "The answer is X", "= X", etc.
        import re
        
        # Try numeric extraction
        numbers = re.findall(r'[-+]?\d*\.?\d+', response)
        if numbers:
            return numbers[-1]  # Often last number is the answer
        
        # Try explicit answer markers
        patterns = [
            r'[Aa]nswer[:\s]+(.+?)(?:\.|$)',
            r'[Tt]he answer is[:\s]+(.+?)(?:\.|$)',
            r'=\s*(.+?)(?:\.|$)'
        ]
        
        for pattern in patterns:
            match = re.search(pattern, response)
            if match:
                return match.group(1).strip()
        
        return None
    
    def decode(self, prompt):
        """
        Main self-consistency decoding process
        """
        # Step 1: Generate multiple reasoning paths
        responses = []
        for _ in range(self.num_samples):
            response = self.generate_with_cot(prompt)
            responses.append(response)
        
        # Step 2: Extract answers from each response
        answers = []
        for response in responses:
            answer = self.extract_answer(response)
            if answer:
                answers.append(answer)
        
        # Step 3: Majority vote
        if not answers:
            # Fallback: return most probable single path
            return responses[0]
        
        # Count answer frequencies
        answer_counts = Counter(answers)
        
        # Return most common answer
        most_common_answer = answer_counts.most_common(1)[0][0]
        
        return most_common_answer

Visual Representation

Prompt: "If a train travels 120km in 2 hours, what is its speed?"

         ┌─────────────────┐
         │ Generate Path 1 │
         │ "120 ÷ 2 = 60"  │ ─┐
         └─────────────────┘  │
         ┌─────────────────┐  │  Sample
         │ Generate Path 2 │  │  Multiple
         │ "120 ÷ 2 = 60"  │  │  Paths
         └─────────────────┘  │
         ┌─────────────────┐  │
         │ Generate Path 3 │  │
         │ "120 ÷ 2 = 60"  │ ─┘
         └─────────────────┘
                    │
                    ▼
         ┌─────────────────┐
         │ Extract Answers │
         │ [60, 60, 60]    │
         └─────────────────┘
                    │
                    ▼
         ┌─────────────────┐
         │  Majority Vote  │
         │   60 (3/3)      │ ──→ Final Answer
         └─────────────────┘

Implementation Strategies

1. Temperature Sampling

Varying temperature controls randomness:

def generate_diverse_paths(prompt, num_paths=5, temperature=0.7):
    """
    Generate diverse reasoning paths using temperature sampling
    """
    responses = []
    
    for i in range(num_paths):
        # Use different temperature for each path
        path_temp = temperature * (1 + i * 0.1)
        
        response = model.generate(
            prompt,
            temperature=path_temp,
            top_p=0.95,
            do_sample=True
        )
        responses.append(response)
    
    return responses

2. Beam Search with Self-Consistency

Combining beam search with majority voting:

def beam_search_with_consistency(prompt, num_beams=5, num_groups=3):
    """
    Use multiple beam groups and vote across them
    """
    all_candidates = []
    
    for group in range(num_groups):
        # Different random seeds for diversity
        torch.manual_seed(group * 42)
        
        candidates = model.generate(
            prompt,
            num_beams=num_beams,
            temperature=0.8,
            do_sample=True,
            output_scores=True,
            return_dict_in_generate=True
        )
        
        all_candidates.extend(candidates)
    
    # Vote across all candidates
    answers = [extract_answer(c) for c in all_candidates]
    return majority_vote(answers)

3. Chain-of-Thought Integration

Self-consistency works best with Chain-of-Thought prompting:

def self_consistency_cot(prompt):
    """
    Full self-consistency with CoT prompting
    """
    # Add CoT prompting
    cot_prompt = f"""Think step by step and show your work.
    Then provide your final answer.

    Question: {prompt}

    Let me think step by step:"""
    
    # Generate multiple paths
    paths = generate_diverse_paths(
        cot_prompt, 
        num_samples=7,
        temperature=0.9
    )
    
    # Extract and vote
    answers = [extract_answer(p) for p in paths]
    
    return majority_vote(answers)

Performance Analysis

Accuracy Improvements

Task	Greedy	Self-Consistency (5 samples)	Improvement
Arithmetic (GSM8K)	17.9%	47.5%	+165%
Multi-digit arithmetic	55.0%	78.7%	+43%
Commonsense reasoning	72.4%	83.2%	+15%
Symbolic reasoning	61.6%	84.3%	+37%

Latency Trade-offs

Self-consistency increases inference time linearly with the number of samples:

Total Time = (Single Generation Time) × (Number of Samples) × (Voting Time)

Typical values:
- Single generation: ~500ms
- 5 samples: ~2.5s total
- 10 samples: ~5s total

When to Use Self-Consistency

Use Case	Recommended	Not Recommended
Math problems	✓ High benefit
Logical reasoning	✓ High benefit
Factual questions		✗ Low benefit
Creative writing		✗ Not applicable
Code generation	✓ Moderate benefit
Translation		✗ Single correct answer unclear

Advanced Techniques

1. Weighted Voting

Weight votes by generation confidence:

def weighted_majority_vote(responses, model):
    """
    Weight each vote by model's confidence
    """
    weighted_counts = Counter()
    
    for response in responses:
        answer = extract_answer(response)
        confidence = calculate_confidence(response, model)
        
        weighted_counts[answer] += confidence
    
    return weighted_counts.most_common(1)[0][0]


def calculate_confidence(response, model):
    """
    Estimate confidence from token probabilities
    """
    # Use entropy or average token probability
    tokens = model.tokenize(response)
    probs = [model.predict_prob(t) for t in tokens]
    
    # Lower entropy = higher confidence
    import numpy as np
    entropy = np.mean([-(p * np.log(p + 1e-10)) for p in probs])
    
    return 1 / (1 + entropy)

2. Semantic Clustering

Group semantically equivalent answers:

from sklearn.cluster import AgglomerativeClustering


def semantic_majority_vote(responses, embeddings):
    """
    Cluster semantically similar answers before voting
    """
    # Cluster answer embeddings
    clusters = AgglomerativeClustering(
        metric='cosine',
        distance_threshold=0.1
    ).fit_predict(embeddings)
    
    # Find largest cluster
    cluster_counts = Counter(clusters)
    dominant_cluster = cluster_counts.most_common(1)[0][0]
    
    # Return representative from largest cluster
    cluster_answers = [a for i, a in enumerate(responses) if clusters[i] == dominant_cluster]
    return cluster_answers[0]

Multiple rounds of self-consistency:

def iterative_self_consistency(prompt, max_rounds=3):
    """
    Iteratively refine answers through multiple rounds
    """
    current_prompt = prompt
    
    for round in range(max_rounds):
        # Generate and vote
        answers = generate_and_vote(current_prompt, num_samples=5)
        
        # Check convergence
        if len(set(answers)) == 1:
            return answers[0]  # All agree!
        
        # Add feedback to prompt
        current_prompt += f"\nPrevious attempts gave: {answers}"
    
    return majority_vote(answers)

Best Practices

1. Sample Diversity

Maximize reasoning path diversity:

def maximize_diversity(responses):
    """
    Strategies for diverse sampling
    """
    # Use varied temperatures
    temps = [0.3, 0.5, 0.7, 0.9, 1.0]
    
    # Use nucleus sampling with different top-p
    top_ps = [0.8, 0.85, 0.9, 0.95, 1.0]
    
    # Use different random seeds
    seeds = [42, 123, 456, 789, 1011]

2. Answer Extraction

Handle various answer formats:

def robust_answer_extraction(responses):
    """
    Multiple extraction strategies
    """
    extractors = [
        extract_numeric_last,
        extract_after_equals,
        extract_in_box,
        extract_quoted,
        extract_from_options  # For multiple choice
    ]
    
    all_answers = []
    for response in responses:
        for extractor in extractors:
            answer = extractor(response)
            if answer:
                all_answers.append(answer)
                break
    
    return all_answers

3. Error Handling

Deal with extraction failures:

def handle_extraction_failures(responses):
    """
    Graceful handling when answers can't be extracted
    """
    successful = []
    failed = []
    
    for response in responses:
        answer = extract_answer(response)
        if answer:
            successful.append(answer)
        else:
            failed.append(response)
    
    if successful:
        return successful
    elif failed:
        # Fallback: use any response
        return [failed[0]]
    else:
        return ["UNKNOWN"]

Cost Optimization

Reducing Compute While Maintaining Quality

Strategy	Samples	Accuracy Retention	Speedup
Standard	5-10	100%	1x
Early stopping	3-5	~85%	1.5-2x
Confidence-based	3-5	~90%	1.5x
Cached paths	Variable	~95%	2-3x

def early_stopping_self_consistency(prompt, max_samples=5, threshold=0.8):
    """
    Stop early if consensus reached
    """
    answers = []
    
    for i in range(max_samples):
        answer = generate_and_extract(prompt)
        answers.append(answer)
        
        # Check if consensus reached
        counts = Counter(answers)
        top_fraction = counts.most_common(1)[0][1] / len(answers)
        
        if top_fraction >= threshold:
            break
    
    return majority_vote(answers)

Combining with Other Techniques

With Tree of Thoughts

def tot_with_self_consistency(prompt):
    """
    Combine ToT with self-consistency
    """
    # Generate multiple thought trees
    trees = [generate_tree(prompt) for _ in range(5)]
    
    # Get best path from each tree
    paths = [tree.best_path() for tree in trees]
    
    # Vote across paths
    answers = [extract_answer(p) for p in paths]
    
    return majority_vote(answers)

With Speculative Decoding

def speculative_with_consistency(prompt):
    """
    Combine speculative decoding with self-consistency
    """
    # Use smaller draft model for speed
    draft_responses = draft_model.sample(prompt, num_samples=5)
    
    # Verify with target model
    verified = [
        target_model.verify(prompt, draft)
        for draft in draft_responses
    ]
    
    return majority_vote(verified)

Conclusion

Self-consistency decoding represents a powerful ensemble technique that significantly improves LLM reasoning without requiring model retraining. By generating multiple reasoning paths and selecting the most consistent answer, it transforms the inherent stochasticity of language model sampling from a limitation into an advantage.

Key insights:

Reasoning Path Diversity: Multiple paths increase likelihood of finding correct solutions
Majority Voting: The correct answer is more likely to appear consistently
No Training Required: Works with any pretrained LLM
Compute Trade-off: Accuracy improves with more samples, but at linear cost

The technique exemplifies a broader principle in modern AI: rather than fighting with randomness, we can harness it through intelligent aggregation. As LLMs continue to grow in capability, self-consistency remains a simple yet effective method for extracting reliable answers from potentially noisy generation processes.