Skip to main content

Statistics for Programmers: Complete Guide

Created: March 7, 2026 CalmOps 8 min read

Introduction

Statistics is fundamental to software development, from analyzing system performance to running A/B tests and making data-driven decisions. This comprehensive guide covers statistical concepts essential for programmers, with practical Python examples and real-world applications.

Key Statistics:

  • 73% of data science positions require statistical knowledge
  • A/B testing can improve conversion rates by 20-30%
  • Understanding p-values prevents 90% of common statistical mistakes
  • Netflix uses A/B testing for 100+ experiments annually

Descriptive Statistics

Key Metrics

import numpy as np
from scipy import stats

def descriptive_statistics(data):
    """Calculate key descriptive statistics"""
    
    # Central tendency
    mean = np.mean(data)
    median = np.median(data)
    mode = stats.mode(data, keepdims=True)
    
    # Dispersion
    variance = np.var(data)
    std_dev = np.std(data)
    range_val = np.max(data) - np.min(data)
    quartiles = np.percentile(data, [25, 50, 75])
    
    # Shape
    skewness = stats.skew(data)
    kurtosis = stats.kurtosis(data)
    
    return {
        "mean": mean,
        "median": median,
        "mode": mode.mode[0],
        "variance": variance,
        "std_dev": std_dev,
        "range": range_val,
        "q1": quartiles[0],
        "q3": quartiles[2],
        "iqr": quartiles[2] - quartiles[0],
        "skewness": skewness,
        "kurtosis": kurtosis
    }

# Example: API response times (ms)
response_times = [45, 52, 48, 55, 62, 58, 51, 49, 53, 47, 
                  150, 48, 50, 52, 49, 51, 250, 53, 47, 49]

stats_result = descriptive_statistics(response_times)
print(f"Mean: {stats_result['mean']:.2f}ms")
print(f"Median: {stats_result['median']:.2f}ms")  
print(f"Std Dev: {stats_result['std_dev']:.2f}ms")
print(f"Skewness: {stats_result['skewness']:.2f}")

Outlier Detection

def detect_outliers(data, method='iqr'):
    """Detect outliers using IQR or Z-score method"""
    
    if method == 'iqr':
        q1 = np.percentile(data, 25)
        q3 = np.percentile(data, 75)
        iqr = q3 - q1
        
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        
        outliers = [x for x in data if x < lower_bound or x > upper_bound]
        return outliers, lower_bound, upper_bound
    
    elif method == 'zscore':
        z_scores = np.abs(stats.zscore(data))
        outliers = [data[i] for i in range(len(data)) if z_scores[i] > 3]
        return outliers

# Detect outliers in response times
outliers, lower, upper = detect_outliers(response_times)
print(f"Outliers: {outliers}")
print(f"Bounds: [{lower:.2f}, {upper:.2f}]")

Probability Distributions

Common Distributions

┌─────────────────────────────────────────────────────────────────┐
│              Common Probability Distributions                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   1. Normal (Gaussian)                                          │
│      - Bell-shaped, symmetric                                   │
│      - Used: heights, test scores, measurement errors          │
│      - Parameters: μ (mean), σ (std dev)                       │
│                                                                  │
│   2. Poisson                                                    │
│      - Count of events in fixed interval                        │
│      - Used: API calls per minute, bugs per module              │
│      - Parameter: λ (average rate)                              │
│                                                                  │
│   3. Exponential                                                │
│      - Time between events                                      │
│      - Used: time between requests, failure times               │
│      - Parameter: λ (rate)                                      │
│                                                                  │
│   4. Binomial                                                   │
│      - Number of successes in n trials                         │
│      - Used: conversion rates, test pass/fail                  │
│      - Parameters: n (trials), p (probability)                 │
│                                                                  │
│   5. Uniform                                                    │
│      - Equal probability for all values                         │
│      - Used: random selection, load balancing                  │
│      - Parameters: a (min), b (max)                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Working with Distributions in Python

import matplotlib.pyplot as plt
from scipy import stats

def demonstrate_distributions():
    """Visualize common probability distributions"""
    
    # Normal distribution
    x = np.linspace(-4, 4, 100)
    y_normal = stats.norm.pdf(x, 0, 1)
    
    # Calculate probabilities
    # P(X < 1.96) for standard normal
    p_value = stats.norm.cdf(1.96)
    print(f"P(X < 1.96) = {p_value:.4f}")  # ≈ 0.975
    
    # Generate random samples
    samples = np.random.normal(0, 1, 1000)
    
    # Fit distribution to data
    mu, sigma = stats.norm.fit(samples)
    print(f"Fitted: μ={mu:.3f}, σ={sigma:.3f}")
    
    return x, y_normal

# Poisson for API calls
# P(exactly 5 calls in minute if avg is 3)
prob_5_calls = stats.poisson.pmf(5, 3)
print(f"P(5 calls) = {prob_5_calls:.4f}")  # ≈ 0.1008

# P(at least 10 calls)
prob_at_least_10 = 1 - stats.poisson.cdf(9, 3)
print(f"P(at least 10) = {prob_at_least_10:.6f}")

Hypothesis Testing

Core Concepts

def hypothesis_test_example():
    """
    Example: Testing if new algorithm is faster
    
    H0: μ_new ≤ μ_old (no improvement)
    H1: μ_new > μ_old (new is faster)
    """
    
    # Sample data: execution times (ms)
    old_algorithm = [120, 115, 118, 122, 119, 121, 117, 120, 116, 118]
    new_algorithm = [108, 112, 105, 110, 107, 111, 109, 106, 113, 108]
    
    # Two-sample t-test (one-tailed)
    t_stat, p_value = stats.ttest_ind(new_algorithm, old_algorithm, 
                                       alternative='less')
    
    print(f"T-statistic: {t_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    
    alpha = 0.05
    if p_value < alpha:
        print("Reject H0: New algorithm is significantly faster")
    else:
        print("Fail to reject H0: No significant difference")
    
    # Effect size (Cohen's d)
    pooled_std = np.sqrt(((len(old_algorithm)-1)*np.var(old_algorithm) + 
                          (len(new_algorithm)-1)*np.var(new_algorithm)) / 
                         (len(old_algorithm) + len(new_algorithm) - 2))
    cohens_d = (np.mean(new_algorithm) - np.mean(old_algorithm)) / pooled_std
    print(f"Effect size (Cohen's d): {cohens_d:.4f}")


# Test types overview
"""
┌─────────────────────────────────────────────────────────────────┐
│                    Hypothesis Test Selection                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Data Type          │ Comparison        │ Test                 │
│   ─────────────────┼──────────────────┼────────────────        │
│   Continuous       │ 2 groups          │ t-test               │
│   Continuous       │ >2 groups         │ ANOVA                │
│   Categorical      │ 2 groups          │ Chi-square           │
│   Categorical      │ >2 groups         │ Chi-square           │
│   Continuous       │ Before/After      │ Paired t-test        │
│   Continuous       │ Mean vs known     │ One-sample t-test    │
│                                                                  │
│   Non-parametric alternatives (when assumptions violated):       │
│   • Mann-Whitney U (instead of t-test)                          │
│   • Wilcoxon (instead of paired t-test)                         │
│   • Kruskal-Wallis (instead of ANOVA)                           │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
"""

Confidence Intervals

def confidence_interval(data, confidence=0.95):
    """Calculate confidence interval"""
    
    n = len(data)
    mean = np.mean(data)
    se = stats.sem(data)  # Standard error
    
    ci = stats.t.interval(confidence, n-1, loc=mean, scale=se)
    return mean, ci

# Example: API latency
latency_data = [45, 52, 48, 55, 62, 58, 51, 49, 53, 47]

mean, ci = confidence_interval(latency_data)
print(f"Mean: {mean:.2f}ms")
print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]ms")

A/B Testing

Implementing A/B Tests

class ABTest:
    """A/B test implementation"""
    
    def __init__(self, control_visitors, control_conversions,
                 treatment_visitors, treatment_conversions):
        self.control_n = control_visitors
        self.control_x = control_conversions
        self.treatment_n = treatment_visitors
        self.treatment_x = treatment_conversions
    
    def calculate_metrics(self):
        """Calculate conversion rates"""
        self.control_rate = self.control_x / self.control_n
        self.treatment_rate = self.treatment_x / self.treatment_n
        self.lift = (self.treatment_rate - self.control_rate) / self.control_rate
        
        return {
            'control_rate': self.control_rate,
            'treatment_rate': self.treatment_rate,
            'lift': self.lift
        }
    
    def statistical_test(self):
        """Perform two-proportion z-test"""
        # Pooled proportion
        p_pool = (self.control_x + self.treatment_x) / (self.control_n + self.treatment_n)
        
        # Standard error
        se = np.sqrt(p_pool * (1 - p_pool) * 
                    (1/self.control_n + 1/self.treatment_n))
        
        # Z-statistic
        z = (self.treatment_rate - self.control_rate) / se
        
        # Two-tailed p-value
        p_value = 2 * (1 - stats.norm.cdf(abs(z)))
        
        return {
            'z_statistic': z,
            'p_value': p_value,
            'significant': p_value < 0.05
        }
    
    def sample_size_calculator(self, baseline_rate, minimum_detectable_effect, 
                              alpha=0.05, power=0.8):
        """Calculate required sample size"""
        p1 = baseline_rate
        p2 = baseline_rate * (1 + minimum_detectable_effect)
        
        z_alpha = stats.norm.ppf(1 - alpha/2)
        z_beta = stats.norm.ppf(power)
        
        n = (2 * (p1 + p2)/2 * (1 - (p1 + p2)/2) * 
             (z_alpha + z_beta)**2 / (p2 - p1)**2)
        
        return int(np.ceil(n))


# Example: Testing new checkout flow
ab_test = ABTest(
    control_visitors=5000,
    control_conversions=150,  # 3% conversion
    treatment_visitors=5000,
    treatment_conversions=185  # 3.7% conversion
)

metrics = ab_test.calculate_metrics()
test_results = ab_test.statistical_test()

print(f"Control: {metrics['control_rate']:.1%}")
print(f"Treatment: {metrics['treatment_rate']:.1%}")
print(f"Lift: {metrics['lift']:.1%}")
print(f"P-value: {test_results['p_value']:.4f}")
print(f"Significant: {test_results['significant']}")

# Calculate required sample size
required_n = ab_test.sample_size_calculator(0.03, 0.1)  # 10% MDE
print(f"Required sample size per variation: {required_n}")

Correlation and Regression

Correlation Analysis

def correlation_analysis():
    """Analyze correlations between variables"""
    
    # Example: Feature usage vs time spent
    features_used = [3, 5, 7, 2, 8, 6, 4, 9, 5, 7, 3, 6, 8, 4, 5]
    time_spent = [15, 25, 35, 10, 40, 30, 20, 45, 25, 35, 15, 30, 40, 20, 25]
    
    # Pearson correlation
    pearson_r, pearson_p = stats.pearsonr(features_used, time_spent)
    
    # Spearman (rank) correlation
    spearman_r, spearman_p = stats.spearmanr(features_used, time_spent)
    
    print(f"Pearson r: {pearson_r:.4f}, p: {pearson_p:.4f}")
    print(f"Spearman ρ: {spearman_r:.4f}, p: {spearman_p:.4f}")
    
    # Interpretation
    if abs(pearson_r) < 0.3:
        strength = "weak"
    elif abs(pearson_r) < 0.7:
        strength = "moderate"
    else:
        strength = "strong"
    
    direction = "positive" if pearson_r > 0 else "negative"
    print(f"Interpretation: {strength} {direction} correlation")


# Linear regression
from scipy.stats import linregress

def linear_regression():
    """Simple linear regression"""
    
    x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    y = np.array([2.1, 4.3, 5.8, 8.2, 9.9, 12.1, 14.0, 16.2, 17.9, 20.1])
    
    slope, intercept, r_value, p_value, std_err = linregress(x, y)
    
    print(f"Equation: y = {slope:.2f}x + {intercept:.2f}")
    print(f"R-squared: {r_value**2:.4f}")
    print(f"P-value: {p_value:.6f}")
    
    # Predict
    predicted = slope * 11 + intercept
    print(f"Prediction for x=11: {predicted:.2f}")

Common Statistical Mistakes

What to Avoid

┌─────────────────────────────────────────────────────────────────┐
│              Common Statistical Mistakes                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   1. Ignoring Sample Size                                        │
│      ✗ Drawing conclusions from small samples                   │
│      ✓ Use power analysis to determine sample size              │
│                                                                  │
│   2. Confusing Correlation with Causation                        │
│      ✗ Assuming A causes B because they're correlated           │
│      ✓ Use controlled experiments to establish causation        │
│                                                                  │
│   3. P-Hacking                                                   │
│      ✗ Trying multiple tests until one "works"                 │
│      ✓ Pre-register hypotheses, adjust for multiple tests       │
│                                                                  │
│   4. Ignoring Effect Size                                        │
│      ✗ Focusing only on statistical significance                 │
│      ✓ Report and interpret effect sizes                        │
│                                                                  │
│   5. Base Rate Neglect                                          │
│      ✗ Ignoring prior probabilities                              │
│      ✓ Consider false positive/negative rates                   │
│                                                                  │
│   6. Survivorship Bias                                           │
│      ✗ Analyzing only successful cases                           │
│      ✓ Include all relevant data, including failures            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Practical Applications

System Performance Analysis

def analyze_performance(data):
    """Statistical analysis of system performance"""
    
    # Descriptive
    stats_summary = descriptive_statistics(data)
    
    # Check normality
    _, p_normal = stats.shapiro(data[:5000] if len(data) > 5000 else data)
    print(f"Normality test p-value: {p_normal:.4f}")
    
    if p_normal < 0.05:
        print("Data is NOT normally distributed")
        print("Use median and IQR instead of mean and std")
    else:
        print("Data appears normally distributed")
    
    # Confidence interval for p95
    p95 = np.percentile(data, 95)
    n = len(data)
    se = p95 * np.sqrt((1 - p95) / n)  # Approximation
    ci_95 = p95 ± 1.96 * se
    print(f"P95: {p95:.2f}, 95% CI: [{ci_95[0]:.2f}, {ci_95[1]:.2f}]")

Best Practices

  1. Always report effect sizes: Statistical significance alone is insufficient
  2. Check assumptions: Normality, independence, equal variance
  3. Use appropriate tests: Match test to data type and question
  4. Pre-register hypotheses: Prevent p-hacking
  5. Consider practical significance: Statistical ≠ practical significance
  6. Visualize data: Always plot before drawing conclusions
  7. Report uncertainty: Include confidence intervals

Conclusion

Statistics is an essential skill for programmers, enabling data-driven decisions, proper experiment design, and accurate data interpretation. By understanding descriptive statistics, probability distributions, hypothesis testing, and A/B testing, you can make more informed decisions and avoid common statistical pitfalls.

Comments

Share this article

Scan to read on mobile