Introduction
Statistics is fundamental to software development, from analyzing system performance to running A/B tests and making data-driven decisions. This comprehensive guide covers statistical concepts essential for programmers, with practical Python examples and real-world applications.
Key Statistics:
- 73% of data science positions require statistical knowledge
- A/B testing can improve conversion rates by 20-30%
- Understanding p-values prevents 90% of common statistical mistakes
- Netflix uses A/B testing for 100+ experiments annually
Descriptive Statistics
Key Metrics
import numpy as np
from scipy import stats
def descriptive_statistics(data):
"""Calculate key descriptive statistics"""
# Central tendency
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data, keepdims=True)
# Dispersion
variance = np.var(data)
std_dev = np.std(data)
range_val = np.max(data) - np.min(data)
quartiles = np.percentile(data, [25, 50, 75])
# Shape
skewness = stats.skew(data)
kurtosis = stats.kurtosis(data)
return {
"mean": mean,
"median": median,
"mode": mode.mode[0],
"variance": variance,
"std_dev": std_dev,
"range": range_val,
"q1": quartiles[0],
"q3": quartiles[2],
"iqr": quartiles[2] - quartiles[0],
"skewness": skewness,
"kurtosis": kurtosis
}
# Example: API response times (ms)
response_times = [45, 52, 48, 55, 62, 58, 51, 49, 53, 47,
150, 48, 50, 52, 49, 51, 250, 53, 47, 49]
stats_result = descriptive_statistics(response_times)
print(f"Mean: {stats_result['mean']:.2f}ms")
print(f"Median: {stats_result['median']:.2f}ms")
print(f"Std Dev: {stats_result['std_dev']:.2f}ms")
print(f"Skewness: {stats_result['skewness']:.2f}")
Outlier Detection
def detect_outliers(data, method='iqr'):
"""Detect outliers using IQR or Z-score method"""
if method == 'iqr':
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = [x for x in data if x < lower_bound or x > upper_bound]
return outliers, lower_bound, upper_bound
elif method == 'zscore':
z_scores = np.abs(stats.zscore(data))
outliers = [data[i] for i in range(len(data)) if z_scores[i] > 3]
return outliers
# Detect outliers in response times
outliers, lower, upper = detect_outliers(response_times)
print(f"Outliers: {outliers}")
print(f"Bounds: [{lower:.2f}, {upper:.2f}]")
Probability Distributions
Common Distributions
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Common Probability Distributions โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. Normal (Gaussian) โ
โ - Bell-shaped, symmetric โ
โ - Used: heights, test scores, measurement errors โ
โ - Parameters: ฮผ (mean), ฯ (std dev) โ
โ โ
โ 2. Poisson โ
โ - Count of events in fixed interval โ
โ - Used: API calls per minute, bugs per module โ
โ - Parameter: ฮป (average rate) โ
โ โ
โ 3. Exponential โ
โ - Time between events โ
โ - Used: time between requests, failure times โ
โ - Parameter: ฮป (rate) โ
โ โ
โ 4. Binomial โ
โ - Number of successes in n trials โ
โ - Used: conversion rates, test pass/fail โ
โ - Parameters: n (trials), p (probability) โ
โ โ
โ 5. Uniform โ
โ - Equal probability for all values โ
โ - Used: random selection, load balancing โ
โ - Parameters: a (min), b (max) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Working with Distributions in Python
import matplotlib.pyplot as plt
from scipy import stats
def demonstrate_distributions():
"""Visualize common probability distributions"""
# Normal distribution
x = np.linspace(-4, 4, 100)
y_normal = stats.norm.pdf(x, 0, 1)
# Calculate probabilities
# P(X < 1.96) for standard normal
p_value = stats.norm.cdf(1.96)
print(f"P(X < 1.96) = {p_value:.4f}") # โ 0.975
# Generate random samples
samples = np.random.normal(0, 1, 1000)
# Fit distribution to data
mu, sigma = stats.norm.fit(samples)
print(f"Fitted: ฮผ={mu:.3f}, ฯ={sigma:.3f}")
return x, y_normal
# Poisson for API calls
# P(exactly 5 calls in minute if avg is 3)
prob_5_calls = stats.poisson.pmf(5, 3)
print(f"P(5 calls) = {prob_5_calls:.4f}") # โ 0.1008
# P(at least 10 calls)
prob_at_least_10 = 1 - stats.poisson.cdf(9, 3)
print(f"P(at least 10) = {prob_at_least_10:.6f}")
Hypothesis Testing
Core Concepts
def hypothesis_test_example():
"""
Example: Testing if new algorithm is faster
H0: ฮผ_new โค ฮผ_old (no improvement)
H1: ฮผ_new > ฮผ_old (new is faster)
"""
# Sample data: execution times (ms)
old_algorithm = [120, 115, 118, 122, 119, 121, 117, 120, 116, 118]
new_algorithm = [108, 112, 105, 110, 107, 111, 109, 106, 113, 108]
# Two-sample t-test (one-tailed)
t_stat, p_value = stats.ttest_ind(new_algorithm, old_algorithm,
alternative='less')
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
alpha = 0.05
if p_value < alpha:
print("Reject H0: New algorithm is significantly faster")
else:
print("Fail to reject H0: No significant difference")
# Effect size (Cohen's d)
pooled_std = np.sqrt(((len(old_algorithm)-1)*np.var(old_algorithm) +
(len(new_algorithm)-1)*np.var(new_algorithm)) /
(len(old_algorithm) + len(new_algorithm) - 2))
cohens_d = (np.mean(new_algorithm) - np.mean(old_algorithm)) / pooled_std
print(f"Effect size (Cohen's d): {cohens_d:.4f}")
# Test types overview
"""
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Hypothesis Test Selection โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Data Type โ Comparison โ Test โ
โ โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโ โ
โ Continuous โ 2 groups โ t-test โ
โ Continuous โ >2 groups โ ANOVA โ
โ Categorical โ 2 groups โ Chi-square โ
โ Categorical โ >2 groups โ Chi-square โ
โ Continuous โ Before/After โ Paired t-test โ
โ Continuous โ Mean vs known โ One-sample t-test โ
โ โ
โ Non-parametric alternatives (when assumptions violated): โ
โ โข Mann-Whitney U (instead of t-test) โ
โ โข Wilcoxon (instead of paired t-test) โ
โ โข Kruskal-Wallis (instead of ANOVA) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
"""
Confidence Intervals
def confidence_interval(data, confidence=0.95):
"""Calculate confidence interval"""
n = len(data)
mean = np.mean(data)
se = stats.sem(data) # Standard error
ci = stats.t.interval(confidence, n-1, loc=mean, scale=se)
return mean, ci
# Example: API latency
latency_data = [45, 52, 48, 55, 62, 58, 51, 49, 53, 47]
mean, ci = confidence_interval(latency_data)
print(f"Mean: {mean:.2f}ms")
print(f"95% CI: [{ci[0]:.2f}, {ci[1]:.2f}]ms")
A/B Testing
Implementing A/B Tests
class ABTest:
"""A/B test implementation"""
def __init__(self, control_visitors, control_conversions,
treatment_visitors, treatment_conversions):
self.control_n = control_visitors
self.control_x = control_conversions
self.treatment_n = treatment_visitors
self.treatment_x = treatment_conversions
def calculate_metrics(self):
"""Calculate conversion rates"""
self.control_rate = self.control_x / self.control_n
self.treatment_rate = self.treatment_x / self.treatment_n
self.lift = (self.treatment_rate - self.control_rate) / self.control_rate
return {
'control_rate': self.control_rate,
'treatment_rate': self.treatment_rate,
'lift': self.lift
}
def statistical_test(self):
"""Perform two-proportion z-test"""
# Pooled proportion
p_pool = (self.control_x + self.treatment_x) / (self.control_n + self.treatment_n)
# Standard error
se = np.sqrt(p_pool * (1 - p_pool) *
(1/self.control_n + 1/self.treatment_n))
# Z-statistic
z = (self.treatment_rate - self.control_rate) / se
# Two-tailed p-value
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
return {
'z_statistic': z,
'p_value': p_value,
'significant': p_value < 0.05
}
def sample_size_calculator(self, baseline_rate, minimum_detectable_effect,
alpha=0.05, power=0.8):
"""Calculate required sample size"""
p1 = baseline_rate
p2 = baseline_rate * (1 + minimum_detectable_effect)
z_alpha = stats.norm.ppf(1 - alpha/2)
z_beta = stats.norm.ppf(power)
n = (2 * (p1 + p2)/2 * (1 - (p1 + p2)/2) *
(z_alpha + z_beta)**2 / (p2 - p1)**2)
return int(np.ceil(n))
# Example: Testing new checkout flow
ab_test = ABTest(
control_visitors=5000,
control_conversions=150, # 3% conversion
treatment_visitors=5000,
treatment_conversions=185 # 3.7% conversion
)
metrics = ab_test.calculate_metrics()
test_results = ab_test.statistical_test()
print(f"Control: {metrics['control_rate']:.1%}")
print(f"Treatment: {metrics['treatment_rate']:.1%}")
print(f"Lift: {metrics['lift']:.1%}")
print(f"P-value: {test_results['p_value']:.4f}")
print(f"Significant: {test_results['significant']}")
# Calculate required sample size
required_n = ab_test.sample_size_calculator(0.03, 0.1) # 10% MDE
print(f"Required sample size per variation: {required_n}")
Correlation and Regression
Correlation Analysis
def correlation_analysis():
"""Analyze correlations between variables"""
# Example: Feature usage vs time spent
features_used = [3, 5, 7, 2, 8, 6, 4, 9, 5, 7, 3, 6, 8, 4, 5]
time_spent = [15, 25, 35, 10, 40, 30, 20, 45, 25, 35, 15, 30, 40, 20, 25]
# Pearson correlation
pearson_r, pearson_p = stats.pearsonr(features_used, time_spent)
# Spearman (rank) correlation
spearman_r, spearman_p = stats.spearmanr(features_used, time_spent)
print(f"Pearson r: {pearson_r:.4f}, p: {pearson_p:.4f}")
print(f"Spearman ฯ: {spearman_r:.4f}, p: {spearman_p:.4f}")
# Interpretation
if abs(pearson_r) < 0.3:
strength = "weak"
elif abs(pearson_r) < 0.7:
strength = "moderate"
else:
strength = "strong"
direction = "positive" if pearson_r > 0 else "negative"
print(f"Interpretation: {strength} {direction} correlation")
# Linear regression
from scipy.stats import linregress
def linear_regression():
"""Simple linear regression"""
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([2.1, 4.3, 5.8, 8.2, 9.9, 12.1, 14.0, 16.2, 17.9, 20.1])
slope, intercept, r_value, p_value, std_err = linregress(x, y)
print(f"Equation: y = {slope:.2f}x + {intercept:.2f}")
print(f"R-squared: {r_value**2:.4f}")
print(f"P-value: {p_value:.6f}")
# Predict
predicted = slope * 11 + intercept
print(f"Prediction for x=11: {predicted:.2f}")
Common Statistical Mistakes
What to Avoid
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Common Statistical Mistakes โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. Ignoring Sample Size โ
โ โ Drawing conclusions from small samples โ
โ โ Use power analysis to determine sample size โ
โ โ
โ 2. Confusing Correlation with Causation โ
โ โ Assuming A causes B because they're correlated โ
โ โ Use controlled experiments to establish causation โ
โ โ
โ 3. P-Hacking โ
โ โ Trying multiple tests until one "works" โ
โ โ Pre-register hypotheses, adjust for multiple tests โ
โ โ
โ 4. Ignoring Effect Size โ
โ โ Focusing only on statistical significance โ
โ โ Report and interpret effect sizes โ
โ โ
โ 5. Base Rate Neglect โ
โ โ Ignoring prior probabilities โ
โ โ Consider false positive/negative rates โ
โ โ
โ 6. Survivorship Bias โ
โ โ Analyzing only successful cases โ
โ โ Include all relevant data, including failures โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Practical Applications
System Performance Analysis
def analyze_performance(data):
"""Statistical analysis of system performance"""
# Descriptive
stats_summary = descriptive_statistics(data)
# Check normality
_, p_normal = stats.shapiro(data[:5000] if len(data) > 5000 else data)
print(f"Normality test p-value: {p_normal:.4f}")
if p_normal < 0.05:
print("Data is NOT normally distributed")
print("Use median and IQR instead of mean and std")
else:
print("Data appears normally distributed")
# Confidence interval for p95
p95 = np.percentile(data, 95)
n = len(data)
se = p95 * np.sqrt((1 - p95) / n) # Approximation
ci_95 = p95 ยฑ 1.96 * se
print(f"P95: {p95:.2f}, 95% CI: [{ci_95[0]:.2f}, {ci_95[1]:.2f}]")
Best Practices
- Always report effect sizes: Statistical significance alone is insufficient
- Check assumptions: Normality, independence, equal variance
- Use appropriate tests: Match test to data type and question
- Pre-register hypotheses: Prevent p-hacking
- Consider practical significance: Statistical โ practical significance
- Visualize data: Always plot before drawing conclusions
- Report uncertainty: Include confidence intervals
Conclusion
Statistics is an essential skill for programmers, enabling data-driven decisions, proper experiment design, and accurate data interpretation. By understanding descriptive statistics, probability distributions, hypothesis testing, and A/B testing, you can make more informed decisions and avoid common statistical pitfalls.
Comments