Skip to main content
โšก Calmops

Statistics Fundamentals for Data Science

Introduction

Statistics is the grammar of data science. In 2026, with organizations generating unprecedented volumes of data, the ability to extract meaningful insights from noise has become a critical skill for developers and engineers. Statistics provides the mathematical foundation for making inferences, predictions, and decisions from data.

This guide covers essential statistical concepts from descriptive statistics through advanced inference methods. We’ll focus on practical implementation with Python, using real-world examples from analytics, A/B testing, and machine learning. Each concept is reinforced with code demonstrations that you can adapt for your own projects.

Whether you’re analyzing user behavior, evaluating experiment results, or building data pipelines, statistical literacy empowers you to make evidence-based decisions confidently. We’ll move from fundamental concepts to sophisticated techniques, building a comprehensive toolkit for data analysis.

Descriptive Statistics

Understanding Your Data

Before diving into complex analyses, you must understand your data. Descriptive statistics summarize and visualize data, revealing patterns, outliers, and distributions. This exploratory phase guides subsequent analysis and often surfaces insights directly.

Measures of central tendency describe the “typical” value: mean (arithmetic average), median (middle value), and mode (most frequent). Each tells a different story. The mean is sensitive to outliers, the median is robust, and the mode identifies peaks in categorical distributions.

Measures of dispersion describe spread: variance, standard deviation, range, and interquartile range. A dataset with high variance is more “noisy” and harder to predict. Understanding dispersion helps set realistic expectations for model performance.

import numpy as np
from collections import Counter

class DescriptiveStatistics:
    """Utilities for descriptive statistics."""
    
    @staticmethod
    def mean(data: list) -> float:
        """Calculate arithmetic mean."""
        return sum(data) / len(data) if data else 0
    
    @staticmethod
    def median(data: list) -> float:
        """Calculate median value."""
        if not data:
            return 0
        sorted_data = sorted(data)
        n = len(sorted_data)
        mid = n // 2
        if n % 2 == 0:
            return (sorted_data[mid - 1] + sorted_data[mid]) / 2
        return sorted_data[mid]
    
    @staticmethod
    def mode(data: list):
        """Calculate mode(s) - returns list of modes."""
        if not data:
            return []
        counts = Counter(data)
        max_count = max(counts.values())
        return [k for k, v in counts.items() if v == max_count]
    
    @staticmethod
    def variance(data: list, sample: bool = True) -> float:
        """Calculate variance. Sample=True uses n-1 denominator (Bessel's correction)."""
        if not data:
            return 0
        m = DescriptiveStatistics.mean(data)
        n = len(data) - 1 if sample else len(data)
        return sum((x - m) ** 2 for x in data) / n
    
    @staticmethod
    def std_dev(data: list, sample: bool = True) -> float:
        """Calculate standard deviation."""
        return np.sqrt(DescriptiveStatistics.variance(data, sample))
    
    @staticmethod
    def quartiles(data: list) -> dict:
        """Calculate Q1, Q2 (median), Q3 and IQR."""
        sorted_data = sorted(data)
        n = len(sorted_data)
        
        def percentile(p):
            """Calculatepth percentile."""
            idx = (n - 1) * p
            lower = int(np.floor(idx))
            upper = int(np.ceil(idx))
            weight = idx - lower
            return sorted_data[lower] * (1 - weight) + sorted_data[upper] * weight
        
        return {
            'Q1': percentile(0.25),
            'Q2': percentile(0.50),
            'Q3': percentile(0.75),
            'IQR': percentile(0.75) - percentile(0.25)
        }

# Example: Analyze website response times
response_times = [45, 52, 48, 67, 43, 51, 49, 55, 62, 58, 
                 44, 156, 47, 53, 50, 61, 54, 46, 49, 52]

print("Response Time Analysis (ms):")
print(f"  Mean: {DescriptiveStatistics.mean(response_times):.2f}")
print(f"  Median: {DescriptiveStatistics.median(response_times):.2f}")
print(f"  Mode: {DescriptiveStatistics.mode(response_times)}")
print(f"  Std Dev: {DescriptiveStatistics.std_dev(response_times):.2f}")
print(f"  Quartiles: {DescriptiveStatistics.quartiles(response_times)}")

# Note: 156ms looks like an outlier - worth investigating

Data Visualization

Visualization transforms abstract numbers into intuitive understanding. The right chart reveals patterns that summary statistics miss. Histograms show distribution shape, box plots compare groups, scatter plots reveal relationships, and time series plots display trends.

Modern Python offers powerful visualization libraries. Matplotlib provides fine-grained control, Seaborn creates statistical graphics elegantly, and Plotly enables interactive web visualizations. For 2026 applications, consider integrating visualizations into dashboards or building interactive explorers with Streamlit or Dash.

import matplotlib.pyplot as plt

def visualize_distribution(data: list, title: str = "Distribution"):
    """Create histogram with statistical annotations."""
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Histogram
    counts, bins, patches = ax.hist(data, bins=20, edgecolor='black', alpha=0.7)
    
    # Add mean and median lines
    mean_val = DescriptiveStatistics.mean(data)
    median_val = DescriptiveStatistics.median(data)
    ax.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.1f}')
    ax.axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median: {median_val:.1f}')
    
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')
    ax.set_title(title)
    ax.legend()
    
    return fig, ax

# Example: Compare two distributions
np.random.seed(42)
normal_data = np.random.normal(100, 15, 1000)
skewed_data = np.random.exponential(15, 1000) + 85

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.hist(normal_data, bins=30, edgecolor='black', alpha=0.7)
ax1.set_title('Normal Distribution (symmetric)')
ax2.hist(skewed_data, bins=30, edgecolor='black', alpha=0.7)
ax2.set_title('Exponential + 85 (right-skewed)')
plt.tight_layout()
plt.show()

print(f"Normal - Mean: {np.mean(normal_data):.2f}, Median: {np.median(normal_data):.2f}")
print(f"Skewed - Mean: {np.mean(skewed_data):.2f}, Median: {np.median(skewed_data):.2f}")

Probability Distributions in Statistics

Common Distributions

Statistical analysis relies heavily on probability distributions. Understanding these distributionsโ€”when they arise, how to identify them, and how to use themโ€”provides the foundation for inference and modeling.

The normal distribution appears everywhere in statistics due to the Central Limit Theorem. Many natural phenomena approximate normality, and sample means follow it regardless of the underlying distribution (given sufficient sample size). This makes normal-based methods widely applicable.

The t-distribution resembles the normal but has heavier tails. It’s essential when working with small samples (n < 30) where the normal assumption is unreliable. As sample size increases, t-distribution approaches normal.

The chi-square distribution appears in variance analysis and goodness-of-fit tests. It’s used to test whether observed frequencies match expected frequenciesโ€”a fundamental question in categorical data analysis.

from scipy import stats
import numpy as np

class DistributionAnalysis:
    """Statistical distributions for data analysis."""
    
    @staticmethod
    def normal_test(data: list, alpha: float = 0.05) -> dict:
        """Test if data follows normal distribution using Shapiro-Wilk test."""
        statistic, p_value = stats.shapiro(data)
        return {
            'test': 'Shapiro-Wilk',
            'statistic': statistic,
            'p_value': p_value,
            'is_normal': p_value > alpha,
            'interpretation': 'Normal' if p_value > alpha else 'Not normal'
        }
    
    @staticmethod
    def confidence_interval_mean(data: list, confidence: float = 0.95) -> tuple:
        """Calculate confidence interval for the mean."""
        n = len(data)
        mean = np.mean(data)
        se = stats.sem(data)  # Standard error of mean
        h = se * stats.t.ppf((1 + confidence) / 2, n - 1)
        return mean - h, mean + h
    
    @staticmethod
    def bootstrap_ci(data: list, statistic_func, n_bootstrap: int = 10000, 
                    confidence: float = 0.95) -> tuple:
        """Calculate bootstrap confidence interval for any statistic."""
        bootstrap_stats = []
        n = len(data)
        
        for _ in range(n_bootstrap):
            sample = np.random.choice(data, size=n, replace=True)
            bootstrap_stats.append(statistic_func(sample))
        
        alpha = 1 - confidence
        lower = np.percentile(bootstrap_stats, alpha / 2 * 100)
        upper = np.percentile(bootstrap_stats, (1 - alpha / 2) * 100)
        
        return lower, upper

# Example: Test normality of response times
response_times = [45, 52, 48, 67, 43, 51, 49, 55, 62, 58, 
                  44, 156, 47, 53, 50, 61, 54, 46, 49, 52]

result = DistributionAnalysis.normal_test(response_times)
print(f"Normality test: {result}")

# Confidence interval for mean response time
ci = DistributionAnalysis.confidence_interval_mean(response_times)
print(f"95% CI for mean: [{ci[0]:.2f}, {ci[1]:.2f}] ms")

# Bootstrap CI for median
bootstrap_ci = DistributionAnalysis.bootstrap_ci(response_times, np.median)
print(f"Bootstrap 95% CI for median: [{bootstrap_ci[0]:.2f}, {bootstrap_ci[1]:.2f}]")

The Central Limit Theorem

The Central Limit Theorem (CLT) states that the sampling distribution of the mean approaches normal as sample size increases, regardless of the original distribution. This theorem is the foundation for most statistical inferenceโ€”it justifies using normal-based methods even when data isn’t normally distributed.

In practice, CLT means that with sufficiently large samples (typically n โ‰ฅ 30), we can treat sample means as normally distributed. This enables confidence intervals and hypothesis tests for means even when the underlying population is skewed or multimodal.

class CentralLimitTheorem:
    """Demonstrate and apply the Central Limit Theorem."""
    
    @staticmethod
    def demonstrate_clt(population_dist: str, sample_sizes: list, 
                       n_samples: int = 1000):
        """
        Demonstrate CLT by showing sample means approach normal distribution.
        """
        np.random.seed(42)
        
        # Generate population
        if population_dist == 'exponential':
            population = np.random.exponential(2, 10000)
        elif population_dist == 'uniform':
            population = np.random.uniform(0, 10, 10000)
        elif population_dist == 'bimodal':
            population = np.concatenate([
                np.random.normal(3, 1, 5000),
                np.random.normal(7, 1, 5000)
            ])
        
        # For each sample size, calculate sample means
        results = {}
        for n in sample_sizes:
            sample_means = []
            for _ in range(n_samples):
                sample = np.random.choice(population, size=n, replace=True)
                sample_means.append(np.mean(sample))
            results[n] = sample_means
        
        return results, population

# Demonstrate CLT
sample_sizes = [5, 10, 30, 100]
results, population = CentralLimitTheorem.demonstrate_clt('exponential', sample_sizes)

# Plot the results
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, n in enumerate(sample_sizes):
    axes[idx].hist(results[n], bins=30, edgecolor='black', alpha=0.7, density=True)
    axes[idx].set_title(f'Sample size n={n}')
    axes[idx].set_xlabel('Sample mean')
    axes[idx].set_ylabel('Density')
    
    # Overlay normal distribution
    mu, sigma = np.mean(population), np.std(population) / np.sqrt(n)
    x = np.linspace(min(results[n]), max(results[n]), 100)
    axes[idx].plot(x, stats.norm.pdf(x, mu, sigma), 'r-', lw=2, label='Theoretical normal')

plt.suptitle('Central Limit Theorem Demonstration\n(Sample means approach normal distribution)', 
             fontsize=14)
plt.tight_layout()
plt.show()

Hypothesis Testing

Fundamentals of Statistical Tests

Hypothesis testing provides a rigorous framework for making decisions from data. We start with a null hypothesis (Hโ‚€)โ€”typically a status quo or no-effect claimโ€”and an alternative hypothesis (Hโ‚)โ€”what we want to prove. Then we collect data and determine whether evidence supports rejecting Hโ‚€.

The process involves calculating a test statistic from the data and comparing it to a null distribution. If the statistic falls in an extreme region (the rejection region), we reject Hโ‚€ in favor of Hโ‚. The probability of incorrectly rejecting a true null is the significance level (ฮฑ), typically 0.05.

P-values quantify evidence against Hโ‚€. A p-value of 0.03 means there’s only a 3% chance of observing data as extreme as we did (or more extreme) if Hโ‚€ were true. Small p-values provide evidence against Hโ‚€. However, p-values are often misunderstoodโ€”they don’t give the probability that Hโ‚€ is true, nor do they measure effect size.

class HypothesisTesting:
    """Common statistical hypothesis tests."""
    
    @staticmethod
    def one_sample_t_test(data: list, mu_0: float, alpha: float = 0.05) -> dict:
        """
        One-sample t-test: Test if sample mean differs from hypothesized population mean.
        """
        n = len(data)
        mean = np.mean(data)
        std = np.std(data, ddof=1)
        se = std / np.sqrt(n)
        t_stat = (mean - mu_0) / se
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), n - 1))
        
        return {
            'test': 'One-sample t-test',
            'n': n,
            'mean': mean,
            'hypothesized_mean': mu_0,
            't_statistic': t_stat,
            'p_value': p_value,
            'reject_null': p_value < alpha,
            'interpretation': f"Mean {'differs' if p_value < alpha else 'does not differ'} from {mu_0}"
        }
    
    @staticmethod
    def two_sample_t_test(data1: list, data2: list, 
                         equal_variance: bool = True, alpha: float = 0.05) -> dict:
        """
        Two-sample t-test: Test if two groups have different means.
        """
        n1, n2 = len(data1), len(data2)
        mean1, mean2 = np.mean(data1), np.mean(data2)
        var1, var2 = np.var(data1, ddof=1), np.var(data2, ddof=1)
        
        if equal_variance:
            # Pooled variance
            pooled_var = ((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2)
            se = np.sqrt(pooled_var * (1/n1 + 1/n2))
            df = n1 + n2 - 2
        else:
            # Welch's t-test (unequal variances)
            se = np.sqrt(var1/n1 + var2/n2)
            df = (var1/n1 + var2/n2)**2 / ((var1/n1)**2/(n1-1) + (var2/n2)**2/(n2-1))
        
        t_stat = (mean1 - mean2) / se
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
        
        return {
            'test': "Two-sample t-test" + (" (Welch)" if not equal_variance else ""),
            'n1': n1, 'n2': n2,
            'mean1': mean1, 'mean2': mean2,
            't_statistic': t_stat,
            'p_value': p_value,
            'reject_null': p_value < alpha
        }
    
    @staticmethod
    def paired_t_test(before: list, after: list, alpha: float = 0.05) -> dict:
        """
        Paired t-test: Test if paired observations have different means.
        """
        differences = [a - b for a, b in zip(after, before)]
        n = len(differences)
        mean_diff = np.mean(differences)
        std_diff = np.std(differences, ddof=1)
        se = std_diff / np.sqrt(n)
        t_stat = mean_diff / se
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), n - 1))
        
        return {
            'test': 'Paired t-test',
            'n': n,
            'mean_difference': mean_diff,
            't_statistic': t_stat,
            'p_value': p_value,
            'reject_null': p_value < alpha
        }

# Example: A/B test analysis
# Group A (control): conversion rates
group_a = [0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1]
group_b = [1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1]

result = HypothesisTesting.two_sample_t_test(group_a, group_b)
print(f"A/B Test Results:")
print(f"  Group A mean: {result['mean1']:.3f}")
print(f"  Group B mean: {result['mean2']:.3f}")
print(f"  t-statistic: {result['t_statistic']:.3f}")
print(f"  p-value: {result['p_value']:.4f}")
print(f"  Significant at ฮฑ=0.05: {result['reject_null']}")

Common Statistical Tests

Different situations call for different tests. Here’s a quick reference:

T-tests compare means: one-sample (against a known value), two-sample (independent groups), or paired (before/after). ANOVA extends this to compare more than two groups. Chi-square tests compare categorical frequencies. Correlation tests measure association between variables. Regression tests model relationships between predictors and outcomes.

The choice depends on data type (continuous vs. categorical), number of groups, sample size, and assumptions (normality, variance equality). Modern approaches emphasize effect sizes and confidence intervals alongside p-values.

class AdvancedTests:
    """Advanced statistical tests."""
    
    @staticmethod
    def anova_one_way(*groups, alpha: float = 0.05) -> dict:
        """
        One-way ANOVA: Test if 3+ groups have different means.
        """
        f_stat, p_value = stats.f_oneway(*groups)
        
        return {
            'test': 'One-way ANOVA',
            'f_statistic': f_stat,
            'p_value': p_value,
            'reject_null': p_value < alpha,
            'n_groups': len(groups),
            'group_means': [np.mean(g) for g in groups]
        }
    
    @staticmethod
    def chi_square_test(observed: list, expected: list = None) -> dict:
        """
        Chi-square test for categorical data.
        """
        if expected is None:
            # Test of independence / goodness of fit
            chi2, p_value, dof, expected = stats.chi2_contingency(observed)
        else:
            # Goodness of fit
            chi2, p_value = stats.chisquare(observed, expected)
            dof = len(observed) - 1
        
        return {
            'test': 'Chi-square test',
            'chi2_statistic': chi2,
            'p_value': p_value,
            'degrees_of_freedom': dof,
            'expected_frequencies': expected
        }
    
    @staticmethod
    def correlation_test(x: list, y: list, method: str = 'pearson') -> dict:
        """
        Test correlation between two variables.
        """
        if method == 'pearson':
            corr, p_value = stats.pearsonr(x, y)
        elif method == 'spearman':
            corr, p_value = stats.spearmanr(x, y)
        
        return {
            'test': f'{method.capitalize()} correlation',
            'correlation': corr,
            'p_value': p_value,
            'interpretation': 'Significant' if p_value < 0.05 else 'Not significant'
        }

# Example: ANOVA - comparing 3 product versions
version_a = [85, 87, 82, 86, 88, 84, 85]
version_b = [92, 89, 94, 91, 90, 93, 92]
version_c = [78, 82, 79, 81, 80, 77, 79]

anova_result = AdvancedTests.anova_one_way(version_a, version_b, version_c)
print(f"ANOVA Results:")
print(f"  F-statistic: {anova_result['f_statistic']:.3f}")
print(f"  p-value: {anova_result['p_value']:.6f}")
print(f"  Significant: {anova_result['reject_null']}")
print(f"  Group means: A={anova_result['group_means'][0]:.1f}, "
      f"B={anova_result['group_means'][1]:.1f}, C={anova_result['group_means'][2]:.1f}")

# Example: Chi-square test - product color preferences
observed = [35, 28, 22, 15]  # Red, Blue, Green, Yellow
expected = [30, 30, 30, 10]  # Hypothesized distribution
chi_result = AdvancedTests.chi_square_test(observed, expected)
print(f"\nChi-square test:")
print(f"  ฯ‡ยฒ statistic: {chi_result['chi2_statistic']:.3f}")
print(f"  p-value: {chi_result['p_value']:.4f}")

Regression Analysis

Simple Linear Regression

Regression analysis models relationships between variables. Simple linear regression fits a straight line through data points, predicting a dependent variable (Y) from an independent variable (X). The model is Y = ฮฒโ‚€ + ฮฒโ‚X + ฮต, where ฮฒโ‚€ is the intercept, ฮฒโ‚ is the slope, and ฮต is random error.

The coefficient of determination (Rยฒ) measures how well the model explains variance in Y. An Rยฒ of 0.75 means the model explains 75% of the variance. However, Rยฒ can be misleadingโ€”adding more predictors always increases it, so adjusted Rยฒ penalizes complexity.

Coefficient interpretation requires caution. The slope ฮฒโ‚ estimates the change in Y for each unit increase in X. But correlation doesn’t imply causationโ€”confounding variables might drive both. Additionally, extrapolation beyond the observed data range is risky.

class LinearRegression:
    """Simple linear regression implementation."""
    
    def __init__(self):
        self.coef_ = None
        self.intercept_ = None
        self.r_squared_ = None
    
    def fit(self, X: list, y: list) -> 'LinearRegression':
        """Fit linear regression model."""
        n = len(X)
        x_mean = np.mean(X)
        y_mean = np.mean(y)
        
        # Calculate slope and intercept
        numerator = sum((x - x_mean) * (y - y_mean) for x, y in zip(X, y))
        denominator = sum((x - x_mean) ** 2 for x in X)
        
        self.coef_ = numerator / denominator
        self.intercept_ = y_mean - self.coef_ * x_mean
        
        # Calculate R-squared
        y_pred = [self.intercept_ + self.coef_ * x for x in]
        ss_res = sum((y_actual - y_pred) ** 2 for y_actual, y_pred in zip(y, y_pred))
        ss_tot = sum((y_actual - y_mean) ** 2 for y_actual in y)
        self.r_squared_ = 1 - (ss_res / ss_tot)
        
        return self
    
    def predict(self, X: list) -> list:
        """Make predictions."""
        return [self.intercept_ + self.coef_ * x for x in X]
    
    def summary(self) -> dict:
        """Return model summary."""
        return {
            'intercept': self.intercept_,
            'coefficient': self.coef_,
            'r_squared': self.r_squared_
        }

# Example: Predicting sales from advertising spend
ad_spend = [100, 150, 200, 250, 300, 350, 400, 450, 500, 550]
sales = [1200, 1450, 1800, 2100, 2400, 2750, 3100, 3400, 3800, 4200]

model = LinearRegression()
model.fit(ad_spend, sales)

print("Linear Regression Results:")
print(f"  Intercept: {model.intercept_:.2f}")
print(f"  Coefficient: {model.coef_:.4f}")
print(f"  R-squared: {model.r_squared_:.4f}")
print(f"\nInterpretation:")
print(f"  Each $1 increase in ad spend predicts ${model.coef_:.2f} in sales")
print(f"  Model explains {model.r_squared_*100:.1f}% of variance in sales")

# Predictions
test_spend = [175, 425]
predictions = model.predict(test_spend)
print(f"\nPredicted sales for ad spend $175, $425:")
print(f"  {predictions[0]:.0f}, {predictions[1]:.0f}")

Multiple Regression

Multiple regression extends simple regression with multiple predictors. The model is Y = ฮฒโ‚€ + ฮฒโ‚Xโ‚ + ฮฒโ‚‚Xโ‚‚ + … + ฮฒโ‚–Xโ‚– + ฮต. This allows us to isolate the effect of each predictor while controlling for othersโ€”a crucial capability for understanding real-world relationships.

Interpreting coefficients requires understanding multicollinearity. When predictors correlate highly, individual coefficients become unstable and potentially misleading. Variance Inflation Factor (VIF) detects problematic multicollinearity; VIF > 10 indicates concerning correlation.

Feature selection methods help identify the most important predictors. Forward selection starts with no predictors and adds them one by one. Backward elimination starts with all predictors and removes insignificant ones. LASSO regression adds L1 regularization that can shrink coefficients to zero, performing selection automatically.

from sklearn.linear_model import LinearRegression as SKLinearRegression
from sklearn.preprocessing import StandardScaler

class MultipleRegression:
    """Multiple linear regression with sklearn."""
    
    def __init__(self):
        self.model = SKLinearRegression()
        self.scaler = StandardScaler()
        self.coefficients_ = None
        
    def fit(self, X: list of list, y: list) -> 'MultipleRegression':
        """Fit multiple regression."""
        X_scaled = self.scaler.fit_transform(X)
        self.model.fit(X_scaled, y)
        
        # Store original-scale coefficients
        self.coefficients_ = self.model.coef_ / self.scaler.scale_
        
        return self
    
    def predict(self, X: list of list) -> list:
        """Make predictions."""
        X_scaled = self.scaler.transform(X)
        return self.model.predict(X_scaled)
    
    def r_squared(self) -> float:
        """Return R-squared."""
        return self.model.score(self.scaler.transform(self.model.coef_), 
                                self.model.intercept_)
    
    def coefficients(self, feature_names: list) -> dict:
        """Return coefficients with feature names."""
        return dict(zip(feature_names, self.coefficients_))

# Example: Predicting house prices
# Features: sqft, bedrooms, bathrooms, age, location_score
X = [
    [2000, 3, 2, 10, 7],
    [1500, 2, 1, 30, 5],
    [2500, 4, 3, 5, 8],
    [1800, 3, 2, 20, 6],
    [3000, 4, 3, 2, 9],
    [1200, 2, 1, 40, 4],
    [2200, 3, 2, 15, 7],
    [2800, 4, 3, 8, 8],
]
y = [350000, 280000, 420000, 310000, 480000, 240000, 370000, 450000]

feature_names = ['sqft', 'bedrooms', 'bathrooms', 'age', 'location_score']
model = MultipleRegression()
model.fit(X, y)

print("Multiple Regression Results:")
print(f"  R-squared: {model.r_squared():.4f}")
print("\nCoefficients (standardized impact):")
for name, coef in model.coefficients(feature_names).items():
    print(f"    {name}: {coef:.2f}")

Statistical Power and Sample Size

Understanding Statistical Power

Statistical power is the probability of detecting a true effectโ€”a statistically significant result when the alternative hypothesis is actually true. Power of 0.80 means 80% chance of finding a real effect. Power analysis helps determine how much data we need.

Low power is a pervasive problem in industry. Many A/B tests with small samples or tiny effects go undetected, leading to false negatives. Conversely, with huge samples, even trivial effects become “significant.” Power analysis should precede any experiment.

Effect size measures the magnitude of the phenomenon you’re studying. Small effects require larger samples to detect. Common effect sizes include Cohen’s d (mean difference in standard deviation units) for t-tests and correlation coefficients for association tests.

class PowerAnalysis:
    """Statistical power and sample size calculations."""
    
    @staticmethod
    def sample_size_t_test(effect_size: float, alpha: float = 0.05, 
                          power: float = 0.80) -> int:
        """
        Calculate sample size for two-sample t-test.
        
        Args:
            effect_size: Cohen's d (difference / common std)
            alpha: Significance level
            power: Desired power (1 - beta)
            
        Returns:
            Required sample size per group
        """
        # Approximation using normal distribution
        z_alpha = stats.norm.ppf(1 - alpha/2)
        z_beta = stats.norm.ppf(power)
        
        n = 2 * ((z_alpha + z_beta) / effect_size) ** 2
        return int(np.ceil(n))
    
    @staticmethod
    def power_t_test(n: int, effect_size: float, alpha: float = 0.05) -> float:
        """
        Calculate power for given sample size.
        """
        z_alpha = stats.norm.ppf(1 - alpha/2)
        z_beta = effect_size * np.sqrt(n/2)
        power = stats.norm.cdf(z_beta - z_alpha)
        return power

# Example: Planning an A/B test
# Want to detect at least 2% improvement in conversion rate
# Baseline: 10%, Expected: 12%
baseline = 0.10
expected = 0.12
effect_size = (expected - baseline) / np.sqrt(baseline * (1 - baseline))

print("A/B Test Power Analysis:")
print(f"  Baseline conversion: {baseline:.1%}")
print(f"  Expected conversion: {expected:.1%}")
print(f"  Effect size (Cohen's h): {effect_size:.4f}")

# Calculate required sample size for different power levels
for target_power in [0.70, 0.80, 0.90]:
    n = PowerAnalysis.sample_size_t_test(effect_size, power=target_power)
    print(f"  Sample size for {target_power:.0%} power: {n:,} per group")

# Power with fixed sample size
current_n = 5000
actual_power = PowerAnalysis.power_t_test(current_n, effect_size)
print(f"\nWith {current_n:,} per group:")
print(f"  Actual power: {actual_power:.1%}")
print(f"  Probability of detecting effect: {actual_power:.1%}")

Common Pitfalls and Best Practices

Statistical Mistakes to Avoid

P-hacking involves analyzing data multiple ways until finding significance. If you test 20 hypotheses at ฮฑ=0.05, you’d expect one false positive by chance. Pre-registering hypotheses and correcting for multiple comparisons protects against this.

Significance โ‰  Importance: A statistically significant effect might be trivial in practice. With large samples, tiny differences become significant. Always report effect sizes and confidence intervals alongside p-values.

Correlation โ‰  Causation: Two variables might move together without one causing the other. A confounding variable might drive both, or the relationship might be reverse causation. Experimental designs (randomization) help establish causation.

Base Rate Neglect: Ignoring the prior probability leads to poor decisions. Even with a “95% accurate” test, if the condition is rare, most positive results will be false positives. Use Bayes’ theorem to update probabilities properly.

Best Practices

Report transparently: Include sample sizes, summary statistics, effect sizes, confidence intervals, and p-values. This allows readers to assess significance and practical importance.

Visualize results: Plots communicate findings more effectively than tables. Show data distributions, effect sizes, and uncertainty.

Replicate findings: Single studies rarely provide definitive evidence. Seek replication across conditions, time periods, or populations.

Consider practical significance: Statistical significance doesn’t guarantee practical importance. Ask: “So what?” and “Does this matter for decisions?”

Conclusion

Statistics provides the toolkit for making rigorous inferences from data. This guide covered descriptive statistics, probability distributions, hypothesis testing, regression analysis, and power analysis. These fundamentals empower you to design experiments, analyze results, and draw valid conclusions.

Key takeaways: start with exploratory analysis, choose tests appropriate to your data and questions, always report effect sizes and uncertainty, and remember that statistical significance is just one piece of evidence. Combine statistical thinking with domain expertise and practical considerations for robust decisions.

As you apply these methods, remember that statistics is both science and art. The science provides rigorous methods; the art involves choosing the right methods, communicating results clearly, and interpreting findings wisely. Practice on real problems, learn from mistakes, and continuously refine your statistical intuition.

Resources

Comments