Statistical Analysis with SciPy: A Practical Guide
Statistical analysis is at the heart of data science. Whether you’re testing hypotheses, comparing groups, or understanding distributions, having the right tools makes all the difference. SciPy’s scipy.stats module provides a comprehensive collection of statistical functions that transform raw data into actionable insights.
In this guide, we’ll explore how to leverage SciPy for practical statistical analysis. We’ll move beyond theory to show you exactly how to implement statistical tests, work with probability distributions, and interpret results in real-world scenarios.
Why SciPy for Statistical Analysis?
SciPy is built on top of NumPy and provides:
- Comprehensive Statistical Functions: Over 80 probability distributions and statistical tests
- Efficient Computation: Optimized C implementations for performance
- Integration with Python Ecosystem: Works seamlessly with Pandas, NumPy, and Matplotlib
- Production-Ready: Trusted by researchers, data scientists, and engineers worldwide
- Well-Documented: Extensive documentation with clear examples
Let’s dive into practical statistical analysis with SciPy.
Getting Started with SciPy Statistics
Installation and Setup
# Install SciPy (if not already installed)
# pip install scipy numpy pandas matplotlib
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
# Set random seed for reproducibility
np.random.seed(42)
Basic Statistical Measures
Before running complex tests, let’s understand how to calculate basic descriptive statistics:
# Generate sample data
data = np.random.normal(loc=100, scale=15, size=1000)
# Calculate descriptive statistics
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
variance = np.var(data)
# Using SciPy for more detailed statistics
describe_result = stats.describe(data)
print(f"Mean: {mean:.2f}")
print(f"Median: {median:.2f}")
print(f"Standard Deviation: {std_dev:.2f}")
print(f"Variance: {variance:.2f}")
print(f"\nSciPy describe():")
print(f"Count: {describe_result.nobs}")
print(f"Min: {describe_result.minmax[0]:.2f}")
print(f"Max: {describe_result.minmax[1]:.2f}")
print(f"Mean: {describe_result.mean:.2f}")
print(f"Variance: {describe_result.variance:.2f}")
print(f"Skewness: {describe_result.skewness:.2f}")
print(f"Kurtosis: {describe_result.kurtosis:.2f}")
The describe() function provides a comprehensive summary including skewness (asymmetry) and kurtosis (tail heaviness), which are important for understanding data distribution shape.
Probability Distributions
Understanding Distributions
Probability distributions are fundamental to statistics. SciPy provides access to over 80 distributions, each with methods for probability density function (PDF), cumulative distribution function (CDF), and random sampling.
Working with the Normal Distribution
# Create a normal distribution object
normal_dist = stats.norm(loc=100, scale=15)
# Generate random samples
samples = normal_dist.rvs(size=1000)
# Calculate probability density at specific points
x_values = np.linspace(70, 130, 100)
pdf_values = normal_dist.pdf(x_values)
# Calculate cumulative probability (P(X <= x))
prob_less_than_110 = normal_dist.cdf(110)
print(f"P(X <= 110) = {prob_less_than_110:.4f}")
# Calculate quantiles (inverse CDF)
quantile_95 = normal_dist.ppf(0.95)
print(f"95th percentile: {quantile_95:.2f}")
# Visualize the distribution
plt.figure(figsize=(10, 6))
plt.plot(x_values, pdf_values, 'b-', linewidth=2, label='PDF')
plt.fill_between(x_values, pdf_values, alpha=0.3)
plt.axvline(110, color='r', linestyle='--', label='X=110')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Normal Distribution (ฮผ=100, ฯ=15)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Other Common Distributions
# Exponential distribution (waiting times, lifetimes)
exp_dist = stats.expon(scale=2)
exp_samples = exp_dist.rvs(size=1000)
# Binomial distribution (number of successes in n trials)
binom_dist = stats.binom(n=10, p=0.5)
binom_samples = binom_dist.rvs(size=1000)
# Poisson distribution (count of events in fixed interval)
poisson_dist = stats.poisson(mu=3)
poisson_samples = poisson_dist.rvs(size=1000)
# Uniform distribution (equal probability across range)
uniform_dist = stats.uniform(loc=0, scale=10)
uniform_samples = uniform_dist.rvs(size=1000)
# Visualize multiple distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Exponential
axes[0, 0].hist(exp_samples, bins=30, density=True, alpha=0.7)
axes[0, 0].set_title('Exponential Distribution')
# Binomial
axes[0, 1].hist(binom_samples, bins=11, density=True, alpha=0.7)
axes[0, 1].set_title('Binomial Distribution')
# Poisson
axes[1, 0].hist(poisson_samples, bins=15, density=True, alpha=0.7)
axes[1, 0].set_title('Poisson Distribution')
# Uniform
axes[1, 1].hist(uniform_samples, bins=30, density=True, alpha=0.7)
axes[1, 1].set_title('Uniform Distribution')
plt.tight_layout()
plt.show()
Hypothesis Testing
Understanding Hypothesis Testing
Hypothesis testing is a systematic method for making decisions about populations based on sample data. The process involves:
- Null Hypothesis (Hโ): The default assumption (usually “no effect”)
- Alternative Hypothesis (Hโ): What we’re trying to prove
- Test Statistic: A calculated value from the sample
- P-value: Probability of observing the data if Hโ is true
- Significance Level (ฮฑ): Threshold for rejecting Hโ (typically 0.05)
One-Sample T-Test
Test whether a sample mean differs significantly from a known population mean.
# Scenario: A manufacturer claims their product lasts 1000 hours
# We test 30 units and want to verify the claim
sample_data = np.array([
985, 1010, 995, 1020, 1005, 990, 1015, 1000, 1008, 1002,
998, 1012, 1003, 997, 1018, 1001, 1009, 994, 1011, 1004,
992, 1016, 1006, 999, 1013, 1002, 1007, 996, 1014, 1000
])
# Perform one-sample t-test
# Hโ: ฮผ = 1000 (the claimed mean)
# Hโ: ฮผ โ 1000 (the mean is different)
t_statistic, p_value = stats.ttest_1samp(sample_data, popmean=1000)
print("One-Sample T-Test Results:")
print(f"Sample mean: {np.mean(sample_data):.2f}")
print(f"Sample std dev: {np.std(sample_data, ddof=1):.2f}")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significance level: 0.05")
# Interpretation
if p_value < 0.05:
print("Result: REJECT the null hypothesis")
print("Conclusion: The sample mean differs significantly from 1000 hours")
else:
print("Result: FAIL TO REJECT the null hypothesis")
print("Conclusion: No significant evidence that the mean differs from 1000 hours")
Two-Sample T-Test
Compare means between two independent groups.
# Scenario: Compare test scores between two teaching methods
group_a = np.array([78, 82, 85, 88, 90, 76, 84, 87, 89, 91,
79, 83, 86, 88, 92, 77, 85, 87, 90, 93])
group_b = np.array([72, 75, 78, 80, 82, 70, 76, 79, 81, 84,
73, 77, 79, 81, 83, 71, 77, 80, 82, 85])
# Perform two-sample t-test
# Hโ: ฮผ_A = ฮผ_B (means are equal)
# Hโ: ฮผ_A โ ฮผ_B (means are different)
t_statistic, p_value = stats.ttest_ind(group_a, group_b)
print("Two-Sample T-Test Results:")
print(f"Group A mean: {np.mean(group_a):.2f}")
print(f"Group B mean: {np.mean(group_b):.2f}")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Result: The two groups have significantly different means")
else:
print("Result: No significant difference between group means")
# Visualize the comparison
plt.figure(figsize=(10, 6))
plt.boxplot([group_a, group_b], labels=['Method A', 'Method B'])
plt.ylabel('Test Score')
plt.title('Comparison of Teaching Methods')
plt.grid(True, alpha=0.3)
plt.show()
Paired T-Test
Compare measurements from the same subjects before and after an intervention.
# Scenario: Blood pressure before and after medication
before = np.array([140, 145, 138, 142, 148, 141, 139, 144, 146, 143,
140, 142, 145, 141, 147, 139, 143, 145, 142, 140])
after = np.array([135, 138, 132, 136, 141, 135, 133, 138, 140, 137,
134, 136, 139, 135, 141, 133, 137, 139, 136, 134])
# Perform paired t-test
# Hโ: ฮผ_difference = 0 (no change)
# Hโ: ฮผ_difference โ 0 (there is a change)
t_statistic, p_value = stats.ttest_rel(before, after)
differences = before - after
print("Paired T-Test Results:")
print(f"Mean before: {np.mean(before):.2f}")
print(f"Mean after: {np.mean(after):.2f}")
print(f"Mean difference: {np.mean(differences):.2f}")
print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Result: The medication significantly changed blood pressure")
else:
print("Result: No significant change in blood pressure")
Analysis of Variance (ANOVA)
One-Way ANOVA
Compare means across three or more groups.
# Scenario: Compare sales performance across three regions
region_north = np.array([45, 52, 48, 55, 50, 49, 53, 51, 54, 47])
region_south = np.array([38, 42, 40, 45, 41, 39, 43, 41, 44, 40])
region_west = np.array([50, 58, 52, 60, 55, 51, 59, 53, 61, 54])
# Perform one-way ANOVA
# Hโ: ฮผ_north = ฮผ_south = ฮผ_west (all means are equal)
# Hโ: At least one mean is different
f_statistic, p_value = stats.f_oneway(region_north, region_south, region_west)
print("One-Way ANOVA Results:")
print(f"North mean: {np.mean(region_north):.2f}")
print(f"South mean: {np.mean(region_south):.2f}")
print(f"West mean: {np.mean(region_west):.2f}")
print(f"F-statistic: {f_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Result: Significant differences exist between regions")
print("Next step: Perform post-hoc tests to identify which regions differ")
else:
print("Result: No significant differences between regions")
# Visualize the comparison
plt.figure(figsize=(10, 6))
plt.boxplot([region_north, region_south, region_west],
labels=['North', 'South', 'West'])
plt.ylabel('Sales')
plt.title('Sales Performance by Region')
plt.grid(True, alpha=0.3)
plt.show()
Correlation and Regression
Pearson Correlation
Measure linear relationship between two continuous variables.
# Scenario: Relationship between advertising spend and sales
advertising = np.array([100, 150, 200, 250, 300, 350, 400, 450, 500, 550])
sales = np.array([500, 650, 750, 900, 1100, 1250, 1400, 1600, 1800, 2000])
# Calculate Pearson correlation
correlation, p_value = stats.pearsonr(advertising, sales)
print("Pearson Correlation Results:")
print(f"Correlation coefficient: {correlation:.4f}")
print(f"P-value: {p_value:.6f}")
# Interpretation
if p_value < 0.05:
if correlation > 0:
print("Result: Strong positive correlation (statistically significant)")
else:
print("Result: Strong negative correlation (statistically significant)")
else:
print("Result: No significant correlation")
# Visualize the relationship
plt.figure(figsize=(10, 6))
plt.scatter(advertising, sales, s=100, alpha=0.6)
# Add regression line
z = np.polyfit(advertising, sales, 1)
p = np.poly1d(z)
plt.plot(advertising, p(advertising), "r--", linewidth=2, label='Regression line')
plt.xlabel('Advertising Spend ($)')
plt.ylabel('Sales ($)')
plt.title(f'Advertising vs Sales (r={correlation:.3f}, p={p_value:.4f})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Spearman Rank Correlation
Measure monotonic relationship (useful for non-linear or ranked data).
# Scenario: Relationship between employee experience and performance rating
experience = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
performance = np.array([3, 4, 5, 6, 7, 8, 7, 9, 8, 10])
# Calculate Spearman correlation
correlation, p_value = stats.spearmanr(experience, performance)
print("Spearman Rank Correlation Results:")
print(f"Correlation coefficient: {correlation:.4f}")
print(f"P-value: {p_value:.6f}")
# Compare with Pearson
pearson_corr, pearson_p = stats.pearsonr(experience, performance)
print(f"\nComparison:")
print(f"Pearson correlation: {pearson_corr:.4f}")
print(f"Spearman correlation: {correlation:.4f}")
Chi-Square Test
Chi-Square Goodness of Fit
Test whether observed frequencies match expected frequencies.
# Scenario: A die is rolled 600 times
# Expected: each face appears 100 times
# Observed: actual counts from experiment
observed = np.array([95, 108, 102, 97, 105, 93]) # Counts for faces 1-6
expected = np.array([100, 100, 100, 100, 100, 100]) # Expected counts
# Perform chi-square goodness of fit test
chi2_statistic, p_value = stats.chisquare(observed, expected)
print("Chi-Square Goodness of Fit Test:")
print(f"Chi-square statistic: {chi2_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Result: The die appears to be biased")
else:
print("Result: The die appears to be fair")
# Visualize
plt.figure(figsize=(10, 6))
x = np.arange(1, 7)
width = 0.35
plt.bar(x - width/2, observed, width, label='Observed', alpha=0.8)
plt.bar(x + width/2, expected, width, label='Expected', alpha=0.8)
plt.xlabel('Die Face')
plt.ylabel('Frequency')
plt.title('Chi-Square Goodness of Fit Test')
plt.xticks(x)
plt.legend()
plt.grid(True, alpha=0.3, axis='y')
plt.show()
Chi-Square Test of Independence
Test whether two categorical variables are independent.
# Scenario: Relationship between gender and product preference
# Create contingency table
contingency_table = np.array([
[50, 30, 20], # Male: Product A, B, C
[35, 45, 40] # Female: Product A, B, C
])
# Perform chi-square test of independence
chi2_statistic, p_value, dof, expected_freq = stats.chi2_contingency(contingency_table)
print("Chi-Square Test of Independence:")
print(f"Chi-square statistic: {chi2_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
if p_value < 0.05:
print("Result: Gender and product preference are NOT independent")
print("Conclusion: There is a significant relationship between gender and preference")
else:
print("Result: Gender and product preference are independent")
print("\nExpected frequencies:")
print(expected_freq)
Practical Workflow: Complete Analysis Example
Let’s combine multiple statistical techniques in a realistic scenario.
# Scenario: Analyze customer satisfaction data across product categories
# Generate sample data
np.random.seed(42)
categories = ['Electronics', 'Clothing', 'Home', 'Sports']
satisfaction_scores = {
'Electronics': np.random.normal(7.2, 1.5, 100),
'Clothing': np.random.normal(7.8, 1.2, 100),
'Home': np.random.normal(7.5, 1.4, 100),
'Sports': np.random.normal(8.1, 1.1, 100)
}
# 1. Descriptive Statistics
print("=" * 60)
print("DESCRIPTIVE STATISTICS")
print("=" * 60)
for category, scores in satisfaction_scores.items():
print(f"\n{category}:")
print(f" Mean: {np.mean(scores):.2f}")
print(f" Median: {np.median(scores):.2f}")
print(f" Std Dev: {np.std(scores):.2f}")
print(f" Min: {np.min(scores):.2f}, Max: {np.max(scores):.2f}")
# 2. Test for Normality (Shapiro-Wilk test)
print("\n" + "=" * 60)
print("NORMALITY TEST (Shapiro-Wilk)")
print("=" * 60)
for category, scores in satisfaction_scores.items():
stat, p_value = stats.shapiro(scores)
is_normal = "Yes" if p_value > 0.05 else "No"
print(f"{category}: p-value = {p_value:.4f}, Normal: {is_normal}")
# 3. ANOVA to compare means across categories
print("\n" + "=" * 60)
print("ONE-WAY ANOVA")
print("=" * 60)
f_stat, p_value = stats.f_oneway(
satisfaction_scores['Electronics'],
satisfaction_scores['Clothing'],
satisfaction_scores['Home'],
satisfaction_scores['Sports']
)
print(f"F-statistic: {f_stat:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Result: Significant differences exist between categories")
else:
print("Result: No significant differences between categories")
# 4. Pairwise comparisons (t-tests with Bonferroni correction)
print("\n" + "=" * 60)
print("PAIRWISE COMPARISONS (with Bonferroni correction)")
print("=" * 60)
categories_list = list(satisfaction_scores.keys())
n_comparisons = len(categories_list) * (len(categories_list) - 1) / 2
bonferroni_alpha = 0.05 / n_comparisons
for i in range(len(categories_list)):
for j in range(i + 1, len(categories_list)):
cat1, cat2 = categories_list[i], categories_list[j]
t_stat, p_val = stats.ttest_ind(
satisfaction_scores[cat1],
satisfaction_scores[cat2]
)
significant = "Yes" if p_val < bonferroni_alpha else "No"
print(f"{cat1} vs {cat2}: p-value = {p_val:.4f}, Significant: {significant}")
# 5. Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Box plot
ax = axes[0, 0]
data_list = [satisfaction_scores[cat] for cat in categories]
ax.boxplot(data_list, labels=categories)
ax.set_ylabel('Satisfaction Score')
ax.set_title('Distribution by Category')
ax.grid(True, alpha=0.3, axis='y')
# Violin plot
ax = axes[0, 1]
parts = ax.violinplot(data_list, positions=range(len(categories)), showmeans=True)
ax.set_xticks(range(len(categories)))
ax.set_xticklabels(categories)
ax.set_ylabel('Satisfaction Score')
ax.set_title('Violin Plot by Category')
ax.grid(True, alpha=0.3, axis='y')
# Histogram with overlays
ax = axes[1, 0]
for category, scores in satisfaction_scores.items():
ax.hist(scores, alpha=0.5, label=category, bins=20)
ax.set_xlabel('Satisfaction Score')
ax.set_ylabel('Frequency')
ax.set_title('Distribution Comparison')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
# Mean with confidence intervals
ax = axes[1, 1]
means = [np.mean(satisfaction_scores[cat]) for cat in categories]
ci_95 = [1.96 * np.std(satisfaction_scores[cat]) / np.sqrt(len(satisfaction_scores[cat]))
for cat in categories]
ax.bar(categories, means, yerr=ci_95, capsize=10, alpha=0.7)
ax.set_ylabel('Mean Satisfaction Score')
ax.set_title('Mean with 95% Confidence Intervals')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Interpreting P-Values and Statistical Significance
Understanding P-Values
The p-value is the probability of observing your data (or more extreme) if the null hypothesis is true. It’s NOT the probability that the null hypothesis is true.
- p-value < 0.05: Typically considered statistically significant (reject Hโ)
- p-value < 0.01: Highly statistically significant
- p-value > 0.05: Not statistically significant (fail to reject Hโ)
Common Misconceptions
# Example: What a p-value of 0.03 means
print("If p-value = 0.03:")
print("โ Correct: There's a 3% chance of observing this data if Hโ is true")
print("โ Wrong: There's a 97% chance the alternative hypothesis is true")
print("โ Wrong: The result is definitely real and important")
print("\nStatistical significance โ Practical significance")
print("A large sample can produce significant p-values for trivial effects")
Effect Size
Always report effect size alongside p-values:
# Example: Cohen's d for effect size
group1 = np.array([10, 12, 14, 16, 18])
group2 = np.array([15, 17, 19, 21, 23])
# Calculate Cohen's d
mean_diff = np.mean(group2) - np.mean(group1)
pooled_std = np.sqrt((np.std(group1)**2 + np.std(group2)**2) / 2)
cohens_d = mean_diff / pooled_std
print(f"Cohen's d: {cohens_d:.3f}")
print("Interpretation:")
print(" |d| < 0.2: Small effect")
print(" 0.2 โค |d| < 0.5: Small to medium effect")
print(" 0.5 โค |d| < 0.8: Medium to large effect")
print(" |d| โฅ 0.8: Large effect")
Best Practices for Statistical Analysis
1. Check Assumptions
Different tests have different assumptions. Always verify them:
# Check normality before using parametric tests
data = np.random.normal(100, 15, 100)
stat, p_value = stats.shapiro(data)
print(f"Shapiro-Wilk test p-value: {p_value:.4f}")
print("Data is normally distributed" if p_value > 0.05 else "Data is not normally distributed")
# Check homogeneity of variance (Levene's test)
group1 = np.random.normal(100, 10, 50)
group2 = np.random.normal(100, 20, 50)
stat, p_value = stats.levene(group1, group2)
print(f"Levene's test p-value: {p_value:.4f}")
print("Variances are equal" if p_value > 0.05 else "Variances are not equal")
2. Use Appropriate Tests
- Parametric tests (t-test, ANOVA): Assume normality, more powerful
- Non-parametric tests (Mann-Whitney U, Kruskal-Wallis): No normality assumption, more robust
# Non-parametric alternative to t-test
group1 = np.array([1, 2, 3, 4, 5])
group2 = np.array([3, 4, 5, 6, 7])
# Mann-Whitney U test (non-parametric)
stat, p_value = stats.mannwhitneyu(group1, group2)
print(f"Mann-Whitney U test p-value: {p_value:.4f}")
3. Control for Multiple Comparisons
When performing multiple tests, adjust significance level:
# Bonferroni correction
n_tests = 10
alpha = 0.05
bonferroni_alpha = alpha / n_tests
print(f"Adjusted significance level: {bonferroni_alpha:.4f}")
4. Report Complete Results
Always include:
- Test statistic
- P-value
- Effect size
- Confidence intervals
- Sample size
Conclusion
SciPy’s statistical capabilities empower you to move beyond descriptive statistics to rigorous hypothesis testing and inference. Key takeaways:
- Start with Exploration: Use descriptive statistics and visualizations to understand your data
- Check Assumptions: Verify that your data meets test requirements
- Choose Appropriate Tests: Select parametric or non-parametric tests based on your data
- Interpret Carefully: P-values indicate statistical significance, not practical importance
- Report Completely: Include test statistics, p-values, effect sizes, and confidence intervals
- Combine with Domain Knowledge: Statistical results should inform, not replace, expert judgment
Statistical analysis is both an art and a science. With SciPy as your tool and these principles as your guide, you’ll be able to extract meaningful insights from data and make confident, data-driven decisions.
Key SciPy Functions Reference
| Function | Purpose |
|---|---|
stats.describe() |
Comprehensive descriptive statistics |
stats.ttest_1samp() |
One-sample t-test |
stats.ttest_ind() |
Two-sample t-test |
stats.ttest_rel() |
Paired t-test |
stats.f_oneway() |
One-way ANOVA |
stats.pearsonr() |
Pearson correlation |
stats.spearmanr() |
Spearman rank correlation |
stats.chisquare() |
Chi-square goodness of fit |
stats.chi2_contingency() |
Chi-square test of independence |
stats.shapiro() |
Shapiro-Wilk normality test |
stats.levene() |
Levene’s homogeneity of variance test |
stats.mannwhitneyu() |
Mann-Whitney U test (non-parametric) |
Start exploring your data with SciPy today, and unlock the power of statistical analysis in Python!
Comments