Data analysis combines statistics, programming, and domain knowledge to extract insights from data. For accountants and analysts moving into a data-driven role, Python offers the most accessible path: a single language that handles everything from Excel automation to machine learning. See Python Guide for more context. See Python Guide for more context. See Python Guide for more context.
Core Statistics Concepts with Python
Statistics is the foundation of data analysis. These concepts appear in every analysis you will perform.
Descriptive Statistics
Descriptive statistics summarize data distributions — central tendency, spread, and shape.
import numpy as np
import pandas as pd
from scipy import stats
data = np.array([23, 45, 67, 45, 23, 89, 45, 12, 34, 56])
mean = np.mean(data) # 43.9
median = np.median(data) # 45.0
std_dev = np.std(data, ddof=1) # 23.1 (sample std)
variance = np.var(data, ddof=1) # 535.7
skewness = stats.skew(data) # 0.21
kurtosis = stats.kurtosis(data) # -0.87
print(f"Mean: {mean:.2f}, Median: {median:.1f}, Std: {std_dev:.2f}")
Use ddof=1 for sample standard deviation (n-1 denominator). For population data, omit it.
Probability Distributions
Understanding distributions lets you model uncertainty and make probabilistic predictions.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
# Normal distribution
x = np.linspace(-4, 4, 1000)
y = stats.norm.pdf(x, loc=0, scale=1)
# Binomial distribution
n, p = 10, 0.5
binom_pmf = stats.binom.pmf(np.arange(0, 11), n, p)
# Poisson distribution
lam = 3
poisson_pmf = stats.poisson.pmf(np.arange(0, 11), lam)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].plot(x, y); axes[0].set_title("Normal Distribution")
axes[1].bar(range(11), binom_pmf); axes[1].set_title("Binomial (n=10, p=0.5)")
axes[2].bar(range(11), poisson_pmf); axes[2].set_title("Poisson (λ=3)")
plt.tight_layout()
plt.show()
Hypothesis Testing
Hypothesis testing determines whether observed effects are statistically significant.
from scipy import stats
import numpy as np
# Two-sample t-test: compare two groups
group_a = np.array([23, 25, 21, 22, 26, 24, 20, 23])
group_b = np.array([30, 28, 32, 29, 31, 27, 33, 28])
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")
# p < 0.05 means significant difference
# Chi-squared test: independence of categorical variables
from scipy.stats import chi2_contingency
table = np.array([[30, 10], [20, 40]])
chi2, p, dof, expected = chi2_contingency(table)
print(f"Chi-squared: {chi2:.3f}, p-value: {p:.4f}")
Always interpret p-values in context: p < 0.05 suggests a real effect, but sample size and practical significance matter too.
Correlation and Regression
import numpy as np
from scipy import stats
hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8])
exam_scores = np.array([35, 45, 50, 55, 60, 70, 75, 85])
r, p_value = stats.pearsonr(hours_studied, exam_scores)
print(f"Pearson r: {r:.3f}, p-value: {p_value:.4f}")
# r = 0.995 → very strong positive correlation
slope, intercept, r_value, p_value, std_err = stats.linregress(
hours_studied, exam_scores
)
print(f"Score ≈ {slope:.1f} × hours + {intercept:.1f}")
Python Libraries for Data Analysis
| Library | Purpose | Key Functions |
|---|---|---|
| pandas | Tabular data manipulation | read_csv(), groupby(), merge(), pivot_table() |
| NumPy | Numerical computing | array(), mean(), linalg.inv(), random.randn() |
| matplotlib | Static plotting | plot(), scatter(), hist(), subplots() |
| seaborn | Statistical visualization | boxplot(), heatmap(), pairplot(), displot() |
| scikit-learn | Machine learning | train_test_split(), LinearRegression(), RandomForestClassifier() |
| statsmodels | Statistical modeling | OLS(), Logit(), tsa.seasonal_decompose() |
| openpyxl | Excel file I/O | load_workbook(), Workbook(), cell.value |
| xlwings | Live Excel automation | Book(), Range(), api.Quit() |
pandas Data Manipulation
import pandas as pd
import numpy as np
df = pd.read_csv("sales_data.csv")
df.info()
df.describe()
# Filtering and aggregation
region_sales = (
df.groupby("region")["revenue"]
.agg(["sum", "mean", "count"])
.reset_index()
)
# Pivot table
pivot = pd.pivot_table(
df,
values="revenue",
index="region",
columns="product_category",
aggfunc="sum",
fill_value=0,
)
# Merging datasets
customers = pd.read_csv("customers.csv")
transactions = pd.read_csv("transactions.csv")
merged = pd.merge(transactions, customers, on="customer_id", how="left")
# Window functions
df["rolling_avg"] = df.groupby("product")["revenue"].transform(
lambda x: x.rolling(window=7).mean()
)
Visualization with matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = sns.load_dataset("penguins")
# Bar plot with error bars
sns.barplot(data=df, x="species", y="body_mass_g", ci=95)
plt.title("Body Mass by Penguin Species with 95% CI")
plt.show()
# Correlation heatmap
numeric_cols = df.select_dtypes(include=["float64", "int64"])
sns.heatmap(numeric_cols.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()
# Pairplot for multivariate exploration
sns.pairplot(df, hue="species", diag_kind="kde")
plt.show()
# Time series line chart
dates = pd.date_range("2025-01-01", periods=100)
values = np.random.randn(100).cumsum() + 100
ts_df = pd.DataFrame({"date": dates, "value": values})
sns.lineplot(data=ts_df, x="date", y="value")
plt.title("Time Series with Seaborn")
plt.show()
SQL Fundamentals
SQL is the universal language for querying relational databases. Every analyst must know it.
Core Query Patterns
-- Basic filtering and sorting
SELECT
customer_id,
SUM(amount) AS total_spent,
COUNT(*) AS transaction_count
FROM transactions
WHERE transaction_date >= '2025-01-01'
GROUP BY customer_id
HAVING total_spent > 1000
ORDER BY total_spent DESC
LIMIT 10;
-- Window functions
SELECT
transaction_date,
amount,
AVG(amount) OVER (PARTITION BY customer_id
ORDER BY transaction_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
AS rolling_7_day_avg
FROM transactions;
-- Joins
SELECT
c.customer_name,
t.transaction_date,
t.amount
FROM customers c
JOIN transactions t ON c.customer_id = t.customer_id
WHERE t.transaction_date >= CURRENT_DATE - INTERVAL '30 days';
-- Subqueries
SELECT department, avg_salary
FROM (
SELECT
department,
AVG(salary) AS avg_salary,
RANK() OVER (ORDER BY AVG(salary) DESC) AS rnk
FROM employees
GROUP BY department
) ranked
WHERE rnk <= 3;
Common Table Expressions (CTEs)
WITH revenue_by_month AS (
SELECT
DATE_TRUNC('month', transaction_date) AS month,
SUM(amount) AS revenue
FROM transactions
WHERE status = 'completed'
GROUP BY 1
),
growth AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
(revenue - LAG(revenue) OVER (ORDER BY month))
/ NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100
AS growth_pct
FROM revenue_by_month
)
SELECT * FROM growth WHERE growth_pct IS NOT NULL;
Data Cleaning and Preprocessing
Real-world data is messy. Cleaning is 60-80% of an analyst’s work.
Handling Missing Values
import pandas as pd
import numpy as np
df = pd.DataFrame({
"age": [25, np.nan, 30, np.nan, 35, 40],
"salary": [50000, 60000, np.nan, 65000, 70000, np.nan],
"department": ["A", "B", "A", "B", np.nan, "A"],
})
# Check missing pattern
print(df.isnull().sum())
print(df.isnull().mean() * 100) # percentage per column
# Imputation strategies
df["age"].fillna(df["age"].median(), inplace=True)
df["salary"].fillna(df["salary"].mean(), inplace=True)
df["department"].fillna(df["department"].mode()[0], inplace=True)
# Forward fill for time series
df_time = pd.DataFrame({
"date": pd.date_range("2025-01-01", periods=5),
"value": [100, np.nan, np.nan, 104, np.nan],
})
df_time["value"].fillna(method="ffill", inplace=True)
Outlier Detection
import numpy as np
import pandas as pd
data = pd.Series([12, 14, 15, 13, 16, 14, 15, 13, 145, 12, 14])
# IQR method
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = data[(data < lower_bound) | (data > upper_bound)]
# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(data))
outliers_z = data[z_scores > 3]
print(f"IQR outliers: {outliers_iqr.tolist()}")
print(f"Z-score outliers: {outliers_z.tolist()}")
Normalization and Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd
import numpy as np
df = pd.DataFrame({
"feature_a": np.random.randn(100) * 10 + 50,
"feature_b": np.random.rand(100) * 100,
})
# Standardization (z-score): zero mean, unit variance
scaler_std = StandardScaler()
df_standardized = pd.DataFrame(
scaler_std.fit_transform(df), columns=df.columns
)
# Min-Max: scale to [0, 1]
scaler_mm = MinMaxScaler()
df_normalized = pd.DataFrame(
scaler_mm.fit_transform(df), columns=df.columns
)
print(f"Before: mean={df.mean().values}, std={df.std().values}")
print(f"After standardization: mean={df_standardized.mean().values:.4f}")
Excel Automation with Python
Accountants working with Excel can save hundreds of hours by automating repetitive tasks with Python.
openpyxl: Read and Write Excel Files
from openpyxl import Workbook, load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.chart import BarChart, Reference
wb = Workbook()
ws = wb.active
ws.title = "Monthly Report"
# Headers
headers = ["Month", "Revenue", "Expenses", "Profit"]
ws.append(headers)
# Data
data = [
("Jan", 50000, 32000, 18000),
("Feb", 55000, 34000, 21000),
("Mar", 48000, 31000, 17000),
("Apr", 62000, 35000, 27000),
]
for row in data:
ws.append(row)
# Styling header row
header_font = Font(bold=True, color="FFFFFF")
header_fill = PatternFill(start_color="4472C4", end_color="4472C4", fill_type="solid")
for cell in ws[1]:
cell.font = header_font
cell.fill = header_fill
cell.alignment = Alignment(horizontal="center")
# Add a bar chart
chart = BarChart()
chart.title = "Monthly Financial Summary"
chart.y_axis.title = "Amount ($)"
data_ref = Reference(ws, min_col=2, max_col=4, min_row=1, max_row=5)
cats_ref = Reference(ws, min_col=1, min_row=2, max_row=5)
chart.add_data(data_ref, titles_from_data=True)
chart.set_categories(cats_ref)
ws.add_chart(chart, "F2")
wb.save("monthly_report.xlsx")
xlwings: Control Live Excel with Python
import xlwings as xw
# Open existing workbook
app = xw.App(visible=False)
wb = xw.Book("financial_model.xlsx")
sheet = wb.sheets["Model"]
# Read values
revenue = sheet.range("B2:B13").value
costs = sheet.range("C2:C13").value
# Calculate and write results
profit = [r - c for r, c in zip(revenue, costs)]
sheet.range("D2:D13").value = profit
# Auto-fit columns
sheet.autofit()
wb.save("financial_model_updated.xlsx")
app.quit()
Comparison: Python vs R vs Excel vs Tableau
| Feature | Python | R | Excel | Tableau |
|---|---|---|---|---|
| Learning curve | Moderate | Steep | Low | Low |
| Data volume | Unlimited | Unlimited | ~1M rows | ~10M+ rows |
| Statistical modeling | Excellent | Excellent | Basic | Limited |
| Visualization | Very good | Very good | Good | Excellent |
| Automation | Excellent | Good | VBA macros | Limited |
| Cost | Free | Free | Paid license | Paid license |
| Reproducibility | Excellent | Excellent | Poor | Manual steps |
| Community | Massive | Strong | Huge | Moderate |
| ML/AI integration | Best-in-class | Good | None | Limited |
| Accounting tasks (Excel I/O) | Excellent (openpyxl, xlwings) | Good (openxlsx) | Native | Not suitable |
When to use each:
- Python: end-to-end data pipeline, ML, automation, large datasets
- R: advanced statistical analysis, academic research, specialized plots
- Excel: ad-hoc analysis, quick reports, what-if scenarios, client distribution
- Tableau: interactive dashboards, executive presentations, data storytelling
12-Week Data Analysis Learning Roadmap
Weeks 1-2: Python Fundamentals
- Variables, data types, control flow, functions
- Lists, dictionaries, sets, tuples
- File I/O and exception handling
- Project: Build a personal expense tracker script
Weeks 3-4: Numerical Computing & Data Manipulation
- NumPy arrays, broadcasting, vectorized operations
- pandas Series and DataFrames, groupby, merge, pivot tables
- Reading/writing CSV, Excel, JSON
- Project: Clean and analyze a sales dataset from Kaggle
Weeks 5-6: Statistics & Probability
- Descriptive statistics: mean, median, variance, correlation
- Probability distributions: normal, binomial, Poisson
- Hypothesis testing: t-tests, chi-squared, ANOVA
- Project: A/B test analysis on marketing campaign data
Weeks 7-8: SQL & Databases
- SELECT, WHERE, GROUP BY, HAVING, ORDER BY
- JOIN types, subqueries, CTEs
- Window functions: ROW_NUMBER, RANK, LAG/LEAD
- Project: Run analytical queries on a sample e-commerce database
Weeks 9-10: Data Visualization
- matplotlib: line, scatter, bar, histogram, subplots
- seaborn: pairplot, heatmap, boxplot, violin plot
- Dashboard-style figures with multiple panels
- Project: Build a multi-panel executive dashboard
Weeks 11-12: Excel Automation & Portfolio
- openpyxl: read, write, style, charts
- xlwings: live Excel workbook automation
- Combine SQL + Python + Excel into a complete pipeline
- Project: End-to-end automated financial reporting system
Building a Data Analysis Portfolio
A strong portfolio demonstrates real skills. Include these types of projects:
- Exploratory Data Analysis (EDA) — Clean and explore a public dataset, publish findings as a Jupyter notebook
- Automated Reporting — Python script that pulls from SQL, processes data, and outputs a formatted Excel report
- Statistical Analysis — Hypothesis testing on real data with clear conclusions
- Dashboard — Matplotlib/seaborn multi-panel visualization telling a data story
- SQL Portfolio — 10+ analytical queries solving business problems
Host notebooks on GitHub and Kaggle. Write clear READMEs that explain the business context, methodology, and findings.
Recommended Resources
Courses
- Python for Everybody (Coursera)
- Data Analysis with Python (freeCodeCamp)
- SQL for Data Analysis (Mode Analytics)
- Statistics with Python (Coursera)
- Accounting Data Analytics Specialization (Coursera)
- Python Data Analysis and Visualization (MOOC)
Books
- Python for Data Analysis by Wes McKinney (O’Reilly) — pandas creator’s definitive guide
- Naked Statistics by Charles Wheelan — intuitive statistics without jargon
- SQL for Data Analysis by Cathy Tanimura — practical business SQL
- Python Crash Course by Eric Matthes — fastest path to Python fluency
- Excel + Python: Fast Data Analysis and Processing (Turing) — Excel-Python integration
Practice Platforms
- Kaggle — Datasets, competitions, notebooks
- LeetCode SQL — SQL interview practice
- StrataScratch — Real interview questions
- HackerRank — SQL skill assessments
Tools
- VS Code with Python, Jupyter, SQL extensions
- Jupyter Notebook / JupyterLab — Interactive analysis
- Git — Version control for your code
- MySQL or PostgreSQL — Open-source databases
- DBeaver — Universal database client
Summary
The learning path is: Python fundamentals → pandas/NumPy → statistics → SQL → visualization → Excel automation. Each stage builds on the previous one. By week 12 you will have an end-to-end data pipeline that pulls raw data from a database, cleans and analyzes it in Python, and outputs a formatted Excel report with charts — the exact workflow expected of a modern data analyst or data-driven accountant.
Comments