Programming Resources for Data Analysts and Accountants

Data analysis combines statistics, programming, and domain knowledge to extract insights from data. For accountants and analysts moving into a data-driven role, Python offers the most accessible path: a single language that handles everything from Excel automation to machine learning. See Python Guide for more context. See Python Guide for more context. See Python Guide for more context.

Core Statistics Concepts with Python

Statistics is the foundation of data analysis. These concepts appear in every analysis you will perform.

Descriptive Statistics

Descriptive statistics summarize data distributions — central tendency, spread, and shape.

import numpy as np
import pandas as pd
from scipy import stats

data = np.array([23, 45, 67, 45, 23, 89, 45, 12, 34, 56])

mean = np.mean(data)          # 43.9
median = np.median(data)      # 45.0
std_dev = np.std(data, ddof=1)  # 23.1 (sample std)
variance = np.var(data, ddof=1) # 535.7
skewness = stats.skew(data)     # 0.21
kurtosis = stats.kurtosis(data) # -0.87

print(f"Mean: {mean:.2f}, Median: {median:.1f}, Std: {std_dev:.2f}")

Use ddof=1 for sample standard deviation (n-1 denominator). For population data, omit it.

Probability Distributions

Understanding distributions lets you model uncertainty and make probabilistic predictions.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

# Normal distribution
x = np.linspace(-4, 4, 1000)
y = stats.norm.pdf(x, loc=0, scale=1)

# Binomial distribution
n, p = 10, 0.5
binom_pmf = stats.binom.pmf(np.arange(0, 11), n, p)

# Poisson distribution
lam = 3
poisson_pmf = stats.poisson.pmf(np.arange(0, 11), lam)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].plot(x, y); axes[0].set_title("Normal Distribution")
axes[1].bar(range(11), binom_pmf); axes[1].set_title("Binomial (n=10, p=0.5)")
axes[2].bar(range(11), poisson_pmf); axes[2].set_title("Poisson (λ=3)")
plt.tight_layout()
plt.show()

Hypothesis Testing

Hypothesis testing determines whether observed effects are statistically significant.

from scipy import stats
import numpy as np

# Two-sample t-test: compare two groups
group_a = np.array([23, 25, 21, 22, 26, 24, 20, 23])
group_b = np.array([30, 28, 32, 29, 31, 27, 33, 28])

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")
# p < 0.05 means significant difference

# Chi-squared test: independence of categorical variables
from scipy.stats import chi2_contingency
table = np.array([[30, 10], [20, 40]])
chi2, p, dof, expected = chi2_contingency(table)
print(f"Chi-squared: {chi2:.3f}, p-value: {p:.4f}")

Always interpret p-values in context: p < 0.05 suggests a real effect, but sample size and practical significance matter too.

Correlation and Regression

import numpy as np
from scipy import stats

hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8])
exam_scores  = np.array([35, 45, 50, 55, 60, 70, 75, 85])

r, p_value = stats.pearsonr(hours_studied, exam_scores)
print(f"Pearson r: {r:.3f}, p-value: {p_value:.4f}")
# r = 0.995 → very strong positive correlation

slope, intercept, r_value, p_value, std_err = stats.linregress(
    hours_studied, exam_scores
)
print(f"Score ≈ {slope:.1f} × hours + {intercept:.1f}")

Python Libraries for Data Analysis

Library	Purpose	Key Functions
pandas	Tabular data manipulation	`read_csv()`, `groupby()`, `merge()`, `pivot_table()`
NumPy	Numerical computing	`array()`, `mean()`, `linalg.inv()`, `random.randn()`
matplotlib	Static plotting	`plot()`, `scatter()`, `hist()`, `subplots()`
seaborn	Statistical visualization	`boxplot()`, `heatmap()`, `pairplot()`, `displot()`
scikit-learn	Machine learning	`train_test_split()`, `LinearRegression()`, `RandomForestClassifier()`
statsmodels	Statistical modeling	`OLS()`, `Logit()`, `tsa.seasonal_decompose()`
openpyxl	Excel file I/O	`load_workbook()`, `Workbook()`, `cell.value`
xlwings	Live Excel automation	`Book()`, `Range()`, `api.Quit()`

pandas Data Manipulation

import pandas as pd
import numpy as np

df = pd.read_csv("sales_data.csv")

df.info()
df.describe()

# Filtering and aggregation
region_sales = (
    df.groupby("region")["revenue"]
    .agg(["sum", "mean", "count"])
    .reset_index()
)

# Pivot table
pivot = pd.pivot_table(
    df,
    values="revenue",
    index="region",
    columns="product_category",
    aggfunc="sum",
    fill_value=0,
)

# Merging datasets
customers = pd.read_csv("customers.csv")
transactions = pd.read_csv("transactions.csv")
merged = pd.merge(transactions, customers, on="customer_id", how="left")

# Window functions
df["rolling_avg"] = df.groupby("product")["revenue"].transform(
    lambda x: x.rolling(window=7).mean()
)

Visualization with matplotlib and seaborn

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

df = sns.load_dataset("penguins")

# Bar plot with error bars
sns.barplot(data=df, x="species", y="body_mass_g", ci=95)
plt.title("Body Mass by Penguin Species with 95% CI")
plt.show()

# Correlation heatmap
numeric_cols = df.select_dtypes(include=["float64", "int64"])
sns.heatmap(numeric_cols.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()

# Pairplot for multivariate exploration
sns.pairplot(df, hue="species", diag_kind="kde")
plt.show()

# Time series line chart
dates = pd.date_range("2025-01-01", periods=100)
values = np.random.randn(100).cumsum() + 100
ts_df = pd.DataFrame({"date": dates, "value": values})
sns.lineplot(data=ts_df, x="date", y="value")
plt.title("Time Series with Seaborn")
plt.show()

SQL Fundamentals

SQL is the universal language for querying relational databases. Every analyst must know it.

Core Query Patterns

-- Basic filtering and sorting
SELECT
    customer_id,
    SUM(amount) AS total_spent,
    COUNT(*) AS transaction_count
FROM transactions
WHERE transaction_date >= '2025-01-01'
GROUP BY customer_id
HAVING total_spent > 1000
ORDER BY total_spent DESC
LIMIT 10;

-- Window functions
SELECT
    transaction_date,
    amount,
    AVG(amount) OVER (PARTITION BY customer_id
                      ORDER BY transaction_date
                      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
        AS rolling_7_day_avg
FROM transactions;

-- Joins
SELECT
    c.customer_name,
    t.transaction_date,
    t.amount
FROM customers c
JOIN transactions t ON c.customer_id = t.customer_id
WHERE t.transaction_date >= CURRENT_DATE - INTERVAL '30 days';

-- Subqueries
SELECT department, avg_salary
FROM (
    SELECT
        department,
        AVG(salary) AS avg_salary,
        RANK() OVER (ORDER BY AVG(salary) DESC) AS rnk
    FROM employees
    GROUP BY department
) ranked
WHERE rnk <= 3;

Common Table Expressions (CTEs)

WITH revenue_by_month AS (
    SELECT
        DATE_TRUNC('month', transaction_date) AS month,
        SUM(amount) AS revenue
    FROM transactions
    WHERE status = 'completed'
    GROUP BY 1
),
growth AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
        (revenue - LAG(revenue) OVER (ORDER BY month))
        / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100
            AS growth_pct
    FROM revenue_by_month
)
SELECT * FROM growth WHERE growth_pct IS NOT NULL;

Data Cleaning and Preprocessing

Real-world data is messy. Cleaning is 60-80% of an analyst’s work.

Handling Missing Values

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "age": [25, np.nan, 30, np.nan, 35, 40],
    "salary": [50000, 60000, np.nan, 65000, 70000, np.nan],
    "department": ["A", "B", "A", "B", np.nan, "A"],
})

# Check missing pattern
print(df.isnull().sum())
print(df.isnull().mean() * 100)  # percentage per column

# Imputation strategies
df["age"].fillna(df["age"].median(), inplace=True)
df["salary"].fillna(df["salary"].mean(), inplace=True)
df["department"].fillna(df["department"].mode()[0], inplace=True)

# Forward fill for time series
df_time = pd.DataFrame({
    "date": pd.date_range("2025-01-01", periods=5),
    "value": [100, np.nan, np.nan, 104, np.nan],
})
df_time["value"].fillna(method="ffill", inplace=True)

Outlier Detection

import numpy as np
import pandas as pd

data = pd.Series([12, 14, 15, 13, 16, 14, 15, 13, 145, 12, 14])

# IQR method
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = data[(data < lower_bound) | (data > upper_bound)]

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(data))
outliers_z = data[z_scores > 3]

print(f"IQR outliers: {outliers_iqr.tolist()}")
print(f"Z-score outliers: {outliers_z.tolist()}")

Normalization and Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "feature_a": np.random.randn(100) * 10 + 50,
    "feature_b": np.random.rand(100) * 100,
})

# Standardization (z-score): zero mean, unit variance
scaler_std = StandardScaler()
df_standardized = pd.DataFrame(
    scaler_std.fit_transform(df), columns=df.columns
)

# Min-Max: scale to [0, 1]
scaler_mm = MinMaxScaler()
df_normalized = pd.DataFrame(
    scaler_mm.fit_transform(df), columns=df.columns
)

print(f"Before: mean={df.mean().values}, std={df.std().values}")
print(f"After standardization: mean={df_standardized.mean().values:.4f}")

Excel Automation with Python

Accountants working with Excel can save hundreds of hours by automating repetitive tasks with Python.

openpyxl: Read and Write Excel Files

from openpyxl import Workbook, load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.chart import BarChart, Reference

wb = Workbook()
ws = wb.active
ws.title = "Monthly Report"

# Headers
headers = ["Month", "Revenue", "Expenses", "Profit"]
ws.append(headers)

# Data
data = [
    ("Jan", 50000, 32000, 18000),
    ("Feb", 55000, 34000, 21000),
    ("Mar", 48000, 31000, 17000),
    ("Apr", 62000, 35000, 27000),
]
for row in data:
    ws.append(row)

# Styling header row
header_font = Font(bold=True, color="FFFFFF")
header_fill = PatternFill(start_color="4472C4", end_color="4472C4", fill_type="solid")
for cell in ws[1]:
    cell.font = header_font
    cell.fill = header_fill
    cell.alignment = Alignment(horizontal="center")

# Add a bar chart
chart = BarChart()
chart.title = "Monthly Financial Summary"
chart.y_axis.title = "Amount ($)"
data_ref = Reference(ws, min_col=2, max_col=4, min_row=1, max_row=5)
cats_ref = Reference(ws, min_col=1, min_row=2, max_row=5)
chart.add_data(data_ref, titles_from_data=True)
chart.set_categories(cats_ref)
ws.add_chart(chart, "F2")

wb.save("monthly_report.xlsx")

xlwings: Control Live Excel with Python

import xlwings as xw

# Open existing workbook
app = xw.App(visible=False)
wb = xw.Book("financial_model.xlsx")
sheet = wb.sheets["Model"]

# Read values
revenue = sheet.range("B2:B13").value
costs = sheet.range("C2:C13").value

# Calculate and write results
profit = [r - c for r, c in zip(revenue, costs)]
sheet.range("D2:D13").value = profit

# Auto-fit columns
sheet.autofit()

wb.save("financial_model_updated.xlsx")
app.quit()

Comparison: Python vs R vs Excel vs Tableau

Feature	Python	R	Excel	Tableau
Learning curve	Moderate	Steep	Low	Low
Data volume	Unlimited	Unlimited	~1M rows	~10M+ rows
Statistical modeling	Excellent	Excellent	Basic	Limited
Visualization	Very good	Very good	Good	Excellent
Automation	Excellent	Good	VBA macros	Limited
Cost	Free	Free	Paid license	Paid license
Reproducibility	Excellent	Excellent	Poor	Manual steps
Community	Massive	Strong	Huge	Moderate
ML/AI integration	Best-in-class	Good	None	Limited
Accounting tasks (Excel I/O)	Excellent (openpyxl, xlwings)	Good (openxlsx)	Native	Not suitable

When to use each:

Python: end-to-end data pipeline, ML, automation, large datasets
R: advanced statistical analysis, academic research, specialized plots
Excel: ad-hoc analysis, quick reports, what-if scenarios, client distribution
Tableau: interactive dashboards, executive presentations, data storytelling

12-Week Data Analysis Learning Roadmap

Weeks 1-2: Python Fundamentals

Variables, data types, control flow, functions
Lists, dictionaries, sets, tuples
File I/O and exception handling
Project: Build a personal expense tracker script

Weeks 3-4: Numerical Computing & Data Manipulation

NumPy arrays, broadcasting, vectorized operations
pandas Series and DataFrames, groupby, merge, pivot tables
Reading/writing CSV, Excel, JSON
Project: Clean and analyze a sales dataset from Kaggle

Weeks 5-6: Statistics & Probability

Descriptive statistics: mean, median, variance, correlation
Probability distributions: normal, binomial, Poisson
Hypothesis testing: t-tests, chi-squared, ANOVA
Project: A/B test analysis on marketing campaign data

Weeks 7-8: SQL & Databases

SELECT, WHERE, GROUP BY, HAVING, ORDER BY
JOIN types, subqueries, CTEs
Window functions: ROW_NUMBER, RANK, LAG/LEAD
Project: Run analytical queries on a sample e-commerce database

Weeks 9-10: Data Visualization

matplotlib: line, scatter, bar, histogram, subplots
seaborn: pairplot, heatmap, boxplot, violin plot
Dashboard-style figures with multiple panels
Project: Build a multi-panel executive dashboard

Weeks 11-12: Excel Automation & Portfolio

openpyxl: read, write, style, charts
xlwings: live Excel workbook automation
Combine SQL + Python + Excel into a complete pipeline
Project: End-to-end automated financial reporting system

Building a Data Analysis Portfolio

A strong portfolio demonstrates real skills. Include these types of projects:

Exploratory Data Analysis (EDA) — Clean and explore a public dataset, publish findings as a Jupyter notebook
Automated Reporting — Python script that pulls from SQL, processes data, and outputs a formatted Excel report
Statistical Analysis — Hypothesis testing on real data with clear conclusions
Dashboard — Matplotlib/seaborn multi-panel visualization telling a data story
SQL Portfolio — 10+ analytical queries solving business problems

Host notebooks on GitHub and Kaggle. Write clear READMEs that explain the business context, methodology, and findings.

Recommended Resources

Courses

Books

Python for Data Analysis by Wes McKinney (O’Reilly) — pandas creator’s definitive guide
Naked Statistics by Charles Wheelan — intuitive statistics without jargon
SQL for Data Analysis by Cathy Tanimura — practical business SQL
Python Crash Course by Eric Matthes — fastest path to Python fluency
Excel + Python: Fast Data Analysis and Processing (Turing) — Excel-Python integration

Practice Platforms

Kaggle — Datasets, competitions, notebooks
LeetCode SQL — SQL interview practice
StrataScratch — Real interview questions
HackerRank — SQL skill assessments

Tools

VS Code with Python, Jupyter, SQL extensions
Jupyter Notebook / JupyterLab — Interactive analysis
Git — Version control for your code
MySQL or PostgreSQL — Open-source databases
DBeaver — Universal database client

Summary

The learning path is: Python fundamentals → pandas/NumPy → statistics → SQL → visualization → Excel automation. Each stage builds on the previous one. By week 12 you will have an end-to-end data pipeline that pulls raw data from a database, cleans and analyzes it in Python, and outputs a formatted Excel report with charts — the exact workflow expected of a modern data analyst or data-driven accountant.

Programming Resources for Data Analysts and Accountants

Core Statistics Concepts with Python

Descriptive Statistics

Probability Distributions

Hypothesis Testing

Correlation and Regression

Python Libraries for Data Analysis

pandas Data Manipulation

Visualization with matplotlib and seaborn

SQL Fundamentals

Core Query Patterns

Common Table Expressions (CTEs)

Data Cleaning and Preprocessing

Handling Missing Values

Outlier Detection

Normalization and Scaling

Excel Automation with Python

openpyxl: Read and Write Excel Files

xlwings: Control Live Excel with Python

Comparison: Python vs R vs Excel vs Tableau

12-Week Data Analysis Learning Roadmap

Weeks 1-2: Python Fundamentals

Weeks 3-4: Numerical Computing & Data Manipulation

Weeks 5-6: Statistics & Probability

Weeks 7-8: SQL & Databases

Weeks 9-10: Data Visualization

Weeks 11-12: Excel Automation & Portfolio

Building a Data Analysis Portfolio

Recommended Resources

Courses

Books

Practice Platforms

Tools

Summary

Resources

Comments

Share this article

👍 Was this article helpful?