Skip to main content

Programming Resources for Data Analysts and Accountants

Created: October 30, 2022 CalmOps 10 min read

Data analysis combines statistics, programming, and domain knowledge to extract insights from data. For accountants and analysts moving into a data-driven role, Python offers the most accessible path: a single language that handles everything from Excel automation to machine learning. See Python Guide for more context. See Python Guide for more context. See Python Guide for more context.

Core Statistics Concepts with Python

Statistics is the foundation of data analysis. These concepts appear in every analysis you will perform.

Descriptive Statistics

Descriptive statistics summarize data distributions — central tendency, spread, and shape.

import numpy as np
import pandas as pd
from scipy import stats

data = np.array([23, 45, 67, 45, 23, 89, 45, 12, 34, 56])

mean = np.mean(data)          # 43.9
median = np.median(data)      # 45.0
std_dev = np.std(data, ddof=1)  # 23.1 (sample std)
variance = np.var(data, ddof=1) # 535.7
skewness = stats.skew(data)     # 0.21
kurtosis = stats.kurtosis(data) # -0.87

print(f"Mean: {mean:.2f}, Median: {median:.1f}, Std: {std_dev:.2f}")

Use ddof=1 for sample standard deviation (n-1 denominator). For population data, omit it.

Probability Distributions

Understanding distributions lets you model uncertainty and make probabilistic predictions.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

# Normal distribution
x = np.linspace(-4, 4, 1000)
y = stats.norm.pdf(x, loc=0, scale=1)

# Binomial distribution
n, p = 10, 0.5
binom_pmf = stats.binom.pmf(np.arange(0, 11), n, p)

# Poisson distribution
lam = 3
poisson_pmf = stats.poisson.pmf(np.arange(0, 11), lam)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].plot(x, y); axes[0].set_title("Normal Distribution")
axes[1].bar(range(11), binom_pmf); axes[1].set_title("Binomial (n=10, p=0.5)")
axes[2].bar(range(11), poisson_pmf); axes[2].set_title("Poisson (λ=3)")
plt.tight_layout()
plt.show()

Hypothesis Testing

Hypothesis testing determines whether observed effects are statistically significant.

from scipy import stats
import numpy as np

# Two-sample t-test: compare two groups
group_a = np.array([23, 25, 21, 22, 26, 24, 20, 23])
group_b = np.array([30, 28, 32, 29, 31, 27, 33, 28])

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")
# p < 0.05 means significant difference

# Chi-squared test: independence of categorical variables
from scipy.stats import chi2_contingency
table = np.array([[30, 10], [20, 40]])
chi2, p, dof, expected = chi2_contingency(table)
print(f"Chi-squared: {chi2:.3f}, p-value: {p:.4f}")

Always interpret p-values in context: p < 0.05 suggests a real effect, but sample size and practical significance matter too.

Correlation and Regression

import numpy as np
from scipy import stats

hours_studied = np.array([1, 2, 3, 4, 5, 6, 7, 8])
exam_scores  = np.array([35, 45, 50, 55, 60, 70, 75, 85])

r, p_value = stats.pearsonr(hours_studied, exam_scores)
print(f"Pearson r: {r:.3f}, p-value: {p_value:.4f}")
# r = 0.995 → very strong positive correlation

slope, intercept, r_value, p_value, std_err = stats.linregress(
    hours_studied, exam_scores
)
print(f"Score ≈ {slope:.1f} × hours + {intercept:.1f}")

Python Libraries for Data Analysis

Library Purpose Key Functions
pandas Tabular data manipulation read_csv(), groupby(), merge(), pivot_table()
NumPy Numerical computing array(), mean(), linalg.inv(), random.randn()
matplotlib Static plotting plot(), scatter(), hist(), subplots()
seaborn Statistical visualization boxplot(), heatmap(), pairplot(), displot()
scikit-learn Machine learning train_test_split(), LinearRegression(), RandomForestClassifier()
statsmodels Statistical modeling OLS(), Logit(), tsa.seasonal_decompose()
openpyxl Excel file I/O load_workbook(), Workbook(), cell.value
xlwings Live Excel automation Book(), Range(), api.Quit()

pandas Data Manipulation

import pandas as pd
import numpy as np

df = pd.read_csv("sales_data.csv")

df.info()
df.describe()

# Filtering and aggregation
region_sales = (
    df.groupby("region")["revenue"]
    .agg(["sum", "mean", "count"])
    .reset_index()
)

# Pivot table
pivot = pd.pivot_table(
    df,
    values="revenue",
    index="region",
    columns="product_category",
    aggfunc="sum",
    fill_value=0,
)

# Merging datasets
customers = pd.read_csv("customers.csv")
transactions = pd.read_csv("transactions.csv")
merged = pd.merge(transactions, customers, on="customer_id", how="left")

# Window functions
df["rolling_avg"] = df.groupby("product")["revenue"].transform(
    lambda x: x.rolling(window=7).mean()
)

Visualization with matplotlib and seaborn

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

df = sns.load_dataset("penguins")

# Bar plot with error bars
sns.barplot(data=df, x="species", y="body_mass_g", ci=95)
plt.title("Body Mass by Penguin Species with 95% CI")
plt.show()

# Correlation heatmap
numeric_cols = df.select_dtypes(include=["float64", "int64"])
sns.heatmap(numeric_cols.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()

# Pairplot for multivariate exploration
sns.pairplot(df, hue="species", diag_kind="kde")
plt.show()

# Time series line chart
dates = pd.date_range("2025-01-01", periods=100)
values = np.random.randn(100).cumsum() + 100
ts_df = pd.DataFrame({"date": dates, "value": values})
sns.lineplot(data=ts_df, x="date", y="value")
plt.title("Time Series with Seaborn")
plt.show()

SQL Fundamentals

SQL is the universal language for querying relational databases. Every analyst must know it.

Core Query Patterns

-- Basic filtering and sorting
SELECT
    customer_id,
    SUM(amount) AS total_spent,
    COUNT(*) AS transaction_count
FROM transactions
WHERE transaction_date >= '2025-01-01'
GROUP BY customer_id
HAVING total_spent > 1000
ORDER BY total_spent DESC
LIMIT 10;

-- Window functions
SELECT
    transaction_date,
    amount,
    AVG(amount) OVER (PARTITION BY customer_id
                      ORDER BY transaction_date
                      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
        AS rolling_7_day_avg
FROM transactions;

-- Joins
SELECT
    c.customer_name,
    t.transaction_date,
    t.amount
FROM customers c
JOIN transactions t ON c.customer_id = t.customer_id
WHERE t.transaction_date >= CURRENT_DATE - INTERVAL '30 days';

-- Subqueries
SELECT department, avg_salary
FROM (
    SELECT
        department,
        AVG(salary) AS avg_salary,
        RANK() OVER (ORDER BY AVG(salary) DESC) AS rnk
    FROM employees
    GROUP BY department
) ranked
WHERE rnk <= 3;

Common Table Expressions (CTEs)

WITH revenue_by_month AS (
    SELECT
        DATE_TRUNC('month', transaction_date) AS month,
        SUM(amount) AS revenue
    FROM transactions
    WHERE status = 'completed'
    GROUP BY 1
),
growth AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
        (revenue - LAG(revenue) OVER (ORDER BY month))
        / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100
            AS growth_pct
    FROM revenue_by_month
)
SELECT * FROM growth WHERE growth_pct IS NOT NULL;

Data Cleaning and Preprocessing

Real-world data is messy. Cleaning is 60-80% of an analyst’s work.

Handling Missing Values

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "age": [25, np.nan, 30, np.nan, 35, 40],
    "salary": [50000, 60000, np.nan, 65000, 70000, np.nan],
    "department": ["A", "B", "A", "B", np.nan, "A"],
})

# Check missing pattern
print(df.isnull().sum())
print(df.isnull().mean() * 100)  # percentage per column

# Imputation strategies
df["age"].fillna(df["age"].median(), inplace=True)
df["salary"].fillna(df["salary"].mean(), inplace=True)
df["department"].fillna(df["department"].mode()[0], inplace=True)

# Forward fill for time series
df_time = pd.DataFrame({
    "date": pd.date_range("2025-01-01", periods=5),
    "value": [100, np.nan, np.nan, 104, np.nan],
})
df_time["value"].fillna(method="ffill", inplace=True)

Outlier Detection

import numpy as np
import pandas as pd

data = pd.Series([12, 14, 15, 13, 16, 14, 15, 13, 145, 12, 14])

# IQR method
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = data[(data < lower_bound) | (data > upper_bound)]

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(data))
outliers_z = data[z_scores > 3]

print(f"IQR outliers: {outliers_iqr.tolist()}")
print(f"Z-score outliers: {outliers_z.tolist()}")

Normalization and Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "feature_a": np.random.randn(100) * 10 + 50,
    "feature_b": np.random.rand(100) * 100,
})

# Standardization (z-score): zero mean, unit variance
scaler_std = StandardScaler()
df_standardized = pd.DataFrame(
    scaler_std.fit_transform(df), columns=df.columns
)

# Min-Max: scale to [0, 1]
scaler_mm = MinMaxScaler()
df_normalized = pd.DataFrame(
    scaler_mm.fit_transform(df), columns=df.columns
)

print(f"Before: mean={df.mean().values}, std={df.std().values}")
print(f"After standardization: mean={df_standardized.mean().values:.4f}")

Excel Automation with Python

Accountants working with Excel can save hundreds of hours by automating repetitive tasks with Python.

openpyxl: Read and Write Excel Files

from openpyxl import Workbook, load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.chart import BarChart, Reference

wb = Workbook()
ws = wb.active
ws.title = "Monthly Report"

# Headers
headers = ["Month", "Revenue", "Expenses", "Profit"]
ws.append(headers)

# Data
data = [
    ("Jan", 50000, 32000, 18000),
    ("Feb", 55000, 34000, 21000),
    ("Mar", 48000, 31000, 17000),
    ("Apr", 62000, 35000, 27000),
]
for row in data:
    ws.append(row)

# Styling header row
header_font = Font(bold=True, color="FFFFFF")
header_fill = PatternFill(start_color="4472C4", end_color="4472C4", fill_type="solid")
for cell in ws[1]:
    cell.font = header_font
    cell.fill = header_fill
    cell.alignment = Alignment(horizontal="center")

# Add a bar chart
chart = BarChart()
chart.title = "Monthly Financial Summary"
chart.y_axis.title = "Amount ($)"
data_ref = Reference(ws, min_col=2, max_col=4, min_row=1, max_row=5)
cats_ref = Reference(ws, min_col=1, min_row=2, max_row=5)
chart.add_data(data_ref, titles_from_data=True)
chart.set_categories(cats_ref)
ws.add_chart(chart, "F2")

wb.save("monthly_report.xlsx")

xlwings: Control Live Excel with Python

import xlwings as xw

# Open existing workbook
app = xw.App(visible=False)
wb = xw.Book("financial_model.xlsx")
sheet = wb.sheets["Model"]

# Read values
revenue = sheet.range("B2:B13").value
costs = sheet.range("C2:C13").value

# Calculate and write results
profit = [r - c for r, c in zip(revenue, costs)]
sheet.range("D2:D13").value = profit

# Auto-fit columns
sheet.autofit()

wb.save("financial_model_updated.xlsx")
app.quit()

Comparison: Python vs R vs Excel vs Tableau

Feature Python R Excel Tableau
Learning curve Moderate Steep Low Low
Data volume Unlimited Unlimited ~1M rows ~10M+ rows
Statistical modeling Excellent Excellent Basic Limited
Visualization Very good Very good Good Excellent
Automation Excellent Good VBA macros Limited
Cost Free Free Paid license Paid license
Reproducibility Excellent Excellent Poor Manual steps
Community Massive Strong Huge Moderate
ML/AI integration Best-in-class Good None Limited
Accounting tasks (Excel I/O) Excellent (openpyxl, xlwings) Good (openxlsx) Native Not suitable

When to use each:

  • Python: end-to-end data pipeline, ML, automation, large datasets
  • R: advanced statistical analysis, academic research, specialized plots
  • Excel: ad-hoc analysis, quick reports, what-if scenarios, client distribution
  • Tableau: interactive dashboards, executive presentations, data storytelling

12-Week Data Analysis Learning Roadmap

Weeks 1-2: Python Fundamentals

  • Variables, data types, control flow, functions
  • Lists, dictionaries, sets, tuples
  • File I/O and exception handling
  • Project: Build a personal expense tracker script

Weeks 3-4: Numerical Computing & Data Manipulation

  • NumPy arrays, broadcasting, vectorized operations
  • pandas Series and DataFrames, groupby, merge, pivot tables
  • Reading/writing CSV, Excel, JSON
  • Project: Clean and analyze a sales dataset from Kaggle

Weeks 5-6: Statistics & Probability

  • Descriptive statistics: mean, median, variance, correlation
  • Probability distributions: normal, binomial, Poisson
  • Hypothesis testing: t-tests, chi-squared, ANOVA
  • Project: A/B test analysis on marketing campaign data

Weeks 7-8: SQL & Databases

  • SELECT, WHERE, GROUP BY, HAVING, ORDER BY
  • JOIN types, subqueries, CTEs
  • Window functions: ROW_NUMBER, RANK, LAG/LEAD
  • Project: Run analytical queries on a sample e-commerce database

Weeks 9-10: Data Visualization

  • matplotlib: line, scatter, bar, histogram, subplots
  • seaborn: pairplot, heatmap, boxplot, violin plot
  • Dashboard-style figures with multiple panels
  • Project: Build a multi-panel executive dashboard

Weeks 11-12: Excel Automation & Portfolio

  • openpyxl: read, write, style, charts
  • xlwings: live Excel workbook automation
  • Combine SQL + Python + Excel into a complete pipeline
  • Project: End-to-end automated financial reporting system

Building a Data Analysis Portfolio

A strong portfolio demonstrates real skills. Include these types of projects:

  1. Exploratory Data Analysis (EDA) — Clean and explore a public dataset, publish findings as a Jupyter notebook
  2. Automated Reporting — Python script that pulls from SQL, processes data, and outputs a formatted Excel report
  3. Statistical Analysis — Hypothesis testing on real data with clear conclusions
  4. Dashboard — Matplotlib/seaborn multi-panel visualization telling a data story
  5. SQL Portfolio — 10+ analytical queries solving business problems

Host notebooks on GitHub and Kaggle. Write clear READMEs that explain the business context, methodology, and findings.

Courses

Books

  • Python for Data Analysis by Wes McKinney (O’Reilly) — pandas creator’s definitive guide
  • Naked Statistics by Charles Wheelan — intuitive statistics without jargon
  • SQL for Data Analysis by Cathy Tanimura — practical business SQL
  • Python Crash Course by Eric Matthes — fastest path to Python fluency
  • Excel + Python: Fast Data Analysis and Processing (Turing) — Excel-Python integration

Practice Platforms

Tools

Summary

The learning path is: Python fundamentals → pandas/NumPy → statistics → SQL → visualization → Excel automation. Each stage builds on the previous one. By week 12 you will have an end-to-end data pipeline that pulls raw data from a database, cleans and analyzes it in Python, and outputs a formatted Excel report with charts — the exact workflow expected of a modern data analyst or data-driven accountant.

Resources

Comments

Share this article

Scan to read on mobile