Introduction
Data science and technical research require precise typesetting for equations, code, statistical results, and visualizations. LaTeX provides the precision and professionalism required for academic publications and technical reports.
This guide covers LaTeX techniques specific to data science: statistical reporting, code listings, data visualization, and reproducible research workflows.
Statistical Reporting
Tables for Statistics
\usepackage{booktabs}
\usepackage{siunitx}
\begin{table}[htbp]
\centering
\caption{Descriptive Statistics}
\label{tab:descriptives}
\sisetup{
table-format=2.3,
table-space-text-post=\sym asterisk,
table-align-text-post=false
}
\begin{tabular}{
l
S[table-format=3.2]
S[table-format=2.3]
S[table-format=1.3]
S
}
\toprule
& \multicolumn{3}{c}{Variable} & \\
\cmidrule{2-4}
Statistic & {Age} & {Income} & {Score} & \\
\midrule
Mean & 34.52 & 54230.18 & 0.724 & \\
SD & 8.34 & 15234.56 & 0.156 & \\
Min & 22 & 25000 & 0.31 & \\
Max & 65 & 125000 & 0.98 & \\
N & 250 & 250 & 250 & \\
\bottomrule
\end{tabular}
\end{table}
Regression Results
\begin{table}[htbp]
\centering
\caption{Linear Regression Results}
\label{tab:regression}
\begin{tabular}{lccc}
\toprule
& \multicolumn{3}{c}{Dependent Variable} \\
\cmidrule{2-4}
Predictor & $\beta$ & SE & $p$ \\
\midrule
(Intercept) & 2.34 & 0.45 & $<$.001 \\
Age & 0.12 & 0.03 & .002 \\
Education & 0.45 & 0.12 & $<$.001 \\
Income & 0.00 & 0.00 & .234 \\
\midrule
$R^2$ & \multicolumn{3}{c}{0.34} \\
$F$ & \multicolumn{3}{c}{42.56} \\
$p$ & \multicolumn{3}{c}{$<$.001} \\
\bottomrule
\end{tabular}
\end{table}
Effect Sizes and Confidence Intervals
\begin{table}[htbp]
\centering
\caption{Effect Sizes with 95\% CI}
\label{tab:effects}
\sisetup{
table-format=1.3,
separate-uncertainty=true
}
\begin{tabular}{l S[table-format=1.3(2)] c}
\toprule
Measure & \multiderive{Estimate} & Interpretation \\
\midrule
Cohen's d & 0.45 \pm 0.12 & Small-medium \\
Pearson's r & 0.38 \pm 0.08 & Small-medium \\
Odds Ratio & 2.34 \pm 0.45 & Medium-large \\
\bottomrule
\end{tabular}
\end{table}
Code Listings
Python Code
\usepackage{minted}
\begin{listing}[htbp]
\begin{minted}[
fontsize=\small,
linenos,
frame=lines,
framesep=2mm
]{python}
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load data
df = pd.read_csv('data.csv')
# Prepare features
X = df.drop('target', axis=1)
y = df['target']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.3f}")
\end{minted}
\caption{Model training pipeline}
\label{code:model}
\end{listing}
R Code
\begin{minted}[fontsize=\small]{R}
library(tidyverse)
library(lme4)
# Load data
data <- read_csv("experiment.csv")
# Fit mixed model
model <- lmer(
outcome ~ treatment + (1|subject),
data = data
)
# Summarize
summary(model)
confint(model)
\end{minted}
Multiple Languages
\newminted{python}{linenos,frame=lines}
\newminted{r}{linenos,frame=lines}
\newminted{sql}{linenos,frame=lines}
\newminted{bash}{linenos,frame=lines}
Data Visualization
PGFPlots Basic Plot
\usepackage{pgfplots}
\pgfplotsset{compat=1.18}
\begin{figure}[htbp]
\centering
\begin{tikzpicture}
\begin{axis}[
width=\linewidth,
height=6cm,
xlabel={X Axis Label},
ylabel={Y Axis Label},
legend pos=north west,
grid=major
]
\addplot[blue,mark=*] coordinates {
(1, 2.3)
(2, 4.5)
(3, 3.7)
(4, 5.2)
(5, 4.8)
};
\addlegendentry{Data Series 1}
\addplot[red,smooth] coordinates {
(1, 2.1)
(2, 4.2)
(3, 3.9)
(4, 5.0)
(5, 4.6)
};
\addlegendentry{Data Series 2}
\end{axis}
\end{tikzpicture}
\caption{Comparison of Two Series}
\label{fig:comparison}
\end{figure}
Error Bars
\begin{tikzpicture}
\begin{axis}[
ybar,
width=\linewidth,
height=6cm,
xtick=data,
symbolic x coords={A,B,C,D},
legend style={at={(0.5,-0.15)},anchor=north},
ylabel={Mean Value}
]
\addplot+[error bars/.cd,y dir=both,y explicit]
coordinates {
(A, 5.2) +- (0.4, 0.3)
(B, 6.8) +- (0.5, 0.4)
(C, 4.9) +- (0.3, 0.5)
(D, 7.2) +- (0.6, 0.4)
};
\addplot+[error bars/.cd,y dir=both,y explicit]
coordinates {
(A, 4.8) +- (0.3, 0.4)
(B, 6.2) +- (0.4, 0.5)
(C, 5.1) +- (0.5, 0.3)
(D, 6.9) +- (0.5, 0.5)
};
\end{axis}
\end{tikzpicture}
Heatmaps
\begin{tikzpicture}
\begin{axis}[
width=8cm,
height=8cm,
colorbar,
view={0}{90}
]
\addplot3[surf] file {data/heatmap.dat};
\end{axis}
\end{tikzpicture}
Reproducible Research
Knitr/R Markdown Integration
% In R, use knitr to generate LaTeX
% knitr::write_bib(c("ggplot2", "dplyr"), "packages.bib")
% Then in LaTeX
\usepackage[backend=biber]{biblatex}
\addbibresource{packages.bib}
Jupyter/Python Integration
# Use jupytext to convert notebooks
# jupytext --to latex notebook.ipynb
# Or use pandoc
# jupyter nbconvert --to latex notebook.ipynb
Version Control for Data
% Reference data files
\usepackage{filecontents}
\begin{filecontents}{data/experiment.csv}
treatment,outcome,subject
A,5.2,1
A,4.8,2
B,6.1,1
B,5.9,2
\end{filecontents}
Mathematical Optimization
Equations in Papers
\begin{align}
\min_{\mathbf{w}} &\quad \frac{1}{2}||\mathbf{w}||^2 \\
\text{s.t.} &\quad y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1, \quad \forall i
\end{align}
\begin{align}
L(\mathbf{w}, b, \boldsymbol{\alpha}) &= \frac{1}{2}||\mathbf{w}||^2 - \sum_{i=1}^{n} \alpha_i [y_i(\mathbf{w} \cdot \mathbf{x}_i + b) - 1]
\end{align}
Algorithms
\usepackage{algorithm}
\usepackage{algorithm2e}
\begin{algorithm}[htbp]
\SetAlgoLined
\KwData{Input data $X$, labels $y$, regularization $\lambda$}
\KwResult{Model parameters $\theta$}
Initialize $\theta^{(0)}$ randomly\;
\For{$t = 1$ to $T$}{
Compute gradient: $g \leftarrow \nabla L(\theta^{(t-1)})$\;
Update: $\theta^{(t)} \leftarrow \theta^{(t-1)} - \eta_t \cdot g$\;
}
\caption{Gradient Descent Algorithm}
\label{algo:gd}
\end{algorithm}
Best Practices
Organization
project/
โโโ main.tex
โโโ references.bib
โโโ figures/
โ โโโ *.pdf, *.png
โโโ data/
โ โโโ *.csv
โโโ code/
โโโ analysis.py
Data Citation
@dataset{smith2024data,
author = {Smith, John},
title = {Experimental Dataset},
year = {2024},
publisher = {Data Repository},
doi = {10.5281/zenodo.xxxxx}
}
Code Citation
@misc{python2024,
title = {Python Programming Language},
author = {Python Software Foundation},
year = {2024},
url = {https://www.python.org}
}
Conclusion
LaTeX provides professional capabilities for data science documentation. From statistical tables to code listings to sophisticated visualizations, LaTeX handles the demanding requirements of technical research.
Combine LaTeX with reproducible research tools for efficient workflows that ensure your publications accurately reflect your analysis.
Comments