LaTeX for Data Science: Technical Reports and Research Papers

Introduction

Data science and technical research require precise typesetting for equations, code, statistical results, and visualizations. LaTeX provides the precision and professionalism required for academic publications and technical reports.

This guide covers LaTeX techniques specific to data science: statistical reporting, code listings, data visualization, and reproducible research workflows.

Statistical Reporting

Tables for Statistics

\usepackage{booktabs}
\usepackage{siunitx}

\begin{table}[htbp]
  \centering
  \caption{Descriptive Statistics}
  \label{tab:descriptives}
  \sisetup{
    table-format=2.3,
    table-space-text-post=\sym asterisk,
    table-align-text-post=false
  }
  \begin{tabular}{
    l
    S[table-format=3.2]
    S[table-format=2.3]
    S[table-format=1.3]
    S
  }
    \toprule
    & \multicolumn{3}{c}{Variable} & \\
    \cmidrule{2-4}
    Statistic & {Age} & {Income} & {Score} & \\
    \midrule
    Mean & 34.52 & 54230.18 & 0.724 & \\
    SD & 8.34 & 15234.56 & 0.156 & \\
    Min & 22 & 25000 & 0.31 & \\
    Max & 65 & 125000 & 0.98 & \\
    N & 250 & 250 & 250 & \\
    \bottomrule
  \end{tabular}
\end{table}

Regression Results

\begin{table}[htbp]
  \centering
  \caption{Linear Regression Results}
  \label{tab:regression}
  \begin{tabular}{lccc}
    \toprule
    & \multicolumn{3}{c}{Dependent Variable} \\
    \cmidrule{2-4}
    Predictor & $\beta$ & SE & $p$ \\
    \midrule
    (Intercept) & 2.34 & 0.45 & $<$.001 \\
    Age & 0.12 & 0.03 & .002 \\
    Education & 0.45 & 0.12 & $<$.001 \\
    Income & 0.00 & 0.00 & .234 \\
    \midrule
    $R^2$ & \multicolumn{3}{c}{0.34} \\
    $F$ & \multicolumn{3}{c}{42.56} \\
    $p$ & \multicolumn{3}{c}{$<$.001} \\
    \bottomrule
  \end{tabular}
\end{table}

Effect Sizes and Confidence Intervals

\begin{table}[htbp]
  \centering
  \caption{Effect Sizes with 95\% CI}
  \label{tab:effects}
  \sisetup{
    table-format=1.3,
    separate-uncertainty=true
  }
  \begin{tabular}{l S[table-format=1.3(2)] c}
    \toprule
    Measure & \multiderive{Estimate} & Interpretation \\
    \midrule
    Cohen's d & 0.45 \pm 0.12 & Small-medium \\
    Pearson's r & 0.38 \pm 0.08 & Small-medium \\
    Odds Ratio & 2.34 \pm 0.45 & Medium-large \\
    \bottomrule
  \end{tabular}
\end{table}

Code Listings

Python Code

\usepackage{minted}

\begin{listing}[htbp]
\begin{minted}[
  fontsize=\small,
  linenos,
  frame=lines,
  framesep=2mm
]{python}
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
df = pd.read_csv('data.csv')

# Prepare features
X = df.drop('target', axis=1)
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.3f}")
\end{minted}
\caption{Model training pipeline}
\label{code:model}
\end{listing}

R Code

\begin{minted}[fontsize=\small]{R}
library(tidyverse)
library(lme4)

# Load data
data <- read_csv("experiment.csv")

# Fit mixed model
model <- lmer(
  outcome ~ treatment + (1|subject),
  data = data
)

# Summarize
summary(model)
confint(model)
\end{minted}

Multiple Languages

\newminted{python}{linenos,frame=lines}
\newminted{r}{linenos,frame=lines}
\newminted{sql}{linenos,frame=lines}
\newminted{bash}{linenos,frame=lines}

Data Visualization

PGFPlots Basic Plot

\usepackage{pgfplots}
\pgfplotsset{compat=1.18}

\begin{figure}[htbp]
  \centering
  \begin{tikzpicture}
    \begin{axis}[
      width=\linewidth,
      height=6cm,
      xlabel={X Axis Label},
      ylabel={Y Axis Label},
      legend pos=north west,
      grid=major
    ]
      \addplot[blue,mark=*] coordinates {
        (1, 2.3)
        (2, 4.5)
        (3, 3.7)
        (4, 5.2)
        (5, 4.8)
      };
      \addlegendentry{Data Series 1}
      
      \addplot[red,smooth] coordinates {
        (1, 2.1)
        (2, 4.2)
        (3, 3.9)
        (4, 5.0)
        (5, 4.6)
      };
      \addlegendentry{Data Series 2}
    \end{axis}
  \end{tikzpicture}
  \caption{Comparison of Two Series}
  \label{fig:comparison}
\end{figure}

Error Bars

\begin{tikzpicture}
  \begin{axis}[
    ybar,
    width=\linewidth,
    height=6cm,
    xtick=data,
    symbolic x coords={A,B,C,D},
    legend style={at={(0.5,-0.15)},anchor=north},
    ylabel={Mean Value}
  ]
    \addplot+[error bars/.cd,y dir=both,y explicit]
      coordinates {
        (A, 5.2) +- (0.4, 0.3)
        (B, 6.8) +- (0.5, 0.4)
        (C, 4.9) +- (0.3, 0.5)
        (D, 7.2) +- (0.6, 0.4)
      };
    \addplot+[error bars/.cd,y dir=both,y explicit]
      coordinates {
        (A, 4.8) +- (0.3, 0.4)
        (B, 6.2) +- (0.4, 0.5)
        (C, 5.1) +- (0.5, 0.3)
        (D, 6.9) +- (0.5, 0.5)
      };
  \end{axis}
\end{tikzpicture}

Heatmaps

\begin{tikzpicture}
  \begin{axis}[
    width=8cm,
    height=8cm,
    colorbar,
    view={0}{90}
  ]
    \addplot3[surf] file {data/heatmap.dat};
  \end{axis}
\end{tikzpicture}

Reproducible Research

Knitr/R Markdown Integration

% In R, use knitr to generate LaTeX
% knitr::write_bib(c("ggplot2", "dplyr"), "packages.bib")

% Then in LaTeX
\usepackage[backend=biber]{biblatex}
\addbibresource{packages.bib}

Jupyter/Python Integration

# Use jupytext to convert notebooks
# jupytext --to latex notebook.ipynb

# Or use pandoc
# jupyter nbconvert --to latex notebook.ipynb

Version Control for Data

% Reference data files
\usepackage{filecontents}

\begin{filecontents}{data/experiment.csv}
treatment,outcome,subject
A,5.2,1
A,4.8,2
B,6.1,1
B,5.9,2
\end{filecontents}

Mathematical Optimization

Equations in Papers

\begin{align}
  \min_{\mathbf{w}} &\quad \frac{1}{2}||\mathbf{w}||^2 \\
  \text{s.t.} &\quad y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1, \quad \forall i
\end{align}

\begin{align}
  L(\mathbf{w}, b, \boldsymbol{\alpha}) &= \frac{1}{2}||\mathbf{w}||^2 - \sum_{i=1}^{n} \alpha_i [y_i(\mathbf{w} \cdot \mathbf{x}_i + b) - 1]
\end{align}

Algorithms

\usepackage{algorithm}
\usepackage{algorithm2e}

\begin{algorithm}[htbp]
  \SetAlgoLined
  \KwData{Input data $X$, labels $y$, regularization $\lambda$}
  \KwResult{Model parameters $\theta$}
  
  Initialize $\theta^{(0)}$ randomly\;
  \For{$t = 1$ to $T$}{
    Compute gradient: $g \leftarrow \nabla L(\theta^{(t-1)})$\;
    Update: $\theta^{(t)} \leftarrow \theta^{(t-1)} - \eta_t \cdot g$\;
  }
  \caption{Gradient Descent Algorithm}
  \label{algo:gd}
\end{algorithm}

Best Practices

Organization

project/
├── main.tex
├── references.bib
├── figures/
│   └── *.pdf, *.png
├── data/
│   └── *.csv
└── code/
    └── analysis.py

Data Citation

@dataset{smith2024data,
  author = {Smith, John},
  title = {Experimental Dataset},
  year = {2024},
  publisher = {Data Repository},
  doi = {10.5281/zenodo.xxxxx}
}

Code Citation

@misc{python2024,
  title = {Python Programming Language},
  author = {Python Software Foundation},
  year = {2024},
  url = {https://www.python.org}
}

Conclusion

LaTeX provides professional capabilities for data science documentation. From statistical tables to code listings to sophisticated visualizations, LaTeX handles the demanding requirements of technical research.

Combine LaTeX with reproducible research tools for efficient workflows that ensure your publications accurately reflect your analysis.