Jupyter for Research and Documentation: Interactive Computing and Reproducible Research

Jupyter notebooks combine code, visualizations, and narrative text, making them ideal for research and documentation. This guide covers best practices for using Jupyter effectively.

Jupyter Fundamentals

Installation and Setup

# Install Jupyter
pip install jupyter

# Install additional kernels
pip install ipykernel
python -m ipykernel install --user --name myenv --display-name "Python (myenv)"

# Start Jupyter
jupyter notebook

# Start JupyterLab (modern interface)
pip install jupyterlab
jupyter lab

Notebook Structure

# Cell 1: Imports and Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# Cell 2: Load Data
data = pd.read_csv('data.csv')
print(data.head())

# Cell 3: Exploratory Analysis
print(data.describe())
print(data.info())

# Cell 4: Visualization
plt.figure(figsize=(10, 6))
plt.hist(data['column'], bins=30)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Distribution')
plt.show()

# Cell 5: Statistical Analysis
result = stats.ttest_ind(data['group1'], data['group2'])
print(f"t-statistic: {result.statistic:.4f}")
print(f"p-value: {result.pvalue:.4f}")

# Cell 6: Conclusions
print("Summary of findings...")

Markdown and Documentation

Markdown Cells

# Main Title

## Section 1
This is a paragraph with **bold** and *italic* text.

### Subsection
- Bullet point 1
- Bullet point 2
- Bullet point 3

1. Numbered item 1
2. Numbered item 2

### Code Example
```python
def hello():
    print("Hello, World!")

Mathematical Equations

Inline equation: $y = mx + b$

Display equation:

$$\int_0^{\infty} e^{-x^2} dx = \frac{\sqrt{\pi}}{2}$$

Links and Images

Link text Image alt text

Tables

Column 1	Column 2
Value 1	Value 2
Value 3	Value 4


## Interactive Widgets

### Using ipywidgets

```python
import ipywidgets as widgets
from IPython.display import display
import numpy as np
import matplotlib.pyplot as plt

# Slider widget
slider = widgets.FloatSlider(
    value=5,
    min=0,
    max=10,
    step=0.1,
    description='Value:'
)

# Dropdown widget
dropdown = widgets.Dropdown(
    options=['Option 1', 'Option 2', 'Option 3'],
    value='Option 1',
    description='Choose:'
)

# Button widget
button = widgets.Button(description='Click me')

def on_button_click(b):
    print("Button clicked!")

button.on_click(on_button_click)

# Interactive plot
@widgets.interact
def plot_function(amplitude=5, frequency=1):
    x = np.linspace(0, 10, 100)
    y = amplitude * np.sin(frequency * x)
    
    plt.figure(figsize=(10, 6))
    plt.plot(x, y)
    plt.xlabel('x')
    plt.ylabel('y')
    plt.title(f'Amplitude: {amplitude}, Frequency: {frequency}')
    plt.grid(True)
    plt.show()

# Display widgets
display(slider, dropdown, button)

Reproducible Research

Project Structure

project/
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_analysis.ipynb
│   └── 03_results.ipynb
├── data/
│   ├── raw/
│   └── processed/
├── src/
│   ├── __init__.py
│   ├── data_processing.py
│   └── analysis.py
├── results/
│   ├── figures/
│   └── tables/
├── requirements.txt
└── README.md

Best Practices

# Cell 1: Set random seed for reproducibility
import numpy as np
import random

SEED = 42
np.random.seed(SEED)
random.seed(SEED)

# Cell 2: Import all dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

# Cell 3: Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Cell 4: Load data with path handling
from pathlib import Path

DATA_DIR = Path('../data/raw')
data = pd.read_csv(DATA_DIR / 'data.csv')

# Cell 5: Document parameters
TRAIN_TEST_SPLIT = 0.8
RANDOM_STATE = 42
MODEL_PARAMS = {
    'n_estimators': 100,
    'max_depth': 10,
    'random_state': RANDOM_STATE
}

# Cell 6: Use functions from modules
from src.data_processing import clean_data, prepare_features

cleaned_data = clean_data(data)
features = prepare_features(cleaned_data)

# Cell 7: Document results
print(f"Data shape: {data.shape}")
print(f"Missing values: {data.isnull().sum().sum()}")
print(f"Features created: {len(features.columns)}")

Convert Notebooks

# Convert to HTML
jupyter nbconvert --to html notebook.ipynb

# Convert to PDF
jupyter nbconvert --to pdf notebook.ipynb

# Convert to Markdown
jupyter nbconvert --to markdown notebook.ipynb

# Convert to Python script
jupyter nbconvert --to script notebook.ipynb

# Convert with template
jupyter nbconvert --to html notebook.ipynb --template lab

Create Presentations

# Install RISE for presentations
pip install RISE

# In notebook, use markdown cells with:
# <!-- .slide: data-background="#ffffff" -->
# # Slide Title

# Use code cells for interactive demos

# Using nbviewer
# https://nbviewer.jupyter.org/github/username/repo/blob/main/notebook.ipynb

# Using Binder
# Add requirements.txt to repo
# https://mybinder.org/v2/gh/username/repo/main?filepath=notebook.ipynb

# Using Google Colab
# Upload notebook or link from GitHub
# https://colab.research.google.com/

Advanced Features

Magic Commands

# Cell timing
%timeit sum(range(100))

# Line profiling
%load_ext line_profiler
%lprun -f function_name function_name(args)

# Memory profiling
%load_ext memory_profiler
%memit function_name(args)

# Display matplotlib inline
%matplotlib inline

# Display matplotlib in notebook
%matplotlib notebook

# Shell commands
!ls -la
!pip install package

# Variable inspection
%whos

# Run external script
%run script.py

Debugging

# Set breakpoint
import pdb
pdb.set_trace()

# Or use IPython debugger
from IPython.core.debugger import set_trace
set_trace()

# Post-mortem debugging
%debug

# Verbose error messages
%xmode Verbose

Notebook Extensions

Useful Extensions

# Install nbextensions
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user

# Install nbextensions configurator
pip install jupyter_nbextensions_configurator
jupyter nbextensions_configurator enable --user

# Popular extensions:
# - Table of Contents
# - Collapsible Headings
# - Variable Inspector
# - ExecuteTime
# - Codefolding

Best Practices

Clear structure: Organize notebooks logically
Documentation: Use markdown cells to explain
Reproducibility: Set seeds, document parameters
Modularity: Move reusable code to modules
Version control: Use Git for notebooks
Testing: Include validation cells
Performance: Monitor execution time

Common Pitfalls

Bad Practice:

# Don't: Unclear cell order
# Cell 5 depends on Cell 10

# Don't: No documentation
x = data[data['col'] > 5]

# Don't: Hardcoded paths
df = pd.read_csv('/Users/username/Desktop/data.csv')

# Don't: Long-running cells
for i in range(1000000):
    # Heavy computation
    pass

Good Practice:

# Do: Logical cell order
# Cell 1: Setup
# Cell 2: Load data
# Cell 3: Process
# Cell 4: Analyze

# Do: Document with markdown
# ## Data Filtering
# Filter rows where column > 5
x = data[data['col'] > 5]

# Do: Use relative paths
from pathlib import Path
data_path = Path('../data/raw/data.csv')
df = pd.read_csv(data_path)

# Do: Optimize long computations
# Use vectorization, parallel processing, or caching

Conclusion

Jupyter notebooks are powerful tools for research and documentation. Use them to create reproducible, well-documented analyses. Combine code, visualizations, and narrative to communicate findings effectively. Follow best practices for organization, documentation, and reproducibility.

Jupyter for Research and Documentation: Interactive Computing and Reproducible Research

Jupyter Fundamentals

Installation and Setup

Notebook Structure

Markdown and Documentation

Markdown Cells

Mathematical Equations

Links and Images

Tables

Reproducible Research

Project Structure

Best Practices

Exporting and Sharing

Convert Notebooks

Create Presentations

Share Notebooks

Advanced Features

Magic Commands

Debugging

Notebook Extensions

Useful Extensions

Best Practices

Common Pitfalls

Conclusion

Comments