Skip to main content
โšก Calmops

Pandas Time Series Analysis: Master Temporal Data Manipulation

Pandas Time Series Analysis: Master Temporal Data Manipulation

Introduction

Time series data is everywhere. Stock prices, sensor readings, website traffic, weather measurements, and sales figures all change over time. Working with temporal data effectively is a critical skill for any data professional, yet it presents unique challenges that don’t exist with static datasets.

Time series analysis requires specialized techniques because temporal data has inherent structure: observations are ordered by time, often have regular intervals, and frequently exhibit patterns like trends and seasonality. Pandas provides powerful tools specifically designed for these challenges, making it the go-to library for time series work in Python.

In this post, we’ll explore practical techniques for manipulating and analyzing time series data with Pandas. You’ll learn how to work with datetime indices, resample data to different frequencies, calculate rolling statistics, filter by time periods, and handle missing temporal data. By the end, you’ll have a comprehensive toolkit for tackling real-world time series problems.


Part 1: Working with Datetime Data

Creating Datetime Objects

The foundation of time series work is proper datetime handling. Pandas makes this straightforward:

import pandas as pd
import numpy as np

# Create datetime from string
date = pd.to_datetime('2024-01-15')
print(date)  # 2024-01-15 00:00:00

# Create datetime from multiple formats
dates = pd.to_datetime(['2024-01-15', '2024-01-16', '2024-01-17'])
print(dates)

# Create datetime with specific format
dates = pd.to_datetime(['15/01/2024', '16/01/2024'], format='%d/%m/%Y')
print(dates)

# Create date range
dates = pd.date_range('2024-01-01', periods=10, freq='D')
print(dates)

# Create date range with specific frequency
dates = pd.date_range('2024-01-01', '2024-01-31', freq='W')  # Weekly
print(dates)

Creating Time Series with Datetime Index

The key to effective time series work is using a datetime index:

# Create sample time series data
dates = pd.date_range('2024-01-01', periods=30, freq='D')
values = np.random.randn(30).cumsum()

# Create Series with datetime index
ts = pd.Series(values, index=dates)
print(ts)

# Create DataFrame with datetime index
df = pd.DataFrame({
    'temperature': np.random.uniform(15, 25, 30),
    'humidity': np.random.uniform(40, 80, 30),
    'pressure': np.random.uniform(1000, 1020, 30)
}, index=dates)
print(df)

Extracting Datetime Components

Once you have datetime data, you can extract useful components:

# Create time series
dates = pd.date_range('2024-01-01', periods=365, freq='D')
ts = pd.Series(np.random.randn(365), index=dates)

# Extract components
print(ts.index.year)      # Year
print(ts.index.month)     # Month (1-12)
print(ts.index.day)       # Day of month
print(ts.index.dayofweek) # Day of week (0=Monday, 6=Sunday)
print(ts.index.quarter)   # Quarter (1-4)
print(ts.index.week)      # Week number

# Create DataFrame with components
df = pd.DataFrame({
    'value': ts.values,
    'year': ts.index.year,
    'month': ts.index.month,
    'day': ts.index.day,
    'dayofweek': ts.index.dayofweek,
    'quarter': ts.index.quarter
}, index=ts.index)
print(df.head())

Timezone Handling

When working with data from different regions, timezone handling is crucial:

# Create timezone-aware datetime
dates = pd.date_range('2024-01-01', periods=10, freq='D', tz='UTC')
print(dates)

# Create with specific timezone
dates = pd.date_range('2024-01-01', periods=10, freq='D', tz='US/Eastern')
print(dates)

# Convert between timezones
dates_utc = dates.tz_convert('UTC')
print(dates_utc)

# Remove timezone information
dates_naive = dates.tz_localize(None)
print(dates_naive)

# Add timezone to naive datetime
dates_aware = dates_naive.tz_localize('US/Eastern')
print(dates_aware)

Part 2: Datetime Indexing and Selection

Selecting by Date

One of the most powerful features of datetime indexing is intuitive date-based selection:

# Create time series
dates = pd.date_range('2024-01-01', periods=365, freq='D')
ts = pd.Series(np.random.randn(365), index=dates)

# Select by specific date
print(ts['2024-01-15'])

# Select by month
print(ts['2024-01'])

# Select by year
print(ts['2024'])

# Select by date range
print(ts['2024-01-01':'2024-01-31'])

# Select using loc
print(ts.loc['2024-01-01':'2024-01-31'])

# Select using iloc (position-based)
print(ts.iloc[0:10])

Filtering by Time Periods

You can filter time series data based on temporal conditions:

# Create time series with hourly data
dates = pd.date_range('2024-01-01', periods=168, freq='H')  # One week
values = np.random.randn(168).cumsum()
ts = pd.Series(values, index=dates)

# Filter for specific hour of day
morning = ts[ts.index.hour < 12]
print(f"Morning observations: {len(morning)}")

# Filter for specific day of week
weekdays = ts[ts.index.dayofweek < 5]  # Monday-Friday
print(f"Weekday observations: {len(weekdays)}")

# Filter for specific month
january = ts[ts.index.month == 1]
print(f"January observations: {len(january)}")

# Filter for business days
business_days = ts[ts.index.dayofweek < 5]
print(f"Business day observations: {len(business_days)}")

# Filter for specific quarter
q1 = ts[ts.index.quarter == 1]
print(f"Q1 observations: {len(q1)}")

Between Dates

The .between_time() method is useful for intraday filtering:

# Create intraday time series
dates = pd.date_range('2024-01-01', periods=480, freq='30min')
ts = pd.Series(np.random.randn(480), index=dates)

# Select between specific times
trading_hours = ts.between_time('09:30', '16:00')
print(f"Trading hours observations: {len(trading_hours)}")

# Select outside specific times
after_hours = ts.between_time('16:00', '09:30')
print(f"After hours observations: {len(after_hours)}")

Part 3: Resampling Time Series

Resampling changes the frequency of time series data. It’s one of the most powerful time series operations:

Upsampling (Increasing Frequency)

# Create daily data
dates = pd.date_range('2024-01-01', periods=10, freq='D')
ts = pd.Series(range(10), index=dates)
print("Original (daily):")
print(ts)

# Upsample to hourly
ts_hourly = ts.resample('H').asfreq()
print("\nUpsampled to hourly (with NaN):")
print(ts_hourly.head(50))

# Upsample with forward fill
ts_ffill = ts.resample('H').ffill()
print("\nUpsampled with forward fill:")
print(ts_ffill.head(50))

# Upsample with interpolation
ts_interp = ts.resample('H').interpolate()
print("\nUpsampled with interpolation:")
print(ts_interp.head(50))

Downsampling (Decreasing Frequency)

# Create hourly data
dates = pd.date_range('2024-01-01', periods=168, freq='H')
values = np.random.uniform(20, 30, 168)
ts = pd.Series(values, index=dates)

# Downsample to daily (mean)
daily_mean = ts.resample('D').mean()
print("Daily mean:")
print(daily_mean)

# Downsample to daily (sum)
daily_sum = ts.resample('D').sum()
print("\nDaily sum:")
print(daily_sum)

# Downsample to daily (multiple aggregations)
daily_stats = ts.resample('D').agg(['mean', 'min', 'max', 'std'])
print("\nDaily statistics:")
print(daily_stats)

# Downsample to weekly
weekly = ts.resample('W').mean()
print("\nWeekly mean:")
print(weekly)

Common Resampling Frequencies

# Common frequency strings
# 'D'   - Daily
# 'W'   - Weekly
# 'M'   - Month end
# 'MS'  - Month start
# 'Q'   - Quarter end
# 'QS'  - Quarter start
# 'Y'   - Year end
# 'YS'  - Year start
# 'H'   - Hourly
# 'T'   - Minute
# 'S'   - Second
# 'B'   - Business day

dates = pd.date_range('2024-01-01', periods=365, freq='D')
ts = pd.Series(np.random.randn(365).cumsum(), index=dates)

print("Monthly mean:")
print(ts.resample('M').mean())

print("\nQuarterly sum:")
print(ts.resample('Q').sum())

print("\nBusiness day mean:")
print(ts.resample('B').mean())

Resampling with Custom Functions

# Define custom aggregation function
def price_range(x):
    return x.max() - x.min()

# Apply custom function during resampling
dates = pd.date_range('2024-01-01', periods=100, freq='H')
prices = pd.Series(np.random.uniform(100, 110, 100), index=dates)

daily_range = prices.resample('D').apply(price_range)
print("Daily price range:")
print(daily_range)

# Multiple custom functions
daily_stats = prices.resample('D').agg([
    ('mean', 'mean'),
    ('range', price_range),
    ('volatility', 'std')
])
print("\nDaily statistics:")
print(daily_stats)

Part 4: Rolling Windows and Moving Calculations

Rolling windows calculate statistics over a moving window of data, perfect for smoothing and trend analysis:

Basic Rolling Operations

# Create time series
dates = pd.date_range('2024-01-01', periods=100, freq='D')
prices = pd.Series(np.random.uniform(100, 110, 100), index=dates)

# 7-day moving average
ma7 = prices.rolling(window=7).mean()
print("7-day moving average:")
print(ma7.head(10))

# 30-day moving average
ma30 = prices.rolling(window=30).mean()

# Plot comparison
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(prices.index, prices, label='Price', alpha=0.7)
plt.plot(ma7.index, ma7, label='7-day MA', linewidth=2)
plt.plot(ma30.index, ma30, label='30-day MA', linewidth=2)
plt.legend()
plt.title('Price with Moving Averages')
plt.show()

Multiple Rolling Statistics

# Calculate multiple rolling statistics
rolling_stats = prices.rolling(window=20).agg([
    ('mean', 'mean'),
    ('std', 'std'),
    ('min', 'min'),
    ('max', 'max')
])
print("20-day rolling statistics:")
print(rolling_stats.head(25))

# Calculate rolling correlation
dates = pd.date_range('2024-01-01', periods=100, freq='D')
price1 = pd.Series(np.random.randn(100).cumsum(), index=dates)
price2 = pd.Series(np.random.randn(100).cumsum(), index=dates)

rolling_corr = price1.rolling(window=30).corr(price2)
print("\n30-day rolling correlation:")
print(rolling_corr.head(35))

Expanding Windows

Expanding windows grow over time, useful for cumulative calculations:

# Expanding mean (cumulative average)
expanding_mean = prices.expanding().mean()
print("Expanding mean:")
print(expanding_mean.head(10))

# Expanding sum (cumulative sum)
expanding_sum = prices.expanding().sum()
print("\nExpanding sum:")
print(expanding_sum.head(10))

# Expanding standard deviation
expanding_std = prices.expanding().std()
print("\nExpanding std:")
print(expanding_std.head(10))

Exponential Weighted Moving Average (EWMA)

EWMA gives more weight to recent observations:

# Exponential weighted moving average
ewma = prices.ewm(span=20).mean()
print("EWMA (span=20):")
print(ewma.head(10))

# Compare with simple moving average
sma = prices.rolling(window=20).mean()

plt.figure(figsize=(12, 6))
plt.plot(prices.index, prices, label='Price', alpha=0.5)
plt.plot(sma.index, sma, label='SMA (20)', linewidth=2)
plt.plot(ewma.index, ewma, label='EWMA (span=20)', linewidth=2)
plt.legend()
plt.title('SMA vs EWMA')
plt.show()

Part 5: Handling Missing Temporal Data

Time series data often has gaps. Handling them properly is crucial:

Identifying Missing Data

# Create time series with missing dates
dates = pd.date_range('2024-01-01', periods=30, freq='D')
# Remove some dates
dates = dates.drop(dates[[5, 10, 15, 20]])
values = np.random.randn(len(dates))
ts = pd.Series(values, index=dates)

print("Original series with gaps:")
print(ts)

# Check for missing dates
print(f"\nMissing values: {ts.isnull().sum()}")

# Identify gaps
expected_dates = pd.date_range(ts.index.min(), ts.index.max(), freq='D')
missing_dates = expected_dates.difference(ts.index)
print(f"\nMissing dates: {missing_dates}")

Reindexing to Fill Gaps

# Reindex to include all dates
ts_reindexed = ts.reindex(expected_dates)
print("After reindexing:")
print(ts_reindexed)

# Fill missing values with forward fill
ts_ffill = ts.reindex(expected_dates).ffill()
print("\nWith forward fill:")
print(ts_ffill)

# Fill missing values with backward fill
ts_bfill = ts.reindex(expected_dates).bfill()
print("\nWith backward fill:")
print(ts_bfill)

# Fill with interpolation
ts_interp = ts.reindex(expected_dates).interpolate()
print("\nWith interpolation:")
print(ts_interp)

# Fill with specific value
ts_filled = ts.reindex(expected_dates).fillna(0)
print("\nFilled with 0:")
print(ts_filled)

Handling Business Day Gaps

# Create business day frequency
dates = pd.bdate_range('2024-01-01', periods=20)
ts = pd.Series(np.random.randn(20), index=dates)

# Reindex to include all business days
all_bdays = pd.bdate_range(ts.index.min(), ts.index.max())
ts_filled = ts.reindex(all_bdays).ffill()
print("Business day series with gaps filled:")
print(ts_filled)

Part 6: Practical Time Series Examples

Example 1: Stock Price Analysis

# Create stock price data
dates = pd.date_range('2024-01-01', periods=252, freq='B')  # Trading days
np.random.seed(42)
prices = 100 + np.random.randn(252).cumsum()
volume = np.random.uniform(1000000, 5000000, 252)

df = pd.DataFrame({
    'price': prices,
    'volume': volume
}, index=dates)

# Calculate returns
df['daily_return'] = df['price'].pct_change()

# Calculate moving averages
df['ma_20'] = df['price'].rolling(window=20).mean()
df['ma_50'] = df['price'].rolling(window=50).mean()

# Calculate volatility
df['volatility'] = df['daily_return'].rolling(window=20).std()

# Identify trading signals
df['signal'] = 0
df.loc[df['ma_20'] > df['ma_50'], 'signal'] = 1  # Buy signal
df.loc[df['ma_20'] < df['ma_50'], 'signal'] = -1  # Sell signal

print("Stock analysis:")
print(df.head(60))

# Summary statistics
print("\nDaily return statistics:")
print(df['daily_return'].describe())

print("\nMonthly returns:")
print(df['daily_return'].resample('M').sum())

Example 2: Sensor Data with Anomaly Detection

# Create sensor data
dates = pd.date_range('2024-01-01', periods=1000, freq='H')
# Normal data with occasional spikes
np.random.seed(42)
values = np.random.normal(20, 2, 1000)
# Add anomalies
anomaly_indices = np.random.choice(1000, 20, replace=False)
values[anomaly_indices] += np.random.uniform(10, 20, 20)

ts = pd.Series(values, index=dates)

# Calculate rolling mean and std
rolling_mean = ts.rolling(window=24).mean()
rolling_std = ts.rolling(window=24).std()

# Detect anomalies (values > 2 std from mean)
upper_bound = rolling_mean + 2 * rolling_std
lower_bound = rolling_mean - 2 * rolling_std
anomalies = (ts > upper_bound) | (ts < lower_bound)

print(f"Anomalies detected: {anomalies.sum()}")
print("\nAnomalous readings:")
print(ts[anomalies])

# Visualize
plt.figure(figsize=(14, 6))
plt.plot(ts.index, ts, label='Sensor Reading', alpha=0.7)
plt.plot(rolling_mean.index, rolling_mean, label='24h Mean', linewidth=2)
plt.fill_between(upper_bound.index, upper_bound, lower_bound, alpha=0.2, label='ยฑ2ฯƒ Band')
plt.scatter(ts[anomalies].index, ts[anomalies], color='red', label='Anomalies', s=50)
plt.legend()
plt.title('Sensor Data with Anomaly Detection')
plt.show()

Example 3: Website Traffic Analysis

# Create hourly website traffic data
dates = pd.date_range('2024-01-01', periods=720, freq='H')  # 30 days
# Traffic pattern: higher during business hours, lower at night
hour_of_day = dates.hour
base_traffic = 1000 + 500 * np.sin(2 * np.pi * hour_of_day / 24)
traffic = base_traffic + np.random.normal(0, 100, 720)

df = pd.DataFrame({
    'traffic': traffic,
    'hour': hour_of_day
}, index=dates)

# Daily traffic
daily_traffic = df['traffic'].resample('D').sum()
print("Daily traffic:")
print(daily_traffic)

# Hourly average by hour of day
hourly_pattern = df.groupby('hour')['traffic'].mean()
print("\nAverage traffic by hour:")
print(hourly_pattern)

# Weekly traffic
weekly_traffic = df['traffic'].resample('W').sum()
print("\nWeekly traffic:")
print(weekly_traffic)

# Identify peak hours
peak_hours = hourly_pattern.nlargest(5)
print("\nPeak traffic hours:")
print(peak_hours)

# Calculate day-over-day growth
daily_growth = daily_traffic.pct_change()
print("\nDay-over-day growth:")
print(daily_growth)

Example 4: Temperature Data with Seasonal Decomposition

# Create temperature data with seasonal pattern
dates = pd.date_range('2023-01-01', periods=365, freq='D')
# Seasonal pattern: higher in summer, lower in winter
day_of_year = np.arange(365)
seasonal = 15 * np.sin(2 * np.pi * day_of_year / 365)
trend = 0.01 * day_of_year
noise = np.random.normal(0, 1, 365)
temperature = 15 + seasonal + trend + noise

ts = pd.Series(temperature, index=dates)

# Calculate components
rolling_mean = ts.rolling(window=30).mean()  # Trend
seasonal_component = ts - rolling_mean  # Seasonal + noise

# Monthly statistics
monthly_stats = ts.resample('M').agg(['mean', 'min', 'max'])
print("Monthly temperature statistics:")
print(monthly_stats)

# Identify warmest and coldest periods
print(f"\nWarmest day: {ts.idxmax().date()} ({ts.max():.1f}ยฐC)")
print(f"Coldest day: {ts.idxmin().date()} ({ts.min():.1f}ยฐC)")

# Calculate heating/cooling degree days
heating_degree_days = (18 - ts).clip(lower=0).resample('M').sum()
cooling_degree_days = (ts - 22).clip(lower=0).resample('M').sum()

print("\nMonthly heating degree days:")
print(heating_degree_days)

print("\nMonthly cooling degree days:")
print(cooling_degree_days)

Part 7: Performance Optimization

When working with large time series datasets, performance matters:

Efficient Data Types

# Use appropriate data types to save memory
df = pd.DataFrame({
    'price': np.random.uniform(100, 110, 100000),
    'volume': np.random.randint(1000000, 5000000, 100000),
    'category': np.random.choice(['A', 'B', 'C'], 100000)
})

# Check memory usage
print("Default memory usage:")
print(df.memory_usage(deep=True))

# Optimize data types
df['price'] = df['price'].astype('float32')
df['volume'] = df['volume'].astype('int32')
df['category'] = df['category'].astype('category')

print("\nOptimized memory usage:")
print(df.memory_usage(deep=True))
print(f"Memory saved: {(1 - df.memory_usage(deep=True).sum() / 
                       pd.DataFrame(df).memory_usage(deep=True).sum()) * 100:.1f}%")

Efficient Resampling

# For large datasets, specify the aggregation function upfront
dates = pd.date_range('2020-01-01', periods=1000000, freq='S')
ts = pd.Series(np.random.randn(1000000), index=dates)

# Efficient: specify aggregation directly
result = ts.resample('H').mean()

# Less efficient: multiple passes
# result = ts.resample('H').apply(lambda x: x.mean())

Indexing Performance

# Ensure datetime index is sorted for better performance
df = df.sort_index()

# Use .loc with datetime slicing (very fast)
subset = df.loc['2024-01-01':'2024-01-31']

# Avoid repeated resampling
# Instead of:
# daily_mean = ts.resample('D').mean()
# daily_sum = ts.resample('D').sum()

# Do:
daily_agg = ts.resample('D').agg(['mean', 'sum'])

Best Practices

Here are key best practices for time series work:

1. Always Use Datetime Index

# Good: datetime index
ts = pd.Series(values, index=pd.date_range('2024-01-01', periods=len(values)))

# Avoid: integer index
ts = pd.Series(values)

2. Handle Timezones Explicitly

# Be explicit about timezones
dates = pd.date_range('2024-01-01', periods=10, freq='D', tz='UTC')

# Convert when necessary
dates_eastern = dates.tz_convert('US/Eastern')

3. Check for Missing Data

# Always check for gaps
expected_dates = pd.date_range(ts.index.min(), ts.index.max(), freq='D')
missing = expected_dates.difference(ts.index)
if len(missing) > 0:
    print(f"Missing dates: {missing}")

4. Use Appropriate Frequencies

# Specify frequency when creating datetime index
dates = pd.date_range('2024-01-01', periods=100, freq='D')

# Infer frequency if not specified
dates = pd.date_range('2024-01-01', '2024-04-10')
print(dates.freq)

5. Document Your Transformations

# Clear variable names and comments
daily_returns = prices.pct_change()  # Daily percentage change
volatility = daily_returns.rolling(20).std()  # 20-day rolling volatility

Common Pitfalls to Avoid

1. Forgetting to Set Datetime Index

# Wrong: datetime as column, not index
df = pd.DataFrame({'date': dates, 'value': values})

# Correct: datetime as index
df = pd.DataFrame({'value': values}, index=dates)

2. Not Handling Timezones

# Problem: mixing naive and aware datetimes
dates_naive = pd.date_range('2024-01-01', periods=10)
dates_aware = pd.date_range('2024-01-01', periods=10, tz='UTC')
# This will cause errors when combining

# Solution: be consistent
dates = pd.date_range('2024-01-01', periods=10, tz='UTC')

3. Assuming Frequency is Preserved

# After filtering, frequency may be lost
ts_filtered = ts[ts > 0]
print(ts_filtered.index.freq)  # Likely None

# Reindex if needed
ts_filtered = ts_filtered.reindex(pd.date_range(ts_filtered.index.min(), 
                                                  ts_filtered.index.max(), 
                                                  freq='D'))

4. Not Handling NaN in Rolling Calculations

# Rolling calculations produce NaN at the start
rolling_mean = ts.rolling(window=20).mean()
print(rolling_mean.head(20))  # First 19 values are NaN

# Handle appropriately
rolling_mean = ts.rolling(window=20, min_periods=1).mean()

Conclusion

Time series analysis is a specialized but essential skill for data professionals. Pandas provides powerful tools specifically designed for temporal data, making it possible to handle complex time series operations efficiently.

Key Takeaways:

  • Datetime indexing is fundamental to effective time series work
  • Resampling changes frequency and aggregates data
  • Rolling windows calculate statistics over moving periods
  • Expanding windows grow over time for cumulative calculations
  • Missing data requires explicit handling with reindexing and filling
  • Timezone handling is important for global data
  • Performance optimization matters with large datasets
  • Best practices ensure reliable and maintainable code

Time series analysis opens doors to understanding trends, detecting anomalies, forecasting future values, and making data-driven decisions based on temporal patterns. As you work with more time series data, you’ll develop intuition for which techniques solve specific problems. Start with simple operations like resampling and moving averages, then gradually explore more advanced techniques like seasonal decomposition and anomaly detection.


Additional Resources

Comments