Skip to main content
โšก Calmops

Pandas Data Loading and Cleaning: Essential Workflows for Data Professionals

Pandas Data Loading and Cleaning: Essential Workflows for Data Professionals

Introduction

In the real world, data is messy. It comes from multiple sources, contains errors, has missing values, and often isn’t in the format you need. In fact, data professionals spend approximately 60-80% of their time on data loading and cleaning rather than actual analysis or modeling.

This is where pandas becomes invaluable. As a data professional, you’ll spend a significant portion of your time using pandas to load data from various sources and transform it into a clean, usable format. The quality of your data directly impacts the quality of your analysis and models, making data cleaning one of the most critical skills in your toolkit.

In this post, we’ll explore practical workflows for loading data from multiple sources and cleaning it effectively. You’ll learn techniques for handling missing values, removing duplicates, fixing data types, and preparing datasets for analysis. By the end, you’ll have a comprehensive understanding of data preprocessing workflows that you can apply immediately to your projects.


Part 1: Loading Data with Pandas

Loading from CSV Files

CSV (Comma-Separated Values) is the most common data format you’ll encounter. Pandas makes loading CSV files straightforward:

import pandas as pd

# Basic CSV loading
df = pd.read_csv('data.csv')

# With common parameters
df = pd.read_csv('data.csv',
                  sep=',',              # Delimiter (default is comma)
                  header=0,             # Row number for column names
                  encoding='utf-8',     # File encoding
                  nrows=1000)           # Load only first 1000 rows

When working with large CSV files, you can optimize loading:

# Load only specific columns
df = pd.read_csv('data.csv', usecols=['name', 'age', 'salary'])

# Specify data types to save memory
df = pd.read_csv('data.csv',
                  dtype={'age': 'int32', 'salary': 'float32', 'name': 'string'})

# Parse dates automatically
df = pd.read_csv('data.csv', parse_dates=['date_column'])

# Handle missing values during loading
df = pd.read_csv('data.csv', na_values=['NA', 'N/A', 'null', ''])

Loading from Excel Files

Excel files are common in business environments:

# Load from Excel
df = pd.read_excel('data.xlsx')

# Specify sheet name
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Load multiple sheets
sheets = pd.read_excel('data.xlsx', sheet_name=['Sheet1', 'Sheet2'])
df1 = sheets['Sheet1']
df2 = sheets['Sheet2']

# With parameters
df = pd.read_excel('data.xlsx',
                   sheet_name=0,
                   header=0,
                   usecols=['name', 'age'],
                   dtype={'age': 'int32'})

Loading from JSON Files

JSON is increasingly common for API responses and web data:

# Load from JSON
df = pd.read_json('data.json')

# Specify orientation (how data is structured)
df = pd.read_json('data.json', orient='records')  # List of dictionaries
df = pd.read_json('data.json', orient='columns')  # Dictionary of columns

# Parse dates
df = pd.read_json('data.json', convert_dates=['date_column'])

Loading from SQL Databases

For larger datasets stored in databases:

import sqlite3
from sqlalchemy import create_engine

# Using SQLite
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM users', conn)
conn.close()

# Using SQLAlchemy (works with multiple databases)
engine = create_engine('sqlite:///database.db')
df = pd.read_sql_table('users', engine)

# With a query
df = pd.read_sql_query('SELECT name, age, salary FROM users WHERE age > 25', engine)

Loading from Multiple Files

Often you need to combine data from multiple files:

import os
import glob

# Load and concatenate multiple CSV files
files = glob.glob('data/*.csv')
dfs = [pd.read_csv(file) for file in files]
df = pd.concat(dfs, ignore_index=True)

# Load with progress tracking for large operations
from tqdm import tqdm
dfs = [pd.read_csv(file) for file in tqdm(files)]
df = pd.concat(dfs, ignore_index=True)

Part 2: Inspecting Your Data

Before cleaning, always inspect your data to understand what you’re working with:

import pandas as pd

# Load sample data
df = pd.read_csv('employees.csv')

# Get basic information
print(df.head())           # First 5 rows
print(df.tail())           # Last 5 rows
print(df.shape)            # (rows, columns)
print(df.info())           # Data types and missing values
print(df.describe())       # Statistical summary

# Check for missing values
print(df.isnull().sum())   # Count of missing values per column
print(df.isnull().sum() / len(df) * 100)  # Percentage missing

# Check for duplicates
print(df.duplicated().sum())  # Count of duplicate rows
print(df.duplicated(subset=['email']).sum())  # Duplicates in specific column

# Get unique values
print(df['department'].unique())
print(df['department'].nunique())

# Check data types
print(df.dtypes)

Example output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   name        950 non-null     object
 1   age         980 non-null     float64
 2   salary      1000 non-null    int64
 3   department  990 non-null     object
 4   hire_date   1000 non-null    object
dtypes: float64(1), int64(1), object(3)
memory usage: 39.1 KB

Part 3: Handling Missing Values

Missing values are one of the most common data quality issues. Here’s how to handle them:

Identifying Missing Values

# Check for missing values
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100)  # Percentage

# Visualize missing data
import matplotlib.pyplot as plt
df.isnull().sum().plot(kind='bar')
plt.title('Missing Values by Column')
plt.show()

# Check for specific patterns
print(df[df['age'].isnull()])  # Rows with missing age

Removing Missing Values

# Remove rows with any missing values
df_clean = df.dropna()

# Remove rows where specific columns are missing
df_clean = df.dropna(subset=['age', 'salary'])

# Remove columns with too many missing values
df_clean = df.dropna(axis=1, thresh=len(df) * 0.5)  # Keep columns with >50% data

# Remove rows where all values are missing
df_clean = df.dropna(how='all')

Filling Missing Values

Filling (imputation) is often better than removing data:

# Fill with a constant value
df['age'] = df['age'].fillna(0)
df['department'] = df['department'].fillna('Unknown')

# Forward fill (use previous value)
df['department'] = df['department'].fillna(method='ffill')

# Backward fill (use next value)
df['department'] = df['department'].fillna(method='bfill')

# Fill with statistical measures
df['age'] = df['age'].fillna(df['age'].mean())
df['salary'] = df['salary'].fillna(df['salary'].median())

# Fill with group-specific values
df['age'] = df.groupby('department')['age'].transform(
    lambda x: x.fillna(x.mean())
)

# Fill with interpolation (useful for time series)
df['value'] = df['value'].interpolate(method='linear')

Advanced Missing Value Handling

# Use different strategies for different columns
fill_values = {
    'age': df['age'].mean(),
    'salary': df['salary'].median(),
    'department': 'Unknown'
}
df = df.fillna(fill_values)

# Use scikit-learn for sophisticated imputation
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
df[['age', 'salary']] = imputer.fit_transform(df[['age', 'salary']])

# KNN imputation for better results
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df[['age', 'salary']] = imputer.fit_transform(df[['age', 'salary']])

Part 4: Handling Duplicates

Duplicate records can skew your analysis:

# Identify duplicates
print(df.duplicated().sum())  # Total duplicates
print(df.duplicated(subset=['email']).sum())  # Duplicates in specific column

# View duplicate rows
print(df[df.duplicated(subset=['email'], keep=False)])

# Remove duplicates
df_clean = df.drop_duplicates()

# Remove duplicates keeping first occurrence
df_clean = df.drop_duplicates(subset=['email'], keep='first')

# Remove duplicates keeping last occurrence
df_clean = df.drop_duplicates(subset=['email'], keep='last')

# Remove duplicates based on multiple columns
df_clean = df.drop_duplicates(subset=['first_name', 'last_name', 'email'])

Practical example:

# Before cleaning
print(f"Original rows: {len(df)}")
print(f"Duplicate emails: {df.duplicated(subset=['email']).sum()}")

# Remove duplicates
df_clean = df.drop_duplicates(subset=['email'], keep='first')

# After cleaning
print(f"Cleaned rows: {len(df_clean)}")
print(f"Rows removed: {len(df) - len(df_clean)}")

Part 5: Fixing Data Types

Incorrect data types can cause problems in analysis and modeling:

Identifying Data Type Issues

# Check current data types
print(df.dtypes)

# Check for columns that should be numeric but aren't
print(df['age'].dtype)  # Should be int or float
print(df['salary'].dtype)  # Should be float

# Check for columns that should be categorical
print(df['department'].dtype)  # Could be category

Converting Data Types

# Convert to numeric
df['age'] = pd.to_numeric(df['age'], errors='coerce')  # Invalid values become NaN
df['salary'] = df['salary'].astype('float64')

# Convert to integer
df['count'] = df['count'].astype('int32')

# Convert to string
df['zip_code'] = df['zip_code'].astype('string')

# Convert to categorical (saves memory)
df['department'] = df['department'].astype('category')
df['status'] = pd.Categorical(df['status'], categories=['active', 'inactive', 'pending'])

# Convert to datetime
df['hire_date'] = pd.to_datetime(df['hire_date'])
df['birth_date'] = pd.to_datetime(df['birth_date'], format='%d/%m/%Y')

# Convert to boolean
df['is_active'] = df['is_active'].astype('bool')

Handling Type Conversion Errors

# Coerce invalid values to NaN
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Raise an error on invalid values
try:
    df['age'] = pd.to_numeric(df['age'], errors='raise')
except ValueError as e:
    print(f"Conversion error: {e}")

# Keep original values on error
df['age'] = pd.to_numeric(df['age'], errors='ignore')

Part 6: Standardizing and Formatting Data

Inconsistent formatting causes problems in analysis:

String Standardization

# Convert to lowercase
df['department'] = df['department'].str.lower()

# Convert to uppercase
df['status'] = df['status'].str.upper()

# Title case
df['name'] = df['name'].str.title()

# Remove whitespace
df['department'] = df['department'].str.strip()

# Replace values
df['status'] = df['status'].str.replace('Y', 'Yes').str.replace('N', 'No')

# Remove special characters
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)

# Standardize phone numbers
df['phone'] = df['phone'].str.replace(r'(\d{3})(\d{3})(\d{4})', r'(\1) \2-\3', regex=True)

Numeric Standardization

# Round to specific decimal places
df['salary'] = df['salary'].round(2)

# Convert to percentage
df['growth_rate'] = (df['growth_rate'] * 100).round(2)

# Normalize to 0-1 range
df['normalized_score'] = (df['score'] - df['score'].min()) / (df['score'].max() - df['score'].min())

# Standardize (z-score normalization)
df['standardized_score'] = (df['score'] - df['score'].mean()) / df['score'].std()

Date Standardization

# Extract components from dates
df['year'] = df['hire_date'].dt.year
df['month'] = df['hire_date'].dt.month
df['day'] = df['hire_date'].dt.day
df['day_of_week'] = df['hire_date'].dt.day_name()

# Calculate age from birth date
df['age'] = (pd.Timestamp.now() - df['birth_date']).dt.days // 365

# Format dates for display
df['formatted_date'] = df['hire_date'].dt.strftime('%Y-%m-%d')

Part 7: Handling Outliers and Invalid Values

Outliers can skew analysis and modeling:

# Identify outliers using IQR method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)]
print(f"Found {len(outliers)} outliers")

# Remove outliers
df_clean = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]

# Cap outliers instead of removing
df['salary'] = df['salary'].clip(lower=lower_bound, upper=upper_bound)

# Identify outliers using z-score
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary']))
outliers = df[z_scores > 3]

# Remove rows with invalid values
df_clean = df[df['age'] > 0]  # Age must be positive
df_clean = df_clean[df_clean['salary'] > 0]  # Salary must be positive
df_clean = df_clean[df_clean['age'] < 120]  # Age must be reasonable

Part 8: Practical Data Cleaning Workflow

Let’s put it all together with a complete example:

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('raw_data.csv')

print("=== INITIAL INSPECTION ===")
print(f"Shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nDuplicates: {df.duplicated().sum()}")

# Step 1: Remove duplicates
print("\n=== REMOVING DUPLICATES ===")
df = df.drop_duplicates(subset=['email'], keep='first')
print(f"Shape after removing duplicates: {df.shape}")

# Step 2: Handle missing values
print("\n=== HANDLING MISSING VALUES ===")
df['age'] = df['age'].fillna(df['age'].median())
df['department'] = df['department'].fillna('Unknown')
df = df.dropna(subset=['email'])  # Email is critical
print(f"Missing values after handling:\n{df.isnull().sum()}")

# Step 3: Fix data types
print("\n=== FIXING DATA TYPES ===")
df['age'] = pd.to_numeric(df['age'], errors='coerce').astype('int32')
df['salary'] = pd.to_numeric(df['salary'], errors='coerce').astype('float32')
df['hire_date'] = pd.to_datetime(df['hire_date'])
df['department'] = df['department'].astype('category')
print(f"Data types after conversion:\n{df.dtypes}")

# Step 4: Standardize strings
print("\n=== STANDARDIZING STRINGS ===")
df['name'] = df['name'].str.strip().str.title()
df['email'] = df['email'].str.strip().str.lower()
df['department'] = df['department'].str.strip().str.title()

# Step 5: Handle outliers
print("\n=== HANDLING OUTLIERS ===")
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_count = len(df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)])
print(f"Outliers found: {outliers_count}")
df = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]

# Step 6: Validate data
print("\n=== VALIDATION ===")
print(f"Age range: {df['age'].min()} - {df['age'].max()}")
print(f"Salary range: ${df['salary'].min():.2f} - ${df['salary'].max():.2f}")
print(f"Departments: {df['department'].unique()}")

print("\n=== FINAL RESULT ===")
print(f"Final shape: {df.shape}")
print(f"Rows removed: {len(pd.read_csv('raw_data.csv')) - len(df)}")
print(f"\nFirst few rows:\n{df.head()}")

# Save cleaned data
df.to_csv('cleaned_data.csv', index=False)
print("\nCleaned data saved to 'cleaned_data.csv'")

Part 9: Performance Considerations

When working with large datasets, performance matters:

# Use appropriate data types to save memory
df['age'] = df['age'].astype('int32')  # Instead of int64
df['score'] = df['score'].astype('float32')  # Instead of float64
df['category'] = df['category'].astype('category')  # Instead of object

# Check memory usage
print(df.memory_usage(deep=True))
print(f"Total memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Load only needed columns
df = pd.read_csv('large_file.csv', usecols=['name', 'age', 'salary'])

# Process in chunks for very large files
chunks = []
for chunk in pd.read_csv('huge_file.csv', chunksize=10000):
    # Clean each chunk
    chunk = chunk.dropna()
    chunk['age'] = pd.to_numeric(chunk['age'], errors='coerce')
    chunks.append(chunk)

df = pd.concat(chunks, ignore_index=True)

# Use efficient operations
# Good: vectorized operations
df['new_col'] = df['col1'] + df['col2']

# Avoid: loops
# for i in range(len(df)):
#     df.loc[i, 'new_col'] = df.loc[i, 'col1'] + df.loc[i, 'col2']

Part 10: Common Data Cleaning Scenarios

Scenario 1: E-commerce Order Data

# Load order data
df = pd.read_csv('orders.csv')

# Clean product names
df['product_name'] = df['product_name'].str.strip().str.title()

# Convert price to numeric
df['price'] = df['price'].str.replace('$', '').astype('float')

# Parse order date
df['order_date'] = pd.to_datetime(df['order_date'])

# Handle missing quantities
df['quantity'] = df['quantity'].fillna(1).astype('int32')

# Remove invalid orders
df = df[df['price'] > 0]
df = df[df['quantity'] > 0]

# Remove duplicate orders (same customer, same product, same date)
df = df.drop_duplicates(subset=['customer_id', 'product_id', 'order_date'])

print(f"Cleaned {len(df)} orders")

Scenario 2: Customer Contact Data

# Load customer data
df = pd.read_csv('customers.csv')

# Standardize names
df['first_name'] = df['first_name'].str.strip().str.title()
df['last_name'] = df['last_name'].str.strip().str.title()

# Standardize email
df['email'] = df['email'].str.strip().str.lower()

# Clean phone numbers
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)
df['phone'] = df['phone'].str.replace(r'(\d{3})(\d{3})(\d{4})', r'(\1) \2-\3', regex=True)

# Remove duplicate emails
df = df.drop_duplicates(subset=['email'], keep='first')

# Remove rows with missing critical fields
df = df.dropna(subset=['email', 'phone'])

print(f"Cleaned {len(df)} customer records")

Scenario 3: Time Series Data

# Load sensor data
df = pd.read_csv('sensor_data.csv')

# Parse timestamp
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Sort by timestamp
df = df.sort_values('timestamp')

# Handle missing values with interpolation
df['temperature'] = df['temperature'].interpolate(method='linear')

# Remove outliers
Q1 = df['temperature'].quantile(0.25)
Q3 = df['temperature'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['temperature'] >= Q1 - 1.5*IQR) & (df['temperature'] <= Q3 + 1.5*IQR)]

# Remove duplicates (same sensor, same timestamp)
df = df.drop_duplicates(subset=['sensor_id', 'timestamp'])

print(f"Cleaned {len(df)} sensor readings")

Best Practices Summary

Here are key best practices for data loading and cleaning:

  1. Always inspect first: Use head(), info(), and describe() before cleaning
  2. Document your process: Keep track of what you cleaned and why
  3. Validate assumptions: Check that your data makes sense after cleaning
  4. Handle missing data thoughtfully: Removal vs. imputation depends on context
  5. Use appropriate data types: Saves memory and improves performance
  6. Standardize formats: Consistent formatting prevents errors
  7. Remove duplicates carefully: Understand what constitutes a duplicate
  8. Test on samples first: Before processing huge files, test on a subset
  9. Keep original data: Always work on a copy, preserve the raw data
  10. Log your changes: Track how many rows/values you modified

Conclusion

Data loading and cleaning is a critical skill that directly impacts the quality of your analysis and models. While it may seem tedious, investing time in proper data preparation pays dividends throughout your project.

Key Takeaways:

  • Pandas provides flexible tools for loading data from multiple sources
  • Thorough data inspection is the first step in any cleaning workflow
  • Missing values, duplicates, and incorrect types are common issues with practical solutions
  • Standardizing and formatting data prevents errors in analysis
  • A systematic approach to cleaning ensures consistency and reproducibility
  • Performance considerations matter when working with large datasets

The techniques you’ve learned here form the foundation of professional data work. Start applying these practices to your datasets, and you’ll quickly develop the intuition for identifying and fixing data quality issues. Remember: clean data is the foundation of good analysis.


Additional Resources

Comments