Pandas Data Manipulation: Filtering, Sorting, and Merging Datasets
Introduction
Data manipulation is at the heart of every data analysis project. Whether you’re preparing data for analysis, cleaning datasets, or combining information from multiple sources, you’ll rely on three fundamental operations: filtering, sorting, and merging.
These operations are so common that mastering them will dramatically improve your productivity as a data professional. In fact, you’ll likely use these techniques in nearly every project you work on. The ability to efficiently filter large datasets, organize data in meaningful ways, and combine information from multiple sources is what separates competent data professionals from exceptional ones.
In this post, we’ll explore each of these operations in depth, starting with practical examples and building toward more complex scenarios. By the end, you’ll have a comprehensive toolkit for manipulating data with pandas.
Part 1: Filtering Data
Filtering is the process of selecting rows from a DataFrame based on specific conditions. It’s one of the most frequently used operations in data analysis.
Boolean Indexing Basics
The foundation of filtering in pandas is boolean indexing. This involves creating a boolean mask (a Series of True/False values) and using it to select rows:
import pandas as pd
# Create sample sales data
sales_data = {
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
'price': [1200, 25, 75, 350, 150],
'quantity_sold': [5, 150, 80, 20, 60],
'region': ['North', 'South', 'North', 'East', 'West']
}
df = pd.DataFrame(sales_data)
# Create a boolean mask
expensive = df['price'] > 100
print(expensive)
# Output:
# 0 True
# 1 False
# 2 False
# 3 True
# 4 True
# Name: price, dtype: bool
# Use the mask to filter
expensive_products = df[expensive]
print(expensive_products)
Output:
product price quantity_sold region
0 Laptop 1200 5 North
3 Monitor 350 20 East
4 Headphones 150 60 West
You can also filter directly without creating an intermediate variable:
# Filter in one line
expensive_products = df[df['price'] > 100]
# Filter for specific value
north_region = df[df['region'] == 'North']
# Filter for multiple values
cheap_products = df[df['price'] < 100]
Multiple Conditions with Logical Operators
Often you need to filter based on multiple conditions. Use & (AND), | (OR), and ~ (NOT) operators:
# AND condition: expensive AND high sales
high_value = df[(df['price'] > 100) & (df['quantity_sold'] > 50)]
print(high_value)
# OR condition: cheap OR low sales
low_priority = df[(df['price'] < 100) | (df['quantity_sold'] < 30)]
print(low_priority)
# NOT condition: not in North region
not_north = df[~(df['region'] == 'North')]
print(not_north)
# Complex conditions
result = df[((df['price'] > 100) & (df['region'] == 'North')) |
(df['quantity_sold'] > 100)]
print(result)
Important: Always use &, |, and ~ for boolean operations on Series. Don’t use and, or, or not as they won’t work correctly with pandas.
Using .loc[] for Label-Based Filtering
The .loc[] method allows you to filter using labels (column names and index values):
# Filter rows where price > 100
expensive = df.loc[df['price'] > 100]
# Filter specific columns for rows where price > 100
expensive_names = df.loc[df['price'] > 100, 'product']
print(expensive_names)
# Filter multiple columns
expensive_info = df.loc[df['price'] > 100, ['product', 'price']]
print(expensive_info)
# Filter with multiple conditions
result = df.loc[(df['price'] > 100) & (df['region'] == 'North'),
['product', 'price', 'region']]
print(result)
Using .iloc[] for Position-Based Filtering
The .iloc[] method filters based on row and column positions (integers):
# Get first 3 rows
first_three = df.iloc[0:3]
print(first_three)
# Get rows 1 and 3, columns 0 and 2
subset = df.iloc[[1, 3], [0, 2]]
print(subset)
# Get all rows, specific columns by position
products_and_prices = df.iloc[:, [0, 1]]
print(products_and_prices)
# Combine with boolean indexing
expensive_mask = df['price'] > 100
expensive_products = df.iloc[expensive_mask, :]
print(expensive_products)
String-Based Filtering
Filtering text columns requires special string methods:
# Exact match
north_products = df[df['region'] == 'North']
# Contains substring
contains_phone = df[df['product'].str.contains('phone', case=False)]
print(contains_phone)
# Starts with
starts_with_m = df[df['product'].str.startswith('M')]
print(starts_with_m)
# Ends with
ends_with_s = df[df['product'].str.endswith('s')]
print(ends_with_s)
# String length
long_names = df[df['product'].str.len() > 6]
print(long_names)
# Is in list
regions = ['North', 'South']
filtered = df[df['region'].isin(regions)]
print(filtered)
Using .query() for Complex Conditions
The .query() method allows you to write filter conditions as strings, which can be more readable:
# Simple condition
expensive = df.query('price > 100')
print(expensive)
# Multiple conditions
result = df.query('price > 100 and quantity_sold > 50')
print(result)
# Using variables
min_price = 100
max_price = 500
result = df.query('@min_price < price < @max_price')
print(result)
# Complex query
result = df.query('(price > 100 and region == "North") or quantity_sold > 100')
print(result)
Filtering with .isin() and .between()
These methods provide convenient ways to filter for multiple values or ranges:
# Filter for specific values
regions = ['North', 'East']
filtered = df[df['region'].isin(regions)]
print(filtered)
# Filter for values NOT in list
filtered = df[~df['region'].isin(['South', 'West'])]
print(filtered)
# Filter for values in range
mid_range = df[df['price'].between(100, 500)]
print(mid_range)
# Exclude range
outside_range = df[~df['price'].between(100, 500)]
print(outside_range)
Part 2: Sorting Data
Sorting organizes your data in a meaningful order, making it easier to identify patterns and trends.
Sorting by Single Column
The .sort_values() method sorts a DataFrame by one or more columns:
# Sort by price in ascending order (default)
sorted_by_price = df.sort_values('price')
print(sorted_by_price)
# Sort by price in descending order
sorted_by_price_desc = df.sort_values('price', ascending=False)
print(sorted_by_price_desc)
# Sort by product name alphabetically
sorted_by_name = df.sort_values('product')
print(sorted_by_name)
Sorting by Multiple Columns
You can sort by multiple columns to create a hierarchy:
# Create a more complex dataset
sales_data = {
'region': ['North', 'North', 'South', 'South', 'East', 'East'],
'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse', 'Laptop', 'Mouse'],
'price': [1200, 25, 1200, 25, 1200, 25],
'quantity_sold': [5, 150, 8, 120, 3, 180]
}
df = pd.DataFrame(sales_data)
# Sort by region first, then by quantity_sold
sorted_df = df.sort_values(['region', 'quantity_sold'], ascending=[True, False])
print(sorted_df)
# Sort by region ascending, quantity descending
sorted_df = df.sort_values(['region', 'quantity_sold'],
ascending=[True, False])
print(sorted_df)
Output:
region product price quantity_sold
4 East Laptop 1200 3
5 East Mouse 25 180
0 North Laptop 1200 5
1 North Mouse 25 150
2 South Laptop 1200 8
3 South Mouse 25 120
Handling Missing Values During Sorting
By default, missing values are placed at the end. You can control this behavior:
# Create data with missing values
data = {
'name': ['Alice', 'Bob', None, 'Diana'],
'score': [85, None, 92, 88]
}
df = pd.DataFrame(data)
# Default: NaN values at the end
sorted_df = df.sort_values('score')
print(sorted_df)
# Put NaN values first
sorted_df = df.sort_values('score', na_position='first')
print(sorted_df)
# Put NaN values last (explicit)
sorted_df = df.sort_values('score', na_position='last')
print(sorted_df)
In-Place vs. Returning New DataFrames
By default, .sort_values() returns a new DataFrame. You can modify the original:
# Default: returns new DataFrame, original unchanged
sorted_df = df.sort_values('price')
print("Original df:")
print(df)
print("\nSorted df:")
print(sorted_df)
# In-place: modifies original DataFrame
df.sort_values('price', inplace=True)
print("\nAfter inplace sort:")
print(df)
Sorting by Index
You can also sort by the index:
# Create DataFrame with custom index
df = pd.DataFrame(
{'value': [10, 20, 30, 40]},
index=['d', 'b', 'a', 'c']
)
# Sort by index
sorted_by_index = df.sort_index()
print(sorted_by_index)
# Sort by index descending
sorted_by_index_desc = df.sort_index(ascending=False)
print(sorted_by_index_desc)
Sorting by Custom Order
Sometimes you need to sort by a specific order that’s not alphabetical or numerical:
# Create data with categorical values
df = pd.DataFrame({
'priority': ['High', 'Low', 'Medium', 'High', 'Low'],
'task': ['A', 'B', 'C', 'D', 'E']
})
# Define custom order
priority_order = ['High', 'Medium', 'Low']
# Convert to categorical with specific order
df['priority'] = pd.Categorical(df['priority'],
categories=priority_order,
ordered=True)
# Sort by the categorical column
sorted_df = df.sort_values('priority')
print(sorted_df)
Output:
priority task
0 High A
3 High D
2 Medium C
1 Low B
4 Low E
Part 3: Merging Datasets
Merging combines data from multiple DataFrames based on common columns or indices. This is essential when working with data from different sources.
Understanding Join Types
Before merging, it’s important to understand the different types of joins:
Left DataFrame: Right DataFrame:
key value_left key value_right
A 1 A x
B 2 B y
C 3 D z
INNER JOIN (only matching keys):
key value_left value_right
A 1 x
B 2 y
LEFT JOIN (all from left, matching from right):
key value_left value_right
A 1 x
B 2 y
C 3 NaN
RIGHT JOIN (all from right, matching from left):
key value_left value_right
A 1 x
B 2 y
D NaN z
OUTER JOIN (all from both):
key value_left value_right
A 1 x
B 2 y
C 3 NaN
D NaN z
Basic Merge Operations
The .merge() method is the primary way to combine DataFrames:
# Create sample datasets
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'city': ['NYC', 'LA', 'Chicago', 'Boston']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104],
'customer_id': [1, 2, 1, 3],
'amount': [100, 250, 150, 300]
})
# Inner join (default): only matching customer_ids
merged = pd.merge(customers, orders, on='customer_id')
print("Inner Join:")
print(merged)
Output:
customer_id name city order_id amount
0 1 Alice NYC 101 100
1 1 Alice NYC 103 150
2 2 Bob LA 102 250
3 3 Charlie Chicago 104 300
Different Join Types
# Left join: all customers, matching orders
left_merge = pd.merge(customers, orders, on='customer_id', how='left')
print("Left Join:")
print(left_merge)
# Right join: all orders, matching customers
right_merge = pd.merge(customers, orders, on='customer_id', how='right')
print("Right Join:")
print(right_merge)
# Outer join: all from both DataFrames
outer_merge = pd.merge(customers, orders, on='customer_id', how='outer')
print("Outer Join:")
print(outer_merge)
Merging on Multiple Keys
When you need to match on multiple columns:
# Create datasets with composite keys
sales_2023 = pd.DataFrame({
'region': ['North', 'North', 'South', 'South'],
'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse'],
'sales_2023': [100, 200, 150, 180]
})
sales_2024 = pd.DataFrame({
'region': ['North', 'North', 'South', 'South'],
'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse'],
'sales_2024': [120, 210, 160, 190]
})
# Merge on multiple columns
merged = pd.merge(sales_2023, sales_2024,
on=['region', 'product'])
print(merged)
Output:
region product sales_2023 sales_2024
0 North Laptop 100 120
1 North Mouse 200 210
2 South Laptop 150 160
3 South Mouse 180 190
Merging on Different Column Names
When the join columns have different names in each DataFrame:
# Create datasets with different column names
customers = pd.DataFrame({
'cust_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana']
})
orders = pd.DataFrame({
'customer_id': [1, 2, 1, 3],
'amount': [100, 250, 150, 300]
})
# Merge using left_on and right_on
merged = pd.merge(customers, orders,
left_on='cust_id',
right_on='customer_id')
print(merged)
# Drop the redundant column
merged = merged.drop('customer_id', axis=1)
print(merged)
Handling Duplicate Column Names
When both DataFrames have columns with the same name (other than the join key):
# Create datasets with overlapping column names
df1 = pd.DataFrame({
'key': [1, 2, 3],
'value': [10, 20, 30],
'source': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
'key': [1, 2, 3],
'value': [100, 200, 300],
'source': ['X', 'Y', 'Z']
})
# Merge with suffixes for duplicate columns
merged = pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))
print(merged)
Output:
key value_left source_left value_right source_right
0 1 10 A 100 X
1 2 20 B 200 Y
2 3 30 C 300 Z
Using .join() Method
The .join() method is similar to .merge() but joins on indices by default:
# Create DataFrames with indices
df1 = pd.DataFrame(
{'value1': [10, 20, 30]},
index=['A', 'B', 'C']
)
df2 = pd.DataFrame(
{'value2': [100, 200, 300]},
index=['A', 'B', 'C']
)
# Join on index
joined = df1.join(df2)
print(joined)
# Join with different index
df3 = pd.DataFrame(
{'value3': [1000, 2000]},
index=['A', 'B']
)
joined = df1.join(df3, how='left')
print(joined)
Using .concat() for Combining DataFrames
The .concat() function combines DataFrames along rows or columns:
# Create sample DataFrames
df1 = pd.DataFrame({
'name': ['Alice', 'Bob'],
'score': [85, 90]
})
df2 = pd.DataFrame({
'name': ['Charlie', 'Diana'],
'score': [88, 92]
})
# Concatenate vertically (stack rows)
combined = pd.concat([df1, df2], ignore_index=True)
print("Vertical concatenation:")
print(combined)
# Concatenate horizontally (side by side)
df3 = pd.DataFrame({
'grade': ['A', 'A', 'B', 'A']
})
combined = pd.concat([combined, df3], axis=1)
print("\nHorizontal concatenation:")
print(combined)
Practical Merging Example
Let’s work through a realistic scenario:
# Sales data from different regions
north_sales = pd.DataFrame({
'product_id': [1, 2, 3],
'product_name': ['Laptop', 'Mouse', 'Keyboard'],
'north_sales': [50, 200, 150]
})
south_sales = pd.DataFrame({
'product_id': [1, 2, 3],
'product_name': ['Laptop', 'Mouse', 'Keyboard'],
'south_sales': [45, 180, 140]
})
# Product information
products = pd.DataFrame({
'product_id': [1, 2, 3],
'category': ['Electronics', 'Accessories', 'Accessories'],
'price': [1200, 25, 75]
})
# Merge sales data
sales = pd.merge(north_sales, south_sales, on=['product_id', 'product_name'])
print("Combined sales:")
print(sales)
# Merge with product information
final = pd.merge(sales, products, on='product_id')
print("\nFinal report:")
print(final)
# Calculate total sales
final['total_sales'] = final['north_sales'] + final['south_sales']
final['revenue'] = final['total_sales'] * final['price']
print("\nWith revenue:")
print(final[['product_name', 'total_sales', 'price', 'revenue']])
Part 4: Combining Operations
In real-world scenarios, you’ll often combine filtering, sorting, and merging:
# Create sample data
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'city': ['NYC', 'LA', 'Chicago', 'Boston', 'NYC']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105, 106],
'customer_id': [1, 2, 1, 3, 2, 4],
'amount': [100, 250, 150, 300, 200, 175],
'date': pd.to_datetime(['2024-01-15', '2024-01-20', '2024-02-10',
'2024-02-15', '2024-03-01', '2024-03-10'])
})
# Step 1: Merge customers and orders
merged = pd.merge(customers, orders, on='customer_id')
# Step 2: Filter for orders over $150
high_value = merged[merged['amount'] > 150]
# Step 3: Sort by amount descending
result = high_value.sort_values('amount', ascending=False)
print("High-value orders (>$150), sorted by amount:")
print(result[['name', 'city', 'amount', 'date']])
# Alternative: chain operations
result = (pd.merge(customers, orders, on='customer_id')
.query('amount > 150')
.sort_values('amount', ascending=False)
[['name', 'city', 'amount', 'date']])
print("\nUsing method chaining:")
print(result)
Best Practices
Here are key best practices for data manipulation:
1. Understand Your Data First
# Always inspect before manipulating
print(df.head())
print(df.info())
print(df.describe())
2. Use Appropriate Methods
# Use .query() for readability with complex conditions
result = df.query('price > 100 and region == "North"')
# Use .isin() for multiple values
result = df[df['region'].isin(['North', 'South'])]
# Use .between() for ranges
result = df[df['price'].between(100, 500)]
3. Be Explicit About Join Types
# Always specify how='inner', how='left', etc.
# Don't rely on defaults
merged = pd.merge(df1, df2, on='key', how='inner')
4. Handle Duplicates in Merge Keys
# Check for duplicates before merging
print(df1['key'].duplicated().sum())
print(df2['key'].duplicated().sum())
# If duplicates exist, understand the result
merged = pd.merge(df1, df2, on='key')
print(f"Original rows: {len(df1)}, {len(df2)}")
print(f"Merged rows: {len(merged)}")
5. Use Method Chaining for Clarity
# Chain operations for readable code
result = (df
.query('price > 100')
.sort_values('quantity', ascending=False)
[['product', 'price', 'quantity']])
6. Preserve Original Data
# Work on copies when needed
df_filtered = df[df['price'] > 100].copy()
df_filtered['new_column'] = 'value'
# Original df is unchanged
Common Pitfalls to Avoid
1. Using and/or Instead of &/|
# Wrong - will raise an error
result = df[df['price'] > 100 and df['quantity'] > 50]
# Correct
result = df[(df['price'] > 100) & (df['quantity'] > 50)]
2. Forgetting Parentheses in Complex Conditions
# Wrong - operator precedence issues
result = df[df['price'] > 100 & df['quantity'] > 50]
# Correct
result = df[(df['price'] > 100) & (df['quantity'] > 50)]
3. Not Checking for Duplicates Before Merging
# Check first
if df1['key'].duplicated().any():
print("Warning: Duplicates found in df1")
# Then merge
merged = pd.merge(df1, df2, on='key')
4. Assuming Column Names Match
# Wrong - assumes both have 'customer_id'
merged = pd.merge(df1, df2, on='customer_id')
# Correct - specify if names differ
merged = pd.merge(df1, df2, left_on='cust_id', right_on='customer_id')
5. Not Handling Missing Values After Merge
# Check for NaN values after merge
print(merged.isnull().sum())
# Handle appropriately
merged = merged.dropna(subset=['important_column'])
Conclusion
Filtering, sorting, and merging are the fundamental operations you’ll use repeatedly in data analysis. Mastering these techniques will make you significantly more productive and enable you to tackle complex data manipulation tasks with confidence.
Key Takeaways:
- Filtering uses boolean indexing to select rows based on conditions
- Sorting organizes data in meaningful ways using
.sort_values() - Merging combines data from multiple sources using
.merge(),.join(), or.concat() - Always understand your data before manipulating it
- Use appropriate methods for readability and performance
- Be explicit about join types and column names
- Chain operations for cleaner, more readable code
These operations form the foundation of data manipulation in pandas. As you become more comfortable with them, you’ll develop the intuition to combine them in creative ways to solve complex data problems. Practice with your own datasets, and you’ll quickly become proficient at transforming raw data into insights.
Additional Resources
- Pandas Merge Documentation
- Pandas Sort Values Documentation
- Pandas Query Documentation
- Pandas Indexing and Selection
Comments