Skip to main content
โšก Calmops

Pandas Data Manipulation: Filtering, Sorting, and Merging Datasets

Pandas Data Manipulation: Filtering, Sorting, and Merging Datasets

Introduction

Data manipulation is at the heart of every data analysis project. Whether you’re preparing data for analysis, cleaning datasets, or combining information from multiple sources, you’ll rely on three fundamental operations: filtering, sorting, and merging.

These operations are so common that mastering them will dramatically improve your productivity as a data professional. In fact, you’ll likely use these techniques in nearly every project you work on. The ability to efficiently filter large datasets, organize data in meaningful ways, and combine information from multiple sources is what separates competent data professionals from exceptional ones.

In this post, we’ll explore each of these operations in depth, starting with practical examples and building toward more complex scenarios. By the end, you’ll have a comprehensive toolkit for manipulating data with pandas.


Part 1: Filtering Data

Filtering is the process of selecting rows from a DataFrame based on specific conditions. It’s one of the most frequently used operations in data analysis.

Boolean Indexing Basics

The foundation of filtering in pandas is boolean indexing. This involves creating a boolean mask (a Series of True/False values) and using it to select rows:

import pandas as pd

# Create sample sales data
sales_data = {
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'price': [1200, 25, 75, 350, 150],
    'quantity_sold': [5, 150, 80, 20, 60],
    'region': ['North', 'South', 'North', 'East', 'West']
}
df = pd.DataFrame(sales_data)

# Create a boolean mask
expensive = df['price'] > 100
print(expensive)
# Output:
# 0     True
# 1    False
# 2    False
# 3     True
# 4     True
# Name: price, dtype: bool

# Use the mask to filter
expensive_products = df[expensive]
print(expensive_products)

Output:

     product  price  quantity_sold region
0     Laptop   1200              5  North
3    Monitor    350             20   East
4 Headphones    150             60   West

You can also filter directly without creating an intermediate variable:

# Filter in one line
expensive_products = df[df['price'] > 100]

# Filter for specific value
north_region = df[df['region'] == 'North']

# Filter for multiple values
cheap_products = df[df['price'] < 100]

Multiple Conditions with Logical Operators

Often you need to filter based on multiple conditions. Use & (AND), | (OR), and ~ (NOT) operators:

# AND condition: expensive AND high sales
high_value = df[(df['price'] > 100) & (df['quantity_sold'] > 50)]
print(high_value)

# OR condition: cheap OR low sales
low_priority = df[(df['price'] < 100) | (df['quantity_sold'] < 30)]
print(low_priority)

# NOT condition: not in North region
not_north = df[~(df['region'] == 'North')]
print(not_north)

# Complex conditions
result = df[((df['price'] > 100) & (df['region'] == 'North')) | 
            (df['quantity_sold'] > 100)]
print(result)

Important: Always use &, |, and ~ for boolean operations on Series. Don’t use and, or, or not as they won’t work correctly with pandas.

Using .loc[] for Label-Based Filtering

The .loc[] method allows you to filter using labels (column names and index values):

# Filter rows where price > 100
expensive = df.loc[df['price'] > 100]

# Filter specific columns for rows where price > 100
expensive_names = df.loc[df['price'] > 100, 'product']
print(expensive_names)

# Filter multiple columns
expensive_info = df.loc[df['price'] > 100, ['product', 'price']]
print(expensive_info)

# Filter with multiple conditions
result = df.loc[(df['price'] > 100) & (df['region'] == 'North'), 
                ['product', 'price', 'region']]
print(result)

Using .iloc[] for Position-Based Filtering

The .iloc[] method filters based on row and column positions (integers):

# Get first 3 rows
first_three = df.iloc[0:3]
print(first_three)

# Get rows 1 and 3, columns 0 and 2
subset = df.iloc[[1, 3], [0, 2]]
print(subset)

# Get all rows, specific columns by position
products_and_prices = df.iloc[:, [0, 1]]
print(products_and_prices)

# Combine with boolean indexing
expensive_mask = df['price'] > 100
expensive_products = df.iloc[expensive_mask, :]
print(expensive_products)

String-Based Filtering

Filtering text columns requires special string methods:

# Exact match
north_products = df[df['region'] == 'North']

# Contains substring
contains_phone = df[df['product'].str.contains('phone', case=False)]
print(contains_phone)

# Starts with
starts_with_m = df[df['product'].str.startswith('M')]
print(starts_with_m)

# Ends with
ends_with_s = df[df['product'].str.endswith('s')]
print(ends_with_s)

# String length
long_names = df[df['product'].str.len() > 6]
print(long_names)

# Is in list
regions = ['North', 'South']
filtered = df[df['region'].isin(regions)]
print(filtered)

Using .query() for Complex Conditions

The .query() method allows you to write filter conditions as strings, which can be more readable:

# Simple condition
expensive = df.query('price > 100')
print(expensive)

# Multiple conditions
result = df.query('price > 100 and quantity_sold > 50')
print(result)

# Using variables
min_price = 100
max_price = 500
result = df.query('@min_price < price < @max_price')
print(result)

# Complex query
result = df.query('(price > 100 and region == "North") or quantity_sold > 100')
print(result)

Filtering with .isin() and .between()

These methods provide convenient ways to filter for multiple values or ranges:

# Filter for specific values
regions = ['North', 'East']
filtered = df[df['region'].isin(regions)]
print(filtered)

# Filter for values NOT in list
filtered = df[~df['region'].isin(['South', 'West'])]
print(filtered)

# Filter for values in range
mid_range = df[df['price'].between(100, 500)]
print(mid_range)

# Exclude range
outside_range = df[~df['price'].between(100, 500)]
print(outside_range)

Part 2: Sorting Data

Sorting organizes your data in a meaningful order, making it easier to identify patterns and trends.

Sorting by Single Column

The .sort_values() method sorts a DataFrame by one or more columns:

# Sort by price in ascending order (default)
sorted_by_price = df.sort_values('price')
print(sorted_by_price)

# Sort by price in descending order
sorted_by_price_desc = df.sort_values('price', ascending=False)
print(sorted_by_price_desc)

# Sort by product name alphabetically
sorted_by_name = df.sort_values('product')
print(sorted_by_name)

Sorting by Multiple Columns

You can sort by multiple columns to create a hierarchy:

# Create a more complex dataset
sales_data = {
    'region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse', 'Laptop', 'Mouse'],
    'price': [1200, 25, 1200, 25, 1200, 25],
    'quantity_sold': [5, 150, 8, 120, 3, 180]
}
df = pd.DataFrame(sales_data)

# Sort by region first, then by quantity_sold
sorted_df = df.sort_values(['region', 'quantity_sold'], ascending=[True, False])
print(sorted_df)

# Sort by region ascending, quantity descending
sorted_df = df.sort_values(['region', 'quantity_sold'], 
                           ascending=[True, False])
print(sorted_df)

Output:

  region   product  price  quantity_sold
4   East    Laptop   1200              3
5   East     Mouse     25            180
0  North    Laptop   1200              5
1  North     Mouse     25            150
2  South    Laptop   1200              8
3  South     Mouse     25            120

Handling Missing Values During Sorting

By default, missing values are placed at the end. You can control this behavior:

# Create data with missing values
data = {
    'name': ['Alice', 'Bob', None, 'Diana'],
    'score': [85, None, 92, 88]
}
df = pd.DataFrame(data)

# Default: NaN values at the end
sorted_df = df.sort_values('score')
print(sorted_df)

# Put NaN values first
sorted_df = df.sort_values('score', na_position='first')
print(sorted_df)

# Put NaN values last (explicit)
sorted_df = df.sort_values('score', na_position='last')
print(sorted_df)

In-Place vs. Returning New DataFrames

By default, .sort_values() returns a new DataFrame. You can modify the original:

# Default: returns new DataFrame, original unchanged
sorted_df = df.sort_values('price')
print("Original df:")
print(df)
print("\nSorted df:")
print(sorted_df)

# In-place: modifies original DataFrame
df.sort_values('price', inplace=True)
print("\nAfter inplace sort:")
print(df)

Sorting by Index

You can also sort by the index:

# Create DataFrame with custom index
df = pd.DataFrame(
    {'value': [10, 20, 30, 40]},
    index=['d', 'b', 'a', 'c']
)

# Sort by index
sorted_by_index = df.sort_index()
print(sorted_by_index)

# Sort by index descending
sorted_by_index_desc = df.sort_index(ascending=False)
print(sorted_by_index_desc)

Sorting by Custom Order

Sometimes you need to sort by a specific order that’s not alphabetical or numerical:

# Create data with categorical values
df = pd.DataFrame({
    'priority': ['High', 'Low', 'Medium', 'High', 'Low'],
    'task': ['A', 'B', 'C', 'D', 'E']
})

# Define custom order
priority_order = ['High', 'Medium', 'Low']

# Convert to categorical with specific order
df['priority'] = pd.Categorical(df['priority'], 
                                categories=priority_order, 
                                ordered=True)

# Sort by the categorical column
sorted_df = df.sort_values('priority')
print(sorted_df)

Output:

  priority task
0     High    A
3     High    D
2   Medium    C
1      Low    B
4      Low    E

Part 3: Merging Datasets

Merging combines data from multiple DataFrames based on common columns or indices. This is essential when working with data from different sources.

Understanding Join Types

Before merging, it’s important to understand the different types of joins:

Left DataFrame:        Right DataFrame:
key  value_left        key  value_right
A    1                 A    x
B    2                 B    y
C    3                 D    z

INNER JOIN (only matching keys):
key  value_left  value_right
A    1           x
B    2           y

LEFT JOIN (all from left, matching from right):
key  value_left  value_right
A    1           x
B    2           y
C    3           NaN

RIGHT JOIN (all from right, matching from left):
key  value_left  value_right
A    1           x
B    2           y
D    NaN         z

OUTER JOIN (all from both):
key  value_left  value_right
A    1           x
B    2           y
C    3           NaN
D    NaN         z

Basic Merge Operations

The .merge() method is the primary way to combine DataFrames:

# Create sample datasets
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'city': ['NYC', 'LA', 'Chicago', 'Boston']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104],
    'customer_id': [1, 2, 1, 3],
    'amount': [100, 250, 150, 300]
})

# Inner join (default): only matching customer_ids
merged = pd.merge(customers, orders, on='customer_id')
print("Inner Join:")
print(merged)

Output:

   customer_id     name      city  order_id  amount
0            1    Alice      NYC       101     100
1            1    Alice      NYC       103     150
2            2      Bob       LA       102     250
3            3  Charlie  Chicago       104     300

Different Join Types

# Left join: all customers, matching orders
left_merge = pd.merge(customers, orders, on='customer_id', how='left')
print("Left Join:")
print(left_merge)

# Right join: all orders, matching customers
right_merge = pd.merge(customers, orders, on='customer_id', how='right')
print("Right Join:")
print(right_merge)

# Outer join: all from both DataFrames
outer_merge = pd.merge(customers, orders, on='customer_id', how='outer')
print("Outer Join:")
print(outer_merge)

Merging on Multiple Keys

When you need to match on multiple columns:

# Create datasets with composite keys
sales_2023 = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South'],
    'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse'],
    'sales_2023': [100, 200, 150, 180]
})

sales_2024 = pd.DataFrame({
    'region': ['North', 'North', 'South', 'South'],
    'product': ['Laptop', 'Mouse', 'Laptop', 'Mouse'],
    'sales_2024': [120, 210, 160, 190]
})

# Merge on multiple columns
merged = pd.merge(sales_2023, sales_2024, 
                  on=['region', 'product'])
print(merged)

Output:

  region product  sales_2023  sales_2024
0  North  Laptop         100         120
1  North   Mouse         200         210
2  South  Laptop         150         160
3  South   Mouse         180         190

Merging on Different Column Names

When the join columns have different names in each DataFrame:

# Create datasets with different column names
customers = pd.DataFrame({
    'cust_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana']
})

orders = pd.DataFrame({
    'customer_id': [1, 2, 1, 3],
    'amount': [100, 250, 150, 300]
})

# Merge using left_on and right_on
merged = pd.merge(customers, orders, 
                  left_on='cust_id', 
                  right_on='customer_id')
print(merged)

# Drop the redundant column
merged = merged.drop('customer_id', axis=1)
print(merged)

Handling Duplicate Column Names

When both DataFrames have columns with the same name (other than the join key):

# Create datasets with overlapping column names
df1 = pd.DataFrame({
    'key': [1, 2, 3],
    'value': [10, 20, 30],
    'source': ['A', 'B', 'C']
})

df2 = pd.DataFrame({
    'key': [1, 2, 3],
    'value': [100, 200, 300],
    'source': ['X', 'Y', 'Z']
})

# Merge with suffixes for duplicate columns
merged = pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))
print(merged)

Output:

   key  value_left source_left  value_right source_right
0    1          10           A          100            X
1    2          20           B          200            Y
2    3          30           C          300            Z

Using .join() Method

The .join() method is similar to .merge() but joins on indices by default:

# Create DataFrames with indices
df1 = pd.DataFrame(
    {'value1': [10, 20, 30]},
    index=['A', 'B', 'C']
)

df2 = pd.DataFrame(
    {'value2': [100, 200, 300]},
    index=['A', 'B', 'C']
)

# Join on index
joined = df1.join(df2)
print(joined)

# Join with different index
df3 = pd.DataFrame(
    {'value3': [1000, 2000]},
    index=['A', 'B']
)

joined = df1.join(df3, how='left')
print(joined)

Using .concat() for Combining DataFrames

The .concat() function combines DataFrames along rows or columns:

# Create sample DataFrames
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'score': [85, 90]
})

df2 = pd.DataFrame({
    'name': ['Charlie', 'Diana'],
    'score': [88, 92]
})

# Concatenate vertically (stack rows)
combined = pd.concat([df1, df2], ignore_index=True)
print("Vertical concatenation:")
print(combined)

# Concatenate horizontally (side by side)
df3 = pd.DataFrame({
    'grade': ['A', 'A', 'B', 'A']
})

combined = pd.concat([combined, df3], axis=1)
print("\nHorizontal concatenation:")
print(combined)

Practical Merging Example

Let’s work through a realistic scenario:

# Sales data from different regions
north_sales = pd.DataFrame({
    'product_id': [1, 2, 3],
    'product_name': ['Laptop', 'Mouse', 'Keyboard'],
    'north_sales': [50, 200, 150]
})

south_sales = pd.DataFrame({
    'product_id': [1, 2, 3],
    'product_name': ['Laptop', 'Mouse', 'Keyboard'],
    'south_sales': [45, 180, 140]
})

# Product information
products = pd.DataFrame({
    'product_id': [1, 2, 3],
    'category': ['Electronics', 'Accessories', 'Accessories'],
    'price': [1200, 25, 75]
})

# Merge sales data
sales = pd.merge(north_sales, south_sales, on=['product_id', 'product_name'])
print("Combined sales:")
print(sales)

# Merge with product information
final = pd.merge(sales, products, on='product_id')
print("\nFinal report:")
print(final)

# Calculate total sales
final['total_sales'] = final['north_sales'] + final['south_sales']
final['revenue'] = final['total_sales'] * final['price']
print("\nWith revenue:")
print(final[['product_name', 'total_sales', 'price', 'revenue']])

Part 4: Combining Operations

In real-world scenarios, you’ll often combine filtering, sorting, and merging:

# Create sample data
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['NYC', 'LA', 'Chicago', 'Boston', 'NYC']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106],
    'customer_id': [1, 2, 1, 3, 2, 4],
    'amount': [100, 250, 150, 300, 200, 175],
    'date': pd.to_datetime(['2024-01-15', '2024-01-20', '2024-02-10', 
                            '2024-02-15', '2024-03-01', '2024-03-10'])
})

# Step 1: Merge customers and orders
merged = pd.merge(customers, orders, on='customer_id')

# Step 2: Filter for orders over $150
high_value = merged[merged['amount'] > 150]

# Step 3: Sort by amount descending
result = high_value.sort_values('amount', ascending=False)

print("High-value orders (>$150), sorted by amount:")
print(result[['name', 'city', 'amount', 'date']])

# Alternative: chain operations
result = (pd.merge(customers, orders, on='customer_id')
          .query('amount > 150')
          .sort_values('amount', ascending=False)
          [['name', 'city', 'amount', 'date']])
print("\nUsing method chaining:")
print(result)

Best Practices

Here are key best practices for data manipulation:

1. Understand Your Data First

# Always inspect before manipulating
print(df.head())
print(df.info())
print(df.describe())

2. Use Appropriate Methods

# Use .query() for readability with complex conditions
result = df.query('price > 100 and region == "North"')

# Use .isin() for multiple values
result = df[df['region'].isin(['North', 'South'])]

# Use .between() for ranges
result = df[df['price'].between(100, 500)]

3. Be Explicit About Join Types

# Always specify how='inner', how='left', etc.
# Don't rely on defaults
merged = pd.merge(df1, df2, on='key', how='inner')

4. Handle Duplicates in Merge Keys

# Check for duplicates before merging
print(df1['key'].duplicated().sum())
print(df2['key'].duplicated().sum())

# If duplicates exist, understand the result
merged = pd.merge(df1, df2, on='key')
print(f"Original rows: {len(df1)}, {len(df2)}")
print(f"Merged rows: {len(merged)}")

5. Use Method Chaining for Clarity

# Chain operations for readable code
result = (df
          .query('price > 100')
          .sort_values('quantity', ascending=False)
          [['product', 'price', 'quantity']])

6. Preserve Original Data

# Work on copies when needed
df_filtered = df[df['price'] > 100].copy()
df_filtered['new_column'] = 'value'
# Original df is unchanged

Common Pitfalls to Avoid

1. Using and/or Instead of &/|

# Wrong - will raise an error
result = df[df['price'] > 100 and df['quantity'] > 50]

# Correct
result = df[(df['price'] > 100) & (df['quantity'] > 50)]

2. Forgetting Parentheses in Complex Conditions

# Wrong - operator precedence issues
result = df[df['price'] > 100 & df['quantity'] > 50]

# Correct
result = df[(df['price'] > 100) & (df['quantity'] > 50)]

3. Not Checking for Duplicates Before Merging

# Check first
if df1['key'].duplicated().any():
    print("Warning: Duplicates found in df1")

# Then merge
merged = pd.merge(df1, df2, on='key')

4. Assuming Column Names Match

# Wrong - assumes both have 'customer_id'
merged = pd.merge(df1, df2, on='customer_id')

# Correct - specify if names differ
merged = pd.merge(df1, df2, left_on='cust_id', right_on='customer_id')

5. Not Handling Missing Values After Merge

# Check for NaN values after merge
print(merged.isnull().sum())

# Handle appropriately
merged = merged.dropna(subset=['important_column'])

Conclusion

Filtering, sorting, and merging are the fundamental operations you’ll use repeatedly in data analysis. Mastering these techniques will make you significantly more productive and enable you to tackle complex data manipulation tasks with confidence.

Key Takeaways:

  • Filtering uses boolean indexing to select rows based on conditions
  • Sorting organizes data in meaningful ways using .sort_values()
  • Merging combines data from multiple sources using .merge(), .join(), or .concat()
  • Always understand your data before manipulating it
  • Use appropriate methods for readability and performance
  • Be explicit about join types and column names
  • Chain operations for cleaner, more readable code

These operations form the foundation of data manipulation in pandas. As you become more comfortable with them, you’ll develop the intuition to combine them in creative ways to solve complex data problems. Practice with your own datasets, and you’ll quickly become proficient at transforming raw data into insights.


Additional Resources

Comments