Regular expressions are one of the most powerful tools in a programmer’s toolkit. They allow you to find patterns in text, extract data, and transform strings with precision. Yet many developers use regex only occasionally, treating it as a mysterious black box.
Python’s re module provides three fundamental operations for working with regular expressions: matching, searching, and substitution. Understanding the differences between these operations and when to use each one is the key to writing effective regex code.
This guide explores all three operations with practical examples you can use immediately in your projects.
What Are Regular Expressions?
A regular expression (regex) is a pattern that describes a set of strings. It’s a language for matching text. For example:
\d+matches one or more digits[a-z]+matches one or more lowercase letters\w+@\w+\.\w+matches a simple email pattern
Regular expressions are incredibly useful for:
- Validation: Checking if input matches expected format
- Extraction: Finding specific patterns in text
- Transformation: Replacing or reformatting text
- Parsing: Breaking down structured text
Understanding the Three Core Operations
Before diving into code, let’s understand the conceptual difference between the three operations:
- Matching: Check if a pattern matches at the beginning of a string
- Searching: Find a pattern anywhere in a string
- Substitution: Replace patterns with new text
This distinction is crucial and often confuses beginners.
Part 1: Matching with match()
What is Matching?
The match() function checks if a pattern matches at the beginning of a string. It returns a match object if successful, or None if the pattern doesn’t match.
Basic Matching Example
import re
# Pattern: starts with one or more digits
pattern = r'^\d+'
# Test strings
text1 = "123 Main Street"
text2 = "Main Street 123"
# match() checks from the beginning
match1 = re.match(pattern, text1)
match2 = re.match(pattern, text2)
print(f"Text 1 matches: {match1 is not None}") # Output: True
print(f"Text 2 matches: {match2 is not None}") # Output: False
# Access matched text
if match1:
print(f"Matched: {match1.group()}") # Output: Matched: 123
Practical Example: Validating Phone Numbers
import re
def validate_phone(phone):
"""Validate US phone number format: (123) 456-7890"""
pattern = r'^\(\d{3}\) \d{3}-\d{4}$'
return re.match(pattern, phone) is not None
# Test cases
phones = [
"(555) 123-4567", # Valid
"555-123-4567", # Invalid format
"(555) 1234567", # Invalid format
"(555) 123-456", # Invalid format
]
for phone in phones:
result = "Valid" if validate_phone(phone) else "Invalid"
print(f"{phone}: {result}")
# Output:
# (555) 123-4567: Valid
# 555-123-4567: Invalid
# (555) 1234567: Invalid
# (555) 123-456: Invalid
Practical Example: Extracting Data with Groups
import re
def parse_date(date_string):
"""Extract date components using groups"""
pattern = r'^(\d{4})-(\d{2})-(\d{2})$'
match = re.match(pattern, date_string)
if match:
year, month, day = match.groups()
return {
'year': int(year),
'month': int(month),
'day': int(day)
}
return None
# Test cases
dates = [
"2025-12-16",
"2025-13-01", # Invalid month
"25-12-16", # Wrong format
]
for date in dates:
result = parse_date(date)
if result:
print(f"{date}: {result}")
else:
print(f"{date}: Invalid format")
# Output:
# 2025-12-16: {'year': 2025, 'month': 12, 'day': 16}
# 2025-13-01: {'year': 2025, 'month': 13, 'day': 1}
# 25-12-16: Invalid format
Key Points About match()
- Matches only at the beginning of the string
- Returns a match object or
None - Use
group()to get the matched text - Use
groups()to get captured groups - Always use raw strings (
r"") for regex patterns
Part 2: Searching with search() and findall()
What is Searching?
The search() function finds a pattern anywhere in a string, not just at the beginning. The findall() function finds all occurrences of a pattern.
Basic Searching Example
import re
# Pattern: email-like format
pattern = r'\w+@\w+\.\w+'
text = "Contact us at [email protected] or [email protected]"
# search() finds first occurrence
match = re.search(pattern, text)
if match:
print(f"Found: {match.group()}") # Output: Found: [email protected]
# findall() finds all occurrences
matches = re.findall(pattern, text)
print(f"All matches: {matches}")
# Output: All matches: ['[email protected]', '[email protected]']
Practical Example: Extracting Email Addresses
import re
def extract_emails(text):
"""Extract all email addresses from text"""
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
return re.findall(pattern, text)
# Test text
text = """
Please send feedback to [email protected] or [email protected].
For urgent issues, contact [email protected].
"""
emails = extract_emails(text)
print("Extracted emails:")
for email in emails:
print(f" - {email}")
# Output:
# Extracted emails:
# - [email protected]
# - [email protected]
# - [email protected]
Practical Example: Finding URLs
import re
def extract_urls(text):
"""Extract all URLs from text"""
pattern = r'https?://[^\s]+'
return re.findall(pattern, text)
# Test text
text = """
Check out https://www.example.com for more info.
Visit https://github.com/user/repo for the code.
Also see http://old-site.com (deprecated).
"""
urls = extract_urls(text)
print("Found URLs:")
for url in urls:
print(f" - {url}")
# Output:
# Found URLs:
# - https://www.example.com
# - https://github.com/user/repo
# - http://old-site.com
Practical Example: Extracting Data with Groups
import re
def extract_log_entries(log_text):
"""Extract timestamp and message from log entries"""
pattern = r'\[(\d{2}:\d{2}:\d{2})\] (ERROR|WARNING|INFO): (.+)'
matches = re.findall(pattern, log_text)
entries = []
for time, level, message in matches:
entries.append({
'time': time,
'level': level,
'message': message
})
return entries
# Test log
log_text = """
[10:30:45] INFO: Application started
[10:31:12] WARNING: High memory usage detected
[10:32:00] ERROR: Database connection failed
[10:32:15] INFO: Attempting reconnection
"""
entries = extract_log_entries(log_text)
for entry in entries:
print(f"{entry['time']} [{entry['level']}] {entry['message']}")
# Output:
# 10:30:45 [INFO] Application started
# 10:31:12 [WARNING] High memory usage detected
# 10:32:00 [ERROR] Database connection failed
# 10:32:15 [INFO] Attempting reconnection
Key Points About search() and findall()
search()finds the first occurrence anywhere in the stringfindall()finds all occurrences and returns a list- With groups,
findall()returns a list of tuples - Use
search()when you need just one match - Use
findall()when you need all matches
Part 3: Substitution with sub() and subn()
What is Substitution?
The sub() function replaces all occurrences of a pattern with a replacement string. The subn() function does the same but also returns the number of replacements made.
Basic Substitution Example
import re
# Pattern: one or more spaces
pattern = r'\s+'
text = "This has irregular spacing"
# sub() replaces all occurrences
cleaned = re.sub(pattern, ' ', text)
print(f"Original: '{text}'")
print(f"Cleaned: '{cleaned}'")
# Output:
# Original: 'This has irregular spacing'
# Cleaned: 'This has irregular spacing'
# subn() returns (new_string, count)
cleaned, count = re.subn(pattern, ' ', text)
print(f"Replacements made: {count}") # Output: Replacements made: 3
Practical Example: Cleaning User Input
import re
def clean_username(username):
"""Clean username: remove special characters, convert to lowercase"""
# Remove anything that's not alphanumeric or underscore
cleaned = re.sub(r'[^a-zA-Z0-9_]', '', username)
# Convert to lowercase
cleaned = cleaned.lower()
# Remove leading underscores
cleaned = re.sub(r'^_+', '', cleaned)
return cleaned
# Test cases
usernames = [
"John_Doe!",
"___admin___",
"User@123",
"Test-User",
]
for username in usernames:
cleaned = clean_username(username)
print(f"'{username}' -> '{cleaned}'")
# Output:
# 'John_Doe!' -> 'john_doe'
# '___admin___' -> 'admin'
# 'User@123' -> 'user123'
# 'Test-User' -> 'testuser'
Practical Example: Formatting Phone Numbers
import re
def format_phone(phone):
"""Format phone number to (XXX) XXX-XXXX"""
# Remove all non-digits
digits = re.sub(r'\D', '', phone)
# Check if we have exactly 10 digits
if len(digits) != 10:
return None
# Format using groups
pattern = r'(\d{3})(\d{3})(\d{4})'
replacement = r'(\1) \2-\3'
return re.sub(pattern, replacement, digits)
# Test cases
phones = [
"5551234567",
"555-123-4567",
"(555) 123-4567",
"555.123.4567",
"123", # Too short
]
for phone in phones:
formatted = format_phone(phone)
print(f"{phone:20} -> {formatted}")
# Output:
# 5551234567 -> (555) 123-4567
# 555-123-4567 -> (555) 123-4567
# (555) 123-4567 -> (555) 123-4567
# 555.123.4567 -> (555) 123-4567
# 123 -> None
Practical Example: Using Functions in Substitution
import re
def convert_markdown_links(text):
"""Convert [text](/programming/url) to HTML <a> tags"""
pattern = r'\[([^\]]+)\]\(([^\)]+)\)'
def replace_link(match):
text = match.group(1)
url = match.group(2)
return f'<a href="{url}">{text}</a>'
return re.sub(pattern, replace_link, text)
# Test markdown
markdown = """
Check out [Python Docs](https://docs.python.org) for more info.
Visit [GitHub](https://github.com) to share code.
"""
html = convert_markdown_links(markdown)
print(html)
# Output:
# Check out <a href="https://docs.python.org">Python Docs</a> for more info.
# Visit <a href="https://github.com">GitHub</a> to share code.
Key Points About sub() and subn()
sub()replaces all occurrences and returns the new stringsubn()returns a tuple: (new_string, count)- Use backreferences (
\1,\2) to reference captured groups - Use a function as the replacement for complex logic
- The function receives a match object and returns the replacement string
Comparison: When to Use Each Operation
| Operation | Function | Use Case | Returns |
|---|---|---|---|
| Matching | match() |
Check if pattern matches at start of string | Match object or None |
| Searching | search() |
Find first occurrence anywhere in string | Match object or None |
| Finding All | findall() |
Find all occurrences in string | List of matches |
| Substitution | sub() |
Replace all occurrences | New string |
| Substitution | subn() |
Replace all occurrences and count | Tuple (string, count) |
Decision Tree
Do you need to:
โโ Check if pattern matches at START of string?
โ โโ Use match()
โโ Find pattern ANYWHERE in string?
โ โโ Need just the first match?
โ โ โโ Use search()
โ โโ Need all matches?
โ โโ Use findall()
โโ Replace patterns with new text?
โโ Need to know how many replacements?
โ โโ Use subn()
โโ Just replace?
โโ Use sub()
Common Regex Patterns
Here are some useful patterns for common tasks:
import re
# Email validation
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
# URL validation
url_pattern = r'^https?://[^\s/$.?#].[^\s]*$'
# Phone number (US)
phone_pattern = r'^\+?1?\d{9,15}$'
# Date (YYYY-MM-DD)
date_pattern = r'^\d{4}-\d{2}-\d{2}$'
# Hex color
color_pattern = r'^#[0-9a-fA-F]{6}$'
# IPv4 address
ipv4_pattern = r'^(\d{1,3}\.){3}\d{1,3}$'
# Username (alphanumeric and underscore, 3-20 chars)
username_pattern = r'^[a-zA-Z0-9_]{3,20}$'
# Password (at least 8 chars, uppercase, lowercase, digit)
password_pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$'
# Test patterns
test_cases = {
'email': ('[email protected]', email_pattern),
'url': ('https://example.com', url_pattern),
'phone': ('+1234567890', phone_pattern),
'date': ('2025-12-16', date_pattern),
'color': ('#FF5733', color_pattern),
'ipv4': ('192.168.1.1', ipv4_pattern),
'username': ('john_doe', username_pattern),
'password': ('SecurePass123', password_pattern),
}
for name, (test_str, pattern) in test_cases.items():
match = re.match(pattern, test_str)
result = "Valid" if match else "Invalid"
print(f"{name:12} '{test_str:20}' -> {result}")
Best Practices and Common Pitfalls
Best Practice 1: Use Raw Strings
# โ Wrong: Backslashes are interpreted by Python
pattern = '\d+' # This is actually just 'd+'
# โ Correct: Raw string preserves backslashes
pattern = r'\d+' # This is the regex pattern for digits
Best Practice 2: Compile Patterns for Reuse
import re
# โ Good: Compile once, use many times
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
emails = [
"[email protected]",
"[email protected]",
"[email protected]"
]
for email in emails:
if email_pattern.match(email):
print(f"Valid: {email}")
# โ Less efficient: Recompile pattern each time
for email in emails:
if re.match(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', email):
print(f"Valid: {email}")
Best Practice 3: Use Named Groups for Clarity
import re
# โ Good: Named groups are self-documenting
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.match(pattern, '2025-12-16')
if match:
date_dict = match.groupdict()
print(f"Year: {date_dict['year']}")
print(f"Month: {date_dict['month']}")
print(f"Day: {date_dict['day']}")
# โ Less clear: Numbered groups
pattern = r'(\d{4})-(\d{2})-(\d{2})'
match = re.match(pattern, '2025-12-16')
if match:
year, month, day = match.groups()
Pitfall 1: Forgetting the ^ and $ Anchors
import re
# โ Problem: Matches partial strings
pattern = r'\d{3}-\d{4}'
text = "Call 555-1234 now"
print(re.search(pattern, text)) # Matches!
# โ Solution: Use anchors for full string match
pattern = r'^\d{3}-\d{4}$'
print(re.match(pattern, text)) # None
print(re.match(pattern, "555-1234")) # Matches!
Pitfall 2: Greedy vs Non-Greedy Matching
import re
text = "<p>Hello</p> and <p>World</p>"
# โ Greedy: Matches too much
pattern = r'<p>.*</p>'
print(re.findall(pattern, text))
# Output: ['<p>Hello</p> and <p>World</p>']
# โ Non-greedy: Matches correctly
pattern = r'<p>.*?</p>'
print(re.findall(pattern, text))
# Output: ['<p>Hello</p>', '<p>World</p>']
Pitfall 3: Not Escaping Special Characters
import re
# โ Problem: . matches any character
pattern = r'file.txt'
print(re.search(pattern, 'file.txt')) # Matches
print(re.search(pattern, 'filextxt')) # Also matches!
# โ Solution: Escape special characters
pattern = r'file\.txt'
print(re.search(pattern, 'file.txt')) # Matches
print(re.search(pattern, 'filextxt')) # None
Conclusion
Regular expressions are powerful tools for text processing. Understanding the three core operationsโmatching, searching, and substitutionโgives you the foundation to solve most regex problems:
- Use
match()when you need to validate that a string starts with a specific pattern - Use
search()when you need to find a pattern anywhere in a string - Use
findall()when you need to extract all occurrences of a pattern - Use
sub()when you need to replace patterns with new text
Key takeaways:
- Always use raw strings (
r"") for regex patterns - Compile patterns if you use them multiple times
- Use named groups for complex patterns
- Test your patterns thoroughly with various inputs
- Consider performance for large texts
- Document complex patterns with comments
- Use online regex testers to debug patterns
Regular expressions have a steep learning curve, but the investment pays off. Start with simple patterns, gradually build complexity, and soon you’ll be writing powerful regex solutions with confidence.
Comments