Skip to main content
โšก Calmops

Python Regular Expressions: Matching, Searching, and Substitution

Regular expressions are one of the most powerful tools in a programmer’s toolkit. They allow you to find patterns in text, extract data, and transform strings with precision. Yet many developers use regex only occasionally, treating it as a mysterious black box.

Python’s re module provides three fundamental operations for working with regular expressions: matching, searching, and substitution. Understanding the differences between these operations and when to use each one is the key to writing effective regex code.

This guide explores all three operations with practical examples you can use immediately in your projects.

What Are Regular Expressions?

A regular expression (regex) is a pattern that describes a set of strings. It’s a language for matching text. For example:

  • \d+ matches one or more digits
  • [a-z]+ matches one or more lowercase letters
  • \w+@\w+\.\w+ matches a simple email pattern

Regular expressions are incredibly useful for:

  • Validation: Checking if input matches expected format
  • Extraction: Finding specific patterns in text
  • Transformation: Replacing or reformatting text
  • Parsing: Breaking down structured text

Understanding the Three Core Operations

Before diving into code, let’s understand the conceptual difference between the three operations:

  • Matching: Check if a pattern matches at the beginning of a string
  • Searching: Find a pattern anywhere in a string
  • Substitution: Replace patterns with new text

This distinction is crucial and often confuses beginners.


Part 1: Matching with match()

What is Matching?

The match() function checks if a pattern matches at the beginning of a string. It returns a match object if successful, or None if the pattern doesn’t match.

Basic Matching Example

import re

# Pattern: starts with one or more digits
pattern = r'^\d+'

# Test strings
text1 = "123 Main Street"
text2 = "Main Street 123"

# match() checks from the beginning
match1 = re.match(pattern, text1)
match2 = re.match(pattern, text2)

print(f"Text 1 matches: {match1 is not None}")  # Output: True
print(f"Text 2 matches: {match2 is not None}")  # Output: False

# Access matched text
if match1:
    print(f"Matched: {match1.group()}")  # Output: Matched: 123

Practical Example: Validating Phone Numbers

import re

def validate_phone(phone):
    """Validate US phone number format: (123) 456-7890"""
    pattern = r'^\(\d{3}\) \d{3}-\d{4}$'
    return re.match(pattern, phone) is not None

# Test cases
phones = [
    "(555) 123-4567",  # Valid
    "555-123-4567",    # Invalid format
    "(555) 1234567",   # Invalid format
    "(555) 123-456",   # Invalid format
]

for phone in phones:
    result = "Valid" if validate_phone(phone) else "Invalid"
    print(f"{phone}: {result}")

# Output:
# (555) 123-4567: Valid
# 555-123-4567: Invalid
# (555) 1234567: Invalid
# (555) 123-456: Invalid

Practical Example: Extracting Data with Groups

import re

def parse_date(date_string):
    """Extract date components using groups"""
    pattern = r'^(\d{4})-(\d{2})-(\d{2})$'
    match = re.match(pattern, date_string)
    
    if match:
        year, month, day = match.groups()
        return {
            'year': int(year),
            'month': int(month),
            'day': int(day)
        }
    return None

# Test cases
dates = [
    "2025-12-16",
    "2025-13-01",  # Invalid month
    "25-12-16",    # Wrong format
]

for date in dates:
    result = parse_date(date)
    if result:
        print(f"{date}: {result}")
    else:
        print(f"{date}: Invalid format")

# Output:
# 2025-12-16: {'year': 2025, 'month': 12, 'day': 16}
# 2025-13-01: {'year': 2025, 'month': 13, 'day': 1}
# 25-12-16: Invalid format

Key Points About match()

  • Matches only at the beginning of the string
  • Returns a match object or None
  • Use group() to get the matched text
  • Use groups() to get captured groups
  • Always use raw strings (r"") for regex patterns

Part 2: Searching with search() and findall()

What is Searching?

The search() function finds a pattern anywhere in a string, not just at the beginning. The findall() function finds all occurrences of a pattern.

Basic Searching Example

import re

# Pattern: email-like format
pattern = r'\w+@\w+\.\w+'

text = "Contact us at [email protected] or [email protected]"

# search() finds first occurrence
match = re.search(pattern, text)
if match:
    print(f"Found: {match.group()}")  # Output: Found: [email protected]

# findall() finds all occurrences
matches = re.findall(pattern, text)
print(f"All matches: {matches}")
# Output: All matches: ['[email protected]', '[email protected]']

Practical Example: Extracting Email Addresses

import re

def extract_emails(text):
    """Extract all email addresses from text"""
    pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
    return re.findall(pattern, text)

# Test text
text = """
Please send feedback to [email protected] or [email protected].
For urgent issues, contact [email protected].
"""

emails = extract_emails(text)
print("Extracted emails:")
for email in emails:
    print(f"  - {email}")

# Output:
# Extracted emails:
#   - [email protected]
#   - [email protected]
#   - [email protected]

Practical Example: Finding URLs

import re

def extract_urls(text):
    """Extract all URLs from text"""
    pattern = r'https?://[^\s]+'
    return re.findall(pattern, text)

# Test text
text = """
Check out https://www.example.com for more info.
Visit https://github.com/user/repo for the code.
Also see http://old-site.com (deprecated).
"""

urls = extract_urls(text)
print("Found URLs:")
for url in urls:
    print(f"  - {url}")

# Output:
# Found URLs:
#   - https://www.example.com
#   - https://github.com/user/repo
#   - http://old-site.com

Practical Example: Extracting Data with Groups

import re

def extract_log_entries(log_text):
    """Extract timestamp and message from log entries"""
    pattern = r'\[(\d{2}:\d{2}:\d{2})\] (ERROR|WARNING|INFO): (.+)'
    matches = re.findall(pattern, log_text)
    
    entries = []
    for time, level, message in matches:
        entries.append({
            'time': time,
            'level': level,
            'message': message
        })
    return entries

# Test log
log_text = """
[10:30:45] INFO: Application started
[10:31:12] WARNING: High memory usage detected
[10:32:00] ERROR: Database connection failed
[10:32:15] INFO: Attempting reconnection
"""

entries = extract_log_entries(log_text)
for entry in entries:
    print(f"{entry['time']} [{entry['level']}] {entry['message']}")

# Output:
# 10:30:45 [INFO] Application started
# 10:31:12 [WARNING] High memory usage detected
# 10:32:00 [ERROR] Database connection failed
# 10:32:15 [INFO] Attempting reconnection

Key Points About search() and findall()

  • search() finds the first occurrence anywhere in the string
  • findall() finds all occurrences and returns a list
  • With groups, findall() returns a list of tuples
  • Use search() when you need just one match
  • Use findall() when you need all matches

Part 3: Substitution with sub() and subn()

What is Substitution?

The sub() function replaces all occurrences of a pattern with a replacement string. The subn() function does the same but also returns the number of replacements made.

Basic Substitution Example

import re

# Pattern: one or more spaces
pattern = r'\s+'

text = "This   has    irregular    spacing"

# sub() replaces all occurrences
cleaned = re.sub(pattern, ' ', text)
print(f"Original: '{text}'")
print(f"Cleaned:  '{cleaned}'")

# Output:
# Original: 'This   has    irregular    spacing'
# Cleaned:  'This has irregular spacing'

# subn() returns (new_string, count)
cleaned, count = re.subn(pattern, ' ', text)
print(f"Replacements made: {count}")  # Output: Replacements made: 3

Practical Example: Cleaning User Input

import re

def clean_username(username):
    """Clean username: remove special characters, convert to lowercase"""
    # Remove anything that's not alphanumeric or underscore
    cleaned = re.sub(r'[^a-zA-Z0-9_]', '', username)
    # Convert to lowercase
    cleaned = cleaned.lower()
    # Remove leading underscores
    cleaned = re.sub(r'^_+', '', cleaned)
    return cleaned

# Test cases
usernames = [
    "John_Doe!",
    "___admin___",
    "User@123",
    "Test-User",
]

for username in usernames:
    cleaned = clean_username(username)
    print(f"'{username}' -> '{cleaned}'")

# Output:
# 'John_Doe!' -> 'john_doe'
# '___admin___' -> 'admin'
# 'User@123' -> 'user123'
# 'Test-User' -> 'testuser'

Practical Example: Formatting Phone Numbers

import re

def format_phone(phone):
    """Format phone number to (XXX) XXX-XXXX"""
    # Remove all non-digits
    digits = re.sub(r'\D', '', phone)
    
    # Check if we have exactly 10 digits
    if len(digits) != 10:
        return None
    
    # Format using groups
    pattern = r'(\d{3})(\d{3})(\d{4})'
    replacement = r'(\1) \2-\3'
    return re.sub(pattern, replacement, digits)

# Test cases
phones = [
    "5551234567",
    "555-123-4567",
    "(555) 123-4567",
    "555.123.4567",
    "123",  # Too short
]

for phone in phones:
    formatted = format_phone(phone)
    print(f"{phone:20} -> {formatted}")

# Output:
# 5551234567           -> (555) 123-4567
# 555-123-4567         -> (555) 123-4567
# (555) 123-4567       -> (555) 123-4567
# 555.123.4567         -> (555) 123-4567
# 123                  -> None

Practical Example: Using Functions in Substitution

import re

def convert_markdown_links(text):
    """Convert [text](/programming/url) to HTML <a> tags"""
    pattern = r'\[([^\]]+)\]\(([^\)]+)\)'
    
    def replace_link(match):
        text = match.group(1)
        url = match.group(2)
        return f'<a href="{url}">{text}</a>'
    
    return re.sub(pattern, replace_link, text)

# Test markdown
markdown = """
Check out [Python Docs](https://docs.python.org) for more info.
Visit [GitHub](https://github.com) to share code.
"""

html = convert_markdown_links(markdown)
print(html)

# Output:
# Check out <a href="https://docs.python.org">Python Docs</a> for more info.
# Visit <a href="https://github.com">GitHub</a> to share code.

Key Points About sub() and subn()

  • sub() replaces all occurrences and returns the new string
  • subn() returns a tuple: (new_string, count)
  • Use backreferences (\1, \2) to reference captured groups
  • Use a function as the replacement for complex logic
  • The function receives a match object and returns the replacement string

Comparison: When to Use Each Operation

Operation Function Use Case Returns
Matching match() Check if pattern matches at start of string Match object or None
Searching search() Find first occurrence anywhere in string Match object or None
Finding All findall() Find all occurrences in string List of matches
Substitution sub() Replace all occurrences New string
Substitution subn() Replace all occurrences and count Tuple (string, count)

Decision Tree

Do you need to:
โ”œโ”€ Check if pattern matches at START of string?
โ”‚  โ””โ”€ Use match()
โ”œโ”€ Find pattern ANYWHERE in string?
โ”‚  โ”œโ”€ Need just the first match?
โ”‚  โ”‚  โ””โ”€ Use search()
โ”‚  โ””โ”€ Need all matches?
โ”‚     โ””โ”€ Use findall()
โ””โ”€ Replace patterns with new text?
   โ”œโ”€ Need to know how many replacements?
   โ”‚  โ””โ”€ Use subn()
   โ””โ”€ Just replace?
      โ””โ”€ Use sub()

Common Regex Patterns

Here are some useful patterns for common tasks:

import re

# Email validation
email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

# URL validation
url_pattern = r'^https?://[^\s/$.?#].[^\s]*$'

# Phone number (US)
phone_pattern = r'^\+?1?\d{9,15}$'

# Date (YYYY-MM-DD)
date_pattern = r'^\d{4}-\d{2}-\d{2}$'

# Hex color
color_pattern = r'^#[0-9a-fA-F]{6}$'

# IPv4 address
ipv4_pattern = r'^(\d{1,3}\.){3}\d{1,3}$'

# Username (alphanumeric and underscore, 3-20 chars)
username_pattern = r'^[a-zA-Z0-9_]{3,20}$'

# Password (at least 8 chars, uppercase, lowercase, digit)
password_pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$'

# Test patterns
test_cases = {
    'email': ('[email protected]', email_pattern),
    'url': ('https://example.com', url_pattern),
    'phone': ('+1234567890', phone_pattern),
    'date': ('2025-12-16', date_pattern),
    'color': ('#FF5733', color_pattern),
    'ipv4': ('192.168.1.1', ipv4_pattern),
    'username': ('john_doe', username_pattern),
    'password': ('SecurePass123', password_pattern),
}

for name, (test_str, pattern) in test_cases.items():
    match = re.match(pattern, test_str)
    result = "Valid" if match else "Invalid"
    print(f"{name:12} '{test_str:20}' -> {result}")

Best Practices and Common Pitfalls

Best Practice 1: Use Raw Strings

# โŒ Wrong: Backslashes are interpreted by Python
pattern = '\d+'  # This is actually just 'd+'

# โœ“ Correct: Raw string preserves backslashes
pattern = r'\d+'  # This is the regex pattern for digits

Best Practice 2: Compile Patterns for Reuse

import re

# โœ“ Good: Compile once, use many times
email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')

emails = [
    "[email protected]",
    "[email protected]",
    "[email protected]"
]

for email in emails:
    if email_pattern.match(email):
        print(f"Valid: {email}")

# โŒ Less efficient: Recompile pattern each time
for email in emails:
    if re.match(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', email):
        print(f"Valid: {email}")

Best Practice 3: Use Named Groups for Clarity

import re

# โœ“ Good: Named groups are self-documenting
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.match(pattern, '2025-12-16')

if match:
    date_dict = match.groupdict()
    print(f"Year: {date_dict['year']}")
    print(f"Month: {date_dict['month']}")
    print(f"Day: {date_dict['day']}")

# โŒ Less clear: Numbered groups
pattern = r'(\d{4})-(\d{2})-(\d{2})'
match = re.match(pattern, '2025-12-16')
if match:
    year, month, day = match.groups()

Pitfall 1: Forgetting the ^ and $ Anchors

import re

# โŒ Problem: Matches partial strings
pattern = r'\d{3}-\d{4}'
text = "Call 555-1234 now"
print(re.search(pattern, text))  # Matches!

# โœ“ Solution: Use anchors for full string match
pattern = r'^\d{3}-\d{4}$'
print(re.match(pattern, text))  # None
print(re.match(pattern, "555-1234"))  # Matches!

Pitfall 2: Greedy vs Non-Greedy Matching

import re

text = "<p>Hello</p> and <p>World</p>"

# โŒ Greedy: Matches too much
pattern = r'<p>.*</p>'
print(re.findall(pattern, text))
# Output: ['<p>Hello</p> and <p>World</p>']

# โœ“ Non-greedy: Matches correctly
pattern = r'<p>.*?</p>'
print(re.findall(pattern, text))
# Output: ['<p>Hello</p>', '<p>World</p>']

Pitfall 3: Not Escaping Special Characters

import re

# โŒ Problem: . matches any character
pattern = r'file.txt'
print(re.search(pattern, 'file.txt'))  # Matches
print(re.search(pattern, 'filextxt'))  # Also matches!

# โœ“ Solution: Escape special characters
pattern = r'file\.txt'
print(re.search(pattern, 'file.txt'))  # Matches
print(re.search(pattern, 'filextxt'))  # None

Conclusion

Regular expressions are powerful tools for text processing. Understanding the three core operationsโ€”matching, searching, and substitutionโ€”gives you the foundation to solve most regex problems:

  • Use match() when you need to validate that a string starts with a specific pattern
  • Use search() when you need to find a pattern anywhere in a string
  • Use findall() when you need to extract all occurrences of a pattern
  • Use sub() when you need to replace patterns with new text

Key takeaways:

  1. Always use raw strings (r"") for regex patterns
  2. Compile patterns if you use them multiple times
  3. Use named groups for complex patterns
  4. Test your patterns thoroughly with various inputs
  5. Consider performance for large texts
  6. Document complex patterns with comments
  7. Use online regex testers to debug patterns

Regular expressions have a steep learning curve, but the investment pays off. Start with simple patterns, gradually build complexity, and soon you’ll be writing powerful regex solutions with confidence.

Comments