Introduction
Regular expressions are one of the most practical tools in computer science. They provide a concise way to describe patterns in strings and are used extensively in:
- Text processing and searching
- Input validation
- Programming language lexical analysis
- Data extraction
- String manipulation
In this article, we’ll explore regular expressions, their syntax, and their relationship to formal language theory.
Regular Languages
Definition of Regular Language
A regular language is a language that can be recognized by a finite automaton or described by a regular expression.
Examples:
Lโ = {a*} (zero or more a's)
Lโ = {(ab)*} (zero or more ab's)
Lโ = {0, 1}* (all binary strings)
Lโ = {a*b*} (zero or more a's followed by zero or more b's)
Properties of Regular Languages
Finite Alphabet: Regular languages are defined over a finite alphabet.
Finite Automata: Regular languages can be recognized by finite automata.
Closure Properties: Regular languages are closed under union, concatenation, and Kleene star.
Regular Expression Syntax
Basic Operators
Concatenation: Placing symbols next to each other.
ab matches "ab"
abc matches "abc"
Alternation (|): Matching one of several options.
a|b matches "a" or "b"
cat|dog matches "cat" or "dog"
Kleene Star (*): Zero or more repetitions.
a* matches "", "a", "aa", "aaa", ...
(ab)* matches "", "ab", "abab", "ababab", ...
Kleene Plus (+): One or more repetitions.
a+ matches "a", "aa", "aaa", ...
(ab)+ matches "ab", "abab", "ababab", ...
Optional (?): Zero or one occurrence.
a? matches "" or "a"
(ab)? matches "" or "ab"
Quantifier {n}: Exactly n repetitions.
a{3} matches "aaa"
(ab){2} matches "abab"
Quantifier {m,n}: Between m and n repetitions.
a{2,4} matches "aa", "aaa", or "aaaa"
Character Classes
Dot (.): Matches any character except newline.
a.b matches "aab", "abb", "acb", ...
Character Set […]: Matches any character in the set.
[abc] matches "a", "b", or "c"
[0-9] matches any digit
[a-z] matches any lowercase letter
Negated Set [^…]: Matches any character NOT in the set.
[^abc] matches any character except "a", "b", "c"
[^0-9] matches any non-digit
Predefined Classes:
\d matches any digit [0-9]
\D matches any non-digit [^0-9]
\w matches any word character [a-zA-Z0-9_]
\W matches any non-word character
\s matches any whitespace
\S matches any non-whitespace
Anchors
Caret (^): Matches start of string.
^a matches "a" at the beginning
^[0-9] matches a digit at the start
Dollar ($): Matches end of string.
a$ matches "a" at the end
[0-9]$ matches a digit at the end
Word Boundary (\b): Matches boundary between word and non-word characters.
\bcat\b matches "cat" as a whole word
Regular Expression Examples
Example 1: Email Validation
Pattern:
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
Explanation:
^- Start of string[a-zA-Z0-9._%+-]+- One or more valid email characters@- Literal @[a-zA-Z0-9.-]+- Domain name\.- Literal dot[a-zA-Z]{2,}- Two or more letters (TLD)$- End of string
Matches:
[email protected] โ
[email protected] โ
[email protected] โ
user@domain โ
Example 2: Phone Number
Pattern:
^\d{3}-\d{3}-\d{4}$
Explanation:
^\d{3}- Start with exactly 3 digits-- Literal hyphen\d{3}- Exactly 3 digits-- Literal hyphen\d{4}$- Exactly 4 digits at end
Matches:
123-456-7890 โ
555-867-5309 โ
12-34-5678 โ
123-45-6789 โ
Example 3: URL
Pattern:
^https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(/.*)?$
Explanation:
^https?://- Start with http or https[a-zA-Z0-9.-]+- Domain name\.[a-zA-Z]{2,}- TLD(/.*)?- Optional path
Matches:
http://example.com โ
https://www.example.com โ
https://example.com/path/to/page โ
ftp://example.com โ
Example 4: IPv4 Address
Pattern:
^(\d{1,3}\.){3}\d{1,3}$
Explanation:
^(\d{1,3}\.){3}- Three groups of 1-3 digits followed by dot\d{1,3}$- Final group of 1-3 digits
Matches:
192.168.1.1 โ
10.0.0.1 โ
255.255.255.255 โ
256.1.1.1 โ (technically matches, but invalid IP)
Regular Expressions and Finite Automata
Relationship
Every regular expression can be converted to a finite automaton, and vice versa.
Conversion Process:
- Regular expression โ NFA (Thompson’s construction)
- NFA โ DFA (subset construction)
- DFA โ Regular expression (state elimination)
Example: Converting Regex to Automaton
Regular Expression: (a|b)*abb
NFA:
States: q0, q1, q2, q3, q4, q5, q6
Start: q0
Accept: q6
Transitions:
q0 --ฮต--> q1, q4
q1 --a--> q2
q1 --b--> q3
q2 --ฮต--> q1
q3 --ฮต--> q1
q4 --a--> q5
q5 --b--> q6
q6 --b--> q6 (accept)
Practical Regular Expression Usage
Text Searching
Find all email addresses in text:
Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Find all numbers:
Pattern: \d+
Input Validation
Validate username (alphanumeric, 3-20 characters):
Pattern: ^[a-zA-Z0-9]{3,20}$
Validate password (at least 8 chars, one uppercase, one digit):
Pattern: ^(?=.*[A-Z])(?=.*\d).{8,}$
String Replacement
Replace all spaces with underscores:
Pattern: \s+
Replacement: _
Convert camelCase to snake_case:
Pattern: ([a-z])([A-Z])
Replacement: $1_$2
Common Regex Patterns
Patterns for Common Tasks
Whole numbers:
^\d+$
Decimal numbers:
^\d+\.\d+$
Hexadecimal:
^[0-9a-fA-F]+$
HTML tags:
<[^>]+>
URLs:
https?://[^\s]+
Dates (MM/DD/YYYY):
^(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d{4}$
Time (HH:MM:SS):
^([01]\d|2[0-3]):([0-5]\d):([0-5]\d)$
Credit card:
^\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}$
Limitations of Regular Expressions
What Regular Expressions Cannot Match
Balanced parentheses:
Cannot match: (()), ((())), etc.
Reason: Requires counting, which needs more than finite automata
Palindromes:
Cannot match: aba, abba, abcba, etc.
Reason: Requires comparing beginning and end
Equal repetitions:
Cannot match: aโฟbโฟ (equal a's and b's)
Reason: Requires counting and comparison
Why These Limitations Exist
Regular expressions can only recognize regular languages, which are limited to:
- Finite memory
- No counting capability
- No nested structures
These limitations are fundamental to finite automata.
Glossary
- Regular language: Language recognizable by finite automata
- Regular expression: Pattern for describing regular languages
- Concatenation: Placing patterns next to each other
- Alternation: Choosing between options (|)
- Kleene star (*): Zero or more repetitions
- Kleene plus (+): One or more repetitions
- Character class: Set of characters to match
- Anchor: Position in string (start, end, boundary)
- Quantifier: Specifying number of repetitions
- Finite automaton: Machine recognizing regular languages
Practice Problems
Problem 1: Pattern Matching
Write a regex for:
- All strings starting with “a”
- All strings ending with “b”
- All strings with exactly 3 characters
Solution:
^a.*.*b$^.{3}$
Problem 2: Character Classes
Write a regex for:
- All lowercase letters
- All digits
- All alphanumeric characters
Solution:
[a-z][0-9]or\d[a-zA-Z0-9]or\w
Problem 3: Quantifiers
Write a regex for:
- One or more a’s
- Zero or more b’s
- Between 2 and 5 c’s
Solution:
a+b*c{2,5}
Problem 4: Complex Patterns
Write a regex for:
- Valid username (alphanumeric, 3-20 chars)
- Valid date (MM/DD/YYYY)
Solution:
^[a-zA-Z0-9]{3,20}$^(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d{4}$
Related Resources
Online Platforms
- Regex101 - Interactive regex tester
- RegexPal - Regex pattern tester
- Khan Academy Computer Science - CS fundamentals
- Coursera Formal Languages - Online courses
- MIT OpenCourseWare - University courses
Interactive Tools
- Regex101 - Full-featured regex tester
- RegExr - Visual regex explorer
- Regex Visualizer - Visualize regex patterns
- Automata Simulator - Automata visualization
- Pattern Tester - Test patterns
Recommended Books
- “Mastering Regular Expressions” by Friedl
- “Introduction to Automata Theory, Languages, and Computation” by Hopcroft, Motwani, Ullman
- “Regular Expressions Cookbook” by Goyvaerts & Levithan
- “Perl Regular Expressions” by Friedl
- “Theory of Computation” by Sipser
Academic Journals
- Journal of Computer and System Sciences
- Theoretical Computer Science
- ACM Transactions on Computational Logic
- Information and Computation
- Formal Aspects of Computing
Software Tools
- Regex101 - Online regex tester
- JFLAP - Automata simulator
- Graphviz - Graph visualization
- Perl - Regex engine
- Python re module - Python regex
Conclusion
Regular expressions are a powerful and practical tool for:
- Pattern matching and searching
- Input validation
- Text processing
- Lexical analysis
- String manipulation
Understanding regular expressions and their relationship to formal language theory provides insight into both practical programming and theoretical computer science.
In the next article, we’ll explore parsing and syntax analysis, which build on regular expressions and context-free grammars.
Next Article: Parsing and Syntax Analysis
Previous Article: Context-Free Grammars
Comments