Regular Expressions and Regular Languages

Introduction

Regular expressions are one of the most practical tools in computer science. They provide a concise way to describe patterns in strings and are used extensively in:

Text processing and searching
Input validation
Programming language lexical analysis
Data extraction
String manipulation

In this article, we’ll explore regular expressions, their syntax, and their relationship to formal language theory.

Regular Languages

Definition of Regular Language

A regular language is a language that can be recognized by a finite automaton or described by a regular expression.

Examples:

L₁ = {a*}                    (zero or more a's)
L₂ = {(ab)*}                 (zero or more ab's)
L₃ = {0, 1}*                 (all binary strings)
L₄ = {a*b*}                  (zero or more a's followed by zero or more b's)

Properties of Regular Languages

Finite Alphabet: Regular languages are defined over a finite alphabet.

Finite Automata: Regular languages can be recognized by finite automata.

Closure Properties: Regular languages are closed under union, concatenation, and Kleene star.

Regular Expression Syntax

Basic Operators

Concatenation: Placing symbols next to each other.

ab matches "ab"
abc matches "abc"

Alternation (|): Matching one of several options.

a|b matches "a" or "b"
cat|dog matches "cat" or "dog"

Kleene Star (*): Zero or more repetitions.

a* matches "", "a", "aa", "aaa", ...
(ab)* matches "", "ab", "abab", "ababab", ...

Kleene Plus (+): One or more repetitions.

a+ matches "a", "aa", "aaa", ...
(ab)+ matches "ab", "abab", "ababab", ...

Optional (?): Zero or one occurrence.

a? matches "" or "a"
(ab)? matches "" or "ab"

Quantifier {n}: Exactly n repetitions.

a{3} matches "aaa"
(ab){2} matches "abab"

Quantifier {m,n}: Between m and n repetitions.

a{2,4} matches "aa", "aaa", or "aaaa"

Character Classes

Dot (.): Matches any character except newline.

a.b matches "aab", "abb", "acb", ...

Character Set […]: Matches any character in the set.

[abc] matches "a", "b", or "c"
[0-9] matches any digit
[a-z] matches any lowercase letter

Negated Set [^…]: Matches any character NOT in the set.

[^abc] matches any character except "a", "b", "c"
[^0-9] matches any non-digit

Predefined Classes:

\d matches any digit [0-9]
\D matches any non-digit [^0-9]
\w matches any word character [a-zA-Z0-9_]
\W matches any non-word character
\s matches any whitespace
\S matches any non-whitespace

Anchors

Caret (^): Matches start of string.

^a matches "a" at the beginning
^[0-9] matches a digit at the start

Dollar ($): Matches end of string.

a$ matches "a" at the end
[0-9]$ matches a digit at the end

Word Boundary (\b): Matches boundary between word and non-word characters.

\bcat\b matches "cat" as a whole word

Regular Expression Examples

Example 1: Email Validation

Pattern:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Explanation:

^ - Start of string
[a-zA-Z0-9._%+-]+ - One or more valid email characters
@ - Literal @
[a-zA-Z0-9.-]+ - Domain name
\. - Literal dot
[a-zA-Z]{2,} - Two or more letters (TLD)
$ - End of string

Matches:

[email protected] ✓
[email protected] ✓
[email protected] ✗
user@domain ✗

Example 2: Phone Number

Pattern:

^\d{3}-\d{3}-\d{4}$

Explanation:

^\d{3} - Start with exactly 3 digits
- - Literal hyphen
\d{3} - Exactly 3 digits
- - Literal hyphen
\d{4}$ - Exactly 4 digits at end

Matches:

123-456-7890 ✓
555-867-5309 ✓
12-34-5678 ✗
123-45-6789 ✗

Example 3: URL

Pattern:

^https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(/.*)?$

Explanation:

^https?:// - Start with http or https
[a-zA-Z0-9.-]+ - Domain name
\.[a-zA-Z]{2,} - TLD
(/.*)? - Optional path

Matches:

http://example.com ✓
https://www.example.com ✓
https://example.com/path/to/page ✓
ftp://example.com ✗

Example 4: IPv4 Address

Pattern:

^(\d{1,3}\.){3}\d{1,3}$

Explanation:

^(\d{1,3}\.){3} - Three groups of 1-3 digits followed by dot
\d{1,3}$ - Final group of 1-3 digits

Matches:

192.168.1.1 ✓
10.0.0.1 ✓
255.255.255.255 ✓
256.1.1.1 ✓ (technically matches, but invalid IP)

Regular Expressions and Finite Automata

Relationship

Every regular expression can be converted to a finite automaton, and vice versa.

Conversion Process:

Regular expression → NFA (Thompson’s construction)
NFA → DFA (subset construction)
DFA → Regular expression (state elimination)

Example: Converting Regex to Automaton

Regular Expression: (a|b)*abb

NFA:

States: q0, q1, q2, q3, q4, q5, q6
Start: q0
Accept: q6

Transitions:
q0 --ε--> q1, q4
q1 --a--> q2
q1 --b--> q3
q2 --ε--> q1
q3 --ε--> q1
q4 --a--> q5
q5 --b--> q6
q6 --b--> q6 (accept)

Practical Regular Expression Usage

Text Searching

Find all email addresses in text:

Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

Find all numbers:

Pattern: \d+

Input Validation

Validate username (alphanumeric, 3-20 characters):

Pattern: ^[a-zA-Z0-9]{3,20}$

Validate password (at least 8 chars, one uppercase, one digit):

Pattern: ^(?=.*[A-Z])(?=.*\d).{8,}$

String Replacement

Replace all spaces with underscores:

Pattern: \s+
Replacement: _

Convert camelCase to snake_case:

Pattern: ([a-z])([A-Z])
Replacement: $1_$2

Common Regex Patterns

Patterns for Common Tasks

Whole numbers:

^\d+$

Decimal numbers:

^\d+\.\d+$

Hexadecimal:

^[0-9a-fA-F]+$

HTML tags:

<[^>]+>

URLs:

https?://[^\s]+

Dates (MM/DD/YYYY):

^(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d{4}$

Time (HH:MM:SS):

^([01]\d|2[0-3]):([0-5]\d):([0-5]\d)$

Credit card:

^\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}$

Limitations of Regular Expressions

What Regular Expressions Cannot Match

Balanced parentheses:

Cannot match: (()), ((())), etc.
Reason: Requires counting, which needs more than finite automata

Palindromes:

Cannot match: aba, abba, abcba, etc.
Reason: Requires comparing beginning and end

Equal repetitions:

Cannot match: aⁿbⁿ (equal a's and b's)
Reason: Requires counting and comparison

Why These Limitations Exist

Regular expressions can only recognize regular languages, which are limited to:

Finite memory
No counting capability
No nested structures

These limitations are fundamental to finite automata.

Glossary

Regular language: Language recognizable by finite automata
Regular expression: Pattern for describing regular languages
Concatenation: Placing patterns next to each other
Alternation: Choosing between options (|)
Kleene star (*): Zero or more repetitions
Kleene plus (+): One or more repetitions
Character class: Set of characters to match
Anchor: Position in string (start, end, boundary)
Quantifier: Specifying number of repetitions
Finite automaton: Machine recognizing regular languages

Practice Problems

Problem 1: Pattern Matching

Write a regex for:

All strings starting with “a”
All strings ending with “b”
All strings with exactly 3 characters

Solution:

^a.*
.*b$
^.{3}$

Problem 2: Character Classes

Write a regex for:

All lowercase letters
All digits
All alphanumeric characters

Solution:

[a-z]
[0-9] or \d
[a-zA-Z0-9] or \w

Problem 3: Quantifiers

Write a regex for:

One or more a’s
Zero or more b’s
Between 2 and 5 c’s

Solution:

a+
b*
c{2,5}

Problem 4: Complex Patterns

Write a regex for:

Valid username (alphanumeric, 3-20 chars)
Valid date (MM/DD/YYYY)

Solution:

^[a-zA-Z0-9]{3,20}$
^(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d{4}$

Online Platforms

Regex101 - Interactive regex tester
RegexPal - Regex pattern tester
Khan Academy Computer Science - CS fundamentals
Coursera Formal Languages - Online courses
MIT OpenCourseWare - University courses

Interactive Tools

Regex101 - Full-featured regex tester
RegExr - Visual regex explorer
Regex Visualizer - Visualize regex patterns
Automata Simulator - Automata visualization
Pattern Tester - Test patterns

Recommended Books

“Mastering Regular Expressions” by Friedl
“Introduction to Automata Theory, Languages, and Computation” by Hopcroft, Motwani, Ullman
“Regular Expressions Cookbook” by Goyvaerts & Levithan
“Perl Regular Expressions” by Friedl
“Theory of Computation” by Sipser

Academic Journals

Software Tools

Regex101 - Online regex tester
JFLAP - Automata simulator
Graphviz - Graph visualization
Perl - Regex engine
Python re module - Python regex

Conclusion

Regular expressions are a powerful and practical tool for:

Pattern matching and searching
Input validation
Text processing
Lexical analysis
String manipulation

Understanding regular expressions and their relationship to formal language theory provides insight into both practical programming and theoretical computer science.

In the next article, we’ll explore parsing and syntax analysis, which build on regular expressions and context-free grammars.

Next Article: Parsing and Syntax Analysis

Previous Article: Context-Free Grammars

Introduction

Regular Languages

Definition of Regular Language

Properties of Regular Languages

Regular Expression Syntax

Basic Operators

Character Classes

Anchors

Regular Expression Examples

Example 1: Email Validation

Example 2: Phone Number

Example 3: URL

Example 4: IPv4 Address

Regular Expressions and Finite Automata

Relationship

Example: Converting Regex to Automaton

Practical Regular Expression Usage

Text Searching

Input Validation

String Replacement

Common Regex Patterns

Patterns for Common Tasks

Limitations of Regular Expressions

What Regular Expressions Cannot Match

Why These Limitations Exist

Glossary

Practice Problems

Problem 1: Pattern Matching

Problem 2: Character Classes

Problem 3: Quantifiers

Problem 4: Complex Patterns

Related Resources

Online Platforms

Interactive Tools

Recommended Books

Academic Journals

Software Tools

Conclusion

Comments