Skip to main content
โšก Calmops

Regular Expressions and Regular Languages

Introduction

Regular expressions are one of the most practical tools in computer science. They provide a concise way to describe patterns in strings and are used extensively in:

  • Text processing and searching
  • Input validation
  • Programming language lexical analysis
  • Data extraction
  • String manipulation

In this article, we’ll explore regular expressions, their syntax, and their relationship to formal language theory.

Regular Languages

Definition of Regular Language

A regular language is a language that can be recognized by a finite automaton or described by a regular expression.

Examples:

Lโ‚ = {a*}                    (zero or more a's)
Lโ‚‚ = {(ab)*}                 (zero or more ab's)
Lโ‚ƒ = {0, 1}*                 (all binary strings)
Lโ‚„ = {a*b*}                  (zero or more a's followed by zero or more b's)

Properties of Regular Languages

Finite Alphabet: Regular languages are defined over a finite alphabet.

Finite Automata: Regular languages can be recognized by finite automata.

Closure Properties: Regular languages are closed under union, concatenation, and Kleene star.

Regular Expression Syntax

Basic Operators

Concatenation: Placing symbols next to each other.

ab matches "ab"
abc matches "abc"

Alternation (|): Matching one of several options.

a|b matches "a" or "b"
cat|dog matches "cat" or "dog"

Kleene Star (*): Zero or more repetitions.

a* matches "", "a", "aa", "aaa", ...
(ab)* matches "", "ab", "abab", "ababab", ...

Kleene Plus (+): One or more repetitions.

a+ matches "a", "aa", "aaa", ...
(ab)+ matches "ab", "abab", "ababab", ...

Optional (?): Zero or one occurrence.

a? matches "" or "a"
(ab)? matches "" or "ab"

Quantifier {n}: Exactly n repetitions.

a{3} matches "aaa"
(ab){2} matches "abab"

Quantifier {m,n}: Between m and n repetitions.

a{2,4} matches "aa", "aaa", or "aaaa"

Character Classes

Dot (.): Matches any character except newline.

a.b matches "aab", "abb", "acb", ...

Character Set […]: Matches any character in the set.

[abc] matches "a", "b", or "c"
[0-9] matches any digit
[a-z] matches any lowercase letter

Negated Set [^…]: Matches any character NOT in the set.

[^abc] matches any character except "a", "b", "c"
[^0-9] matches any non-digit

Predefined Classes:

\d matches any digit [0-9]
\D matches any non-digit [^0-9]
\w matches any word character [a-zA-Z0-9_]
\W matches any non-word character
\s matches any whitespace
\S matches any non-whitespace

Anchors

Caret (^): Matches start of string.

^a matches "a" at the beginning
^[0-9] matches a digit at the start

Dollar ($): Matches end of string.

a$ matches "a" at the end
[0-9]$ matches a digit at the end

Word Boundary (\b): Matches boundary between word and non-word characters.

\bcat\b matches "cat" as a whole word

Regular Expression Examples

Example 1: Email Validation

Pattern:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Explanation:

  • ^ - Start of string
  • [a-zA-Z0-9._%+-]+ - One or more valid email characters
  • @ - Literal @
  • [a-zA-Z0-9.-]+ - Domain name
  • \. - Literal dot
  • [a-zA-Z]{2,} - Two or more letters (TLD)
  • $ - End of string

Matches:

[email protected] โœ“
[email protected] โœ“
[email protected] โœ—
user@domain โœ—

Example 2: Phone Number

Pattern:

^\d{3}-\d{3}-\d{4}$

Explanation:

  • ^\d{3} - Start with exactly 3 digits
  • - - Literal hyphen
  • \d{3} - Exactly 3 digits
  • - - Literal hyphen
  • \d{4}$ - Exactly 4 digits at end

Matches:

123-456-7890 โœ“
555-867-5309 โœ“
12-34-5678 โœ—
123-45-6789 โœ—

Example 3: URL

Pattern:

^https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}(/.*)?$

Explanation:

  • ^https?:// - Start with http or https
  • [a-zA-Z0-9.-]+ - Domain name
  • \.[a-zA-Z]{2,} - TLD
  • (/.*)? - Optional path

Matches:

http://example.com โœ“
https://www.example.com โœ“
https://example.com/path/to/page โœ“
ftp://example.com โœ—

Example 4: IPv4 Address

Pattern:

^(\d{1,3}\.){3}\d{1,3}$

Explanation:

  • ^(\d{1,3}\.){3} - Three groups of 1-3 digits followed by dot
  • \d{1,3}$ - Final group of 1-3 digits

Matches:

192.168.1.1 โœ“
10.0.0.1 โœ“
255.255.255.255 โœ“
256.1.1.1 โœ“ (technically matches, but invalid IP)

Regular Expressions and Finite Automata

Relationship

Every regular expression can be converted to a finite automaton, and vice versa.

Conversion Process:

  1. Regular expression โ†’ NFA (Thompson’s construction)
  2. NFA โ†’ DFA (subset construction)
  3. DFA โ†’ Regular expression (state elimination)

Example: Converting Regex to Automaton

Regular Expression: (a|b)*abb

NFA:

States: q0, q1, q2, q3, q4, q5, q6
Start: q0
Accept: q6

Transitions:
q0 --ฮต--> q1, q4
q1 --a--> q2
q1 --b--> q3
q2 --ฮต--> q1
q3 --ฮต--> q1
q4 --a--> q5
q5 --b--> q6
q6 --b--> q6 (accept)

Practical Regular Expression Usage

Text Searching

Find all email addresses in text:

Pattern: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

Find all numbers:

Pattern: \d+

Input Validation

Validate username (alphanumeric, 3-20 characters):

Pattern: ^[a-zA-Z0-9]{3,20}$

Validate password (at least 8 chars, one uppercase, one digit):

Pattern: ^(?=.*[A-Z])(?=.*\d).{8,}$

String Replacement

Replace all spaces with underscores:

Pattern: \s+
Replacement: _

Convert camelCase to snake_case:

Pattern: ([a-z])([A-Z])
Replacement: $1_$2

Common Regex Patterns

Patterns for Common Tasks

Whole numbers:

^\d+$

Decimal numbers:

^\d+\.\d+$

Hexadecimal:

^[0-9a-fA-F]+$

HTML tags:

<[^>]+>

URLs:

https?://[^\s]+

Dates (MM/DD/YYYY):

^(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d{4}$

Time (HH:MM:SS):

^([01]\d|2[0-3]):([0-5]\d):([0-5]\d)$

Credit card:

^\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}$

Limitations of Regular Expressions

What Regular Expressions Cannot Match

Balanced parentheses:

Cannot match: (()), ((())), etc.
Reason: Requires counting, which needs more than finite automata

Palindromes:

Cannot match: aba, abba, abcba, etc.
Reason: Requires comparing beginning and end

Equal repetitions:

Cannot match: aโฟbโฟ (equal a's and b's)
Reason: Requires counting and comparison

Why These Limitations Exist

Regular expressions can only recognize regular languages, which are limited to:

  • Finite memory
  • No counting capability
  • No nested structures

These limitations are fundamental to finite automata.

Glossary

  • Regular language: Language recognizable by finite automata
  • Regular expression: Pattern for describing regular languages
  • Concatenation: Placing patterns next to each other
  • Alternation: Choosing between options (|)
  • Kleene star (*): Zero or more repetitions
  • Kleene plus (+): One or more repetitions
  • Character class: Set of characters to match
  • Anchor: Position in string (start, end, boundary)
  • Quantifier: Specifying number of repetitions
  • Finite automaton: Machine recognizing regular languages

Practice Problems

Problem 1: Pattern Matching

Write a regex for:

  1. All strings starting with “a”
  2. All strings ending with “b”
  3. All strings with exactly 3 characters

Solution:

  1. ^a.*
  2. .*b$
  3. ^.{3}$

Problem 2: Character Classes

Write a regex for:

  1. All lowercase letters
  2. All digits
  3. All alphanumeric characters

Solution:

  1. [a-z]
  2. [0-9] or \d
  3. [a-zA-Z0-9] or \w

Problem 3: Quantifiers

Write a regex for:

  1. One or more a’s
  2. Zero or more b’s
  3. Between 2 and 5 c’s

Solution:

  1. a+
  2. b*
  3. c{2,5}

Problem 4: Complex Patterns

Write a regex for:

  1. Valid username (alphanumeric, 3-20 chars)
  2. Valid date (MM/DD/YYYY)

Solution:

  1. ^[a-zA-Z0-9]{3,20}$
  2. ^(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d{4}$

Online Platforms

Interactive Tools

  • “Mastering Regular Expressions” by Friedl
  • “Introduction to Automata Theory, Languages, and Computation” by Hopcroft, Motwani, Ullman
  • “Regular Expressions Cookbook” by Goyvaerts & Levithan
  • “Perl Regular Expressions” by Friedl
  • “Theory of Computation” by Sipser

Academic Journals

Software Tools

Conclusion

Regular expressions are a powerful and practical tool for:

  • Pattern matching and searching
  • Input validation
  • Text processing
  • Lexical analysis
  • String manipulation

Understanding regular expressions and their relationship to formal language theory provides insight into both practical programming and theoretical computer science.

In the next article, we’ll explore parsing and syntax analysis, which build on regular expressions and context-free grammars.


Next Article: Parsing and Syntax Analysis

Previous Article: Context-Free Grammars

Comments