Introduction
Parsing is the process of analyzing a sequence of symbols (tokens) to determine its grammatical structure. It’s a fundamental component of:
- Compilers and interpreters
- Programming language design
- Natural language processing
- Data format parsing (JSON, XML, etc.)
- Query language processing
In this article, we’ll explore parsing techniques, algorithms, and their applications.
Parsing Basics
Definition of Parsing
Parsing is the process of analyzing a string of symbols according to the rules of a formal grammar.
Input: A sequence of tokens (lexemes) Output: A parse tree or abstract syntax tree (AST) Process: Determine if the input is valid and build its structure
Tokens and Lexemes
Token: A category of symbol (e.g., identifier, number, operator) Lexeme: The actual text of a token
Example:
Input: x = 5 + 3;
Tokens: identifier assignment number plus number semicolon
Lexemes: x = 5 + 3 ;
Parse Tree vs. Abstract Syntax Tree
Parse Tree: Shows all details of the parsing process, including non-terminals.
Abstract Syntax Tree (AST): Simplified tree showing only essential structure.
Example:
Input: 2 + 3 * 4
Parse Tree:
E
/ \
E +
| \
2 T
/ \
T *
| \
3 4
AST:
+
/ \
2 *
/ \
3 4
Top-Down Parsing
Definition
Top-down parsing starts with the start symbol and tries to derive the input string.
Process:
- Start with the start symbol
- Apply production rules to expand non-terminals
- Match terminals with input tokens
- Continue until entire input is consumed
Recursive Descent Parsing
Recursive descent parsing uses recursive functions to parse each non-terminal.
Algorithm:
For each non-terminal A:
Create a function parse_A()
For each production A โ ฮฑ:
Try to match ฮฑ
If successful, return
If no production matches, error
Example:
Grammar:
E โ T + E | T
T โ F * T | F
F โ (E) | num
Recursive descent parser:
def parse_E():
parse_T()
if current_token == '+':
consume('+')
parse_E()
def parse_T():
parse_F()
if current_token == '*':
consume('*')
parse_T()
def parse_F():
if current_token == '(':
consume('(')
parse_E()
consume(')')
elif current_token == 'num':
consume('num')
else:
error()
LL(1) Parsing
LL(1) parsing uses a parsing table to determine which production to apply.
LL(1): Left-to-right scan, Leftmost derivation, 1 token lookahead
Advantages:
- Efficient (O(n) time)
- Easy to implement
- Good error messages
Disadvantages:
- Limited to LL(1) grammars
- Requires grammar transformation
Bottom-Up Parsing
Definition
Bottom-up parsing starts with the input tokens and tries to reduce them to the start symbol.
Process:
- Read input tokens
- Find substrings that match production right-hand sides
- Replace with left-hand side non-terminal
- Continue until only start symbol remains
Shift-Reduce Parsing
Shift-reduce parsing uses a stack and a parsing table.
Operations:
- Shift: Push next token onto stack
- Reduce: Replace top of stack with non-terminal
- Accept: Parsing complete
- Error: Invalid input
Example:
Grammar:
E โ E + T | T
T โ T * F | F
F โ (E) | num
Input: 2 + 3
Stack operations:
1. Shift 2: [2]
2. Reduce F โ num: [F]
3. Reduce T โ F: [T]
4. Reduce E โ T: [E]
5. Shift +: [E, +]
6. Shift 3: [E, +, 3]
7. Reduce F โ num: [E, +, F]
8. Reduce T โ F: [E, +, T]
9. Reduce E โ E + T: [E]
10. Accept
LR Parsing
LR parsing is a powerful bottom-up parsing technique.
LR(k): Left-to-right scan, Rightmost derivation, k token lookahead
Variants:
- SLR: Simple LR
- LALR: Look-ahead LR (most common)
- LR(1): Full LR
Advantages:
- Handles most practical grammars
- Efficient (O(n) time)
- Good error detection
Disadvantages:
- Complex to implement
- Requires parser generator
Parsing Algorithms
CYK Algorithm
The Cocke-Younger-Kasami (CYK) algorithm parses using a grammar in Chomsky Normal Form.
Time Complexity: O(nยณ) Space Complexity: O(nยฒ)
Algorithm:
For i = 1 to n:
For each production A โ a:
If a matches token i:
Add A to table[i][i]
For length = 2 to n:
For i = 1 to n - length + 1:
j = i + length - 1
For k = i to j - 1:
For each production A โ BC:
If B in table[i][k] and C in table[k+1][j]:
Add A to table[i][j]
If start symbol in table[1][n]:
Accept
Else:
Reject
Earley Parser
The Earley parser handles arbitrary context-free grammars.
Time Complexity: O(nยณ) worst case, O(n) for unambiguous grammars Space Complexity: O(nยฒ)
Advantages:
- Handles any CFG
- Efficient for many practical grammars
- Can handle ambiguous grammars
Error Handling
Error Detection
Errors occur when:
- No valid production matches
- Unexpected token encountered
- End of input reached prematurely
Error Recovery
Strategies:
- Panic mode: Skip tokens until synchronization point
- Phrase level: Replace erroneous token with expected token
- Error productions: Add special productions for common errors
- Global correction: Find minimal edits to make input valid
Example:
Input: x = 5 +;
Error: Expected expression after +
Recovery options:
1. Skip semicolon and continue
2. Insert default value (0)
3. Report error and stop
Practical Parsing
Lexical Analysis (Tokenization)
Before parsing, input is converted to tokens.
Example:
Input: x = 5 + 3;
Lexer output:
ID(x) ASSIGN NUM(5) PLUS NUM(3) SEMICOLON
Parser Generators
Tools that automatically generate parsers from grammar specifications.
Popular tools:
- Yacc/Bison: LALR parser generator
- ANTLR: LL(*) parser generator
- Flex/Lex: Lexer generator
- Lemon: LALR parser generator
Example (Bison):
%token ID NUM PLUS ASSIGN SEMICOLON
%%
program: statement
| program statement
;
statement: ID ASSIGN expression SEMICOLON
;
expression: NUM
| expression PLUS NUM
;
%%
Glossary
- Parsing: Analyzing structure of input according to grammar
- Token: Category of symbol
- Lexeme: Actual text of a token
- Parse tree: Tree showing parsing structure
- AST: Abstract syntax tree
- Top-down parsing: Starting from start symbol
- Bottom-up parsing: Starting from input tokens
- Shift-reduce parsing: Using stack and table
- LR parsing: Left-to-right, rightmost derivation
- Error recovery: Handling parsing errors
Practice Problems
Problem 1: Parse Tree
Given grammar:
E โ E + T | T
T โ T * F | F
F โ (E) | num
Draw parse tree for “2 + 3 * 4”.
Solution:
E
/ \
E +
| \
2 T
/ \
T *
| \
3 4
Problem 2: Recursive Descent
Implement recursive descent parser for:
S โ aSb | ฮต
Solution:
def parse_S():
if current_token == 'a':
consume('a')
parse_S()
consume('b')
elif current_token == 'EOF':
pass # ฮต production
else:
error()
Problem 3: Shift-Reduce
Trace shift-reduce parsing for “2 * 3” using grammar from Problem 1.
Solution:
1. Shift 2: [2]
2. Reduce F โ num: [F]
3. Reduce T โ F: [T]
4. Shift *: [T, *]
5. Shift 3: [T, *, 3]
6. Reduce F โ num: [T, *, F]
7. Reduce T โ T * F: [T]
8. Reduce E โ T: [E]
9. Accept
Problem 4: Error Handling
For input “x = 5 +;” with grammar from Problem 1, describe error and recovery.
Solution: Error: Expected expression after + Recovery: Insert default value or skip to semicolon
Related Resources
Online Platforms
- Khan Academy Computer Science - CS fundamentals
- Coursera Compilers - Compiler courses
- MIT OpenCourseWare - University courses
- Brilliant.org Computer Science - Interactive lessons
- Stanford CS143 - Compilers course
Interactive Tools
- JFLAP - Automata and grammar simulator
- Yacc/Bison - Parser generator
- ANTLR - Parser generator
- Graphviz - Graph visualization
- Parse Tree Visualizer - Visualize parse trees
Recommended Books
- “Compilers: Principles, Techniques, and Tools” by Aho, Lam, Sethi, Ullman
- “Introduction to Automata Theory, Languages, and Computation” by Hopcroft, Motwani, Ullman
- “Engineering a Compiler” by Cooper & Torczon
- “Modern Compiler Implementation” by Appel
- “Theory of Computation” by Sipser
Academic Journals
- Journal of Computer and System Sciences
- Theoretical Computer Science
- ACM Transactions on Programming Languages and Systems
- Information and Computation
- Formal Aspects of Computing
Software Tools
- ANTLR - Parser generator
- Bison - LALR parser generator
- Flex - Lexer generator
- Lemon - LALR parser generator
- Yacc - Parser generator
Conclusion
Parsing is a fundamental technique for:
- Compiler and interpreter construction
- Programming language design
- Data format processing
- Natural language processing
- Query language implementation
Understanding parsing algorithms and techniques is essential for building language processors and understanding how computers process structured input.
In the next article, we’ll explore the Chomsky hierarchy, which classifies languages by their computational power.
Next Article: Chomsky Hierarchy
Previous Article: Regular Expressions and Regular Languages
Comments