Parsing and Syntax Analysis

Introduction

Parsing is the process of analyzing a sequence of symbols (tokens) to determine its grammatical structure. It’s a fundamental component of:

Compilers and interpreters
Programming language design
Natural language processing
Data format parsing (JSON, XML, etc.)
Query language processing

In this article, we’ll explore parsing techniques, algorithms, and their applications.

Parsing Basics

Definition of Parsing

Parsing is the process of analyzing a string of symbols according to the rules of a formal grammar.

Input: A sequence of tokens (lexemes) Output: A parse tree or abstract syntax tree (AST) Process: Determine if the input is valid and build its structure

Tokens and Lexemes

Token: A category of symbol (e.g., identifier, number, operator) Lexeme: The actual text of a token

Example:

Input: x = 5 + 3;

Tokens:    identifier  assignment  number  plus  number  semicolon
Lexemes:   x            =           5       +     3       ;

Parse Tree vs. Abstract Syntax Tree

Parse Tree: Shows all details of the parsing process, including non-terminals.

Abstract Syntax Tree (AST): Simplified tree showing only essential structure.

Example:

Input: 2 + 3 * 4

Parse Tree:
        E
       / \
      E   +
      |    \
      2     T
           / \
          T   *
          |    \
          3     4

AST:
        +
       / \
      2   *
         / \
        3   4

Top-Down Parsing

Definition

Top-down parsing starts with the start symbol and tries to derive the input string.

Process:

Start with the start symbol
Apply production rules to expand non-terminals
Match terminals with input tokens
Continue until entire input is consumed

Recursive Descent Parsing

Recursive descent parsing uses recursive functions to parse each non-terminal.

Algorithm:

For each non-terminal A:
  Create a function parse_A()
  For each production A → α:
    Try to match α
    If successful, return
  If no production matches, error

Example:

Grammar:
E → T + E | T
T → F * T | F
F → (E) | num

Recursive descent parser:
def parse_E():
  parse_T()
  if current_token == '+':
    consume('+')
    parse_E()

def parse_T():
  parse_F()
  if current_token == '*':
    consume('*')
    parse_T()

def parse_F():
  if current_token == '(':
    consume('(')
    parse_E()
    consume(')')
  elif current_token == 'num':
    consume('num')
  else:
    error()

LL(1) Parsing

LL(1) parsing uses a parsing table to determine which production to apply.

LL(1): Left-to-right scan, Leftmost derivation, 1 token lookahead

Advantages:

Efficient (O(n) time)
Easy to implement
Good error messages

Disadvantages:

Limited to LL(1) grammars
Requires grammar transformation

Bottom-Up Parsing

Definition

Bottom-up parsing starts with the input tokens and tries to reduce them to the start symbol.

Process:

Read input tokens
Find substrings that match production right-hand sides
Replace with left-hand side non-terminal
Continue until only start symbol remains

Shift-Reduce Parsing

Shift-reduce parsing uses a stack and a parsing table.

Operations:

Shift: Push next token onto stack
Reduce: Replace top of stack with non-terminal
Accept: Parsing complete
Error: Invalid input

Example:

Grammar:
E → E + T | T
T → T * F | F
F → (E) | num

Input: 2 + 3

Stack operations:
1. Shift 2: [2]
2. Reduce F → num: [F]
3. Reduce T → F: [T]
4. Reduce E → T: [E]
5. Shift +: [E, +]
6. Shift 3: [E, +, 3]
7. Reduce F → num: [E, +, F]
8. Reduce T → F: [E, +, T]
9. Reduce E → E + T: [E]
10. Accept

LR Parsing

LR parsing is a powerful bottom-up parsing technique.

LR(k): Left-to-right scan, Rightmost derivation, k token lookahead

Variants:

SLR: Simple LR
LALR: Look-ahead LR (most common)
LR(1): Full LR

Advantages:

Handles most practical grammars
Efficient (O(n) time)
Good error detection

Disadvantages:

Complex to implement
Requires parser generator

Parsing Algorithms

CYK Algorithm

The Cocke-Younger-Kasami (CYK) algorithm parses using a grammar in Chomsky Normal Form.

Time Complexity: O(n³) Space Complexity: O(n²)

Algorithm:

For i = 1 to n:
  For each production A → a:
    If a matches token i:
      Add A to table[i][i]

For length = 2 to n:
  For i = 1 to n - length + 1:
    j = i + length - 1
    For k = i to j - 1:
      For each production A → BC:
        If B in table[i][k] and C in table[k+1][j]:
          Add A to table[i][j]

If start symbol in table[1][n]:
  Accept
Else:
  Reject

Earley Parser

The Earley parser handles arbitrary context-free grammars.

Time Complexity: O(n³) worst case, O(n) for unambiguous grammars Space Complexity: O(n²)

Advantages:

Handles any CFG
Efficient for many practical grammars
Can handle ambiguous grammars

Error Handling

Error Detection

Errors occur when:

No valid production matches
Unexpected token encountered
End of input reached prematurely

Error Recovery

Strategies:

Panic mode: Skip tokens until synchronization point
Phrase level: Replace erroneous token with expected token
Error productions: Add special productions for common errors
Global correction: Find minimal edits to make input valid

Example:

Input: x = 5 +;

Error: Expected expression after +

Recovery options:
1. Skip semicolon and continue
2. Insert default value (0)
3. Report error and stop

Practical Parsing

Lexical Analysis (Tokenization)

Before parsing, input is converted to tokens.

Example:

Input: x = 5 + 3;

Lexer output:
ID(x) ASSIGN NUM(5) PLUS NUM(3) SEMICOLON

Parser Generators

Tools that automatically generate parsers from grammar specifications.

Popular tools:

Yacc/Bison: LALR parser generator
ANTLR: LL(*) parser generator
Flex/Lex: Lexer generator
Lemon: LALR parser generator

Example (Bison):

%token ID NUM PLUS ASSIGN SEMICOLON

%%
program: statement
       | program statement
       ;

statement: ID ASSIGN expression SEMICOLON
         ;

expression: NUM
          | expression PLUS NUM
          ;
%%

Glossary

Parsing: Analyzing structure of input according to grammar
Token: Category of symbol
Lexeme: Actual text of a token
Parse tree: Tree showing parsing structure
AST: Abstract syntax tree
Top-down parsing: Starting from start symbol
Bottom-up parsing: Starting from input tokens
Shift-reduce parsing: Using stack and table
LR parsing: Left-to-right, rightmost derivation
Error recovery: Handling parsing errors

Practice Problems

Problem 1: Parse Tree

Given grammar:

E → E + T | T
T → T * F | F
F → (E) | num

Draw parse tree for “2 + 3 * 4”.

Solution:

Problem 2: Recursive Descent

Implement recursive descent parser for:

S → aSb | ε

Solution:

def parse_S():
  if current_token == 'a':
    consume('a')
    parse_S()
    consume('b')
  elif current_token == 'EOF':
    pass  # ε production
  else:
    error()

Problem 3: Shift-Reduce

Trace shift-reduce parsing for “2 * 3” using grammar from Problem 1.

Solution:

1. Shift 2: [2]
2. Reduce F → num: [F]
3. Reduce T → F: [T]
4. Shift *: [T, *]
5. Shift 3: [T, *, 3]
6. Reduce F → num: [T, *, F]
7. Reduce T → T * F: [T]
8. Reduce E → T: [E]
9. Accept

Problem 4: Error Handling

For input “x = 5 +;” with grammar from Problem 1, describe error and recovery.

Solution: Error: Expected expression after + Recovery: Insert default value or skip to semicolon

Online Platforms

Khan Academy Computer Science - CS fundamentals
Coursera Compilers - Compiler courses
MIT OpenCourseWare - University courses
Brilliant.org Computer Science - Interactive lessons
Stanford CS143 - Compilers course

Interactive Tools

JFLAP - Automata and grammar simulator
Yacc/Bison - Parser generator
ANTLR - Parser generator
Graphviz - Graph visualization
Parse Tree Visualizer - Visualize parse trees

Recommended Books

“Compilers: Principles, Techniques, and Tools” by Aho, Lam, Sethi, Ullman
“Introduction to Automata Theory, Languages, and Computation” by Hopcroft, Motwani, Ullman
“Engineering a Compiler” by Cooper & Torczon
“Modern Compiler Implementation” by Appel
“Theory of Computation” by Sipser

Academic Journals

Software Tools

ANTLR - Parser generator
Bison - LALR parser generator
Flex - Lexer generator
Lemon - LALR parser generator
Yacc - Parser generator

Conclusion

Parsing is a fundamental technique for:

Compiler and interpreter construction
Programming language design
Data format processing
Natural language processing
Query language implementation

Understanding parsing algorithms and techniques is essential for building language processors and understanding how computers process structured input.

In the next article, we’ll explore the Chomsky hierarchy, which classifies languages by their computational power.

Next Article: Chomsky Hierarchy

Previous Article: Regular Expressions and Regular Languages

Parsing and Syntax Analysis

Introduction

Parsing Basics

Definition of Parsing

Tokens and Lexemes

Parse Tree vs. Abstract Syntax Tree

Top-Down Parsing

Definition

Recursive Descent Parsing

LL(1) Parsing

Bottom-Up Parsing

Definition

Shift-Reduce Parsing

LR Parsing

Parsing Algorithms

CYK Algorithm

Earley Parser

Error Handling

Error Detection

Error Recovery

Practical Parsing

Lexical Analysis (Tokenization)

Parser Generators

Glossary

Practice Problems

Problem 1: Parse Tree

Problem 2: Recursive Descent

Problem 3: Shift-Reduce

Problem 4: Error Handling

Online Platforms

Interactive Tools

Recommended Books

Academic Journals

Software Tools

Conclusion

Comments

Share this article

👍 Was this article helpful?

Introduction

Parsing Basics

Definition of Parsing

Tokens and Lexemes

Parse Tree vs. Abstract Syntax Tree

Top-Down Parsing

Definition

Recursive Descent Parsing

LL(1) Parsing

Bottom-Up Parsing

Definition

Shift-Reduce Parsing

LR Parsing

Parsing Algorithms

CYK Algorithm

Earley Parser

Error Handling

Error Detection

Error Recovery

Practical Parsing

Lexical Analysis (Tokenization)

Parser Generators

Glossary

Practice Problems

Problem 1: Parse Tree

Problem 2: Recursive Descent

Problem 3: Shift-Reduce

Problem 4: Error Handling

Related Resources

Online Platforms

Interactive Tools

Recommended Books

Academic Journals

Software Tools

Conclusion

Comments

Share this article

👍 Was this article helpful?