Skip to main content
โšก Calmops

Parsing and Syntax Analysis

Introduction

Parsing is the process of analyzing a sequence of symbols (tokens) to determine its grammatical structure. It’s a fundamental component of:

  • Compilers and interpreters
  • Programming language design
  • Natural language processing
  • Data format parsing (JSON, XML, etc.)
  • Query language processing

In this article, we’ll explore parsing techniques, algorithms, and their applications.

Parsing Basics

Definition of Parsing

Parsing is the process of analyzing a string of symbols according to the rules of a formal grammar.

Input: A sequence of tokens (lexemes) Output: A parse tree or abstract syntax tree (AST) Process: Determine if the input is valid and build its structure

Tokens and Lexemes

Token: A category of symbol (e.g., identifier, number, operator) Lexeme: The actual text of a token

Example:

Input: x = 5 + 3;

Tokens:    identifier  assignment  number  plus  number  semicolon
Lexemes:   x            =           5       +     3       ;

Parse Tree vs. Abstract Syntax Tree

Parse Tree: Shows all details of the parsing process, including non-terminals.

Abstract Syntax Tree (AST): Simplified tree showing only essential structure.

Example:

Input: 2 + 3 * 4

Parse Tree:
        E
       / \
      E   +
      |    \
      2     T
           / \
          T   *
          |    \
          3     4

AST:
        +
       / \
      2   *
         / \
        3   4

Top-Down Parsing

Definition

Top-down parsing starts with the start symbol and tries to derive the input string.

Process:

  1. Start with the start symbol
  2. Apply production rules to expand non-terminals
  3. Match terminals with input tokens
  4. Continue until entire input is consumed

Recursive Descent Parsing

Recursive descent parsing uses recursive functions to parse each non-terminal.

Algorithm:

For each non-terminal A:
  Create a function parse_A()
  For each production A โ†’ ฮฑ:
    Try to match ฮฑ
    If successful, return
  If no production matches, error

Example:

Grammar:
E โ†’ T + E | T
T โ†’ F * T | F
F โ†’ (E) | num

Recursive descent parser:
def parse_E():
  parse_T()
  if current_token == '+':
    consume('+')
    parse_E()

def parse_T():
  parse_F()
  if current_token == '*':
    consume('*')
    parse_T()

def parse_F():
  if current_token == '(':
    consume('(')
    parse_E()
    consume(')')
  elif current_token == 'num':
    consume('num')
  else:
    error()

LL(1) Parsing

LL(1) parsing uses a parsing table to determine which production to apply.

LL(1): Left-to-right scan, Leftmost derivation, 1 token lookahead

Advantages:

  • Efficient (O(n) time)
  • Easy to implement
  • Good error messages

Disadvantages:

  • Limited to LL(1) grammars
  • Requires grammar transformation

Bottom-Up Parsing

Definition

Bottom-up parsing starts with the input tokens and tries to reduce them to the start symbol.

Process:

  1. Read input tokens
  2. Find substrings that match production right-hand sides
  3. Replace with left-hand side non-terminal
  4. Continue until only start symbol remains

Shift-Reduce Parsing

Shift-reduce parsing uses a stack and a parsing table.

Operations:

  • Shift: Push next token onto stack
  • Reduce: Replace top of stack with non-terminal
  • Accept: Parsing complete
  • Error: Invalid input

Example:

Grammar:
E โ†’ E + T | T
T โ†’ T * F | F
F โ†’ (E) | num

Input: 2 + 3

Stack operations:
1. Shift 2: [2]
2. Reduce F โ†’ num: [F]
3. Reduce T โ†’ F: [T]
4. Reduce E โ†’ T: [E]
5. Shift +: [E, +]
6. Shift 3: [E, +, 3]
7. Reduce F โ†’ num: [E, +, F]
8. Reduce T โ†’ F: [E, +, T]
9. Reduce E โ†’ E + T: [E]
10. Accept

LR Parsing

LR parsing is a powerful bottom-up parsing technique.

LR(k): Left-to-right scan, Rightmost derivation, k token lookahead

Variants:

  • SLR: Simple LR
  • LALR: Look-ahead LR (most common)
  • LR(1): Full LR

Advantages:

  • Handles most practical grammars
  • Efficient (O(n) time)
  • Good error detection

Disadvantages:

  • Complex to implement
  • Requires parser generator

Parsing Algorithms

CYK Algorithm

The Cocke-Younger-Kasami (CYK) algorithm parses using a grammar in Chomsky Normal Form.

Time Complexity: O(nยณ) Space Complexity: O(nยฒ)

Algorithm:

For i = 1 to n:
  For each production A โ†’ a:
    If a matches token i:
      Add A to table[i][i]

For length = 2 to n:
  For i = 1 to n - length + 1:
    j = i + length - 1
    For k = i to j - 1:
      For each production A โ†’ BC:
        If B in table[i][k] and C in table[k+1][j]:
          Add A to table[i][j]

If start symbol in table[1][n]:
  Accept
Else:
  Reject

Earley Parser

The Earley parser handles arbitrary context-free grammars.

Time Complexity: O(nยณ) worst case, O(n) for unambiguous grammars Space Complexity: O(nยฒ)

Advantages:

  • Handles any CFG
  • Efficient for many practical grammars
  • Can handle ambiguous grammars

Error Handling

Error Detection

Errors occur when:

  • No valid production matches
  • Unexpected token encountered
  • End of input reached prematurely

Error Recovery

Strategies:

  1. Panic mode: Skip tokens until synchronization point
  2. Phrase level: Replace erroneous token with expected token
  3. Error productions: Add special productions for common errors
  4. Global correction: Find minimal edits to make input valid

Example:

Input: x = 5 +;

Error: Expected expression after +

Recovery options:
1. Skip semicolon and continue
2. Insert default value (0)
3. Report error and stop

Practical Parsing

Lexical Analysis (Tokenization)

Before parsing, input is converted to tokens.

Example:

Input: x = 5 + 3;

Lexer output:
ID(x) ASSIGN NUM(5) PLUS NUM(3) SEMICOLON

Parser Generators

Tools that automatically generate parsers from grammar specifications.

Popular tools:

  • Yacc/Bison: LALR parser generator
  • ANTLR: LL(*) parser generator
  • Flex/Lex: Lexer generator
  • Lemon: LALR parser generator

Example (Bison):

%token ID NUM PLUS ASSIGN SEMICOLON

%%
program: statement
       | program statement
       ;

statement: ID ASSIGN expression SEMICOLON
         ;

expression: NUM
          | expression PLUS NUM
          ;
%%

Glossary

  • Parsing: Analyzing structure of input according to grammar
  • Token: Category of symbol
  • Lexeme: Actual text of a token
  • Parse tree: Tree showing parsing structure
  • AST: Abstract syntax tree
  • Top-down parsing: Starting from start symbol
  • Bottom-up parsing: Starting from input tokens
  • Shift-reduce parsing: Using stack and table
  • LR parsing: Left-to-right, rightmost derivation
  • Error recovery: Handling parsing errors

Practice Problems

Problem 1: Parse Tree

Given grammar:

E โ†’ E + T | T
T โ†’ T * F | F
F โ†’ (E) | num

Draw parse tree for “2 + 3 * 4”.

Solution:

        E
       / \
      E   +
      |    \
      2     T
           / \
          T   *
          |    \
          3     4

Problem 2: Recursive Descent

Implement recursive descent parser for:

S โ†’ aSb | ฮต

Solution:

def parse_S():
  if current_token == 'a':
    consume('a')
    parse_S()
    consume('b')
  elif current_token == 'EOF':
    pass  # ฮต production
  else:
    error()

Problem 3: Shift-Reduce

Trace shift-reduce parsing for “2 * 3” using grammar from Problem 1.

Solution:

1. Shift 2: [2]
2. Reduce F โ†’ num: [F]
3. Reduce T โ†’ F: [T]
4. Shift *: [T, *]
5. Shift 3: [T, *, 3]
6. Reduce F โ†’ num: [T, *, F]
7. Reduce T โ†’ T * F: [T]
8. Reduce E โ†’ T: [E]
9. Accept

Problem 4: Error Handling

For input “x = 5 +;” with grammar from Problem 1, describe error and recovery.

Solution: Error: Expected expression after + Recovery: Insert default value or skip to semicolon

Online Platforms

Interactive Tools

  • “Compilers: Principles, Techniques, and Tools” by Aho, Lam, Sethi, Ullman
  • “Introduction to Automata Theory, Languages, and Computation” by Hopcroft, Motwani, Ullman
  • “Engineering a Compiler” by Cooper & Torczon
  • “Modern Compiler Implementation” by Appel
  • “Theory of Computation” by Sipser

Academic Journals

Software Tools

  • ANTLR - Parser generator
  • Bison - LALR parser generator
  • Flex - Lexer generator
  • Lemon - LALR parser generator
  • Yacc - Parser generator

Conclusion

Parsing is a fundamental technique for:

  • Compiler and interpreter construction
  • Programming language design
  • Data format processing
  • Natural language processing
  • Query language implementation

Understanding parsing algorithms and techniques is essential for building language processors and understanding how computers process structured input.

In the next article, we’ll explore the Chomsky hierarchy, which classifies languages by their computational power.


Next Article: Chomsky Hierarchy

Previous Article: Regular Expressions and Regular Languages

Comments