CSE411

CSE411: Introduction to Compilers

Parsing #1

Jooyong Yi (UNIST)

2023 Spring
CSE411

Where We Are

The Phases of Compilation

                 |-----------|        |----------|
Source Program → | Front End | → IR → | Back End | → Target Program
                 |-----------|        |----------|

The Phases of Frontend

         |-------|            |--------|           |----------|        |-----------|
Source → | Lexer | → tokens → | Parser | → parse → | Semantic |→ AST → |   IR      | → IR
Program  |       |            |        |   tree    | Analyzer |        | Generator |
         |-------|            |--------|           |----------|        |-----------|
2023 Spring
CSE411

Interface of Parser

                     |--------|           
a stream of tokens → | Parser | → parse tree
                     |--------|         
2023 Spring
CSE411

What does a parser do?

2023 Spring
CSE411

What does a parser do?

  1. Performs syntax checking
    • e.g., If you forgot to end a sentence with the semicolon in a C program, the compiler reports an error.
  2. Generates a parse tree
    • Why tree?
2023 Spring
CSE411

Simplified Interface of Parser

                     |--------|           
a stream of tokens → | Parser | → ok / error
                     |--------|         
2023 Spring
CSE411

Parsing Example

// Written in ANTLR v4.
// In ANTLR, each rule ends with the semicolon. 
INT : '0' | ('1'..'9') ('0'..'9')* ;
ID : ( 'a'..'z' | 'A'..'Z' | '_' | '$' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )* ;
PLUS : '+' ;
       |-------|            
10+x → | Lexer | → 
       |-------|                          
2023 Spring
CSE411

Parsing Example

// Written in ANTLR v4.
// In ANTLR, each rule ends with the semicolon. 
INT : '0' | ('1'..'9') ('0'..'9')* ;
ID : ( 'a'..'z' | 'A'..'Z' | '_' | '$' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )* ;
PLUS : '+' ;
       |-------|            
10+x → | Lexer | → INT PLUS ID
       |-------|                          
2023 Spring
CSE411

Parsing Example

// Written in ANTLR v4.
// In ANTLR, each rule ends with the semicolon. 
INT : '0' | ('1'..'9') ('0'..'9')* ;
ID : ( 'a'..'z' | 'A'..'Z' | '_' | '$' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )* ;
PLUS : '+' ;
              |--------|
INT PLUS ID → | Parser | →  
              |--------|
2023 Spring
CSE411

Parsing Example

start : exp EOF
exp: INT | ID | exp PLUS exp ;

INT : '0' | ('1'..'9') ('0'..'9')* ;
ID : ( 'a'..'z' | 'A'..'Z' | '_' | '$' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )* ;
PLUS : '+' ;
              |--------|
INT PLUS ID → | Parser | → ok
              |--------|
2023 Spring
CSE411

Parsing Example

start : exp EOF
exp : INT | ID | exp PLUS exp ;

INT : '0' | ('1'..'9') ('0'..'9')* ;
ID : ( 'a'..'z' | 'A'..'Z' | '_' | '$' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )* ;
PLUS : '+' ;
                  |--------|
INT PLUS ID ID →  | Parser | → 
                  |--------|
2023 Spring
CSE411

Parsing Example

start : exp EOF
exp: INT | ID | exp PLUS exp ;

INT : '0' | ('1'..'9') ('0'..'9')* ;
ID : ( 'a'..'z' | 'A'..'Z' | '_' | '$' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )* ;
PLUS : '+' ;
                 |--------|
INT PLUS ID ID → | Parser | → error
                 |--------|
2023 Spring
CSE411

How to write a parser?

2023 Spring
CSE411

Synthesizing a Parser

         Specification for the parser
                   ↓
            |-------------|    
            |    Parser   |        
            | Synthesizer |
            |-------------|
                   |
              synthesizes   
                   ↓
              |--------|            
INT PLUS ID → | Parser | → ok
              |--------|                          
2023 Spring
CSE411

Parser Synthesizer Examples

> man yacc

YACC(1)                                                                User Commands                                                                YACC(1)

NAME
       yacc - GNU Project parser generator

SYNOPSIS
       yacc [OPTION]... FILE

DESCRIPTION
       Yacc (Yet Another Compiler Compiler) is a parser generator.  This version is a simple wrapper around bison(1).  It passes option -y, --yacc to activate 
       the upward compatibility mode. See bison(1) for more information.
2023 Spring
CSE411

What to Give as the Specification for Parser?

2023 Spring
CSE411

What to Give as the Specification for Parser?

start : exp EOF
exp: INT | ID | exp PLUS exp ;

INT : '0' | ('1'..'9') ('0'..'9')* ;
ID : ( 'a'..'z' | 'A'..'Z' | '_' | '$' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )* ;
PLUS : '+' ;

Is it a regular expression?

2023 Spring
CSE411

Constructing a regular expression

Given an alphabet , a regular expression is constructed using the following rules:

  1. , , and are all regular expressions. These are called primitive regular expressions.
  2. If and are regular expressions, so are , , , , and .
  3. A string is a regular expression if and only if it can be derived from the primitive regular expressions by a finite number of applications of the rules in (2).
2023 Spring
CSE411

Is it a regular expression?

start : exp EOF
exp: INT | ID | '(' exp ')' ;

INT : '0' | ('1'..'9') ('0'..'9')* ;
ID : ( 'a'..'z' | 'A'..'Z' | '_' | '$' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )* ;
2023 Spring
CSE411

Is it a regular expression?

start : exp EOF
exp: INT | ID | '(' exp ')' ;

INT : '0' | ('1'..'9') ('0'..'9')* ;
ID : ( 'a'..'z' | 'A'..'Z' | '_' | '$' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )* ;
  • The language is a prototypical example of a language that cannot be described by a regular expression.
2023 Spring
CSE411

What formalism do we need?

start : exp EOF
exp: INT | ID | exp '+' exp | '(' exp ')' ;

INT : '0' | ('1'..'9') ('0'..'9')* ;
ID : ( 'a'..'z' | 'A'..'Z' | '_' | '$' ) ( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '$' )* ;
2023 Spring
CSE411

Context-Free Grammar (CFG)

start : exp EOF
exp: INT | ID | exp '+' exp | '(' exp ')' ;
  • CFG
    • : a finite set of non-terminals (e.g., exp).
    • : a finite set of terminal symbols (e.g., '+', 'a', '1')
    • : a start non-terminal (e.g., start)
    • : a finite set of productions.
      • Each production has the form
        • where and
2023 Spring
CSE411

CFG and Syntax Checking

  • Suppose we have a CFG and a program .
  • What does it mean to perform syntax checking for with regard to ?
2023 Spring
CSE411

CFG and Syntax Checking

  • Suppose we have a CFG and a program .
  • No syntax error
    • represents the language of
    • represents all syntactically valid programs
2023 Spring
CSE411

The Language of CFG

// Language G
start : exp EOF
exp: INT | ID | exp '+' exp | '(' exp ')' ;
  • INT '+' ID EOF . Why?
  • INT '+' ID ID EOF . Why?
2023 Spring
CSE411

The Language of CFG

// Language G
start : exp EOF
exp: INT | ID | exp '+' exp | '(' exp ')' ;
  • INT '+' ID EOF . Why?
    • start exp EOF exp '+' exp EOF INT '+' exp EOF
      INT '+' ID EOF
2023 Spring
CSE411

The Language of CFG

// Language G
start : exp EOF
exp: INT | ID | exp '+' exp | '(' exp ')' ;
  • INT '+' ID ID EOF . Why?
    • start INT '+' ID ID EOF
2023 Spring
CSE411

The Language of CFG

2023 Spring
CSE411

Derivation

  • Assume CFG where
  • Suppose you have a string where
  • Then, derives and we write
2023 Spring
CSE411

Derivation

  • When , we can simply write .
  • Notation represents the application of zero or more times.
2023 Spring
CSE411

The Language of CFG

  • Assume CFG

2023 Spring
CSE411

Synthesizing a Parser

                  CFG
                   ↓
            |-------------|    
            |    Parser   |        
            | Synthesizer |
            |-------------|
                   |
              synthesizes   
                   ↓
              |--------|            
    program → | Parser | → ok / error
              |--------|                          
  • Which automata can be used for a parser?
2023 Spring
CSE411

CFG and Pushdown Automata (PDA)

Theory: For any context-free grammar , there exists an non-deterministic pushdown automata (NPDA) accepting .

2023 Spring
CSE411

Non-deterministic PDA (NPDA)

Input
□□□□□□□□
  ↓
|---------|    Stack
| Control | ⇔  |__|
|  Unit   |    |__|
|---------|    |__|
2023 Spring
CSE411

Non-deterministic PDA (NPDA)

  • NPDA
    • : a finite set of internal states of the control unit
    • : the input alphabet
    • : the stack alphabet
    • : the transition function
    • : the initial state of the control unit
    • : the stack start symbol
    • : the set of final states
2023 Spring
CSE411

Acceptance Condition of NPDA

  • Suppose
  • if and
    1. When all symbols of the input are consumed, the final state is reached.
    2. The state of the stack is irrelevant to the acceptance.
2023 Spring
CSE411

NPDA Example

  • NPDA
2023 Spring
CSE411

Deterministic PDA (DPDA)

  • DPDA
    • : a finite set of internal states of the control unit
    • : the input alphabet
    • : the stack alphabet
    • : the transition function
      • If is defined, then should not be defined for every .
    • : the initial state of the control unit
    • : the stack start symbol
    • : the set of final states
2023 Spring
CSE411

RE --> NFA --> DFA

CFG --> NPDA --> DPDA ?

2023 Spring
CSE411

RE --> NFA --> DFA

CFG --> NPDA --> DPDA (This does not work. Why?)

2023 Spring
CSE411

CFG --> NPDA --> DPDA does not work

  • NPDA -X-> DPDA
    • The expressive power of NPDA and DPDA are not equivalent.
    • NPDA is more expressive.
2023 Spring
CSE411

CFG --> NPDA --> DPDA does not work

  • NPDA -X-> DPDA
    • The expressive power of NPDA and DPDA are not equivalent.
    • NPDA is more expressive.
  • We prefer a deterministic algorithm.
  • What conclusion can we draw?
2023 Spring
CSE411

CFG --> NPDA --> DPDA does not work

  • NPDA -X-> DPDA
    • The expressive power of NPDA and DPDA are not equivalent.
    • NPDA is more expressive.
  • We prefer a deterministic algorithm.
  • What conclusion can we draw?
    • We use a restricted CFG that can be handled by a DPDA.
    • We will see various types of restricted CFGs and their DPDAs.
2023 Spring