CSE411

CSE411: Introduction to Compilers

Lexer (Scanner)

Jooyong Yi (UNIST)

2024 Spring
CSE411

Lexing and Parsing

        position  =  initial  +  rate   *   60
                           ↓
        |------------------------------------|
        |      Lexical Analyzer (Lexer)      |
        |------------------------------------|
                           ↓
        <id, 1> <=> <id, 2> <+> <id, 3> <*> <int, 60>
                           ↓
        |------------------------------------|
        |      Syntax Analyzer (Parser)      |
        |------------------------------------|
                           ↓
                       =
                      / \
                <id,1>   +
                        / \
                   <id,2>  *
                          / \
                    <id,3>   <int,60>                   
2024 Spring
CSE411

Lexing

        position  =  initial  +  rate   *   60               source program
                           ↓
        |------------------------------------|
        |      Lexical Analyzer (Lexer)      |
        |------------------------------------|
                           ↓
        <id, 1> <=> <id, 2> <+> <id, 3> <*> <int, 60>        a sequence of tokens
2024 Spring
CSE411

Terminologies

  • Token
    • the most basic unit in the grammar of a programming language.
    • a.k.a., lexeme
2024 Spring
CSE411

Terminologies

  • Token
    • the most basic unit in the grammar of a programming language.
    • a.k.a., lexeme
  • Lexing
    • the process of converting a sequence of characters into a sequence of lexemes.
2024 Spring
CSE411

Learning Objective

  • Understand how to extract tokens from a source program
        position  =  initial  +  rate   *   60               source program
                           ↓
        |------------------------------------|
        |      Lexical Analyzer (Lexer)      | 
        |------------------------------------|
                           ↓
        <id, 1> <=> <id, 2> <+> <id, 3> <*> <int, 60>        a sequence of tokens
2024 Spring
CSE411

How to Write a Lexer?

2024 Spring
CSE411

How to Write a Lexer?

  • Option #1: write a lexer yourself
2024 Spring
CSE411

Alternative Option: Synthesizing a Lexer


             |-------------|    
             |    Lexer    |        
             | Synthesizer |
             |-------------|
                    |
               synthesizes   
                    ↓
                |-------|            
int val = 10; → | Lexer | → int ID ASSIGN NUM ;
                |-------|                          
2024 Spring
CSE411

Lexer Synthesizer Examples

> man lex

FLEX(1)                               Programming                                              FLEX(1)

NAME
       flex - the fast lexical analyser generator

SYNOPSIS
       flex [OPTIONS] [FILE]...

DESCRIPTION
       Generates programs that perform pattern-matching on text.
2024 Spring
CSE411

Input to Lex

        Specification for the lexer
                    ↓
             |-------------|    
             |    Lexer    |        
             | Synthesizer |
             |-------------|
                    |
               synthesizes   
                    ↓
                |-------|            
int val = 10; → | Lexer | → int ID ASSIGN NUM ;
                |-------|                          
2024 Spring
CSE411

What to Give as the Specification for Lex?

2024 Spring
CSE411

Specification for the NUM Token

What is a single expression that represents the following?

  • 1
  • 21
  • 411
  • ...
2024 Spring
CSE411

Specification for the NUM Token

What is a single expression that represents the following?

  • 1
  • 21
  • 411
  • ...

(1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)(0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)*

2024 Spring
CSE411

Regular Expression (RE) as the Specification for Lex

        Specification for the lexer written in RE
                    ↓
             |-------------|    
             |    Lexer    |        
             | Synthesizer |
             |-------------|
                    |
               synthesizes   
                    ↓
                |-------|            
int val = 10; → | Lexer | → INT ID ASSIGN NUM ;
                |-------|                          
2024 Spring
CSE411

Regular Expressions

  • a: An ordinary character stands for itself.
  • "ab": Quotation, a string in quotes stands for itself literally. E.g., "if"
  • : The empty string
  • : Alternation, choosing from or .
  • [0-9]: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
  • : Concatenation, an followed by an .
  • : Repetition (zero or more times)
  • : Repetition (one or more times)
  • : Optional, zero or one occurrences of .
2024 Spring
CSE411

Regular Expression Example

What is a single expression that represents the following?

  • 1, 21, 401, ...
  • -1, -21, -401, ...
  • +1, +21, +401, ...
2024 Spring
CSE411

Regular Expression Example

What is a single expression that represents the following?

  • 1, 21, 401, ...
  • -1, -21, -401, ...
  • +1, +21, +401, ...

(-|+)? [1-9] [0-9]*

2024 Spring
CSE411

Regular Expression Example

What is a single expression that represents the following?

  • 1, 21, 401, ...
  • -1, -21, -401, ...
  • +1, +21, +401, ...

(-|+)? [1-9] [0-9]*
(-|+|) [1-9] [0-9]*

2024 Spring
CSE411

Using RE as the Specification Language for Lex

           Specification for INT: (-|+)?[1-9][0-9]*
        Specification for ASSIGN: =
            Specification for ID: [a-zA-Z] [a-zA-Z0-9]* 
                    ↓
             |-------------|    
             |    Lexer    |        
             | Synthesizer |
             |-------------|
                    |
               synthesizes   
                    ↓
                |-------|            
     x = -401 → | Lexer | → ID ASSIGN INT
                |-------|                          
2024 Spring
CSE411

Potential Alternative

           Specification for INT: 1, -21, 401, ...
        Specification for ASSIGN: =
            Specification for ID: x, y, x2, xy, ...
                    ↓
             |-------------|    
             |    Lexer    |        
             | Synthesizer |
             |-------------|
                    |
               synthesizes   
                    ↓
                |-------|            
     x = -401 → | Lexer | → ID ASSIGN INT
                |-------|                          
2024 Spring
CSE411

Language of a Regular Expression

  • Let be (-|+)? [1-9] [0-9]*.
  • Then the language of , denoted with , includes: 1, -21, 401, ...
2024 Spring
CSE411

Program Synthesis

             1, -21, 401, ...
                    ↓
             |-------------|    
             |   Program   |        
             | Synthesizer |
             |-------------|
                    |
               synthesizes   
                    ↓
           (-|+)?[1-9][0-9]*
2024 Spring
CSE411

Synthesized Lexer

What does a synthesized lexer look like?

2024 Spring
CSE411

Synthesized Lexer

  • Input: [1-9] [0-9]*
  • Output:
2024 Spring
CSE411

Lex Example

> flex scanner.l
> gcc -o scanner lex.yy.c
> echo "x = -401" | ./scanner
2024 Spring
CSE411

RE and DFA (Deterministic Finite Automaton)

  • Input (RE): [1-9] [0-9]*
  • Output (DFA):
2024 Spring
CSE411

RE and DFA (Deterministic Finite Automaton)

Theory: Let be a regular expression. Then there exists a deterministic finite automaton that accepts .

2024 Spring
CSE411

How to turn RE into DFA?

2024 Spring
CSE411

How to turn RE into DFA?

  • RE NFA DFA
2024 Spring
CSE411

DFA and NFA (Non-deterministic Finite Automaton)

2024 Spring
CSE411

DFA and NFA

  • DFA and NFA
    • : a finite set of states.
    • : an input alphabet.
      • For DFA only:
    • : the initial state
    • : a set of final (accepting) states
    • : a transition function.
      • For DFA:
      • For NFA:
2024 Spring
CSE411

RE, DFA and NFA

  • RE NFA DFA
    • For every NFA , there exists a DFA such that and accept the same language.
    • Let be a regular expression. Then there exists a DFA that accepts .
    • Let be a regular expression. Then there exists an NFA that accepts .
2024 Spring
CSE411

RE NFA

  • RE:
  • NFA:
2024 Spring
CSE411

RE NFA

  • RE:
  • NFA:
2024 Spring
CSE411

RE NFA

  • RE:
  • NFA:
2024 Spring
CSE411

RE NFA

  • RE:
  • NFA:
2024 Spring
CSE411

RE NFA

  • RE:
  • NFA:
2024 Spring
CSE411

RE NFA

  • RE:
  • NFA:
2024 Spring
CSE411

NFA DFA

How to transform NFA to DFA?

2024 Spring
CSE411

NFA DFA

  • Map multiple NFA states to a single DFA state
2024 Spring
CSE411

NFA DFA

  • Bisimulate both automata
NFA State DFA State a b
{0, 1, 2, 4, 7} A ? ?
2024 Spring
CSE411

NFA DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B ?
{1, 2, 3, 4, 6, 7, 8} B ? ?
2024 Spring
CSE411

NFA DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B ? ?
{1, 2, 4, 5, 6, 7} C ? ?
2024 Spring
CSE411

NFA DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B ?
{1, 2, 4, 5, 6, 7} C ? ?
2024 Spring
CSE411

NFA DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7} C ? ?
{1, 2, 4, 5, 6, 7, 9} D ? ?
2024 Spring
CSE411

NFA DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D ? ?
2024 Spring
CSE411

NFA DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B ?
2024 Spring
CSE411

NFA --> DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E ? ?
2024 Spring
CSE411

NFA DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E B C
2024 Spring
CSE411

Subset Construction Algorithm

  • Input: NFA:
  • Output: DFA

Note that

  • DFA and NFA
    • : a finite set of states.
    • : an input alphabet.
      • For DFA only:
    • : the initial state
    • : a set of final (accepting) states
    • : a transition function.
      • For DFA:
      • For NFA:
2024 Spring
CSE411

How to obtain

  • Input: NFA:
  • Output: DFA

We need a function that satisfies:

2024 Spring
CSE411

Property of

  • (the same is true for and )
2024 Spring
CSE411

Property of

  • (the same is true for and )
2024 Spring
CSE411

-Closure

  • is an -closure.
2024 Spring
CSE411

-Closure

  • Given a set and its -closure ,
2024 Spring
CSE411

-Closure

  • Given a set , its -closure satisfies the following:

Another formal definition (also showing how to compute):

  • Given a set , its -closure is the smallest set satisfying the following:
2024 Spring
CSE411

Subset Construction Algorithm

  • Input: NFA:
  • Output: DFA
d0 := ε-closure({n0})
D := {d0}
W := {d0}  // worklist
while W ≠ ∅:
   remove q from W
   for α in Σ:
      T := ∅
      for s in q:
         T := T ∪ δ_N(s, α)
      T' := ε-closure(T)
      D' := D ∪ {T'}
      δ_D(q, α) = T' // update δ_D
      if T' ∉ D:
         W := W ∪ {T'}
      D := D'
F_D = {q ∊ D | q ∩ F_N ≠ ∅} // define F_D      
2024 Spring