CSE411

CSE411: Introduction to Compilers

Lexer (Scanner)

Jooyong Yi (UNIST)

2023 Spring
CSE411

Interface of Lexer

         |-------|            
Source → | Lexer | → tokens 
Program  |       |          
         |-------|          
2023 Spring
CSE411

Token

What is token in the context of a programming language?

2023 Spring
CSE411

Token

A token is the most basic unit in the grammar of a programming language.

2023 Spring
CSE411

Token

How many units do you find in the following C program snippet?

int

(1). 1 unit consisting of int
(2). 2 units consisting of i and nt (or in and t)
(3). 3 units consisting of i, n, and t

2023 Spring
CSE411

Token

How many units do you find in the following C program snippet?

int val = 10;
2023 Spring
CSE411

Lexing (Scanning) Example

                |-------|            
int val = 10; → | Lexer | → int ID ASSIGN NUM ;
                |-------|                          
2023 Spring
CSE411

Lexing (Scanning) Example

                |-------|            
int val = 10; → | Lexer | → int ID ASSIGN NUM ;
                |-------|                          
  • A token is also called a lexical token, hence the name lexer.
2023 Spring
CSE411

How to Write a Lexer?

2023 Spring
CSE411

How to Write a Lexer?

  • Option #1: write a lexer yourself
2023 Spring
CSE411

Alternative Option: Synthesizing a Lexer


             |-------------|    
             |    Lexer    |        
             | Synthesizer |
             |-------------|
                    |
               synthesizes   
                    ↓
                |-------|            
int val = 10; → | Lexer | → int ID ASSIGN NUM ;
                |-------|                          
2023 Spring
CSE411

Lexer Synthesizer Examples

> man lex

FLEX(1)                               Programming                                              FLEX(1)

NAME
       flex - the fast lexical analyser generator

SYNOPSIS
       flex [OPTIONS] [FILE]...

DESCRIPTION
       Generates programs that perform pattern-matching on text.
2023 Spring
CSE411

Input to Lex

        Specification for the lexer
                    ↓
             |-------------|    
             |    Lexer    |        
             | Synthesizer |
             |-------------|
                    |
               synthesizes   
                    ↓
                |-------|            
int val = 10; → | Lexer | → int ID ASSIGN NUM ;
                |-------|                          
2023 Spring
CSE411

What to Give as the Specification for Lex?

2023 Spring
CSE411

Specification for the NUM Token

What is a single expression that represents the following?

  • 1
  • 21
  • 411
  • ...
2023 Spring
CSE411

Specification for the NUM Token

What is a single expression that represents the following?

  • 1
  • 21
  • 411
  • ...

(1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)(0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)*

2023 Spring
CSE411

Regular Expression (RE) as the Specification for Lex

        Specification for the lexer written in RE
                    ↓
             |-------------|    
             |    Lexer    |        
             | Synthesizer |
             |-------------|
                    |
               synthesizes   
                    ↓
                |-------|            
int val = 10; → | Lexer | → int ID ASSIGN NUM ;
                |-------|                          
2023 Spring
CSE411

Regular Expressions

  • a: An ordinary character stands for itself.
  • "ab": Quotation, a string in quotes stands for itself literally. E.g., "if"
  • : The empty string
  • : Alternation, choosing from or .
  • [0-9]: 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
  • : Concatenation, an followed by an .
  • : Repetition (zero or more times)
  • : Repetition (one or more times)
  • : Optional, zero or one occurrences of .
2023 Spring
CSE411

Regular Expression Example

What is a single expression that represents the following?

  • 1, 21, 411, ...
  • -1, -21, -411, ...
  • +1, +21, +411, ...
2023 Spring
CSE411

Regular Expression Example

What is a single expression that represents the following?

  • 1, 21, 411, ...
  • -1, -21, -411, ...
  • +1, +21, +411, ...

(-|+)? [1-9] [0-9]*

2023 Spring
CSE411

Regular Expression Example

What is a single expression that represents the following?

  • 1, 21, 411, ...
  • -1, -21, -411, ...
  • +1, +21, +411, ...

(-|+)? [1-9] [0-9]*
(-|+|) [1-9] [0-9]*

2023 Spring
CSE411

Language of a Regular Expression

  • Let be .
  • Then the language of , denoted with , includes: 1, 21, 411, ...
2023 Spring
CSE411

Alphabets, Strings, and Languages

  1. An alphabet is a finite non-empty set of symbols.
    • e.g.,
  2. A string is a finite sequence of symbols chosen from an alphabet.
    • e.g., 1, 21, 411,
  3. Given an alphabet ,
    • : the set of strings over of length .
    • the set of all strings over alphabet :
  4. A language is a subset of where all strings in satisfy the property of .
2023 Spring
CSE411

Languages of Regular Expressions

  • (a) {a}
  • ("ab") {"ab"}
  • ([0-9]) {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
  • ( )
2023 Spring
CSE411

Synthesized Lexer

What does a synthesized lexer look like?

2023 Spring
CSE411

Synthesized Lexer

  • Input: [1-9] [0-9]*
  • Output:
2023 Spring
CSE411

RE and DFA (Deterministic Finite Automaton)

Theory: Let be a regular expression. Then there exists a deterministic finite automaton that accepts .

2023 Spring
CSE411

RE --> DFA

Try the following tool: https://github.com/yakout/compiler

  • Caveat: most programs are not bug-free. Do not blindly rely on the tool.
2023 Spring
CSE411

DFA and NFA (Non-deterministic Finite Automaton)

  • DFA and NFA
    • : a finite set of states.
    • : an input alphabet.
      • For DFA only:
    • : the initial state
    • : a set of final (accepting) states
    • : a transition function.
      • For DFA:
      • For NFA:
2023 Spring
CSE411

RE --> NFA --> DFA

Theory: For every NFA , there exists a DFA such that and accept the same language.

2023 Spring
CSE411

RE --> NFA

  1. RE:
  2. NFA:
2023 Spring
CSE411

RE --> NFA

  1. RE:
  2. NFA:
2023 Spring
CSE411

RE --> NFA

  1. RE:
  2. NFA:
2023 Spring
CSE411

Thompson's Construction

RE -- Thompson's construction --> NFA

2023 Spring
CSE411

NFA --> DFA

How to transform NFA to DFA?

2023 Spring
CSE411

NFA --> DFA

  • Map multiple NFA states to a single DFA state
2023 Spring
CSE411

NFA --> DFA

  • Bisimulate both automata
NFA State DFA State a b
{0, 1, 2, 4, 7} A ? ?
2023 Spring
CSE411

NFA --> DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B ?
{1, 2, 3, 4, 6, 7, 8} B ? ?
2023 Spring
CSE411

NFA --> DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B ? ?
{1, 2, 4, 5, 6, 7} C ? ?
2023 Spring
CSE411

NFA --> DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B ?
{1, 2, 4, 5, 6, 7} C ? ?
2023 Spring
CSE411

NFA --> DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7} C ? ?
{1, 2, 4, 5, 6, 7, 9} D ? ?
2023 Spring
CSE411

NFA --> DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D ? ?
2023 Spring
CSE411

NFA --> DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B ?
2023 Spring
CSE411

NFA --> DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E ? ?
2023 Spring
CSE411

NFA --> DFA

NFA State DFA State a b
{0, 1, 2, 4, 7} A B C
{1, 2, 3, 4, 6, 7, 8} B B D
{1, 2, 4, 5, 6, 7} C B C
{1, 2, 4, 5, 6, 7, 9} D B E
{1, 2, 4, 5, 6, 7, 10} E B C
2023 Spring
CSE411

Subset Construction Algorithm

  • Input: NFA:
  • Output: DFA

Note that

  • DFA and NFA
    • : a finite set of states.
    • : an input alphabet.
      • For DFA only:
    • : the initial state
    • : a set of final (accepting) states
    • : a transition function.
      • For DFA:
      • For NFA:
2023 Spring
CSE411

How to obtain

  • Input: NFA:
  • Output: DFA

We need a function that satisfies:

2023 Spring
CSE411

Property of

  • (the same is true for and )
2023 Spring
CSE411

Property of

  • (the same is true for and )
2023 Spring
CSE411

-Closure

  • is an -closure.
2023 Spring
CSE411

-Closure

  • Given a set and its -closure ,
2023 Spring
CSE411

-Closure

  • Given a set , its -closure satisfies the following:

Another formal definition (also showing how to compute):

  • Given a set , its -closure is the smallest set satisfying the following:
2023 Spring
CSE411

Subset Construction Algorithm

  • Input: NFA:
  • Output: DFA
d0 := ε-closure({n0})
D := {d0}
W := {d0}  // worklist
while W ≠ ∅:
   remove q from W
   for α in Σ:
      T := ∅
      for s in q:
         T := T ∪ δ_N(s, α)
      T' := ε-closure(T)
      D' := D ∪ {T'}
      δ_D(q, α) = T' // update δ_D
      if T' ∉ D:
         W := W ∪ {T'}
      D := D'
F_D = {q ∊ D | q ∩ F_N ≠ ∅} // define F_D      
2023 Spring
CSE411

Our Example

d0 := ε-closure({n0})
D := {d0}
W := {d0}  // worklist
...

d0 = {0, 1, 2, 4, 7}
W0 = {0, 1, 2, 4, 7}

2023 Spring
CSE411

Our Example

   remove q from W
   for α in Σ:
      T := ∅
      for s in q:
         T := T ∪ δ_N(s, α)
      T' := ε-closure(T)

α = a
T = {3, 8}
T' = {1, 2, 3, 4, 6, 7, 8}

2023 Spring
CSE411

Our Example

      D' := D ∪ {T'}
      δ_D(q, α) = T' // update δ_D
      if T' ∉ D:
         W := W ∪ {T'}

T' = {1, 2, 3, 4, 6, 7, 8}
D = {{0, 1, 2, 4, 7}}
D' = {{0, 1, 2, 4, 7}, {1, 2, 3, 4, 6, 7, 8}}
δ_D({0, 1, 2, 4, 7}, a) = {1, 2, 3, 4, 6, 7, 8}

2023 Spring