- Writing a compiler gives a student experience with large-scale applications development. Your compiler program may be the largest program you write as a student. Experience working with really big data structures and complex interactions between algorithms will help you out on your next big programming project.
- Compiler writing is one of the shining triumphs of CS theory. It demonstrates the value of theory over the impulse to just "hack up" a solution.
- Compiler writing is a basic element of programming language research. Many language researchers write compilers for the languages they design.
- Many applications have similar properties to one or more phases of a compiler, and compiler expertise and tools can help an application programmer working on other projects besides compilers.
- Writing a compiler gives a student experience with large-scale applications development. Your compiler program may be the largest program you write as a student. Experience working with really big data structures and complex interactions between algorithms will help you out on your next big programming project.
- Compiler writing is one of the shining triumphs of CS theory. It demonstrates the value of theory over the impulse to just "hack up" a solution.
- Compiler writing is a basic element of programming language research. Many language researchers write compilers for the languages they design.
- Many applications have similar properties to one or more phases of a compiler, and compiler expertise and tools can help an application programmer working on other projects besides compilers.
Compilers - a brief overview
Purpose: translate a program in some language (the source language) into a lower-level language (the target language).Phases:
- Lexical Analysis:
- Converts a sequence of characters into words, or tokens
- Syntax Analysis:
- Converts a sequence of tokens into a parse tree
- Semantic Analysis:
- Manipulates parse tree to verify symbol and type information
- Intermediate Code Generation:
- Converts parse tree into a sequence of intermediate code instructions
- Optimization:
- Manipulates intermediate code to produce a more efficient program
- Final Code Generation:
- Translates intermediate code into final (machine/assembly) code
- Lexical Analysis:
- Converts a sequence of characters into words, or tokens
- Syntax Analysis:
- Converts a sequence of tokens into a parse tree
- Semantic Analysis:
- Manipulates parse tree to verify symbol and type information
- Intermediate Code Generation:
- Converts parse tree into a sequence of intermediate code instructions
- Optimization:
- Manipulates intermediate code to produce a more efficient program
- Final Code Generation:
- Translates intermediate code into final (machine/assembly) code
Example of the Compilation Process
Consider the example from Figure 1.10 on p. 13 of the book in detail.position = initial + rate * 60
30 or so characters, from a single line of source code, are first transformed by lexical analysis into a sequence of 7 tokens. Those tokens are then used to build a tree of height 4 during syntax analysis. Semantic analysis may transform the tree into one of height 5, that includes a type conversion necessary for real addition on an integer operand. Intermediate code generation uses a simple traversal algorithm to linearize the tree back into a sequence of machine-independent three-address-code instructions.t1 = inttoreal(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Optimization of the intermediate code allows the four instructions to be reduced to two machine-independent instructions. Final code generation might implement these two instructions using 5 machine instructions, in which the actual registers and addressing modes of the CPU are utilized.MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
position = initial + rate * 60
t1 = inttoreal(60) t2 = id3 * t1 t3 = id2 + t2 id1 = t3
MOVF id3, R2 MULF #60.0, R2 MOVF id2, R1 ADDF R2, R1 MOVF R1, id1
Overview of Lexical Analysis
Lexical Analyzer a.k.a. scanner
- Convert from a sequence of characters into a (shorter) sequence of tokens
- Identify and categorize specific character sequences into tokens
- Skip comments & whitespace
- Handle lexical errors (illegal characters, malformed tokens)
- Efficiency is crucial; scanner may perform elaborate input buffering
- Tokens are specified as regular expressions, e.g. IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*
- Lexical Analyzers are implemented by DFAs.
The term "token" usually refers to an object (or struct) containing complete information about a single lexical entity, but it is often also used to refer the category ("class" if you prefer) of that entity. The term "lexeme" denotes the actual string of characters that comprise a particular occurrence ("instance" if you like) of a token.
- Convert from a sequence of characters into a (shorter) sequence of tokens
- Identify and categorize specific character sequences into tokens
- Skip comments & whitespace
- Handle lexical errors (illegal characters, malformed tokens)
- Efficiency is crucial; scanner may perform elaborate input buffering
- Tokens are specified as regular expressions, e.g. IDENTIFIER=[a-zA-Z][a-zA-Z0-9]*
- Lexical Analyzers are implemented by DFAs.
Regular Expressions
The notation we use to precisely capture all the variations that a given category of token may take are called "regular expressions" (or, less formally, "patterns". The word "pattern" is really vague and there are lots of other notations for patterns besides regular expressions). Regular expressions are a shorthand notation for sets of strings. In order to even talk about "strings" you have to first define an alphabet, the set of characters which can appear.
- Epsilon is a regular expression denoting the set containing the empty string
- Any letter in the alphabet is also a regular expression denoting the set containing a one-letter string consisting of that letter.
- For regular expressions r and s,
r | s
is a regular expression denoting the union of r and s
- For regular expressions r and s,
r s
is a regular expression denoting the set of strings consisting of a member of r followed by a member of s
- For regular expression r,
r*
is a regular expression denoting the set of strings consisting of zero or more occurrences of r.
- You can parenthesize a regular expression to specify operator precedence (otherwise, alternation is like plus, concatenation is like times, and closure is like exponentiation)
Although these operators are sufficient to describe all regular languages, in practice everybody uses extensions:
- For regular expression r,
r+
is a regular expression denoting the set of strings consisting of one or more occurrences of r. Equivalent to rr*
- For regular expression r,
r?
is a regular expression denoting the set of strings consisting of zero or one occurrence of r. Equivalent to r|epsilon
- The notation [abc] is short for a|b|c. [a-z] is short for a|b|...|z
- Epsilon is a regular expression denoting the set containing the empty string
- Any letter in the alphabet is also a regular expression denoting the set containing a one-letter string consisting of that letter.
- For regular expressions r and s,
r | s
is a regular expression denoting the union of r and s - For regular expressions r and s,
r s
is a regular expression denoting the set of strings consisting of a member of r followed by a member of s - For regular expression r,
r*
is a regular expression denoting the set of strings consisting of zero or more occurrences of r. - You can parenthesize a regular expression to specify operator precedence (otherwise, alternation is like plus, concatenation is like times, and closure is like exponentiation)
- For regular expression r,
r+
is a regular expression denoting the set of strings consisting of one or more occurrences of r. Equivalent to rr* - For regular expression r,
r?
is a regular expression denoting the set of strings consisting of zero or one occurrence of r. Equivalent to r|epsilon - The notation [abc] is short for a|b|c. [a-z] is short for a|b|...|z
Finite Automata
A finite automaton is an abstract, mathematical machine, also known as a finite state machine, with the following components:
- A set of states S
- A set of input symbols E (the alphabet)
- A transition function move(state, symbol) : new state(s)
- A start state S0
- A set of final states F
For a deterministic finite automaton (DFA), the function move(state, symbol) goes to at most one state, and symbol is never epsilon.Finite automata correspond in a 1:1 relationship to transition diagrams; from any transition diagram one can write down the formal automaton in terms of items #1-#5 above, and vice versa. To draw the transition diagram for a finite automaton:
- draw a circle for each state s in S; put a label inside the circles to identify each state by number or name
- draw an arrow between Si and Sj, labeled with x whenever the transition says to move(Si, x) : Sj
- draw a "wedgie" into the start state S0 to identify it
- draw a second circle inside each of the final states in F
- A set of states S
- A set of input symbols E (the alphabet)
- A transition function move(state, symbol) : new state(s)
- A start state S0
- A set of final states F
- draw a circle for each state s in S; put a label inside the circles to identify each state by number or name
- draw an arrow between Si and Sj, labeled with x whenever the transition says to move(Si, x) : Sj
- draw a "wedgie" into the start state S0 to identify it
- draw a second circle inside each of the final states in F
DFA Implementation
The nice part about DFA's is that they are efficiently implemented on computers. What DFA does the following code correspond to? What is the corresponding regular expression? You can speed this code fragment up even further if you are willing to use goto's or write it in assembler.state := S0
for(;;)
switch (state) {
case 0:
switch (input) {
'a': state = 1; input = getchar(); break;
'b': input = getchar(); break;
default: printf("dfa error\n"); exit(1);
}
case 1:
switch (input) {
EOF: printf("accept\n"); exit(0);
default: printf("dfa error\n"); exit(1);
}
}
state := S0 for(;;) switch (state) { case 0: switch (input) { 'a': state = 1; input = getchar(); break; 'b': input = getchar(); break; default: printf("dfa error\n"); exit(1); } case 1: switch (input) { EOF: printf("accept\n"); exit(0); default: printf("dfa error\n"); exit(1); } }
Nondeterministic Finite Automata (NFA's)
Notation convenience motivates more flexible machines in which function move() can go to more than one state on a given input symbol, and some states can move to other states even without consuming an input symbol.Fortunately, one can prove that for any NFA, there is an equivalent DFA. They are just a notational convenience. So, finite automata help us get from a set of regular expressions to a computer program that recognizes them efficiently.
Regular expressions can be converted automatically to NFA's
Each rule in the definition of regular expressions has a corresponding NFA; NFA's are composed using epsilon transitions. This is cited in the text as "Thompson's construction" (Algorithm 3.3). We will work examples such as (a|b)*abb in class and during lab.
- For epsilon, draw two states with a single epsilon transition.
- For any letter in the alphabet, draw two states with a single transition labeled with that letter.
- For regular expressions r and s, draw r | s by adding a new start state with epsilon transitions to the start states of r and s, and a new final state with epsilon transitions from each final state in r and s.
- For regular expressions r and s, draw rs by adding epsilon transitions from the final states of r to the start state of s.
- For regular expression r, draw r* by adding new start and final states, and epsilon transitions (a) from the start state to the final state, (b) from the final state back to the start state, (c) from the new start to the old start and from the old final states to the new final state.
- For parenthesed regular expression (r) you can use the NFA for r.
- For epsilon, draw two states with a single epsilon transition.
- For any letter in the alphabet, draw two states with a single transition labeled with that letter.
- For regular expressions r and s, draw r | s by adding a new start state with epsilon transitions to the start states of r and s, and a new final state with epsilon transitions from each final state in r and s.
- For regular expressions r and s, draw rs by adding epsilon transitions from the final states of r to the start state of s.
- For regular expression r, draw r* by adding new start and final states, and epsilon transitions (a) from the start state to the final state, (b) from the final state back to the start state, (c) from the new start to the old start and from the old final states to the new final state.
- For parenthesed regular expression (r) you can use the NFA for r.
NFA's can be converted automatically to DFA's
In: NFA N
OUt: DFA D
Method: Construct transition table Dtran (a.k.a. the "move function"). Each DFA state is a set of NFA states. Dtran simulates in parallel all possible moves N can make on a given string.Operations to keep track of sets of NFA states:
- e_closure(s)
- set of states reachable from state s via epsilon
- e_closure(T)
- set of states reachable from any state in set T via epsilon
- move(T,a)
- set of states to which there is an NFA transition from states in T on symbol a
Algorithm:
Dstates := {e_closure(start_state)}
while T := unmarked_member(Dstates) do {
mark(T)
for each input symbol a do {
U := e_closure(move(T,a))
if not member(Dstates, U) then
insert(Dstates, U)
Dtran[T,a] := U
}
}
OUt: DFA D
Method: Construct transition table Dtran (a.k.a. the "move function"). Each DFA state is a set of NFA states. Dtran simulates in parallel all possible moves N can make on a given string.Operations to keep track of sets of NFA states:
- e_closure(s)
- set of states reachable from state s via epsilon
- e_closure(T)
- set of states reachable from any state in set T via epsilon
- move(T,a)
- set of states to which there is an NFA transition from states in T on symbol a
Dstates := {e_closure(start_state)} while T := unmarked_member(Dstates) do { mark(T) for each input symbol a do { U := e_closure(move(T,a)) if not member(Dstates, U) then insert(Dstates, U) Dtran[T,a] := U } }
lex(1) and flex(1)
These programs generally take a lexical specification given in a .l file and create a corresponding C language lexical analyzer in a file named lex.yy.c. The lexical analyzer is then linked with the rest of your compiler.The C code generated by lex has the following public interface. Note the use of global variables instead of parameters, and the use of the prefix yy to distinguish scanner names from your program names. This prefix is also used in the YACC parser generator.
FILE *yyin; /* set this variable prior to calling yylex() */
int yylex(); /* call this function once for each token */
char yytext[]; /* yylex() writes the token's lexeme to this array */
The .l file format consists of a mixture of lex syntax and C code fragments. The percent sign (%) is used to signify lex elements. The whole file is divided into three sections separated by %%:
header
%%
body
%%
helper functions
The header consists of C code fragments enclosed in %{ and %} as well as macro definitions consisting of a name and a regular expression denoted by that name. lex macros are invoked explicitly by enclosing the macro name in curly braces. Following are some example lex macros.
letter [a-zA-Z]
digit [0-9]
ident {letter}({letter}|{digit})*
The body consists of of a sequence of regular expressions for different token categories and other lexical entities. Each regular expression can have a C code fragment enclosed in curly braces that executes when that regular expression is matched. For most of the regular expressions this code fragment (also called a semantic action consists of returning an integer that identifies the token category to the rest of the compiler, particularly for use by the parser to check syntax. Some typical regular expressions and semantic actions might include:
" " { /* no-op, discard whitespace */ }
{ident} { return IDENTIFIER; }
"*" { return ASTERISK; }
You also need regular expressions for lexical errors such as unterminated character constants, or illegal characters.The helper functions in a lex file typically compute lexical attributes, such as the actual integer or string values denoted by literals.
FILE *yyin; /* set this variable prior to calling yylex() */ int yylex(); /* call this function once for each token */ char yytext[]; /* yylex() writes the token's lexeme to this array */The .l file format consists of a mixture of lex syntax and C code fragments. The percent sign (%) is used to signify lex elements. The whole file is divided into three sections separated by %%:
header %% body %% helper functionsThe header consists of C code fragments enclosed in %{ and %} as well as macro definitions consisting of a name and a regular expression denoted by that name. lex macros are invoked explicitly by enclosing the macro name in curly braces. Following are some example lex macros.
letter [a-zA-Z] digit [0-9] ident {letter}({letter}|{digit})*The body consists of of a sequence of regular expressions for different token categories and other lexical entities. Each regular expression can have a C code fragment enclosed in curly braces that executes when that regular expression is matched. For most of the regular expressions this code fragment (also called a semantic action consists of returning an integer that identifies the token category to the rest of the compiler, particularly for use by the parser to check syntax. Some typical regular expressions and semantic actions might include:
" " { /* no-op, discard whitespace */ } {ident} { return IDENTIFIER; } "*" { return ASTERISK; }You also need regular expressions for lexical errors such as unterminated character constants, or illegal characters.The helper functions in a lex file typically compute lexical attributes, such as the actual integer or string values denoted by literals.
Lex extended regular expressions
Lex further extends the regular expressions with several helpful operators. Lex's regular expressions include:
- c
- normal characters mean themselves
- \c
- backslash escapes remove the meaning from most operator characters. Inside character sets and quotes, backslash performs C-style escapes.
- "s"
- Double quotes mean to match the C string given as itself. This is particularly useful for multi-byte operators and may be more readable than using backslash multiple times.
- [s]
- This character set operator matches any one character among those in s.
- [^s]
- A negated-set matches any one character not among those in s.
- .
- The dot operator matches any one character except newline: [^\n]
- r*
- match r 0 or more times.
- r+
- match r 1 or more times.
- r?
- match r 0 or 1 time.
- r{m,n}
- match r between m and n times.
- r1r2
- concatenation. match r1 followed by r2
- r1|r2
- alternation. match r1 or r2
- (r)
- parentheses specify precedence but do not match anything
- r1/r2
- lookahead. match r1 when r2 follows, without consuming r2
- ^r
- match r only when it occurs at the beginning of a line
- r$
- match r only when it occurs at the end of a line
- c
- normal characters mean themselves
- \c
- backslash escapes remove the meaning from most operator characters. Inside character sets and quotes, backslash performs C-style escapes.
- "s"
- Double quotes mean to match the C string given as itself. This is particularly useful for multi-byte operators and may be more readable than using backslash multiple times.
- [s]
- This character set operator matches any one character among those in s.
- [^s]
- A negated-set matches any one character not among those in s.
- .
- The dot operator matches any one character except newline: [^\n]
- r*
- match r 0 or more times.
- r+
- match r 1 or more times.
- r?
- match r 0 or 1 time.
- r{m,n}
- match r between m and n times.
- r1r2
- concatenation. match r1 followed by r2
- r1|r2
- alternation. match r1 or r2
- (r)
- parentheses specify precedence but do not match anything
- r1/r2
- lookahead. match r1 when r2 follows, without consuming r2
- ^r
- match r only when it occurs at the beginning of a line
- r$
- match r only when it occurs at the end of a line
Lexical Attributes and Token Objects
Besides the token's category, the rest of the compiler may several pieces of information about a token in order to perform semantic analysis, code generation, and error handling. These are stored in an object instance of class Token, or in C, a struct. The fields are generally something like:struct token {
int category;
char *text;
int linenumber;
int column;
char *filename;
union literal value;
}
The union literal will hold computed values of integers, real numbers, and strings.
struct token { int category; char *text; int linenumber; int column; char *filename; union literal value; }The union literal will hold computed values of integers, real numbers, and strings.
Lexical Analysis and the Symbol Table
In many compilers, the symbol table and memory management components of the compiler interact with several phases of compilation, starting with lexical analysis.
- Efficient storage is necessary to handle large input files.
- There is a colossal amount of duplication in lexical data: variable names, strings and other literal values duplicate frequently
- What token type to use may depend on previous declarations.
A hash table or other efficient data structure can avoid this duplication. The software engineering design pattern to use is called the "flyweight". Example: Figure 3.18, p. 109. Use "install_id()" instead of "strdup()" to avoid duplication in the lexical data.
- Efficient storage is necessary to handle large input files.
- There is a colossal amount of duplication in lexical data: variable names, strings and other literal values duplicate frequently
- What token type to use may depend on previous declarations.
Syntax Analysis
Parsing is the act of performing syntax analysis to verify an input program's compliance with the source language. A by-product of this process is typically a tree that represents the structure of the program.
Context Free Grammars
A context free grammar G has:
- A set of terminal symbols, T
- A set of nonterminal symbols, N
- A start symbol, s, which is a member of N
- A set of production rules of the form A -> w, where A is a nonterminal and w is a string of terminal and nonterminal symbols. A context free grammar can be used to generate strings in the corresponding language as follows:
let X = the start symbol s
while there is some nonterminal Y in X do
apply any one production rule using Y, e.g. Y -> w
When X consists only of terminal symbols, it is a string of the language denoted by the grammar. Each iteration of the loop is a derivation step. If an iteration has several nonterminals to choose from at some point, the rules of derviation would allow any of these to be applied. In practice, parsing algorithms tend to always choose the leftmost nonterminal, or the rightmost nonterminal, resulting in strings that are leftmost derivations or rightmost derivations.
- A set of terminal symbols, T
- A set of nonterminal symbols, N
- A start symbol, s, which is a member of N
- A set of production rules of the form A -> w, where A is a nonterminal and w is a string of terminal and nonterminal symbols. A context free grammar can be used to generate strings in the corresponding language as follows:
let X = the start symbol s while there is some nonterminal Y in X do apply any one production rule using Y, e.g. Y -> w
When X consists only of terminal symbols, it is a string of the language denoted by the grammar. Each iteration of the loop is a derivation step. If an iteration has several nonterminals to choose from at some point, the rules of derviation would allow any of these to be applied. In practice, parsing algorithms tend to always choose the leftmost nonterminal, or the rightmost nonterminal, resulting in strings that are leftmost derivations or rightmost derivations.
Grammar Ambiguity
- The grammar
E -> E + E
E -> E * E
E -> ( E )
E -> ident
allows two different derivations for strings such as "x + y * z". The grammar is ambiguous, but the semantics of the language dictate a particular operator precedence that should be used. One way to eliminate such ambiguity is to rewrite the grammar. For example, we can force the precedence we want by adding some nonterminals and production rules.E -> E + T
E -> T
T -> T * F
T -> F
F -> ( F )
F -> ident
- The grammar
E -> E + E E -> E * E E -> ( E ) E -> ident
allows two different derivations for strings such as "x + y * z". The grammar is ambiguous, but the semantics of the language dictate a particular operator precedence that should be used. One way to eliminate such ambiguity is to rewrite the grammar. For example, we can force the precedence we want by adding some nonterminals and production rules.E -> E + T E -> T T -> T * F T -> F F -> ( F ) F -> ident
Recursive Descent Parsing
- Perhaps the simplest parsing method, for a large subset of context free grammars, is called recursive descent. It is simple because the algorithm closely follows the production rules of nonterminal symbols.
- Write 1 procedure per nonterminal rule
- Within each procedure, a) match terminals at appropriate positions, and b) call procedures for non-terminals.
- Pitfalls: left recursion is FATAL; must distinguish between several production rules, or potentially, one has to try all of them via backtracking.
- Perhaps the simplest parsing method, for a large subset of context free grammars, is called recursive descent. It is simple because the algorithm closely follows the production rules of nonterminal symbols.
- Write 1 procedure per nonterminal rule
- Within each procedure, a) match terminals at appropriate positions, and b) call procedures for non-terminals.
- Pitfalls: left recursion is FATAL; must distinguish between several production rules, or potentially, one has to try all of them via backtracking.
Recursive Descent Parsing Example #1
- The grammar
S -> A B C
A -> a A
A -> epsilon
B -> b
C -> c
maps to code likeprocedure S()
if A() & B() & C() then succeed
end
procedure A()
if currtoken == a then # use production 2
currtoken = scan()
return A()
else
succeed # production rule 3, match epsilon
end
procedure B()
if currtoken == b then
currtoken = scan()
succeed
else fail
end
procedure C()
if currtoken == c then
currtoken = scan()
succeed
else fail
end
- The grammar
S -> A B C A -> a A A -> epsilon B -> b C -> c
maps to code likeprocedure S() if A() & B() & C() then succeed end procedure A() if currtoken == a then # use production 2 currtoken = scan() return A() else succeed # production rule 3, match epsilon end procedure B() if currtoken == b then currtoken = scan() succeed else fail end procedure C() if currtoken == c then currtoken = scan() succeed else fail end
Removing Left Recursion
E -> E + T | T
T -> T * F | F
F -> ( E ) | ident
We can remove the left recursion by introducing new nonterminals and new production rules.E -> T E'
E' -> + T E' | epsilon
T -> F T'
T' -> * F T' | epsilon
F -> ( E ) | ident
Getting rid of such immediate left recursion is not enough, one must get rid of indirect left recursion, where two or more nonterminals are mutually left-recursive. One can rewrite any CFG to remove left recursion (Algorithm 4.1).for i := 1 to n do
for j := 1 to i-1 do begin
replace each Ai -> Aj gamma with productions
Ai -> delta1gamma | delta2gamma
end
eliminate immediate left recursion
E -> E + T | T T -> T * F | F F -> ( E ) | ident
We can remove the left recursion by introducing new nonterminals and new production rules.E -> T E' E' -> + T E' | epsilon T -> F T' T' -> * F T' | epsilon F -> ( E ) | ident
Getting rid of such immediate left recursion is not enough, one must get rid of indirect left recursion, where two or more nonterminals are mutually left-recursive. One can rewrite any CFG to remove left recursion (Algorithm 4.1).for i := 1 to n do for j := 1 to i-1 do begin replace each Ai -> Aj gamma with productions Ai -> delta1gamma | delta2gamma end eliminate immediate left recursion
Backtracking?
- Current token could begin more than one of your possible production rules? Try all of them, remember and reset state for each try.
S -> cAd
A -> ab
A -> a
Left factoring can often solve such problemsS -> cAd
A -> a A'
A'-> b
A'-> (epsilon)
One can also perform left factoring (Algorithm 4.2) to reduce or eliminate the lookahead or backtracking needed to tell which production rule to use. If the end result has no lookahead or backtracking needed, the resulting CFG can be solved by a "predictive parser" and coded easily in a conventional language. If backtracking is needed, a recursive descent parser takes more work to implement, but is still feasible. As a more concrete example:S -> if E then S
S -> if E then S1 else S2
can be factored to:S -> if E then S S'
S'-> else S2 | epsilon
- Current token could begin more than one of your possible production rules? Try all of them, remember and reset state for each try.
S -> cAd A -> ab A -> a
Left factoring can often solve such problemsS -> cAd A -> a A' A'-> b A'-> (epsilon)
One can also perform left factoring (Algorithm 4.2) to reduce or eliminate the lookahead or backtracking needed to tell which production rule to use. If the end result has no lookahead or backtracking needed, the resulting CFG can be solved by a "predictive parser" and coded easily in a conventional language. If backtracking is needed, a recursive descent parser takes more work to implement, but is still feasible. As a more concrete example:S -> if E then S S -> if E then S1 else S2
can be factored to:S -> if E then S S' S'-> else S2 | epsilon
Some More Parsing Theory
- Automatic techniques for constructing parsers start with computing some basic functions for symbols in the grammar. These functions are useful in understanding both recursive descent and bottom-up LR parsers.
- Automatic techniques for constructing parsers start with computing some basic functions for symbols in the grammar. These functions are useful in understanding both recursive descent and bottom-up LR parsers.
First(a)
- First(a) is the set of terminals that begin strings derived from a, which can include epsilon.
- First(X) starts with the empty set.
- if X is a terminal, First(X) is {X}.
- if X -> epsilon is a production, add epsilon to First(X).
- if X is a non-terminal and X -> Y1 Y2 ... Yk is a production, add First(Y1) to First(X).
for (i = 1; if Yi can derive epsilon; i++)
add First(Yi+1) to First(X)
- First(a) is the set of terminals that begin strings derived from a, which can include epsilon.
- First(X) starts with the empty set.
- if X is a terminal, First(X) is {X}.
- if X -> epsilon is a production, add epsilon to First(X).
- if X is a non-terminal and X -> Y1 Y2 ... Yk is a production, add First(Y1) to First(X).
for (i = 1; if Yi can derive epsilon; i++) add First(Yi+1) to First(X)
Follow(A)
- Follow(A) for nonterminal A is the set of terminals that can appear immediately to the right of A in some sentential form S -> aAxB... To compute Follow, apply these rules to all nonterminals in the grammar:
- Add $ to Follow(S)
- if A -> aBb then add First(b) - epsilon to Follow(B)
- if A -> aB or A -> aBb where epsilon is in First(b), then add Follow(A) to Follow(B).
- Follow(A) for nonterminal A is the set of terminals that can appear immediately to the right of A in some sentential form S -> aAxB... To compute Follow, apply these rules to all nonterminals in the grammar:
- Add $ to Follow(S)
- if A -> aBb then add First(b) - epsilon to Follow(B)
- if A -> aB or A -> aBb where epsilon is in First(b), then add Follow(A) to Follow(B).
Bottom Up Parsing
- Bottom up parsers start from the sequence of terminal symbols and work their way back up to the start symbol by repeatedly replacing grammar rules' right hand sides by the corresponding non-terminal. This is the reverse of the derivation process, and is called "reduction".
Example. For the grammar
(1) S->aABe
(2) A->Abc
(3) A->b
(4) B->d
the string "abbcde" can be parsed bottom-up by the following reduction steps:abbcde
aAbcde
aAde
aABe
S
Definition: a handle is a substring that
- matches a right hand side of a production rule in the grammar and
- whose reduction to the nonterminal on the left hand side of that grammar rule is a step along the reverse of a rightmost derivation.
- Bottom up parsers start from the sequence of terminal symbols and work their way back up to the start symbol by repeatedly replacing grammar rules' right hand sides by the corresponding non-terminal. This is the reverse of the derivation process, and is called "reduction".Example. For the grammar
(1) S->aABe (2) A->Abc (3) A->b (4) B->d
the string "abbcde" can be parsed bottom-up by the following reduction steps:abbcde aAbcde aAde aABe S
Definition: a handle is a substring that- matches a right hand side of a production rule in the grammar and
- whose reduction to the nonterminal on the left hand side of that grammar rule is a step along the reverse of a rightmost derivation.
Shift Reduce Parsing
- A shift-reduce parser performs its parsing using the following structure
Stack Input
$ w$
At each step, the parser performs one of the following actions.
- Shift one symbol from the input onto the parse stack
- Reduce one handle on the top of the parse stack. The symbols from the right hand side of a grammar rule are popped of the stack, and the nonterminal symbol is pushed on the stack in their place.
- Accept is the operation performed when the start symbol is alone on the parse stack and the input is empty.
- Error actions occur when no successful parse is possible.
- A shift-reduce parser performs its parsing using the following structure
Stack Input $ w$
At each step, the parser performs one of the following actions.- Shift one symbol from the input onto the parse stack
- Reduce one handle on the top of the parse stack. The symbols from the right hand side of a grammar rule are popped of the stack, and the nonterminal symbol is pushed on the stack in their place.
- Accept is the operation performed when the start symbol is alone on the parse stack and the input is empty.
- Error actions occur when no successful parse is possible.
LR Parsers
- LR denotes a class of bottom up parsers that is capable of handling virtually all programming language constructs. LR is efficient; it runs in linear time with no backtracking needed. The class of languages handled by LR is a proper superset of the class of languages handled by top down "predictive parsers". LR parsing detects an error as soon as it is possible to do so. Generally building an LR parser is too big and complicated a job to do by hand, we use tools to generate LR parsers.
The LR parsing algorithm is given below. See Figure 4.29 for a schematic.
ip = first symbol of input
repeat {
s = state on top of parse stack
a = *ip
case action[s,a] of {
SHIFT s': { push(a); push(s') }
REDUCE A->beta: {
pop 2*|beta| symbols; s' = new state on top
push A
push goto(s', A)
}
ACCEPT: return 0 /* success */
ERROR: { error("syntax error", s, a); halt }
}
}
- LR denotes a class of bottom up parsers that is capable of handling virtually all programming language constructs. LR is efficient; it runs in linear time with no backtracking needed. The class of languages handled by LR is a proper superset of the class of languages handled by top down "predictive parsers". LR parsing detects an error as soon as it is possible to do so. Generally building an LR parser is too big and complicated a job to do by hand, we use tools to generate LR parsers.The LR parsing algorithm is given below. See Figure 4.29 for a schematic.
ip = first symbol of input repeat { s = state on top of parse stack a = *ip case action[s,a] of { SHIFT s': { push(a); push(s') } REDUCE A->beta: { pop 2*|beta| symbols; s' = new state on top push A push goto(s', A) } ACCEPT: return 0 /* success */ ERROR: { error("syntax error", s, a); halt } } }
Conflicts in Shift-Reduce Parsing
- "Conflicts" occur when an ambiguity in the grammar creates a situation where the parser does not know which step to perform at a given point during parsing. There are two kinds of conflicts that occur.
- shift-reduce
- a shift reduce conflict occurs when the grammar indicates that different successful parses might occur with either a shift or a reduce at a given point during parsing. The vast majority of situations where this conflict occurs can be correctly resolved by shifting.
- reduce-reduce
- a reduce reduce conflict occurs when the parser has two or more handles at the same time on the top of the stack. Whatever choice the parser makes is just as likely to be wrong as not. In this case it is usually best to rewrite the grammar to eliminate the conflict, possibly by factoring.
Example shift reduce conflict:S->if E then S
S->if E then S else S
In many languages two nested "if" statements produce a situation where an "else" clause could legally belong to either "if". The usual rule (to shift) attaches the else to the nearest (i.e. inner) if statement. Example reduce reduce conflict:
(1) S -> id LP plist RP
(2) S -> E GETS E
(3) plist -> plist, p
(4) plist -> p
(5) p -> id
(6) E -> id LP elist RP
(7) E -> id
(8) elist -> elist, E
(9) elist -> E
By the point the stack holds ...id LP id
the parser will not know which rule to use to reduce the id: (5) or (7).
- "Conflicts" occur when an ambiguity in the grammar creates a situation where the parser does not know which step to perform at a given point during parsing. There are two kinds of conflicts that occur.
- shift-reduce
- a shift reduce conflict occurs when the grammar indicates that different successful parses might occur with either a shift or a reduce at a given point during parsing. The vast majority of situations where this conflict occurs can be correctly resolved by shifting.
- reduce-reduce
- a reduce reduce conflict occurs when the parser has two or more handles at the same time on the top of the stack. Whatever choice the parser makes is just as likely to be wrong as not. In this case it is usually best to rewrite the grammar to eliminate the conflict, possibly by factoring.
S->if E then S S->if E then S else S
In many languages two nested "if" statements produce a situation where an "else" clause could legally belong to either "if". The usual rule (to shift) attaches the else to the nearest (i.e. inner) if statement. Example reduce reduce conflict:(1) S -> id LP plist RP (2) S -> E GETS E (3) plist -> plist, p (4) plist -> p (5) p -> id (6) E -> id LP elist RP (7) E -> id (8) elist -> elist, E (9) elist -> E
By the point the stack holds ...id LP id
the parser will not know which rule to use to reduce the id: (5) or (7).
Constructing SLR Parsing Tables:
-
Definition: An LR(0) item of a grammar G is a production of G with a dot at some position of the RHS.
Example: The production A->aAb gives the items:
A->.aAb A->a.Ab A->aA.b A->aAb.
Note: A production A-> e generates only one item:
A->.
Intuition: an item A->a. b denotes:
- a - we have already seen a string derivable from a
- b - we hope to see a string derivable from b
- Definition: An LR(0) item of a grammar G is a production of G with a dot at some position of the RHS.Example: The production A->aAb gives the items:A->.aAb A->a.Ab A->aA.b A->aAb.Note: A production A-> e generates only one item:A->.Intuition: an item A->a. b denotes:
- a - we have already seen a string derivable from a
- b - we hope to see a string derivable from b
Functions on Sets of Items
-
Closure: if I is a set of items for a grammar G, then closure(I) is the set of items constructed as follows:
- Every item in I is in closure(I).
- If A->aBb is in closure(I) and B->g is a production, then add B-> .g to closure(I).
These two rules are applied repeatedly until no new items can be added.
Intuition: If A->a.Bb e closure(I) then we home to see a string derivable from B in the input. So if B-> g is a production, we should hope to see a string derivable from g. Hence, B->.g e closure(I).
Goto: if I is a set of items and X is a grammar symbol, then goto(I,X) is defined to be:
goto(I,X) = closure({[A->aX. b] | [A->a.Xb] e I})
Intuition:
- [A->a.Xb] e I => we've seen a string derivable from a; we hope to see a string derivable from Xb.
- Now suppose we see a string derivable from X
- Then, we should "goto" a state where we've seen a string derivable from aX, and where we hope to see a string derivable from b. The item corresponding to this is [A->aX. b]
- Example: Consider the grammar
E -> E+T | T
T -> T*F | F
F -> (E) | id
Let I = {[E -> E+.T]} then: goto(I,+) = closure({[E -> E+.T]})
= closure({[E -> E+.T], [E -> .T*F], [T -> .F]})
= closure({[E -> E+.T], [E -> .T*F], [T -> .F], [F-> .(E)], [F -> .id]})
= { [E -> E + .T],[T -> .T * F],[T -> .F],[F -> .(E)],[F -> .id]}
- Closure: if I is a set of items for a grammar G, then closure(I) is the set of items constructed as follows:
- Every item in I is in closure(I).
- If A->aBb is in closure(I) and B->g is a production, then add B-> .g to closure(I).
These two rules are applied repeatedly until no new items can be added.Intuition: If A->a.Bb e closure(I) then we home to see a string derivable from B in the input. So if B-> g is a production, we should hope to see a string derivable from g. Hence, B->.g e closure(I).Goto: if I is a set of items and X is a grammar symbol, then goto(I,X) is defined to be:goto(I,X) = closure({[A->aX. b] | [A->a.Xb] e I})Intuition:- [A->a.Xb] e I => we've seen a string derivable from a; we hope to see a string derivable from Xb.
- Now suppose we see a string derivable from X
- Then, we should "goto" a state where we've seen a string derivable from aX, and where we hope to see a string derivable from b. The item corresponding to this is [A->aX. b]
- Example: Consider the grammar
E -> E+T | T T -> T*F | F F -> (E) | id
Let I = {[E -> E+.T]} then:goto(I,+) = closure({[E -> E+.T]}) = closure({[E -> E+.T], [E -> .T*F], [T -> .F]}) = closure({[E -> E+.T], [E -> .T*F], [T -> .F], [F-> .(E)], [F -> .id]}) = { [E -> E + .T],[T -> .T * F],[T -> .F],[F -> .(E)],[F -> .id]}
The Sets of Items Construction
- Given a grammar G with start symbol S, construct the augmented grammar by adding a special production S'->S where S' does no appear in G.
- Algorithm for constructing the canonical collection of LR(0) items for an augmented grammar G':
begin
C := { closure({[S' -> .S]}) };
repeat
for each set of items I e C:
for each grammar symbol X:
if goto(I,X) != 0 and goto(I,X) !e C then
add goto(I,X) to C;
until no new sets of items can be added to C;
return C;
end
Valid Items: an item A -> b 1. b 2 is valid for a viable prefix a b 1 if there is a derivation:
S' =>*rm aA w =>*rma b1 b 2w
Suppose A -> b1. b 2 is valid for ab1, and aB1 is on the parsing stack
- if b2 != e, we should shift
- if b2 = e, A -> b1 is the handle, and we should reduce by this production
Note: two valid items may tell us to do different things for the same viable prefix. Some of these conflicts can be resolved using lookahead on the input string.
- Given a grammar G with start symbol S, construct the augmented grammar by adding a special production S'->S where S' does no appear in G.
- Algorithm for constructing the canonical collection of LR(0) items for an augmented grammar G':
begin C := { closure({[S' -> .S]}) }; repeat for each set of items I e C: for each grammar symbol X: if goto(I,X) != 0 and goto(I,X) !e C then add goto(I,X) to C; until no new sets of items can be added to C; return C; end
Valid Items: an item A -> b 1. b 2 is valid for a viable prefix a b 1 if there is a derivation:S' =>*rm aA w =>*rma b1 b 2w
Suppose A -> b1. b 2 is valid for ab1, and aB1 is on the parsing stack- if b2 != e, we should shift
- if b2 = e, A -> b1 is the handle, and we should reduce by this production
Note: two valid items may tell us to do different things for the same viable prefix. Some of these conflicts can be resolved using lookahead on the input string.
Constructing an SLR Parsing Table
- Given a grammar G, construct the augmented grammar by adding the production S' -> S.
- Construct C = {I0, I1, … In}, the set of sets of LR(0) items for G'.
- State I is constructed from Ii, with parsing action determined as follows:
- [A -> a.aB] e Ii, a a terminal; goto(Ii,a) = Ij : set action[i,a] = "shift j"
- [A -> a.] e Ii : set action[i,a] to "reduce A -> x" for all a e FOLLOW(A), where A != S'
- [S' -> S] e Ii : set action[i,$] to "accept"
- goto transitions constructed as follows: for all non-terminals: if goto(Ii, A) = Ij, then goto[i,A] = j
- All entries not defined by (3) & (4) are made "error". If there are any multiply defined entries, grammar is not SLR.
- Initial state S0 of parser: that constructed from I0 or [S' -> S]
Example:
S -> aABe FIRST(S) = {a} FOLLOW(S) = {$}
A -> Abc FIRST{A} = {b} FOLLOW(A) = {b,d}
A -> b FIRST{B} = {d} FOLLOW{B} = {e}
B -> d FIRST{S'}= {a} FOLLOW{S'}= {$}
I0 = closure([S'->.S]
= closure([S'->.S],[S->.aABe])
goto(I0,S) = closure([S'->S.]) = I1
goto(I0,a) = closure([S->a.Abe])
= closure([S->a.Abe],[A->.Abc],[A->.b]) = I2
goto(I2,A) = closure([S->aA.Be],[A->A.bc])
= closure([S->aA.Be],[A->A.bc],[B->.d]) = I3
goto(I2,B) = closure([A->b.]) = I4
goto(I3,B) = closure([S->aAB.e]) = I5
goto(I3,b) = closure([A->Ab.c]) = I6
goto(I3,d) = closure([B->d.]) = I7
goto(I5,e) = closure([S->aABe.]) = I8
goto(I6,c) = closure([A->Abc.]) = I9
- Given a grammar G, construct the augmented grammar by adding the production S' -> S.
- Construct C = {I0, I1, … In}, the set of sets of LR(0) items for G'.
- State I is constructed from Ii, with parsing action determined as follows:
- [A -> a.aB] e Ii, a a terminal; goto(Ii,a) = Ij : set action[i,a] = "shift j"
- [A -> a.] e Ii : set action[i,a] to "reduce A -> x" for all a e FOLLOW(A), where A != S'
- [S' -> S] e Ii : set action[i,$] to "accept"
- goto transitions constructed as follows: for all non-terminals: if goto(Ii, A) = Ij, then goto[i,A] = j
- All entries not defined by (3) & (4) are made "error". If there are any multiply defined entries, grammar is not SLR.
- Initial state S0 of parser: that constructed from I0 or [S' -> S]
Example:S -> aABe FIRST(S) = {a} FOLLOW(S) = {$} A -> Abc FIRST{A} = {b} FOLLOW(A) = {b,d} A -> b FIRST{B} = {d} FOLLOW{B} = {e} B -> d FIRST{S'}= {a} FOLLOW{S'}= {$} I0 = closure([S'->.S] = closure([S'->.S],[S->.aABe]) goto(I0,S) = closure([S'->S.]) = I1 goto(I0,a) = closure([S->a.Abe]) = closure([S->a.Abe],[A->.Abc],[A->.b]) = I2 goto(I2,A) = closure([S->aA.Be],[A->A.bc]) = closure([S->aA.Be],[A->A.bc],[B->.d]) = I3 goto(I2,B) = closure([A->b.]) = I4 goto(I3,B) = closure([S->aAB.e]) = I5 goto(I3,b) = closure([A->Ab.c]) = I6 goto(I3,d) = closure([B->d.]) = I7 goto(I5,e) = closure([S->aABe.]) = I8 goto(I6,c) = closure([A->Abc.]) = I9
YACC
- YACC files end in .y and take the form
declarations
%%
grammar
%%
subroutines
The declarations section defines the terminal symbols (tokens) and nonterminal symbols. The most useful declarations are:
- %token a
- declares terminal symbol a; YACC can generate a set of #define's that map these symbols onto integers, in a y.tab.h file
- %start A
- specifies the start symbol for the grammar (defaults to nonterminal on left side of the first production rule).
The grammar gives the production rules, interspersed with program code fragments called semantic actions that let the programmer do what's desired when the grammar productions are reduced. They follow the syntax
A : body ;
Where body is a sequence of 0 or more terminals, nonterminals, or semantic actions (code, in curly braces) separated by spaces. As a notational convenience, multiple production rules may be grouped together using the vertical bar (|).
- YACC files end in .y and take the form
declarations %% grammar %% subroutines
The declarations section defines the terminal symbols (tokens) and nonterminal symbols. The most useful declarations are:- %token a
- declares terminal symbol a; YACC can generate a set of #define's that map these symbols onto integers, in a y.tab.h file
- %start A
- specifies the start symbol for the grammar (defaults to nonterminal on left side of the first production rule).
The grammar gives the production rules, interspersed with program code fragments called semantic actions that let the programmer do what's desired when the grammar productions are reduced. They follow the syntaxA : body ;
Where body is a sequence of 0 or more terminals, nonterminals, or semantic actions (code, in curly braces) separated by spaces. As a notational convenience, multiple production rules may be grouped together using the vertical bar (|).
The Value Stack
- YACC's parse stack contains only states
- YACC maintains a parallel set of values
- $ is used in semantic actions to name elements on the value stack
- $$ denotes the value associated with the LHS (nonterminal) symbol
- $n denotes the value associated with RHS symbol at position n.
- Value stack typically used to construct the parse tree
- Typical rule with semantic action: A : b C d { $$ = tree(R,3,$1,$2,$3); }
- The default value stack is an array of integers
- The value stack can hold arbitrary values in an array of unions
- The union type is declared with %union and is named YYSTYPE
- YACC's parse stack contains only states
- YACC maintains a parallel set of values
- $ is used in semantic actions to name elements on the value stack
- $$ denotes the value associated with the LHS (nonterminal) symbol
- $n denotes the value associated with RHS symbol at position n.
- Value stack typically used to construct the parse tree
- Typical rule with semantic action: A : b C d { $$ = tree(R,3,$1,$2,$3); }
- The default value stack is an array of integers
- The value stack can hold arbitrary values in an array of unions
- The union type is declared with %union and is named YYSTYPE
YACC precedence and associativity declarations
- YACC headers can specify precedence and associativity rules for otherwise heavily ambiguous grammars. Precedence is determined by increasing order of these declarations. Example:
%right ASSIGN
%left PLUS MINUS
%left TIMES DIVIDE
%right POWER
%%
expr: expr ASSIGN expr
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIVIDE expr
| expr POWER expr
;
- YACC headers can specify precedence and associativity rules for otherwise heavily ambiguous grammars. Precedence is determined by increasing order of these declarations. Example:
%right ASSIGN %left PLUS MINUS %left TIMES DIVIDE %right POWER %% expr: expr ASSIGN expr | expr PLUS expr | expr MINUS expr | expr TIMES expr | expr DIVIDE expr | expr POWER expr ;
YACC error handling and recovery
- Use special predefined token
error
where errors expected
- On an error, the parser pops states until it enters one that has an action on the error token.
- For example: statement: error ';' ;
- The parser must see 3 good tokens before it decides it has recovered.
- yyerrok tells parser to skip the 3 token recovery rule
- yyclearin throws away the current (error-causing?) token
- yyerror(s) is called when a syntax error occurs (s is the error message)
- Use special predefined token
error
where errors expected - On an error, the parser pops states until it enters one that has an action on the error token.
- For example: statement: error ';' ;
- The parser must see 3 good tokens before it decides it has recovered.
- yyerrok tells parser to skip the 3 token recovery rule
- yyclearin throws away the current (error-causing?) token
- yyerror(s) is called when a syntax error occurs (s is the error message)
- Use special predefined token
Semantic Analysis
- Semantic ("meaning") analysis refers to a phase of compilation in which the input program is studied in order to determine what operations are to be carried out. The two primary components of a classic semantic analysis phase are variable reference analysis and type checking. These components both rely on an underlying symbol table.
What we have at the start of semantic analysis is a tree built definitions; they can have all the synthesized attributes they want. In practice, attributes get stored in parse tree nodes and the semantic rules are evaluated either (a) during parsing (for easy rules) or (b) during one or more (sub)tree traversals.
- Semantic ("meaning") analysis refers to a phase of compilation in which the input program is studied in order to determine what operations are to be carried out. The two primary components of a classic semantic analysis phase are variable reference analysis and type checking. These components both rely on an underlying symbol table.What we have at the start of semantic analysis is a tree built definitions; they can have all the synthesized attributes they want. In practice, attributes get stored in parse tree nodes and the semantic rules are evaluated either (a) during parsing (for easy rules) or (b) during one or more (sub)tree traversals.
Symbol Table Module
- Symbol tables are used to resolve names within name spaces. Symbol tables are generally organized hierarchically according to the scope rules of the language. See the operations defined on pages 474-476 of the text. To wit:
- mktable(parent)
- creates a new symbol table, whose scope is local to (or inside) parent
- enter(table, symbolname, type, offset)
- insert a symbol into a table
- addwidth(table)
- sums the widths of all entries in the table
- enterproc(table, name, newtable)
- enters the local scope of the named procedure
- Symbol tables are used to resolve names within name spaces. Symbol tables are generally organized hierarchically according to the scope rules of the language. See the operations defined on pages 474-476 of the text. To wit:
- mktable(parent)
- creates a new symbol table, whose scope is local to (or inside) parent
- enter(table, symbolname, type, offset)
- insert a symbol into a table
- addwidth(table)
- sums the widths of all entries in the table
- enterproc(table, name, newtable)
- enters the local scope of the named procedure
Type Checking
- Perhaps the primary component of semantic analysis in many traditional compilers consists of the type checker. In order to check types, one first must have a representation of those types (a type system) and then one must implement comparison and composition operators on those types using the semantic rules of the source language being compiled. Lastly, type checking will involve adding synthesized attributes through those parts of the language grammar that involve expressions and values.
- Perhaps the primary component of semantic analysis in many traditional compilers consists of the type checker. In order to check types, one first must have a representation of those types (a type system) and then one must implement comparison and composition operators on those types using the semantic rules of the source language being compiled. Lastly, type checking will involve adding synthesized attributes through those parts of the language grammar that involve expressions and values.
Type Systems
- Types are defined recursively according to rules defined by the source language being compiled. A type system might start with rules like:
- Base types (int, char, etc.) are types
- Named types (via typedef, etc.) are types
- Types composed using other types are types, for example:
- array(T, indices) is a type. In some languages indices always start with 0, so array(T, size) works.
- T1 x T2 is a type (specifying, more or less, the tuple or sequence T1 followed by T2; x is a so-called cross-product operator).
- record((f1 x T1) x (f2 x T2) x ... x (fn x Tn)) is a type
- in languages with pointers, pointer(T) is a type
- (T1 x ... Tn) -> Tn+1 is a type denoting a function mapping parameter types to a return type
- In some language type expressions may contain variables whose values are types.
In addition, a type system includes rules for assigning these types to the various parts of the program; usually this will be performed using attributes assigned to grammar symbols.
- Types are defined recursively according to rules defined by the source language being compiled. A type system might start with rules like:
- Base types (int, char, etc.) are types
- Named types (via typedef, etc.) are types
- Types composed using other types are types, for example:
- array(T, indices) is a type. In some languages indices always start with 0, so array(T, size) works.
- T1 x T2 is a type (specifying, more or less, the tuple or sequence T1 followed by T2; x is a so-called cross-product operator).
- record((f1 x T1) x (f2 x T2) x ... x (fn x Tn)) is a type
- in languages with pointers, pointer(T) is a type
- (T1 x ... Tn) -> Tn+1 is a type denoting a function mapping parameter types to a return type
- In some language type expressions may contain variables whose values are types.
Representing C (C++, Java, etc.) Types
- The type system is represented using data structures in the compiler's implementation language. In the symbol table and in the parse tree attributes used in type checking, there is a need to represent and compare source language types. You might start by trying to assign a numeric code to each type, kind of like the integers used to denote each terminal symbol and each production rule of the grammar. But what about arrays? What about structs? There are an infinite number of types; any attempt to enumerate them will fail. Instead, you should create a new data type to explicitly represent type information. This might look something like the following:
struct c_type {
int base_type; /* 1 = int, 2=float, ... */
union {
struct array {
int size;
struct c_type *elemtype;
} a;
struct ctype *p;
struct struc {
char *label;
struct field **f;
} s;
} u;
}
struct field {
char *name;
struct ctype *elemtype;
}
Given this representation, how would you initialize a variable to represent each of the following types:int [10][20]
struct foo { int x; char *s; }
- The type system is represented using data structures in the compiler's implementation language. In the symbol table and in the parse tree attributes used in type checking, there is a need to represent and compare source language types. You might start by trying to assign a numeric code to each type, kind of like the integers used to denote each terminal symbol and each production rule of the grammar. But what about arrays? What about structs? There are an infinite number of types; any attempt to enumerate them will fail. Instead, you should create a new data type to explicitly represent type information. This might look something like the following:
struct c_type { int base_type; /* 1 = int, 2=float, ... */ union { struct array { int size; struct c_type *elemtype; } a; struct ctype *p; struct struc { char *label; struct field **f; } s; } u; } struct field { char *name; struct ctype *elemtype; }
Given this representation, how would you initialize a variable to represent each of the following types:int [10][20] struct foo { int x; char *s; }
Run-time Environments
- Relationship between source code names and data objects during execution
- Procedure activations
- Memory management and layout
- Library functions
- Relationship between source code names and data objects during execution
- Procedure activations
- Memory management and layout
- Library functions
Scopes and Bindings
- Variables may be declared explicitly or implicitly in some languages
Scope rules for each language determine how to go from names to declarations.
Each use of a variable name must be associated with a declaration. This is generally done via a symbol table. In most compiled languages it happens at compile time (in contrast, for example ,with LISP).
- Variables may be declared explicitly or implicitly in some languagesScope rules for each language determine how to go from names to declarations.Each use of a variable name must be associated with a declaration. This is generally done via a symbol table. In most compiled languages it happens at compile time (in contrast, for example ,with LISP).
Environment and State
- Environment maps source code names onto storage addresses (at compile time), while state maps storage addresses into values (at runtime). Environment relies on binding rules and is used in code generation; state operations are loads/stores into memory, as well as allocations and deallocations. Environment is concerned with scope rules, state is concerned with things like the lifetimes of variables.
- Environment maps source code names onto storage addresses (at compile time), while state maps storage addresses into values (at runtime). Environment relies on binding rules and is used in code generation; state operations are loads/stores into memory, as well as allocations and deallocations. Environment is concerned with scope rules, state is concerned with things like the lifetimes of variables.
Runtime Memory Regions
- Operating systems vary in terms of how the organize program memory for runtime execution, but a typical scheme looks like this:
code
static data
stack (grows down)
heap (may grow up, from bottom of address space)
The code section may be read-only, and shared among multiple instances of a program. Dynamic loading may introduce multiple code regions, which may not be contiguous, and some of them may be shared by different programs. The static data area may consist of two sections, one for "initialized data", and one section for uninitialized (i.e. all zero's at the beginning). Some OS'es place the heap at the very end of the address space, with a big hole so either the stack or the heap may grow arbitrarily large. Other OS'es fix the stack size and place the heap above the stack and grow it down.
- Operating systems vary in terms of how the organize program memory for runtime execution, but a typical scheme looks like this:
code static data stack (grows down) heap (may grow up, from bottom of address space)
Intermediate Code Generation
- Goal: list of machine-independent instructions for each procedure/method in the program. Basic data layout of all variables.
Can be formulated as syntax-directed translation
- add new attributes where necessary, e.g. for expression E we might have
- E.place
- the name that holds the value of E
- E.code
- the sequence of intermediate code statements evaluating E.
- new helper functions, e.g.
newtemp()
- returns a new temporary variable each time it is called
newlabel()
- returns a new label each time it is called
- actions that generate intermediate code formulated as semantic rules
Production Semantic Rules
S -> id ASN E S.code = E.code || gen(ASN, id.place, E.place)
E -> E1 PLUS E2 E.place = newtemp();
E.code = E1.code || E2.code || gen(PLUS,E.place,E1.place,E2.place);
E -> E1 MUL E2 E.place = newtemp();
E.code = E1.code || E2.code || gen(MUL,E.place,E1.place,E2.place);
E -> MINUS E1 E.place = newtemp();
E.code = E1.code || gen(NEG,E.place,E1.place,E2.place);
E -> LP E1 RP E.place = E1.place;
E.code = E1.code;
E -> IDENT E.place = id.place;
E.code = emptylist();
- Goal: list of machine-independent instructions for each procedure/method in the program. Basic data layout of all variables.Can be formulated as syntax-directed translation
- add new attributes where necessary, e.g. for expression E we might have
- E.place
- the name that holds the value of E
- E.code
- the sequence of intermediate code statements evaluating E.
- new helper functions, e.g.
newtemp()
- returns a new temporary variable each time it is called
newlabel()
- returns a new label each time it is called
- actions that generate intermediate code formulated as semantic rules
Production Semantic Rules S -> id ASN E S.code = E.code || gen(ASN, id.place, E.place) E -> E1 PLUS E2 E.place = newtemp();
E.code = E1.code || E2.code || gen(PLUS,E.place,E1.place,E2.place);E -> E1 MUL E2 E.place = newtemp();
E.code = E1.code || E2.code || gen(MUL,E.place,E1.place,E2.place);E -> MINUS E1 E.place = newtemp();
E.code = E1.code || gen(NEG,E.place,E1.place,E2.place);E -> LP E1 RP E.place = E1.place;
E.code = E1.code;E -> IDENT E.place = id.place;
E.code = emptylist(); - add new attributes where necessary, e.g. for expression E we might have
Three-Address Code
- Basic idea: break down source language expressions into simple pieces that:
- translate easily into real machine code
- form a linearized representation of a syntax tree
- allow us to check our own work to this point
- allow machine independent code optimizations to be performed
- increase the portability of the compiler
Instruction set:
x := y op z store result of binary operation on y and z to x
x := op y store result of unary operation on y to x
x := y store y to x
x := &y store address of y to x
x := *y store contents pointed to by y to x
*x := y store y to location pointed to by x
goto L unconditional jump to L
if x rop y then goto L binary conditional jump to L
if x then goto L unary conditional jump to L
if !x then goto L unary negative conditional jump to L
param x store x as a parameter
call p,n,x call procedure p with n parameters, store result in x
return x return from procedure, use x as the result
Declarations (Pseudo instructions): These declarations list size units as "bytes"; in a uniform-size environment offsets and counts could be given in units of "slots", where a slot (4 bytes on 32-bit machines) holds anything.
global x,n1,n2 declare a global named x at offset n1 having n2 bytes of space
proc x,n1,n2 declare a procedure named x with n1 bytes of parameter space and n2 bytes of local variable space
local x,n declare a local named x at offset n from the procedure frame
label Ln designate that label Ln refers to the next instruction
end declare the end of the current procedure
Adaptations for Object Oriented Code
x := y field z lookup field named z within y, store address to x
class x,n1,n2 declare a class named x with n1 bytes of class variables and n2 bytes of class method pointers
field x,n declare a field named x at offset n in the class frame
new x create a new instance of class name x
- Basic idea: break down source language expressions into simple pieces that:
- translate easily into real machine code
- form a linearized representation of a syntax tree
- allow us to check our own work to this point
- allow machine independent code optimizations to be performed
- increase the portability of the compiler
Instruction set:x := y op z store result of binary operation on y and z to x x := op y store result of unary operation on y to x x := y store y to x x := &y store address of y to x x := *y store contents pointed to by y to x *x := y store y to location pointed to by x goto L unconditional jump to L if x rop y then goto L binary conditional jump to L if x then goto L unary conditional jump to L if !x then goto L unary negative conditional jump to L param x store x as a parameter call p,n,x call procedure p with n parameters, store result in x return x return from procedure, use x as the result Declarations (Pseudo instructions): These declarations list size units as "bytes"; in a uniform-size environment offsets and counts could be given in units of "slots", where a slot (4 bytes on 32-bit machines) holds anything.global x,n1,n2 declare a global named x at offset n1 having n2 bytes of space proc x,n1,n2 declare a procedure named x with n1 bytes of parameter space and n2 bytes of local variable space local x,n declare a local named x at offset n from the procedure frame label Ln designate that label Ln refers to the next instruction end declare the end of the current procedure Adaptations for Object Oriented Codex := y field z lookup field named z within y, store address to x class x,n1,n2 declare a class named x with n1 bytes of class variables and n2 bytes of class method pointers field x,n declare a field named x at offset n in the class frame new x create a new instance of class name x
Intermediate Code for Control Flow
- Code for control flow (if-then, switches, and loops) consists of code to test conditions, and the use of goto instructions and labels to route execution to the correct code. Each chunk of code that is executed together (no jumps into or out of it) is called a basic block. The basic blocks are nodes in a control flow graph, where goto instructions, as well as falling through from one basic block to another, are edges connecting basic blocks.
Depending on your source language's semantic rules for things like "short-circuit" evaluation for boolean operators, the operators like || and && might be similar to + and * (non-short-circuit) or they might be more like if-then code.
A general technique for implementing control flow code is to add new attributes to tree nodes to hold labels that denote the possible targets of jumps. The labels in question are sort of analogous to FIRST and FOLLOW; for any given list of instructions corresponding to a given tree node, we might want a .first attribute to hold the label for the beginning of the list, and a .follow attribute to hold the label for the next instruction that comes after the list of instructions. The .first attribute can be easily synthesized. The .follow attribute must be inherited from a sibling. The labels have to actually be allocated and attached to instructions at appropriate nodes in the tree corresponding to grammar production rules that govern control flow. An instruction in the middle of a basic block need neither a first nor a follow.
C code Attribute Manipulations
S->if E then S1 E.true = newlabel();
E.false = S.follow;
S1.follow = S.follow;
S.code = E.code || gen(LABEL, E.true)||
S1.code
S->if E then S1 else S2 E.true = newlabel();
E.false = newlabel();
S1.follow = S.follow;
S2.follow = S.follow;
S.code = E.code || gen(LABEL, E.true)||
S1.code || gen(GOTO, S.follow) ||
gen(LABEL, E.false) || S2.code
Exercise: OK, so what does a while loop look like?
- Code for control flow (if-then, switches, and loops) consists of code to test conditions, and the use of goto instructions and labels to route execution to the correct code. Each chunk of code that is executed together (no jumps into or out of it) is called a basic block. The basic blocks are nodes in a control flow graph, where goto instructions, as well as falling through from one basic block to another, are edges connecting basic blocks.Depending on your source language's semantic rules for things like "short-circuit" evaluation for boolean operators, the operators like || and && might be similar to + and * (non-short-circuit) or they might be more like if-then code.A general technique for implementing control flow code is to add new attributes to tree nodes to hold labels that denote the possible targets of jumps. The labels in question are sort of analogous to FIRST and FOLLOW; for any given list of instructions corresponding to a given tree node, we might want a .first attribute to hold the label for the beginning of the list, and a .follow attribute to hold the label for the next instruction that comes after the list of instructions. The .first attribute can be easily synthesized. The .follow attribute must be inherited from a sibling. The labels have to actually be allocated and attached to instructions at appropriate nodes in the tree corresponding to grammar production rules that govern control flow. An instruction in the middle of a basic block need neither a first nor a follow.
C code Attribute Manipulations S->if E then S1 E.true = newlabel();
E.false = S.follow;
S1.follow = S.follow;
S.code = E.code || gen(LABEL, E.true)||
S1.codeS->if E then S1 else S2 E.true = newlabel();
E.false = newlabel();
S1.follow = S.follow;
S2.follow = S.follow;
S.code = E.code || gen(LABEL, E.true)||
S1.code || gen(GOTO, S.follow) ||
gen(LABEL, E.false) || S2.codeExercise: OK, so what does a while loop look like?
On Boolean Operators, and Short Circuit Control Flow
- Different languages have different semantics for booleans; for example Pascal treats them as identical to arithmetic operators, while the C family of languages (and many ) others specify "short-circuit" evaluation in which operands are not evaluated once the answer to the boolean result is known. Some ("kitchen-sink" design) languages have two sets of boolean operators: short circuit and non-short-circuit.
Implementation techniques for these alternatives include:
- treat boolean operators same as arithmetic operators, evaluate each and every one into temporary variable locations.
- add extra attributes to keep track of code locations that are targets of jumps. Boolean expressions' results evaluate to jump instructions.
- one could change the machine execution model so it implicity routes control from expression failure to the appropriate location. In order to do this one would
- mark boundaries of code in which failure propagates
- maintain a stack of such marked "expression frames"
- Different languages have different semantics for booleans; for example Pascal treats them as identical to arithmetic operators, while the C family of languages (and many ) others specify "short-circuit" evaluation in which operands are not evaluated once the answer to the boolean result is known. Some ("kitchen-sink" design) languages have two sets of boolean operators: short circuit and non-short-circuit.Implementation techniques for these alternatives include:
- treat boolean operators same as arithmetic operators, evaluate each and every one into temporary variable locations.
- add extra attributes to keep track of code locations that are targets of jumps. Boolean expressions' results evaluate to jump instructions.
- one could change the machine execution model so it implicity routes control from expression failure to the appropriate location. In order to do this one would
- mark boundaries of code in which failure propagates
- maintain a stack of such marked "expression frames"
Non-short Circuit Example
a
translates into
100: if a1 = 0
goto 104
103: t1 = 1
104: if c2 = 0
goto 108
107: t2 = 1
108: if e3 = 0
goto 112
111: t3 = 1
112: t4 = t2 AND t3
t5 = t1 OR t4
a translates into
100: if a1 = 0 goto 104 103: t1 = 1 104: if c
2 = 0 goto 108 107: t2 = 1 108: if e 3 = 0 goto 112 111: t3 = 1 112: t4 = t2 AND t3 t5 = t1 OR t4
Short-Circuit Example
a
translates into
if a
Note: L3 might instead be the target E.false; L1 might instead be E.true;
no computation of a 0 or 1 into t might be needed at all.
a translates into
if a Note: L3 might instead be the target E.false; L1 might instead be E.true; no computation of a 0 or 1 into t might be needed at all.
Final Code Generation
Final code generation takes a linear sequence of 3-address intermediate
code instructions, and translates each 3-address instruction into one or
more native instructions.
The big issues in code generation are (a) instruction selection, and (b)
register allocation and assignment.
Final code generation takes a linear sequence of 3-address intermediate code instructions, and translates each 3-address instruction into one or more native instructions. The big issues in code generation are (a) instruction selection, and (b) register allocation and assignment.
Instruction Selection
The hardware may have many difference sequences of instructions to
accomplish a given task. Instruction selection must choose a particular
sequence. At issue may be: how many registers to use, whether a special
case instruction is available, and what addressing mode(s) to use. Given
a choice among equivalent/alternaive sequences, the decision on which
sequence of instructions to use is based on estimates or measurements of
which sequence executes the fastest. This is usually determined by the
number of memory references incurred during execution, including the
memory references for the instructions themselves. Simply picking the
shortest sequence of instructions is often a good approximation of the
optimal result, since fewer instructions usually translates into fewer
memory references.
The hardware may have many difference sequences of instructions to accomplish a given task. Instruction selection must choose a particular sequence. At issue may be: how many registers to use, whether a special case instruction is available, and what addressing mode(s) to use. Given a choice among equivalent/alternaive sequences, the decision on which sequence of instructions to use is based on estimates or measurements of which sequence executes the fastest. This is usually determined by the number of memory references incurred during execution, including the memory references for the instructions themselves. Simply picking the shortest sequence of instructions is often a good approximation of the optimal result, since fewer instructions usually translates into fewer memory references.
Register Allocation and Assignment
Accessing values in registers is much much faster than accessing main memory.
Register allocation denotes the selection of which variables will go
into registers. Register assignment is the determination of exactly
which register to place a given variable. The goal of these operations
is generally to minimize the total number of memory accesses required
by the program.
In the Old Days, there were Load-Store hardware architectures in which
only one (accumulator) register was present. On such an architecture,
register allocation and assignment is not needed; the compiler has few
options about how it uses the accumulator register. Traditional x86
16-bit architecture was only a little better than a load-store architecture,
with 4 registers instead of 1. At the other extreme, Recent History has
included CPU's with 32 or more general purpose registers. On such systems,
high quality compiler register allocation and assignment makes a huge
difference in program execution speed. Unfortunately, optimal register
allocation and assignment is NP-complete, so compilers must settle for
doing a "good" job.
When the number of variables in use at a given time exceeds the number
of registers available (the common case), some variables may be used
directly from memory if the instruction set supports memory-based operations.
When an instruction set does not support memory-based operations, all
variables must be loaded into a register in order to perform arithmetic
or logic using them.
Even if an instruction set does support memory-based operations, most
compilers will want to load load a value into a register while it is
being used, and then spill it back out to main memory when the register
is needed for another purpose. The task of minimizing memory accesses
becomes the task of minimizing register loads and spills.
Accessing values in registers is much much faster than accessing main memory. Register allocation denotes the selection of which variables will go into registers. Register assignment is the determination of exactly which register to place a given variable. The goal of these operations is generally to minimize the total number of memory accesses required by the program. In the Old Days, there were Load-Store hardware architectures in which only one (accumulator) register was present. On such an architecture, register allocation and assignment is not needed; the compiler has few options about how it uses the accumulator register. Traditional x86 16-bit architecture was only a little better than a load-store architecture, with 4 registers instead of 1. At the other extreme, Recent History has included CPU's with 32 or more general purpose registers. On such systems, high quality compiler register allocation and assignment makes a huge difference in program execution speed. Unfortunately, optimal register allocation and assignment is NP-complete, so compilers must settle for doing a "good" job.
When the number of variables in use at a given time exceeds the number of registers available (the common case), some variables may be used directly from memory if the instruction set supports memory-based operations. When an instruction set does not support memory-based operations, all variables must be loaded into a register in order to perform arithmetic or logic using them.
Even if an instruction set does support memory-based operations, most compilers will want to load load a value into a register while it is being used, and then spill it back out to main memory when the register is needed for another purpose. The task of minimizing memory accesses becomes the task of minimizing register loads and spills.
Code Generation for Virtual Machines
A virtual machine architecture such as the JVM changes the "final" code
generation somewhat. We have seen several changes, some of which
simplify final code generation and some of which complicate things.
- no registers, simplified addressing
- a virtual machine may omit a register model and avoid complex
addressing modes for different types of variables
- uni-size or descriptor-based values
- if all variables are "the same size", some of the details of
memory management are simplified. In Java most values occupy
a standard "slot" size, although some values occupy two slots.
In Icon and Unicon, all values are stored using a same-size descriptor.
- runtime type system
- requiring type information at runtime may complicate the
code generation task since type information must be present
in generated code. For example in Java method invocation and
field access instructions must encode class information.
A virtual machine architecture such as the JVM changes the "final" code generation somewhat. We have seen several changes, some of which simplify final code generation and some of which complicate things.
- no registers, simplified addressing
- a virtual machine may omit a register model and avoid complex addressing modes for different types of variables
- uni-size or descriptor-based values
- if all variables are "the same size", some of the details of memory management are simplified. In Java most values occupy a standard "slot" size, although some values occupy two slots. In Icon and Unicon, all values are stored using a same-size descriptor.
- runtime type system
- requiring type information at runtime may complicate the code generation task since type information must be present in generated code. For example in Java method invocation and field access instructions must encode class information.
No comments:
Post a Comment