Mimir:Draft5 Chapter11

From Openitware
Jump to: navigation, search

Back to Table of Contents

Chapter 11: Writing Your Own Parser

"QUOTE HERE" - Author



Chapter 11 will continue the second half of the tokenization to parser experiment, which is the parser itself. It is important to understand exactly what a parser is in the context of programming and to realize there are several different types of parsers that can be applied.

What is a Parser?

The first thing to ask is, what is a parser? Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar.[1] The parser shown in the next section moves through, line by line, identifying each aspect of the lines it is given and taking action depending on what it finds. Following the steps

Types of Parsers

There are two ways of parsing information. Bottom-Up and Top-Down

The Bottom-Up Parser name comes from the concept of a parse tree, in which the most detailed parts are at the bushy bottom of the (upside-down) tree, and larger structures composed from them are in successively higher layers, until at the top or "root" of the tree a single unit describes the entire input stream. A bottom-up parse discovers and processes that tree starting from the bottom left end, and incrementally works its way upwards and rightwards. [2]

Top-Down Parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse-trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right.[3] Opposite of the Bottom-Up Parser, the Top-Down Parser works from the top of the tree and works its way down. The lexical analyzer we are building is an example of a top down parser, which you will discover in the next section.

There are some key differences in the end results from choosing one over the other.

Top-Down Parsers:

Slower - Due to the fact that they are recursive descent parsers they are typically far slower. Greedy - Easier - Though they are slower, they are easier to understand and easier to create. This is the main reason we chose to create our project as a top-down rather than a bottom-up. LL - Top-down parsers read from left to right, and expands upon the leftmost entry first.

Bottom-Up Parsers:

Faster - Without recursion slowing them down they tend to be much more efficient. Harder - The structure of a bottom-up parser is much more complicated and harder to understand. Popular - Most modern languages are bottom-up. LR - Bottom-up parsers read from left to right, then exands upon the rightmost entry first.

Finally, bottom-up parsers can parse languages that not all top-down parsers can simply by the way the languages are built.

Our Lexical Analyzer - The Parser

Understanding the Parser

There are countless ways to design your own language. Any rules that you could think up could easily be enforced. Those familiar with Python know that you need to add the appropriately add white space in nested structures or it will give an error. In C++ you must end every line with a semi-colon. Our language is simple enough, as you will discover.

The coding below represents the skeleton of the parser for SQRL. Not all sections have been written--blank segments will be filled in by students later as a learning exercise, and these blank segments will be pointed out as we come across them in this breakdown. Additionally, the segments we reviewed in the previous section--the lexical analyzer--are not included. The first three written lines below are provided more for context; you will notice they mirror what's in the analyzer.

import java.util.*;

public class Parse

    private int[] vars;
    public Parse()
        vars = new int[26];
        for (int i=0; i<26; i++)
            vars[i] = 0;

The above function creates an array to store variables, which you'll define later. You'll notice, the array maxes out at 26 places--this function lets you use single letters to store numbers.

    // parse token for < code>
    private void parseCode(String token)
        do {
            parseStmt(token);        // ::= <statement> < code> | <statement>
            token = getToken();
        } while (!token.equals("."));

The parseCode function does one thing--it tells the program to be looking for input, as long as the input doesn't equal a "." character, and feed it into the main defining function of the program.

    // parse token for <statement>
    private void parseStmt(String token)
        int val;
        if (token.equals("print"))
            token = getToken();
            val = parseExpr(token);
        else if (token.equals("input"))
            token = getToken();
            if (!isVar(token))
            System.out.print("? ");
            val = scan.nextInt();
            storeVar(token, val);
        else if (token.equals("if"))
        else if (isVar(token))

The parseStmt function contains several cases to recognize basic commands for the language--specifically print, input, and if. This is the big driving function of the language--this is where the language starts trying to figure out exactly what it is it's looking at.

    // parses <expr>
    private int parseExpr(String token)
        String op;
        int val;
        val = parseVal(token);
        op = getToken();
            case '+':
                val = val + parseVal(getToken());    // ::= <val> + <val>
            case '-':
            case '*':
            case '/':
            default:    // oops, a unary expression
                line = op + line;
        return val;

The parseExpr function identifies characters related to basic mathematical expressions, and sets up actions related to those expressions. If it sees a "+", as written, the code here will take the previously identified value and look for another value using the getToken function, then try to add those two values. If it sees a "-" character, the code as written here will totally collapse. You will fill this information in later.

    // parses token for <val> returns a integer value
    private int parseVal(String token)
        if (isNumeric(token))                    // ::= <num>
            return Integer.parseInt(token);
        if (isVar(token))                        // ::= < var>
            return parseVar(token);
        return -1; // should never happen

This function is a fork, checking an individual character to determine if it's a number by using the isNumeric function, or if it's a variable by using the isVar function.

    // store value into < var>
    private void storeVar(String token, int val)
        vars[((int)(token.charAt(0))) - 97] = val;

The storeVar function associates a value with a hashed array created to have 26 positions--which you'll remember from the beginning of the code above. The data stored should be a letter, and an associated number.

    // checks to see if token is a <num>
    private boolean isNumeric(String token)
        for (int i=0; i<token.length(); i++)
            if (!Character.isDigit(token.charAt(i)))
                return false;
        return true;

The isNumeric function analyzes a token of single-digit length. It uses a built-in function of Java -- isDigit -- to compare the currently selected character in a token to standard digits, and make sure that it is actually a digit. If at any point the function finds a non-digit, it returns false.

    // checks to see if token is a < var>
    private boolean isVar(String token)
        return (token.length() == 1 && isAlpha(token.charAt(0)));

This segment determines that the token is only one character long, and uses the isAlpha function to make sure that the parsed one-character token is alphabetic.

    // is it a to z?
    private boolean isAlpha(char ch)
        return ((int) ch) >= 97 && ((int) ch) <= 122;

The isAlpha function returns a true value if the ascii value of the character being sent to it falls within the ascii values of alphabetic characters. This feeds back into the prior function.

    // parses < var> and returns integer value
    private int parseVar(String token)
        return 0;

This last portion is meant to parse a variable and return the associated stored integer value in the array created at the beginning of the code above.

Copy all of the code and put it together to run your own lexical analyzer. SQRL, when ready, should be capable of understanding the following phrases.

print 5 + 6
a = 5
b = 6
print a + b
if a < b if b < 5 print 7 + 2

Please note, on this last example, unless you use different variables than the ones provided earlier for a and b, the code should return nothing at all when run--as a has ben set to 5, and b has been set to 6, the second if statement should return false, and terminate the code before the print statement can return.

Once you have a deeper understanding of the way it works, please move on to the next section for exercises where you will be tasked with adding on more functionality to the program.

Table Driven Parsing

(From casey - I have and notes for this - not sure how to write tables in wiki format.


A quick summary of the chapter should go here

Key Terms

A list of key terms should go here. This should be created using some sort of glossary type plugin.

Problem Sets

Problem 1

Can you make SQRL follow this BNF Grammar?

 <SQRL> ::= < code> .
 < code> ::= <statement>  
 < code> ::= <statement> 
 <statement> ::= print <expr> 
 <statement> ::= input < var>
 <statement> ::= if <cond> <statement>
 <statement> ::= < var> = <expr>
 <expr> ::= <val> + <val>
 <expr> ::= <val> - <val>
 <expr> ::= <val> * <val>
 <expr> ::= <val> / <val>
 <expr> ::= <val>
 <cond> ::= <val> = = <val>  //Note that this is a double equals 
 <cond> ::= <val> > <val>
 <cond> ::= <val> < <val>
 <val> ::= <num>
 <val> ::= < var>
 <num> ::= <dig> | <dig><num>
 <dig> :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
< var> ::= a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p
| q | r | s | t | u | v | w | x | y | z

Top of Page - Prev Chapter - Next Chapter