Mimir:Draft2 Chapter10

From Openitware
Jump to: navigation, search

Back to Table of Contents



Chapter TODOs

Chapter 10: Tokenization: A Lexical Analyzer in Java

"QUOTE HERE" - Author

Introduction

MimirsWell.jpg

Chapter 10 focuses on what is known as Tokenization in regards to a Lexical Analyzer built with the Java programming language. We will go through step by step to discover exactly what it means to build a programming language from scratch. Java will aid us in the process, handling some of the lower level processes while we tackle the higher level.

A Lexical Analyzer

What is a Lexical Analyzer?

A Lexical Analyzer is a program that breaks programming code into smaller segments. These segments, called tokens, are code snippets that, on their own are at some level meaningful. A lexical analyzer has the ability to recognize parsed code that conforms to individual words, or to determine that captured code doesn't conform to any known code configurations and respond appropriately. Do not worry about what a parser is at this time, as that will be covered in much more detail in Chapter 11.

What we will be doing in the next section is creating the lexical analyzer. It will allow us to enter information which will then be analyzed and subsequently tokenized. Tokenization when applied to data security, is the process of substituting a sensitive data element with a non-sensitive equivalent, referred to as a token, that has no extrinsic or exploitable meaning or value. [1] Knowing what the token is at any given time and taking the appropriate action is vital to the success of the lexical analyzer. The token could be holding or representing a command, a numeric value, an operator, or something that is invalid all together.


Building the Lexical Analyzer

The first piece to look at would be the primary class Parse. The class as a whole, outside of its internal functions, consists of effectively just the following few lines:

import java.util.*;

public class Parse
{
    private String line = "";
    private Scanner scan;
   

Of course, it initializes the class, then then creates two variables. The first, line, creates a null data set, starting a parsing session cleanly. The second, scan, is a placeholder for incoming data being pulled from the piece of coding being parsed.

The next function on the block is run.

    public void run()
    {
        String token;
        
        scan = new Scanner(System.in);
        while(true)
        {
            System.out.print("> ");
            do 
            {
                token = getToken();
                if (token.equals("."))
                {
                    return;
                }
                System.out.print("[" + token + "]");
            } while (line.length() > 0);
            System.out.println();
        }
    }

The run function first creates a placeholder for the incoming token string, then immediately sets to scanning the incoming data. The while loop sets up a system prompt. The subsequent do command checks to make sure that the termination command, ".", hasn't been received after the user has entered text, then prints out the effects of the code entered.

The isBlank function is next.

    private boolean isBlank(char ch)
    {
        switch (ch)
        {
            case ' ':
            case '\t':
                return true;
        }
        return false;
    }

The isBlank function discovers whitespace. It looks specifically for spaces and carriage returns, and returns a boolean value for whatever has been found.


    private boolean isDelim(char ch)
    {
        if (isBlank(ch))
            return true;
            
        switch (ch)
        {
            case '.':
            case '+': case '-': case '*': case '/':
            case '=': case '>': case '<':
                return true;
        }
        return false;
    }

The isDelim function checks to see if the token consists of a delimiting character--something that indicates either a needed operation, or that indicates the lack of any function at all. It first runs the isBlank function to determine if the incoming character is a space or a carriage return. If it's either, the function terminates. It then takes the character and tests it against a list of mathematical operators. If it still doesn't find the character, the function has determined that the character it's looking at is not a delimiting character.

The last major function of our analyzer as written here is the getToken function.

    private String getToken()
    {
        int i; // increment variable
        String token;
        
        while (line.length() == 0)
        {
            line = scan.nextLine();
       
            // skip leading blanks
            for (i=0; i<line.length(); i++)
                if (!isBlank(line.charAt(i)))
                    break;
            line = line.substring(i);
        }
        
        for (i=0; i<line.length();i++)
            if (isDelim(line.charAt(i)))
            {
                if (i == 0) // if it's a delimiter
                    i++;
                    
                token = line.substring(0, i);
                line = line.substring(i);
                return token;
            }
        
        token = line;
        line = "";
        return token;
    }

} // ... and an extra one to close class Parse

The getToken function is the major driving engine of the analyzer. This function opens by setting a storage place "i" to determine the length of a given token, and sets a holding place for incoming data called token. If the length of the incoming data is 0, it immediately moves on. It then checks for blank characters using isBlank, then for delimiters using isDelim. As long as the line still has data to process, the function is designed to continue attempting to process that data.

What the clever observer will note is that while this analyzer has successfully managed to divide code into whitespace, delimiters, and tokens, it is not currently set up to recognize any actual incoming commands. This functionality will be added as the analyzer continues to be developed in the next chapter in which parsing will be added.

What is a token?

A token is a categorized block of text, usually consisting of indivisible characters known as lexemes. A lexical analyzer initially reads in lexemes and categorizes them according to function, giving them meaning. This assignment of meaning is known as tokenization. A token can look like anything: English, gibberish symbols, anything; it just needs to be a useful part of the structured text.Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer such as lex. The lexical analyzer reads in a stream of lexemes and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.


Summary

A quick summary of the chapter should go here

Key Terms

A list of key terms should go here. This should be created using some sort of glossary type plugin.

Problem Sets

A list of practice problems