Chapter 12: Lex and Yacc
"Make everything as simple as possible, but not simpler" - Albert Eistein
Lex & Yacc
Lex and yacc are tools designed for writers of compilers and interpreters, although they are also useful for many applications that will interest the noncompiler writer. Any application that looks for patterns in its input, or has an input or command language is a good candidate for lex and yacc. Furthermore, they allow for rapid application prototyping, easy modification, and simple maintenance of programs.
Lex and yacc were both developed at Bell Laboratories in the 1970s. Yacc was the first of the two, developed by Stephen C. Johnson. Lex was designed by Mike Lesk and Eric Schmidt to work with yacc. Both lex and yacc have been standard UNIX utilities since 7th Edition UNIX. System V and older versions of BSD use the original AT&T versions, while the newest version of BSD uses flex (see below) and Berkeley yacc. The articles written by the developers remain the primary source of information on lex and yacc.
Lex is officially known as a "Lexical Analyser". Its main job is to break up an input stream into more usable elements. It is used to identify the "interesting bits" in a text file. More to come....
Yacc takes a grammar that you specify and writes a parser that recognizes valid “sentences” in that grammar.
Overview - What is Lex and Yacc (or Flex and Bison)
Lexical analyzers: A lexer takes an arbitrary input stream and tokenizes it, i.e., divides it up into lexical tokens. This tokenized output can then be processed further, usually by yacc, or it can be the “end product.”
When you write a lex specification, you create a set of patterns which lex matches against the input. Each time one of the patterns matches, the lex program invokes C code (or any language you used to created your program) that you provide which does something with the matched text. In this way a lex program divides the input into strings which we call tokens.
In Summary, lex recognizes regular expressions, yacc recognizes entire grammars. Lex divides the input stream into pieces (tokens) and then yacc takes these pieces and groups them together logically.
Yacc takes a grammar that you specify and writes a parser that recognizes valid “sentences” in that grammar. The term “sentence” here is used in a fairly general way.
A grammar is a series of rules that the parser uses to recognize syntactically valid input.
Lex (Flex): A lexical analyzer generator
A regular expression is a pattern that the regular expression engine attempts to match in input text. A pattern consists of one or more character literals, operators, or constructs.
Regular Expression Syntax
Some syntax rules for regular expressions are shown below. Note that this list does not include all of the regular expression syntax rules because there are many of them. The great thing about these rules however is that you can combine rules to create some very clever string parsing techniques.
- . matches any single character
- [ ] matches a single character that is inside the brackets
- [^ ] matches a single character that is not contained within the brackets
- ^ matches the starting position within the string
- $ matches the ending position or the position just before a string-ending new line
Below is a list of metacharacters that can be used to combine some of the previous syntax rules to make it more flexible.
- ? matches the preceding element zero or one time
- + matches the preceding element one or more times
- | the choice operator matches the expression before or after the operator. Essentially a boolean OR
Some examples of these rules in actions are
- ab?c matches only "ac" or "abc"
- [hc]+at matches "hat", "cat", "hhat", "chat", "hcat", "cchchat", and so on, but not "at"
- cat|dog matches "cat" or "dog"
- .at matches any three-character string ending with "at", including "hat", "cat", and "bat"
- ^[hc]at matches "hat" and "cat", but only at the beginning of the string or line
A token is a categorized block of text, usually consisting of indivisible characters known as lexemes. A lexical analyzer initially reads in lexemes and categorizes them according to function, giving them meaning. This assignment of meaning is known as tokenization. A token can look like anything: English, gibberish symbols, anything; it just needs to be a useful part of the structured text.Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer such as lex. The lexical analyzer reads in a stream of lexemes and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.
Parsing a line of code
Lex Implementation Example
Yacc (Bison): Yet Another Compiler-Compiler
Yacc rules (Context Free Grammars)
Yacc BNF format description
Yacc Code example
A quick summary of the chapter should go here
BNF: In computer science, BNF (Backus Normal Form or Backus–Naur Form) is one of the two main notation techniques for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats, instruction sets and communication protocols; the other main technique for writing context-free grammars is the van Wijngaarden form. They are applied wherever exact descriptions of languages are needed: for instance, in official language specifications, in manuals, and in textbooks on programming language theory.
Interpreter: In computer science, an interpreter is a computer program that directly executes, i.e. performs, instructions written in a programming or scripting language, without previously compiling them into a machine language program. An interpreter generally uses one of the following strategies for program execution:
- parse the source code and perform its behavior directly
- translate source code into some efficient intermediate representation and immediately execute this
- explicitly execute stored precompiled code made by a compiler which is part of the interpreter system
Compiler: A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language, often having a binary form known as object code). The most common reason for converting a source code is to create an executable program.
Scanner: A scanner (lexical analyzer) is a function lex : string -> Lex.token list where Lex.token is a datatype of atomic symbols of the language.
Token: A "Token" is a single piece of data or an operator that has some meaning to the higher-level parser. The purpose of "Tokenizing" (creating tokens) is to remove unnecessary characters (spaces, line breaks, and so-on) from the input, and so substantially reduce the necessary complexity of the parser.A token can look like anything: English, gibberish symbols, anything; it just needs to be a useful part of the structured text.Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer such as lex. The lexical analyzer reads in a stream of lexemes and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.
Parsing: Parsing is an essential step for both Interpreters and Compilers. It is how we can pull text from files and make sense of it. This also means that it clearly defines the grammar for programming languages because how else can you parse it if you do not know the grammar for that language. That is the general purpose of most parser's but they also come with other features. A common feature is one where they check for correct syntax.
Regular expression: Regular Expressions (usually shortened to regex) is a sequence of characters that form a search pattern. This provides a common way of searching through strings. Regex's were created in the 1950s by an American mathematician named Stephen Kleene. In type three of Chomsky's Hierarchy the regular grammars use a similar system of terminals and non-terminals to Regex's.
Rule: Rule is a statement that tells you what is allowed or what will happen within a particular system.
A list of practice problems