Mimir:Draft5 Chapter9

Back to Table of Contents

=Chapter 9: Grammars= ''"Let schoolmasters puzzle their brain,

''With grammar, nonsense and learning,

''Good liquor, I stoutly maintain,

Gives genius the better discerning." - Oliver Goldsmith (1728-1774)

Introduction
Grammars are metalanguages, that is, languages which are used to describe other languages. Grammars are used to make statements about statements in other languages and to completely specify the rules of syntax. We know grammar from normal speaking languages, such as English. Grammars are also used to describe machine programming languages. The structure formulated by a grammar leads to a systematic way of processing expressions written in a programming language.

Brief Historical Background
Talk about Pāṇini and grammar origins in Sanskrit. Lead up to BNF.

Structure of Grammar
A grammar consists of four parts: identifiers, terminals, non-terminals, and production rules. A grammar must contain at least one rule and this rule must have an identifier. Identifiers distinguish rules from other rules, while also acting as a reference for the rule. They must begin with either an alpha character or an underscore. Following can be either an alpha, a numeric, or an underscore. Non-terminals are symbols that are used to write out the grammar. They can be expanded and are represented in < >. Terminals are symbols that appear in the language generated by the grammar. They are written exactly as they are, inside single quotes ' '. Production Rules are composed of of a non-terminal followed by ::= followed by a combination terminal &/or non-terminal.The main notation used to represent grammars is Backus-Naur Form or BNF for short. This notation looks a lot like HTML and is used to describe the structure of a language.


 * ::= ; <-- production rule in BNF
 *  ::= + | = | - |
 * ::= number ...
 * ::= ;
 *  ::= { }
 *  ::= if ;
 *  ::= if ;
 * ::= ;
 *  ::= { }
 *  ::= if ;
 *  ::= if ;
 *  ::= if ;
 *  ::= if ;

Identifiers
Identifiers distinguish rules from other rules, while also acting as a reference for the rule.


 *  ::= |<start-char><seq-class>
 * <start-char> ::= |
 * <seq -chars> ::= <any-char>|<any-char>|<any-char><seq-chars>
 * <any-char> ::= | |
 * ::= a|...|z|A|...|Z
 * ::= 0|...|9
 * ::= 0|...|9

Terminals
Terminals are symbols that appear in the language generated by the grammar.They cannot be broken down further, e.g. a literal character or digit.

They are written exactly as they are, inside single quotes ' '.

Non-Terminals
Non-terminals are symbols that are used to write out the grammar and that can be reduced further by the production rules.They are represented in < >.

Production rules
Production Rules are composed of of a non-terminal followed by ::= followed by a combination terminal &/or non-terminal.

Type 0: Unrestricted Grammars
These are unrestricted grammars that encompasses all other formal languages. You can also refer to this type as a recursive enumerable language. Any language that is acceptable by a deterministic Turing Machine is considered to be recursive enumerable language. Where the rules given are allowed to at most perform one action for any given situation.

Type 0 grammars have rules of the form a -> b where a and b are arbitrary strings over a vocabulary or V, a cannot be empty. a and b are just terminals {a, b} in this grammar. Their are no non-terminals {S, A, B} and no start symbols S.

As an example of how a basic recursive enumerable language is interpreted by a deterministic Turing Machine.

For string : w Let be a L recursively enumerable language and the Turing Machine that accepts it M if w ∈ L then M halts in a final state if w ∉ L then M halts in a non-final state or loops forever

Type 1: Context Sensitive Grammars
These are grammars that can be parsed by a linear bounded non-deterministic Turing machine (The linear bounded portion of this definition means that the tape is bounded by a constant times the length of the input). In contrast to the deterministic Turing Machine the linear bounded non-deterministic Turing machine may have a set of rules that allows for more than one action for a given situation. For example, a non-deterministic Turing machine may have both "If you are in state 2 and you see an 'A', change it to a 'B' and move left" and "If you are in state 2 and you see an 'A', change it to a 'C' and move right" in its rule set.

Type 1 grammars have rules of the form Alpha A Beta -> Alpha L Beta Alpha L and Beta are strings of terminals and/or non-terminals. Alpha and Beta can be empty but L must NOT be empty. the rule S -> E is allowed but only if it does not show up on the right side of the rule.

Example for Language L

L = {a^n b^n c^n | n >= 1} let V = {S, B} and set S → aSBc abc cB → Bc bB → bb

Type 2: Context Free Grammars
With context-free grammars we are always free to replace a non-terminal in a working string with the right-hand-side of any of its rules. It doesn't matter what's on either side of that non-terminal in the working string. In other words, the context of the non-terminal doesn't matter. These are defined by the rules Alpha -> Lambda with Alpha being a non-terminal and Lambda a string of terminals and non-terminals. These are languages taht can be recognized by a non-deterministic push-down automation where a stack is formed and the parsing starts from the top and continues down the stack.

Example grammar that generates strings representing arithmetic expressions with the four operators +, -, *, /

--> number --> --> +  --> -  --> *  --> /


 * The non-terminal symbols in this example are the, which in this case is also the start symbol .The terminal symbols are {+,-,*,/,,number}
 * The first rule (or production) states that an can be rewritten as (or replaced by) a number. In other words, a number is a valid expression.
 * The second rule says that an in parentheses is also an . Note that this rule defines an expression in terms of expressions, an example of the use of recursion in the definition of context-free grammars.
 * The remaining rules say that the sum, difference, product, or division of two is also an expression.

With the above grammar we can run a parse tree for the following expression x + y * z

|                |          / | \             / | \      *    +            /       \        /         \      / | \      z      x        / | \ +           *        |        |                |        |    x        y                y        z

Type 3: Regular Grammars
These languages are all languages that can be decided by a finite state automaton also known as a state machine. These are known as Regular languages and are commonly utilized to define search patterns and the lexical structure of programming languages.

Regular Expressions
Regular Expressions (usually shortened to regex) is a sequence of characters that form a search pattern. This provides a common way of searching through strings. Regex's were created in the 1950s by an American mathematician named Stephen Kleene. In type three of Chomsky's Hierarchy the regular grammars use a similar system of terminals and non-terminals to Regex's.

Regular Expression Syntax
Some syntax rules for regular expressions are shown below. Note that this list does not include all of the regular expression syntax rules because there are many of them. The great thing about these rules however is that you can combine rules to create some very clever string parsing techniques.


 * . matches any single character
 * [ ] matches a single character that is inside the brackets
 * [^ ] matches a single character that is not contained within the brackets
 * ^ matches the starting position within the string
 * $ matches the ending position or the position just before a string-ending new line

Below is a list of metacharacters that can be used to combine some of the previous syntax rules to make it more flexible.


 * ? matches the preceding element zero or one time
 * + matches the preceding element one or more times
 * | the choice operator matches the expression before or after the operator. Essentially a boolean OR

Some examples of these rules in actions are
 * ab?c matches only "ac" or "abc"
 * [hc]+at matches "hat", "cat", "hhat", "chat", "hcat", "cchchat", and so on, but not "at"
 * cat|dog matches "cat" or "dog"
 * .at matches any three-character string ending with "at", including "hat", "cat", and "bat"
 * ^[hc]at matches "hat" and "cat", but only at the beginning of the string or line

Summary
A quick summary of the chapter should go here

Key Terms
A list of key terms should go here. This should be created using some sort of glossary type plugin.

Problem Sets
A list of practice problems

Top of Page - Prev Chapter -  Next Chapter