Mimir:Draft2 Chapter9

From Openitware
Jump to: navigation, search

Back to Table of Contents



Chapter TODOs

  • quote
  • content
  • summary
  • key terms
  • exercises

Chapter 9: Grammars

"QUOTE HERE" - Author

Introduction to Grammars

Grammars are used to completely specify syntax rules of languages. The structure imposed by a grammar gives a systemic way of processing expressions.

A grammar consists of four main pieces identifiers, terminals, non-terminals, and production rules. A grammar must contain at least one rule and this rule must have an identifier. Identifiers distinguish rules from other rules, while also acting as a reference for the rule. They must begin with either an alpha character or an underscore. Following can be either an alpha, a numeric, or an underscore. Non-terminals are symbols that are used to write out the grammar. They can be expanded and are represented in < >. Terminals are symbols that appear in the language generated by the grammar. They are written exactly as they are, inside single quotes ' '. Production Rules are composed of of a non-terminal followed by ::= followed by a combination terminal &/or non-terminal.The main notation used to represent grammars is Backus-Naur Form or BNF for short. This notation looks a lot like HTML and is used to describe the structure of a language.


  •  ::= <statements>; <code> <-- production rule in BNF
  • <code> ::= <statement>;
  • <statement> ::= <variable> <operator> <value>;
  • <variable> ::= <identifier>;
  • <op> ::= + | = | - |
  • <value> ::= <variable> | <number>
  • <identifier> ::= number ...
  • <number> ::= ...
  • <statement> ::= ;
  • <statement> ::= <if – clause>;
  • <code - block> ::= {<code>}
  • <if – statement> ::= if(<expression>) <statement>;
  • <if – statement> ::= if(<expression>) <block – statement>;


::=    means "is defined as"
 |      means "or"

Identifiers

Identifiers distinguish rules from other rules, while also acting as a reference for the rule.

  • <id> ::= <start-char>|<start-char><seq-class>
  • <start-char> ::= <letter>|<symbol>
  • <seq -chars> ::= <any-char>|<any-char>|<any-char><seq-chars>
  • <any-char> ::= <letter>|<symbol>|<digit>
  • <letter> ::= a|...|z|A|...|Z
  • <symbol> ::= ~|$
  • <digit> ::= 0|...|9

Terminals

Terminals are symbols that appear in the language generated by the grammar.They cannot be broken down further, e.g. a literal character or digit.

They are written exactly as they are, inside single quotes ' '.

Non-Terminals

Non-terminals are symbols that are used to write out the grammar and that can be reduced further by the production rules.They are represented in < >.

Production rules

Production Rules are composed of of a non-terminal followed by ::= followed by a combination terminal &/or non-terminal.

Chomsky's Hierarchy

Type 0: Unrestricted Grammars

These are unrestricted grammars that encompasses all other formal languages. You can also refer to this type as a recursive enumerable language. Any language that is acceptable by a deterministic Turing Machine is considered to be recursive enumerable language. Where the rules given are allowed to at most perform one action for any given situation.

Type 0 grammars have rules of the form a -> b where a and b are arbitrary strings over a vocabulary or V, a cannot be empty. a and b are just terminals {a, b} in this grammar. Their are no non-terminals {S, A, B} and no start symbols S.

As an example of how a basic recursive enumerable language is interpreted by a deterministic Turing Machine.

For string : w
Let be a L recursively enumerable language
and the Turing Machine that accepts it M
if w ∈ L then M halts in a final state
if w ∉ L then M halts in a non-final state or loops forever

Type 1: Context Sensitive Grammars

These are grammars that can be parsed by a linear bounded non-deterministic Turing machine (The linear bounded portion of this definition means that the tape is bounded by a constant times the length of the input). In contrast to the deterministic Turing Machine the linear bounded non-deterministic Turing machine may have a set of rules that allows for more than one action for a given situation. For example, a non-deterministic Turing machine may have both "If you are in state 2 and you see an 'A', change it to a 'B' and move left" and "If you are in state 2 and you see an 'A', change it to a 'C' and move right" in its rule set.

Type 1 grammars have rules of the form Alpha A Beta -> Alpha L Beta Alpha L and Beta are strings of terminals and/or non-terminals. Alpha and Beta can be empty but L must NOT be empty. the rule S -> E is allowed but only if it does not show up on the right side of the rule.

Example for Language L

L = {a^n b^n c^n | n >= 1}
let V = {S, B} and set
S → aSBc
abc
cB → Bc
bB → bb

Type 2: Context Free Grammars

With context-free grammars we are always free to replace a non-terminal in a working string with the right-hand-side of any of its rules. It doesn't matter what's on either side of that non-terminal in the working string. In other words, the context of the non-terminal doesn't matter. These are defined by the rules Alpha -> Lambda with Alpha being a non-terminal and Lambda a string of terminals and non-terminals. These are languages taht can be recognized by a non-deterministic push-down automation where a stack is formed and the parsing starts from the top and continues down the stack.

Example grammar that generates strings representing arithmetic expressions with the four operators +, -, *, /

<expr> --> number
<expr> --> ( <expr> )
<expr> --> <expr> + <expr>
<expr> --> <expr> - <expr>
<expr> --> <expr> * <expr>
<expr> --> <expr> / <expr>
  • The non-terminal symbols in this example are the <expr>, which in this case is also the start symbol .The terminal symbols are {+,-,*,/,(,),number}
  • The first rule (or production) states that an <expr> can be rewritten as (or replaced by) a number. In other words, a number is a valid expression.
  • The second rule says that an <expr> in parentheses is also an <expr>. Note that this rule defines an expression in terms of expressions, an example of the use of recursion in the definition of context-free grammars.
  • The remaining rules say that the sum, difference, product, or division of two <expr> is also an expression.

With the above grammar we can run a parse tree for the following expression x + y * z

         <expr>            <expr>
           |                 |
         / | \             / | \
     <expr>*<expr>    <expr> + <expr>    
       /       \        /         \
     / | \      z      x        / | \
<expr> + <expr>            <expr> * <expr>    
   |        |                |        |
   x        y                y        z

Type 3: Regular Grammars

These languages are all languages that can be decided by a finite state automaton also known as a state machine. These are known as Regular languages and are commonly utilized to define search patterns and the lexical structure of programming languages.

Regular Expressions

Regular Expressions (usually shortened to regex) is a sequence of characters that form a search pattern. This provides a common way of searching through strings. Regex's were created in the 1950s by an American mathematician named Stephen Kleene. In type three of Chomsky's Hierarchy the regular grammars use a similar system of terminals and non-terminals to Regex's.

Regular Expression Syntax

Some syntax rules for regular expressions are shown below. Note that this list does not include all of the regular expression syntax rules because there are many of them. The great thing about these rules however is that you can combine rules to create some very clever string parsing techniques.

  • . matches any single character
  • [ ] matches a single character that is inside the brackets
  • [^ ] matches a single character that is not contained within the brackets
  • ^ matches the starting position within the string
  • $ matches the ending position or the position just before a string-ending new line

Below is a list of metacharacters that can be used to combine some of the previous syntax rules to make it more flexible.

  •  ? matches the preceding element zero or one time
  • + matches the preceding element one or more times
  • | the choice operator matches the expression before or after the operator. Essentially a boolean OR

Some examples of these rules in actions are

  • ab?c matches only "ac" or "abc"
  • [hc]+at matches "hat", "cat", "hhat", "chat", "hcat", "cchchat", and so on, but not "at"
  • cat|dog matches "cat" or "dog"
  • .at matches any three-character string ending with "at", including "hat", "cat", and "bat"
  • ^[hc]at matches "hat" and "cat", but only at the beginning of the string or line

Summary

A quick summary of the chapter should go here

Key Terms

A list of key terms should go here. This should be created using some sort of glossary type plugin.

Problem Sets

A list of practice problems