Mimir:Draft1 Chapter2

From Openitware
Jump to: navigation, search

Back to Table of Contents


Chapter 2: An Overview of Programming Languages

"It has been raining all day on my elephant skull and also elsewhere." - Diary of a genius, Salvador Dali

Programming Language Generations

As our power of knowledge grows from generation to generation, so does that of which we create. Every generation of programming language is built on top of its predecessor's accomplishments, spawning more powerful grammar and therefore more powerful functionality with each epoch.

First Generation (1GL)

First Generation languages are Machine Code. These cryptic instructions coded by the programmer as certain patterns of 1s and 0s are executed directly by a CPU, there is no compilation or interpretation. The instruction is a very low level task, the lowest level--a hardware level task--such as a jump, a copy of data from one register to another, a comparison of data, or an arithmetic function on data in a register. Furthermore, not all CPUs have the same machine code instruction set, though some families of processor may.

Second Generation (2GL)

Second Generation programming languages, known as assembly languages, have a one-to-one relationship with machine code, i.e. one statement in assembly language translates to exactly one statement in machine language. The difference of the two being in grammar: in assembly the programmer can read and write code much easier using labels and operations that are represented as text, where in machine code it would be represented only as binary values. In machine code to copy data from one register to another could look something like this:

01110001 10001111 01110010

In Assembly language the same instruction could look something like:

CPY 0x71 0x72

As you can see it is much easier to decipher what the code is doing in assembly language--copying the contents at memory location 0x71 to location 0x72.

Again like first generation languages, the instruction set and specific grammars of a second generation programming language are CPU dependent. Coding on this type of platform requires specification documents that include coding syntax, register names, the built in operation names, and CPU electrical hardware information, per CPU.

Unlike the first generation, assembly language requires a conversion process called assembly, to convert the code into machine language, the only language a CPU knows.

Example listing of assembly language source code (subroutine), generated by the NASM, an assembler for 32-bit Intel x86 CPU code Assembly Language:

                     ;-----------------------------------------------------------
                     ; zstr_count:
                     ; Counts a zero-terminated ASCII string to determine its size
                     ; in:   eax = start address of the zero terminated string
                     ; out:  ecx = count = the length of the string
 
                     zstr_count:                   ; Entry point
00000030 B9FFFFFFFF      mov  ecx, -1              ; Init the loop counter, pre-decrement
                                                   ;  to compensate for the increment
                     .loop:
00000035 41              inc  ecx                  ; Add 1 to the loop counter
00000036 803C0800        cmp  byte [eax + ecx], 0  ; Compare the value at the string's
                                                   ;  [starting memory address Plus the
                                                   ;  loop offset], to zero
0000003A 75F9            jne  .loop                ; If the memory value is not zero,
                                                   ;  then jump to the label called '.loop',
                                                   ;  otherwise continue to the next line
                     .done:
                                                   ; We don't do a final increment,
                                                   ;  because even though the count is base 1,
                                                   ;  we do not include the zero terminator in the
                                                   ;  string's length
0000003C C3              ret                       ; Return to the calling program

Third Generation (3GL)

Third Generation programming languages brought logical structure to the software, in order to make the languages more programmer friendly. The level of abstraction is higher, thus can be considered higher level languages than first and second generation counterparts.

New features have been implemented, like improved support for aggregate data types and expressing concepts in a way that favors the programmer, not the computer.

Early examples for 3GL are Fortran, ALGOL, COBOL, first introduced in the late 1950's, with C, C++, Java, BASIC, Pascal and others to follow. 3GLs focus on structured programming and have no meaning for object encapsulation concepts. C++, Java, C# followed a different path than the one mapped by 3GL guidelines.

Third generation programming languages are characterized by a one-to-many mapping with assembly and machine language, i.e. one statement in a third generation language will translate to numerous statements in a first or second generation language. At this time compilers and interpreters were created; their role is to handle the conversion of source code to machine code, where the one-to-many translation occurs.

Another key characteristic of a third generation language is its hardware independence, as the grammar of the language is no longer dependent on the CPU.

Fourth Generation (4GL)

The need for a Fourth Generation of programming languages appeared as a new higher level of abstraction was envisioned, one that could generate from a more natural language, the equivalent of the very complicated 3GL instructions with fewer errors.

On the early machines, by using a few cards 4GL language, replacing boxes of punched cards written in a 3GL language was common.

We can identify a few types of 4GLs:

  • Table-driven programming - the programmer uses control tables instead of code to define his logic. Example: PowerBuilder
  • Report-generator - generates a report, or a program to generate a report, based on data and report format. Examples: Oracle Reports, OpenOffice Base
  • Form-generator - manage online interactions or generate programs to do so
  • Data management - provide sophisticated commands for data manipulation/documentation, for statistical analysis and reporting. Examples: SAS, SPSS, Stata
  • Other - attempts to generate whole systems from CASE tool outputs have been made

Some engineering systems were automated to use data flow diagrams, relationship diagrams and life history diagrams to generate an impressive number of lines of code as the final product.

4GL examples: DataFlex, FoxPro, IBM Rational EGL, LabView, Perl, Python, Ruby

A sample code written in a database access 4GL can be seen below:

EXTRACT ALL CUSTOMERS WHERE "PREVIOUS PURCHASES" TOTAL MORE THAN $1000

Fifth Generation (5GL)

Fifth Generation programming languages are based on solving problems using constraints rather than an algorithm written by a programmer. Most logic programming languages are considered to be 5GLs. They are designed to make the computer solve a problem without the programmer, and are used mainly in artificial intelligence research.

Examples: PROLOG, OPS5, Mercury.

Variables

Variable Types

In computer programming, types or datatypes help classify one of the many different types of values a variable can hold. Some common datatypes include boolean (true or false), string, and integer. In a typed language such as Java, variables must have a type associated with them. The type of the variable is important because it determines a variety of characteristics of that specific variable like:

  • Possible values that can be assigned
  • The actual meaning of the variable
  • The operations one can do with variables of a certain datatype

For example we have some Java code below:

int count = 10;

That one line of code is saying we have a variable called count and it is of type int which in Java and many other languages is short for Integer. The Integer type ONLY allows values that are whole numbers like: -5, 0, 5, 10, 12, 20 etc. If we were writing a Java program and we do something like int count = "10"; and then we compile this up, we're going to get a syntactical error and the compile will fail. It will fail because we specified the type of this variable count as an int.

Why are data types important?

In computers, memory is where data and information about programs is stored. That memory is made up from binary digits known as bits stored as either a 0 or 1. Computers access these bits not as individual bits, but rather as a collection of bits usually consisting of 8, 16, 32 or 64 bits. The collection length on a machine is determined by how many simultaneous operations a CPU can process. A basic rule of thumb is the larger the number of bits, the larger chunk of data can be processed.

Now that we have context of memory, let's move onto how data is stored. Any piece of data, like a variable that has the value of 10, is stored in memory at an actual physical location - a location is known as an address. Obviously it's difficult for us to know exactly where this location is or how to access it. This is why we name this address with a variable.

Variables are a much easier way of reserving a spot in memory to hold our information for our program. This way we don't have to remember random numbers like 234098.

When we specify the datatype of a variable, for instance an int, we are setting aside an actual size of memory for our data. The number of bytes that can reserved for a data type can vary based on which programming language you're using. Like we said above, we have a variable that is of type int. When this line of code gets hit, the computer knows exactly how much memory to reserve because we used the data type int.

Strong vs Weak Typed Languages

There is a significant difference between a strong vs weakly typed language. We have already gone over a lot regarding strongly typed variables like integers and strings. When using a language with strong types, that value is known to have specific characteristics and those characteristics cannot change.

Conversely with weakly typed languages, the type and characteristics are determined on how it is used. For example, if we have an expression like this:

a = 5;
b = "10";
c = a + b;

Depending on the language, the value of c will be 5 + 10 if the code can interpret "10" as the integer value 10.

We could also make the b variable equal to "ten" and the code will convert that string value to whatever ASCII values that represents.

If we did this in a strongly typed language, the compiler would through an error as any string value assigned to an integer typed variable isn't allowed.

There are of course advantages to each.

For strongly typed languages, the programmer is forced to create the behavior of the program explicitly. There is no "hidden" behavior where another programmer could be modifying legacy code and the parameter names are not descriptive enough causing that programmer to have no idea what type of variable their working with.

For weakly typed languages, the advantage is writing less code. Also, it maybe faster because there is no overhead for processing and remembering the unique data types a strongly typed language has to do.

Dynamic vs Static Typed Language

Dynamic typing and static typing cause a good amount of confusion among programmers. Below will define in detail each one and also include an example snippet of code.

Dynamic typing

A dynamic type programming language is one that the type is interpreted at runtime. The pros of this type of language is that one can write less code quicker as they don't have to specify the type of each variable they use. The downside of this is when one has to error check. Because of the type being computed at runtime, there's no type-checking before hand so in order to test, a programmer must run the program and clean-up afterwards. Common languages that are dynamically typed include Python and PHP. Below is some Python code:

firstName = "Joshua"
firstName = 10;

We have defined the variable name firstName as a string value "Joshua" - then right after that we changed the value to an integer 10. In a dynamically typed language, this would run perfectly fine and the value (until you change it again) of firstName would be 10.

Statically typing

In a statically typed language, the type is computed at compile time instead of runtime. If when compiled and there are no errors, then the code is guaranteed to work, syntactically. One of the many pros of a static typed language includes the speed at which a programmer can fix bugs and type mismatches because of this precision. A downside of course is this requires writing more code and making sure before compiling that all variables are typed properly. Popular programming languages that are statically typed include C, C++, Java, and C#.

// Example from above.
firstName = "Joshua"
firstName = 10;
string firstName = "Joshua";
int age = 22;

Above I put the example we had for a dynamically typed language and below that an example showing how it would be written correctly in Java, a statically typed language. This is a statically typed language because all variable names AND their types must be explicitly declared. If we attempted to assign the value of 10 to firstName, we would get an error at compile time telling us that it cannot evaluate an integer to a string.

Data Types

Primitive data types

Primitive data types are mostly found in statically typed languages such as Java. Like we said above, this means that all variable names and type must be declared explicitly in order to pass the type-check at compile time. Below is a table of the more common primitive types in statically typed languages.

Type Description Example
int Whole number integers 10, 45, 109
double Real numbers 10.0, 5.67, -98.5
char Single character 'j', 's', 'a'
Complex data types

When thinking of a common complex data type, the array has to be most popular. A complex datatype is any type that holds multiple independent values. An array is a complex data type because it is one object made of up a number of indepedent values. An example in Java:

int songs[10];

This statement is saying we want to define an array variable named songs and to set aside 10 integer values in memory. It's important to note here that we did NOT initialize any of those 10 integer values inside the array, but rather just allocated the space we want to in memory.

Complex data types can also be types that you define yourself as the programmer. In Object-oriented based languages, we can create a new class which will have properties and functions inside it that actually define what the class is. For example we have this Java code:

// Creating the class Car.
public class Car {
	private int year;
	private String make;
	private String model;

	.
	.
	.

	public setYear(int y) {
		year = y;
	}

	public getYear() {
		return year;
	}

	.
	.
	.
}

// Using the Car class.
Car myCar = new Car();
myCar.setYear(2013);
System.out.println(myCar.getYear());

This code above is creating a new class titled Car which has 3 properties (year, make, model) and what I showed, 2 methods. Of course there would be "setters" and "getters" for the other properties as well. However, we now can use this Class and create a new object, or variable, of type Car. Once the line Car myCar = new Car() compiles, we now have initialized a new object of type Car and can use the public properties/methods inside that class throughout the program.

One can see exactly why the difference between a primitive data type and a complex data type is significantly different.

Variable Scope

The scope of a variable is what defines the availability of it during the executing of the program. Some programmers say that the scope of the variable is actually a property in it of itself. In other words, a variables scope is the area of a program where that specific variable has meaning and thus being visible to the code within that scope.

In most programming languages, there are different levels of scope - Global, Parametric, and Local.

  • Global variables are commonly known as variables with an indefinite scope, or visible to the entire program. Programming languages handle a global variable differently. For instance, many languages like C or C++ there is no actual global identifier, however if there is a variable defined outside a function, then that variable is treated as having "file scope" which means it's visible everywhere in the file. However in PHP, there is an actual global keyword you can place in front of defining a variable, and then you can use that variable anywhere that file is included in. For example:
// config.php
<?php

global $SITE_NAME = "Josh's CD Shop";

?>
// index.php
<?php
require_once('config.php');

echo $SITE_NAME;
?>

In those two snippets of code, we are defining a GLOBAL variable $SITE_NAME in a file named config.php. Then in a completely different file, index.php, we are including that config.php file and using the global variable we defined.

Global variables are often viewed as bad practice because it can create confusing, and more error-prone code. Code in a more larger project can be both read and more importantly maintained when the scope of variables are limited to their specific code block or function. Using global variables can cause headaches because in any part of a program, they can be modified (if their not memory protected which some languages provide), which makes it difficult to remember exactly their intended use. Below we will introduce better ways to manage your variables within a program.

  • Local variables are ones that can be accessed and used within certain functions; it is given local scope to wherever it may be. For example, say we have a block of code like below:
<?php

function test() {
	$a = 5;	
}

echo $a;

?>

When we try to run this PHP code, we will get an error. That is because we are trying to echo a variable that is NOT in scope.

In our function test(), we defined a variable $a = 5. Then directly after the } which closes that scope, we went ahead and echo'd out that same variable $a. This is wrong because the PHP file doesn't have access to LOCAL variables inside functions.

  • Parametric variables can be accessed inside a method or function which accepts it as an argument. This is how you can solve the problem of using global variables in a program - by passing values to functions instead of creating global variables so that function can use them.
<?php
class Item {
	private $price;

	function setPrice($myPrice, $amountOff) {
		if($amountOff != null) {
			// We have a sale.
			$price = ($myPrice - $amountOff);
		} else {
			$price = $myPrice;
		}
	}	

	function getPrice() {
		return $price;
	}
}

.
.
.

// Somewhere else in the program ...
$item = new Item();
$item.setPrice(100, 20);

echo $item.getPrice();

?>

This snippet above demonstrates the power and simplicity of a parametric variable. We have a class which has a private property $price ... the only way to set this value would be to create a custom setter function and pass that function a value we'd like to use. In the function setPrice($myPrice, $amountOff) we specified arguments of $myPrice and $amountOff which we will accept in the function as parametric scoped variables - they can only be accessed within this function.

Using this concept, we can create a program that is harmonious because we always know what the variables are, and their intention. If we put $amountOff in a global variable, we may never know how that variable is even defined, never mind where. Using it this way, we are complete control of where that value is set and who has access to it.

Grammar

Grammars are used to completely specify syntax rules of programming languages. The structure imposed by a grammar gives a systemic way of processing expressions.

One of the most used techniques to represent programming language grammar is Backus-Naur Form or BNF for short. This notation looks a lot like HTML and is used to describe the structure of a language.

A grammar consists of four main components: identifiers, terminals, non-terminals, and production rules.

A grammar must contain at least one rule and this rule must have an identifier. Identifiers distinguish rules from other rules, while also acting as a reference for the rule and must begin with an alpha character or an underscore.

Following can be either an alpha, a numeric, or an underscore.

Non-terminals are symbols that are used to write out the grammar. They can be expanded and are represented in < >.

Terminals are symbols that appear in the language generated by the grammar. They are written exactly as they are, inside single quotes ' '.

Production Rules are composed of of a non-terminal followed by ::= followed by a combination terminal &/or non-terminal.


Additional meta-symbols of BNF are:

::=    means "is defined as"
 |      means "or"

Example of a BNF Grammar: production rules for context free grammar (CFG) of an identifier

<ident> ::= <start_char> <remain_chars>
<start_char> ::= <symbol> | <letter>
<remain_chars> ::= <alpha_num> | <alpha_num> <remain_chars>
<alpha_num> ::= <symbol> | <letter> | <number>
<symbol> ::= _
<letter> ::= a | b | ... | z | A | B | ... | Z
<number> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

This captures legal identifiers such as:

  count
  count1
  _count2
  _count34
  __count_

A non-legal identifiers would be:

 0count

Summary

A quick summary of the chapter should go here

Key Terms

A list of key terms should go here. This should be created using some sort of glossary type plugin.

Problem Sets

A list of practice problems