Speech:Summer 2013 Eric Beikman

Week 1:June 2 - June 8th
Goals:
 * Re-familiarize myself with Sphinx.
 * Read about Sphinx to determine courses of action.
 * Begin to create scripts to automate the experiment process.

Results This week, my goals were to determine a course of action for the the upcoming weeks. Last Semester we achieved average word error rates in the mid to low 30s, using a 5 hour corpus. For this semester, we hope to decrease the word error rate of the models to the lowest possible score we can generate to establish set of baseline models and scores. Once such baselines are established, we can have a reference point to gauge the effectiveness of new speech-recognition technologies.

Due to the amount of experiments we will be doing, a side-task of generating a set of scripts to automate the process will be invaluable. By spending less time on running the experiments, we can spend more time analyzing the data. These scripts will begin to automate individual groups of steps within the existing process, with the ultimate goal of having a single control script sequentially execute each of the smaller scripts.

This week, I have created the first two of these scripts:
 * train_01.pl
 * Automates the experiment directory setup steps. Will also determine the lowest available experiment number to utilize, or an experiment number can be specified. It will also prevent the user from over-writing an existing experiment.
 * train_02.pl
 * Automates the Trainer configuration step. It will either edit the config files on lines 6-8, and 79-80 to meet the specific experiment or will replace the entire config file with one that is provided, depending on the flags given.

Week 2:June 9 - June 16th
Goals
 * Improve the Scripts created last week.
 * Begin to run experiments to decrease the word error rate.

Results

The new set of scripts have proved their utility by automating the first few tasks in the Training process which were otherwise very tedious to do by hand.

For train_01.pl, I've discovered a bug which results from the inconsistent output of $0 within perl. I originally assumed that $0 will only return the script's name; however, on certain systems, it returned not only the name, but also the filepath! To resolve the bug, I utilized the File::Basename module and the basename($0) statement, which removes any filepath returned from $0, if any.

Since the sets of experiments which I will be running throughout this semester may incorporate heavily customized sphinx_train.cfg files, I needed a way to easily adapt a sphinx_train.cfg file to a new experiment. To do so, modified train_02.pl with a new operation. Using the -c flag and specifying the filepath of a sphinx_train.cfg file, the script will now copy over the given file, and only make the adjustments needed for use in the the given experiment. This feature was useful during this week's experiments as it allowed me to quickly copy and convert the customized config files from the training experiment to its corresponding decode and score experiment.

As these script have proven their usefulness, I will be creating additional scripts in the future to further streamline the process of running experiments.

In regards to experiments, in my research last week, I identified two possible ways to increase the word accuracy of generated models:
 * 1) Adjust the minimum amount of Baum-Welch iterations.
 * 2) Increase the Senone value for the train.

I'm concerned that we may have reached the limit of preparing the data for Sphinx, meaning that the data and transcripts are optimal for sphinx and the more data we throw at it may not eat pie.

Using the new train_01.pl script, I've generated four new experiments, 0101, 0102, 0103, And 0104. 0101 and 0102 are Training experiments which corresponds to the two areas identified above. Experiments 0103 and 0104 are test on trains for experiment 0101 and 0102 respectively.

For experiment 0101, I set the minimum Baum-Welch iterations from its default value of 1, to 4. Based on my research, The Sphinx Trainer will automatically determine the amount of iterations it needs to do to create the best model; by adjusting the value, we may see a slight increase in accuracy. That being said, the manual warns that the more BW iterations are forced, the more likely the model will be exclusively tuned for the data, increasing the word error rate for decodes using data which isn't a part of the training corpus. In Experiment 0103, the results from decoding the model from 0101 were surprising, the word error rate was probably the worst I've seen yet. It is safe to assume that determining the number of required BW iterations is best left to Sphinx.

For experiment 0102, we adjusted the senone values from 1000, to 2000. As recommended by CMU for the amount of data utilized (5 hours). This change had a very beneficial effect on the word error rate of the model, generating the lowest score we've seen for the Last_5hr/train corpus: 29.3, an decrease of about 3 points from our previous best in experiment 0090. As a result, we need to ensure that we adjust the senone value within the sphinx trainer to match was is recommended by |CMU.

It will be interesting to see, the effects of further increasing the senone value even further using the same data set. For instances like this, it would be nice to have a tool which could create a number of mostly similar experiments which have differing values; For example, 5 experiments with senone values ranging from 2500, to 4000.

Goals

 * Experiment with senone values.
 * Look into putting SpEAK onto Caesar.

Sphinx
Last week, I've determined that increasing the amount of senones in an acoustic model will result in a higher-scoring model. This week I've decided to explore what would happen if we further increase the senone values even further.

I've ran the following experiments which utilize the same variables as Experiments Experiment 0089 and Experiment 0102. Namely:
 * 1) The same 5-hour corpus (/mnt/main/corpus/switchboard/last_5hr/train) for creating the model.
 * 2) The same version of the genTrans.pl script. (genTrans5.pl)
 * 3) The same dictionaries and phone lists.
 * 4) The same 30-minute corpus for the decode experiments (last_5hr/test).

The following differences were:
 * Experiment 0105 and its corresponding decode experiment
 * Use a senone value of 2500.
 * Experiment 0106 and its corresponding decode experiment
 * Use a senone value of 3000.
 * Experiment 0107 and its corresponding decode experiment
 * Use a senone value of 4000.



The Graph above illustrates a diminishing improvement past a senone value of 2500. For a 5 hour corpus, CMU recommends a senone value of between 1000 and 2500; so the results above roughly correlate with CMU's reccomendations. It can be assumed that after a certain value, the increase in word accuracy may not justify the additional overhead caused by the additional senones; the decoder will need to work even harder to find a matching senone, making it take longer. It benefits to have a higher senone value when you have more sample data, as there are more varieties of triphones and thus would benefit with an acoustic model which has more of these defined.

Below is a table of suggested senone values for different corpus sizes. This data is taken directly from CMU's Sphinx 3 FAQ |link, and is recorded here for convenience.

SpEAK
The SpEAK source code is located here

Before I put the site on a public-facing server (While the public may not have credentials to caesar, people can still try to log in), I wanted to test the site and check the code to ensure that it is indeed ready for deployment; namely:
 * 1) Verify that unauthorized users can't log in or otherwise use exploits to either root the system or damage the database.
 * 2) Ensure that there are no major bugs in the code which will negatively effect the server.
 * 3) Determine the best way to install it.

As such, I have hosted the site on my personal Linux test machine using XAMPP. XAMPP is an easily installable Apache distribution including Apache, a PHP interpreter, MySQL, and other utilities and all the configurations to run everything out of the box. By using XAMPP, I was able to quickly the SpEAK site up and running. I've listed the following observations below:

Recommendations for future classes:
 * Clean up the Trunk branch
 * There is quite a bit of scripts, documentation, and other files which are not applicable to the current code base and should be removed.
 * For example, there is a PDF in the SQL directory containing diagrams for the database, however, its out of date!
 * There are a few scripts to populate the data with sample data which don't work! Presumably due to changes in the database schema.
 * There is little to no instructions on how to set up the system!
 * There is little to no documentation in general!
 * That being said, the code itself is pretty good in regards to commenting.
 * add.php has a note which states that two people think they are making an experiment with a given experiment number, but only one will take.
 * This really isn't acceptable, think of how frustrating it would be if you create and type up an experiment, only to have it not save because somebody took your experiment number while you were typing.
 * PHP is whining that date has not been assigned a time zone.
 * The User edit screen needs to have a second 'confirmation' password field.
 * Getting locked out of your account because of a typo when you tried to change your password is not cool.
 * Include a radio checkbox determining the type of experiment.
 * There are a few different types of experiments due to how the caesar is set up:
 * Training experiments, where acoustic models are generated but not tested.
 * Decode experiments, where Language models are generated and an acoustic model is tested.
 * The reason to keep these separate is that normally you wish to test models using a corpus that is smaller than the corpus that was used to make the models. This is mainly for time reasons, there isn't a good reason to run a 5 hour decode on a 5 hour train! A 30 minute decode to test models using 5 hours worth of audio is more efficient in terms of time!
 * Train and Decode experiments, where models are created, then a decode is ran using the same audio data used to create the models.
 * Other: A catch all for all experiments which don't meet the above types.
 * A database table for corpuses.
 * Depending on the type of experiment performed, each experiment will have an association with the corpus used to create the models and/or decode.
 * Info about the corpus could include the filename/path, length, source (I.E. Switchboard), and offset if the corpus is a subset of a larger corpus (for example, the last_5hr corpus has an offset of 303 hours as it starts 303 hours from the beginning of the 308-hour long Switchboard corpus).
 * The SpEAK google site page needs more detail as to what exactly the code represents, how it works, and licensing. If this is open software, then we need to detail this knowledge to distribute it to those who might find it useful.

The next week I will begin to prep Caesar to host this site. Some research will need to be done to determine the best way to do this. More than likely we SHOULD utilize a HTTPS connection, this will require a bit of research on my part.

=Week 4: 6/24-6/30 =

Goals:

 * Get SpEAK up and running on Caesar.
 * Research if others have used the Switchboard corpus with Sphinx
 * Learn more about the variables in the Sphinx Trainer config file (sphinx_train.cfg).
 * Specifically, how they would affect the word error rate of any generated models.

SpEAK
SpEAK operates in a LAMP environment. LAMP is an acronym for Linux, Apache, MySQL, and PHP. As Caesar is currently running a Linux O.S. (OpenSUSE), we simply need to install and configure Apache2, MySQL, and PHP. The following instructions are based on a set from here. The link mainly details how to install and configure it on OpenSUSE, the installation portions will vary based on the Linux distro in use, but the actual configuration is more or less the same across distributions.


 * Steps:

MySQL

 * 1) Install MySQL
 * 2) *Use YAST
 * 3) *They've changed the name of the MySQL package to mysql-community-server
 * 4) *The MySQL client will also be installed.
 * 5) Config runlevels so it starts MySQL.
 * 6) *Runlevels are used by most Unix & Unix-like Operating systems to determine what services need to run at which time.
 * 7) **Reference [en.wikipedia.org/wiki/Runlevel Wikipedia] for more information.
 * 8) *Use
 * 9) **It adds the daemon (mysql) to start at the necessary runlevels.
 * 10) **Use  to see if MySQL set to run correctly.
 * 11) ***Check that at least runlevel '5' is set to 'on'.
 * 12) Start the mysql service:
 * 13) *You can either reboot the box, or simply enter in
 * 14) Run the MySQL secure installation script.
 * 15) *Run it completely! Do not do exit out halfway.
 * 16) The initial root password is null, so hit enter.
 * 17) Set a new Root password!
 * 18) *Don't make it the same as the machine's root password, this way, in case someone figures out the MySQL password, they won't have root password for the machine.
 * 19) Remove Anonymous users.(Y) Any users that want access to data must log in.
 * 20) Disallow Root login remotely. (Y) Only people who have access to the machine should be able to log in as root.
 * 21) Remove test database (Y). We might as well, its not needed.
 * 22) Reload Privileges (Y). To ensure that all changes made thusfar are enforced immediately.
 * 23) Change where MySQL saves its data.
 * 24) *By default, MySQL will put the actual database-related files within /var/lib/mysql. This is on the OS partition, which has no redundancy due to how Caesar is set up. We want to move it to the RAID drives, whose mount point is /mnt/main.
 * 25) *The easiest way to do this probably will be to make /var/lib/mysql a softlink to a directory in /mnt/main. This is similar to how /usr/local is redirected to /mnt/main/local.
 * 26) Stop the mysql daemon.
 * 27) Move the existing database data to its new location.
 * 28) Create a softlink to link the /mnt/main/var/mysql directory to /var/lib/mysql
 * 29) Start up the MySQL daemon.
 * 1) Create a softlink to link the /mnt/main/var/mysql directory to /var/lib/mysql
 * 2) Start up the MySQL daemon.
 * 1) Start up the MySQL daemon.

You can log in to mysql using: Database users and passwords are not the same as the machine's. Root access is needed to get access to the MySQL client. Otherwise you will get a CNF error. I need to do some research as to why this is.

Apache2 And PHP
Since the PHP interpreter is tightly coupled to Apache and doesn't need any configuration, installing Apache and PHP are combined into a single set of isntructions. This section was also based on the official OpenSUSE documentation.


 * 1) Using YAST2 (or your distro's software package manager). Search for, select, and install the following packages (and any dependencies):
 * 2) *apache2
 * 3) **apache2 was already installed on Caesar, but was not configured or running.
 * 4) *apache2-mod_php5
 * 5) Ensure that the Apache daemon starts automatically.
 * 6) Start Apache.
 * 1) Start Apache.

Apache should be up and running. You may need to verify that the machine's firewall is allowing incoming requests on port 80, otherwise you won't be able to see any web sites. As recommended by the OpenSUSE documentation, instead of editing the master config file (httpd.conf), we will set up a virtual host for the site. This effectively will allow caesar to host more than one website; in other words, it will make Caesar more flexible and capible. It may be a good idea to utilize this as a way to multiple versions of SpEAK: a stable "Production" site in which actual experiment data will be recorded, and an unstable "Test" site which will be used to test new versions of SpEAK for any major flaws before it is rolled over to the stable site.

We have the following goals for Apache:
 * All requests on port 80 are redirected to the UNH Speech Home page on Foss.
 * All secure requests on port 443 are sent to SpEAK.


 * 1) To show existing virtual hosts, use
 * 2) *Caesar wasn't running any virtual hosts at the time, so it returned nothing when I ran it.
 * 3) Go to /etc/apache2/vhosts.d/
 * 4) We will need to copy one of the config templates according to what type of site we are using. Ensure that the filename is nice and descriptive as to what configuration it is for, but make sure that is has a .conf suffix! Otherwise it will not load!
 * 5) *Copy over vhost.template for a normal HTTP host.
 * 6) **Thus we need this for the "default" host.
 * 7) *Copy over vhost-ssl.template for a secure HTTPS host.
 * 8) **We need this for our SpEAK site.
 * 9) Configure the new virtual host config. To do so, get into it using your favorite text editor, make sure you use "sudo" or else you won't have write privileges.
 * 10) The  tag defines the address (and thus interfaces) and port which is associated with this virtual host. By default, it is *:80 (or *:443 for HTTPS/SSL configs), or it will accept connections on all interfaces from port 80 (or 443).
 * 11) *The defaults are fine.
 * 12) Define the ServerAdmin statement.
 * 13) *This defines who to contact in case something goes wrong with the server. It needs to be an email address.
 * 14) *I used Mike Jonas's address for this host, as he is the individual who wields the ultimate control on these machines.
 * 15) Define the ServerName statement.
 * 16) *As the name suggests, this is where you put the server's full DNS address. (You should know it by now).
 * 17) Define the DocumentRoot statement.
 * 18) *The document root is essentially the directory on the server which represents the root of the website.
 * 19) *DO NOT MAKE IT THE SERVER'S ROOT DIRECTORY! 
 * 20) **That isn't terribly bright as it could potentially give access to everything on the server.
 * 21) *In our case, we are setting it one of the directories within/mnt/main/srv/vhosts/
 * 22) **depending on the site the config is for.
 * 23) Define the ErrorLog and CustomLog statements.
 * 24) *These define where logs are stored.
 * 25) *The default directory (/var/log/apache2) is fine for both logs. We just need to change the log's name.
 * 26) **I just changed the names of the logfiles to be  -error_log and  -access_log combined for each type of logfile respectively.
 * 27) Ensure that HostnameLooksups is off. Having this on makes the server do reverse-DNS queries for connected clients; thus is wasteful and not needed.
 * 28) Comment out or remove the ScriptAlias statement. It isn't needed.
 * 29) In the  directive.
 * 30) *This directive defines user directories. In other words, what would happen if you stuck a ~/ at the end the website's root. Its what UNH's student's webpage hosting at pubpages.unh.edu uses.
 * 31) *We don't want this for caesar, it isn't needed and its a security risk.
 * 32) *Disable it by ensuing that  is the only thing within this directive.
 * 33) Setup the  directive.
 * 34) *The default value for it is
 * 35) *The part in quotations needs to be set to the your website's root directory. (as defined above)
 * 36) Ensure that the Order keyword is set to  and the Allow keyword is set to
 * 37) To ensure that there aren't any errors in the config file(s). Run
 * 38) We need to edit the listen.conf file to change what ports Apache listens on.
 * 39) *By default, Apache will not listen on port 80. It will instead only listen to port 8080 OR 443 (for https connections only).
 * 40) **Regardless of what is defined in the virtual host's config file.
 * 41) *For the one we have, we defined:
 * 42) *#*To listen on port 80.
 * 43) *#*To listen on port 443.
 * 44) *You need then close any ports that aren't in use by either deleting the line, or commenting it out.
 * 1) *#*To listen on port 443.
 * 2) *You need then close any ports that aren't in use by either deleting the line, or commenting it out.

Special Instructions for the default site:
We need to do some additional steps for this side. For now, lets make all requests which go to Caesar on port 80 to be redirected to the Speech page on Foss. It will be a nifty shortcut for those who wish to bypass the slow parts of foss to get to the Speech portions; in the future, this method may be used to redirect non-secure connections http (port 80) to a secure one (port 443). The best way to implement this would be use mod_alias extension within the default.conf virtual host config file. To do this, simply enter into the appropriate virtual host: RewriteEngine on RewriteRule ^(.*) http://foss.unh.edu/projects/index.php/Speech:Home [R=301,L]
 * 1) Redirects all traffic to Speech project homepage on foss.
 * RewriteEngine on is a directive which turns on the rewrite engine on.
 * Rewrite Engine is a powerful tool used to shape HTTP requests.
 * RewriteRule is where we want to specify what happens to where.
 * It follows the syntax
 * Where:
 *  is a standardized perl regular expression.
 * In our case we match everything starting at the beginning.
 *  is what will be substituted.
 * In our case we want it go to the Speech page on Foss.
 * 
 * These are multiple commands encapsulated within two square brackets. We have two:
 * R defines the HTML redirect code we sent back to the client. In our case its Code 301, or "Moved Permanently".
 * L States that this is the last RewriteRule to be processed.

If you get an error message stating that Apache cannot find the "RewriteEngine" directive, it may not be installed. Use  to easily install it.

Special Instructions for SpEAK virtual host:
Since this host will be running a https service, we need to do some things differently:
 * Ensure that you copy over the proper virtual host template.
 * We need to create a SSL cert. We are gonna use the official OpenSUSE instructions which can be found here.
 * We are gonna do a self-sign certificate, so the client will complain about an untrusted connection. But its better than nothing.


 * Change directories to /usr/share/doc/packages/apache2
 * Execute
 * In the prompts, enter:
 * R for RSA.
 * Enter in the info asked. Since we are signing it ourselves, we are using a fake CA.
 * Enter '3 for the Certificate version.
 * Enter in similar data to the step before the last. It does not have to be the same.
 * Enter '3 for the Certificate version.
 * Enter 'Y To encrypt the CA's RSA private key.
 * I used the server's root password for the pass phrase.
 * Enter N to encrypt the Server's RSA private key.
 * This certificate is good for ONE YEAR. In other words, it needs to be renewed before June 28, 2014.
 * The certificate has been moved to its proper locations under /etc/apache2. The script output lies.
 * Reload the server and attempt to connect. If the server is acting like the server doesn't exist:
 * Check the listen.conf file to see if its working.
 * Ensure that ssh_mod is installed on the server (it should be by default).
 * Make sure that the SSL flag for Apache was set.
 * Then do a HARD restart of the httpd daemon. Stop it, then start it up again. Reloading it gracefully will not work.
 * Then do a HARD restart of the httpd daemon. Stop it, then start it up again. Reloading it gracefully will not work.

We now need to download the source code to SPEAK from its SVN repository. Execute the following in the /mnt/main/svr/www/vhosts/ directory. Make sure that the speak directory does not already exist.

Once the above is done, ensure that the directory root is set to the php directory within speak. The database is empty, though you should log in with the database's root credentials and create an administrator. Then log out, log back in as that new administrator, and create user accounts. One should never use the Database Administrator account with Speak once an administrator is defined! In the future, access to Speak from this account should be disabled by default.

Unfortunately, I ran into a bit of trouble with SpEAK. Namely, it was erroring out with some constructs. Error logs are found in the /var/logs/apache2/ directory, the one I was interested in was in speak-error_log. After some reading, I found it was throwing errors at the portions of code utilizing a syntax introduced in later versions of the PHP interpreter, turns out the PHP interpreter included was out of date.

Since this version cannot be found in the mainstream repositories for the version of OpenSUSE installed, I had to do it manually. Updating Caesar to a more modern version (of any distro) should be a priority for this reason.

To bypass this, I had to:
 * 1) Find a repository containing RPM packages of the necessary packages for this version of SUSE.
 * 2) *In this case, I found a customized repository created byCorot Sebastien (scorot). Thanks!
 * 3) Set it as a custom repository within Zypper.
 * 4) Force an update to PHP5 and all associated packages.
 * 5) *By default Suse prefers to keep packages updated from the repository it was installed.
 * 6) **Since we need to update it using a RPM from another repository, we essentially need to re-install it and all related packages.
 * 7) ** will do.
 * 8) Zypper will state there is problems as all the dependencies need to be updated to this new repository too.
 * 9) *Selecting Solution "1" will do just this.

Of course updating Apache broke it. After installing some packages containing necessary dependencies, and a reboot. The machine won't let me back on....

Apache tips:

 * Almost all of the config files mentioned above are merely extentions of httpd.conf.
 * Meaning that you could put the same directives into each of the conf files, including httpd.conf and Apache will work.
 * The conf files are separated to help manageability (related directives are grouped together), then are "imported"
 * After changing a config file, you can 'gracefully' restart the server by executing.
 * It is "graceful" as instead of suddenly shutting down the server and thus cutting off active requests, it will wait until all active requests have been completed, THEN restart.
 * Apache2 has to be restarted after a config change. It does 'not need to be restarted if a site's content has changed.
 * Do NOT edit the httpd.conf file!
 * In most cases it isn't needed. Use virtual hosts, not only is it more flexible, but its also easier (messing up an httpd.conf file
 * To easily enable an Apache module, use:
 * To easily disable an Apache module, use
 * Make sure you restart Apache after doing either command.

=Week: 5 (7/1-7/7) =

Goals

 * Fix Caesar
 * Roll back updates
 * Configure Marathon into Rome
 * Rome will be a server running Fedora 19
 * Run some test experiments on Rome to verify that changing the OS does not affect the results.
 * Attempt to get SpEAK running on Rome.

Results

 * Fix Caesar:
 * Rolled back all recent updates.
 * The newst OpenSSL update in particular caused issues with SSHD.
 * The easiest way to do this was to simply remove the new (unofficial) repository and re-install the packages that I had updated.
 * After a reboot, sshd was running normally and allowing remote connections.

This little fiasco proved that SpEAK will not run on OpenSUSE 11.3 without updating various components, which may break other components; the fixes to those issues will break yet more components, and thus forth.

We determined the best option is to upgrade the server to a newer OS. Since we were already talking of switching distros, we decided to switch to Fedora 19.

Marathon is a Dell Desktop which was running Windows XP. We decided to make an image of the unit (to ease reverting back to its original state if we so choose), then wipe the drive, install Fedora on it, and dub it "Rome". Rome will be out Fedora test box.

Our goals are simple:
 * Determine if Fedora meet our needs.
 * Determine if Fedora works with Sphinx
 * Determine if using Fedora affects the results of experiments.
 * Determine if Fedora will be a good platform for hosting SpEAK.


 * Install Fedora on Rome:

To make an image of the existing drive, I booted off the Fedora LiveCD and used DD, followed by xz to compress the file. It took a while, but XZ was able to get a 50GB image down to about 25GB; decompressing an xz (lzma2) file takes much less time and thus shouldn't take too long when/if we put the image back. This image has been put onto the drive at /home/mcy58/marathon.img.xz

This machine has been taken back to my house for testing. Since we need access to the private LAN for NFS, and for web hosting, we need to set up a host-to-host VPN between Rome and Caesar.

I chose StrongSwan for this task as its well supported on both Caesar (OpenSUSE) and Rome (Fedora).

Made Casear give access to /mnt/main over 192.168.10.0/25. Allowing access to both the existing 192.168.10.0/24 access and the new 192.168.11.0/24 subnet. (Which will consist of Marathon over a VPN tunnel).

Once the tunnel was established. Caesar and Rome behaved like they were directly connected to each other.

Installing Apache
Now, to install Apache:
 * Unlike in Suse, the package for Apache is called 'httpd', which is the running name of the Apache daemon in both distributions.
 * PHP is simply called "php".
 * MySQL is called the same as Suse though.
 * This site is a good resource for installing the above packages.

Apache along with PHP and MySQL was installed on the machine. Before configuring SpEAK, I configured a test page to determine if the port-forwarding/static NAT/Masquerade configured on Caesar was working. Ideally, Caesar will take all traffic incoming from port 80 and 443 and forward it to Rome, which will respond to the request. Although I know that Rome is hosting websites properly, Caesar is not forwarding traffic properly. I'm apprehensive about playing with the firewall remotely (lest I kick everybody out again).

Testing Sphinx

 * Like the other servers. Rome has a soft link at /usr/local that points to a customized local directory on Caesar (at /mnt/main/local). As this is where most of the Sphinx utilities are installed, this method ensures that the version of sphinx is consistent across every server within the cluster.
 * Running train_01.pl failed the first few times.
 * After some troubleshooting, I've determined the sphinx configuration script was failing due to it not finding a module (Pod/Usage.pm).
 * This was somewhat concerning as research had determined that this module was already part of the standard Perl distribution.
 * As Perl came pre-installed as part of Fedora, a missing module may indicate a failed installation.
 * To play it safe, I decided to try to search for and possibly the missing module using CPAN.
 * This was not installed by default, I had to install it using Yum.
 * After and configuring Installing CPAN, I was able search for and confirm that Pod/Usage.pm did exist on Rome.
 * Note:In order for Sphinx to run properly! CPAN must be installed and configured (execute cpan once and follow the prompts).
 * Unfortunately, GenTrans failed the first few times.
 * After some debugging, It turns out that Sox is the program which is failing.
 * Apparently an executable can be found within /usr/local on caesar.
 * This is how sox is accessed on the batch machines.
 * However, this version of sox appears to be incompatible with Rome for whatever reason.
 * Even if sox is installed directly on Rome (within /usr/bin), the path variable always chooses the executable in /usr/local first.
 * Oddly, the path variable on caesar is reversed in that /usr/bin is checked before /usr/local/.
 * I was able to temporarily fix this problem by temporarily overriding the $PATH variable using the following:
 * setenv PATH "/usr/bin:/usr/local/bin:/usr/local/sbin:/usr/sbin"
 * After doing so, Sox and GenTrans was able to run successfully.
 * That being said, this is primarily a temporary fix.
 * Additionally, SOX is completing normally, but is returning an error:
 * completedsox WARN sox: Option `-U' is deprecated, use `-e mu-law' instead.
 * The sox used on rome is of a later version that which is provided on Caesar. We probably should change the commands used in genTrans.pl to address this, otherwise some date it wouldn't work at all!

=Week: 5 (7/8-7/14) =

Goals:

 * Get Rome Integrated & SpEAK running.
 * Run some more experiments on Rome.
 * Run a different 5-hour test train.

SpEAK & Rome:

 * Once Rome was put back onto the same LAN as the other batch machines, the firewall settings on caesar worked properly.
 * With a bit of tweaking to reflect rome's new IP address (192.168.10.11), of course.
 * This is likely because the Virtual interface that rome was on when it was attached via a VPN tunnel was not assigned a firewall zone (inside/outside).
 * Since the LAN where the batch machines reside are on the "Inside" zone, the masquerade rules work properly.
 * Since the VPN tunnel isn't needed, It was removed off of Rome and disabled on Caesar.
 * Put up test page on Rome.
 * I had some issues with certificates with SpEAK on rome, I was hoping I could resolve it from home once it was installed.
 * Unfortunately: The day after installing rome in the server closet, I've lost all access to it.
 * Its strange, I've tested for access (and the site) from home when I got home.
 * I can ping Rome. But there isn't any SSH access. It freezes when I try my account, and it accepts the root account credentials, but then freezes after login. I don't even get a shell.
 * Its not only Rome, similar stuff is happening to
 * Asterix
 * Automatix
 * Idefix
 * verleihnix
 * Its not happening to just connections from casear, attempting to ssh into one of the above machines from an accessible batch machine has the same results.
 * Oddly enough, this problem appears to be intermittent. I was able to get into rome as root later in the day.
 * I still couldn't access Asterix et.al. Perhaps they share a switch which is going bad?

=Week: 6 (7/15 7/21) =

Goals:

 * Fix Caesar.
 * Determine why Rome is giving different results than the other experiments.

Results

 * Troubleshooted Network on site.
 * Turns out that Caesar had an option for "Protect Firewall from Internal Zone" enabled.
 * Effectively applying a very strict firewall policy for the internal interface.
 * Disabling it allowed for normal network operation.
 * Experiment 0117 and 0116, both of which were completed exclusively on Rome, were clones of experiments 0089 and 0090 respectively.
 * They were designed to validate that there were no significant changes in word error rate when using the existing sphinx implementation on a fedora server.
 * Unfortunately, given the same inputs ( dictionaries, transcripts, LM, etc), there was a significant change in word error rates.


 * The version of sox which exists on caesar for use by the batch machines is out of date and will not run on Rome; the version of sox which runs on Rome is a much newer version.
 * Gentrans5 appears to be incompatible with this newer version; while it did appear to work on Rome, sox was throwing non-fatal errors regarding obsolete syntax.
 * Because of the errors encountered by sox and genTrans, I believe that the issue is not inheritly sphinx's problem, but rather a problem with input wave files generated by sox.

Exp/0115> shasum wav/sw4816B-ms98-a-0033.sph ../0114/wav/sw4816B-ms98-a-0033.sph ../0090/wav/sw4816B-ms98-a-0033.sph 577ac157cd8c5a477c3e5b7281190472702e1a7b wav/sw4816B-ms98-a-0033.sph 577ac157cd8c5a477c3e5b7281190472702e1a7b ../0114/wav/sw4816B-ms98-a-0033.sph 7c7acd0253db5f1134d366d7540aa2007da56b27 ../0090/wav/sw4816B-ms98-a-0033.sph
 * By running the following command to compare file hashes of a single wavefile between both experiments 0117, 0116, and the reference experiment for gentrans5 in 0090, I was able to confirm this hypothesis that the input files differed for this set of experiments.

=Week: 7 (7/22 7/31) =

Goals:

 * Confirm that Sox is the cause of differing experiment results on Rome.
 * Determine what would be the best way to resolve this.
 * Look into the options within sphinx_train.cfg and what they do.
 * Run some experiments to improve the WER using the info above.

Results

 * Sox:
 * Started a new experiment, 0118 to replicate Experiment 0090.
 * Except instead of running GenTrans, I copied over the transcript, fileid list, and wavefiles.
 * After running the feats creation script, I compared a feat file with one from the previous experiment on rome, along with a the reference experiment (0090):

[ejg58@rome 0118]$ shasum wav/sw4816B-ms98-a-0033.sph ../0114/wav/sw4816B-ms98-a-0033.sph ../0090/wav/sw4816B-ms98-a-0033.sph 7c7acd0253db5f1134d366d7540aa2007da56b27 wav/sw4816B-ms98-a-0033.sph 577ac157cd8c5a477c3e5b7281190472702e1a7b ../0114/wav/sw4816B-ms98-a-0033.sph 7c7acd0253db5f1134d366d7540aa2007da56b27 ../0090/wav/sw4816B-ms98-a-0033.sph


 * As we can see above, the file hashes for both experiment 0118 and experiment 0090 are identical. We can prove that the issue is in fact genTrans, and not the feat creation script.
 * When starting the decode script, I encountered another issue:
 * Previously I was able to resolve this issue on rome by setting the "$LD_LIBRARY_PATH" to /usr/local/lib
 * I have a better solution now:
 * Make a file located at and called: /etc/ld.so.conf.d/sphinx.conf
 * Add  to the file.
 * Execute  to reload the shared libraries.
 * After the above steps, the decoder started.
 * After the above steps, the decoder started.


 * Sphinx Config file:
 * Started a new page in the wiki.
 * See [Speech:Sphinx_train.cfg|Sphinx_train.cfg] section for more info.
 * Spotted a few promising options.
 * Turns out senomes are closely related to an option called densities.
 * Set using
 * See [model type and model parameters|here] for more information.
 * Started experiments 0119 through experiments 0124 to test out this knowlege.
 * 0121 and 0122 Use a senome value of 1000 and a Density of 16. Which is double the normal (8)
 * 0123 and 0124 Use a senome value of 4000 and a Density of 16.
 * Tried a few experiments using Automatic Gain control.
 * This more or less adjusts the volume to be consistent.
 * It is noted that telephone recording equipment usually applies AGC during recording.
 * Experiments using this option within sphinx (Exps 0119 and 0120) showed absolutely no changes in WER when compared to the reference experiments (0089 & 0090)
 * As Switchboard is based off of phone recordings, it is likely that AGC was applied during its creation.
 * Using it within Sphinx has no effect other than eating up resources.


 * Misc
 * I've been searching for a good version control system to use with our ever-expanding set of scripts. I believe that I have found an answer: Git.
 * Git is a distributed Version control system.
 * It is designed for Open-source projects with an undefined number of individuals working on single project.
 * Instead of using a centralized repository like SVN, users work "Checkout" a copy of the master branch from a centralized source to a local repository.
 * They work locally, with a local list of changes being kept.
 * They can then "Push" their revisions to the master branch.
 * Git will work for us out of the box: We don't need to setup a server.
 * Just a central spot to hold the master branch.
 * Developers will retrieve a copy and work in a local repository located within their home directories.
 * They will test the scripts within the home directory, when it is stable it can be pushed to the master repository for use.
 * I have tested Git out by improving one of the new scripts I've made this semester:
 * clone_exps.pl will now also copy over a transcript, file list, and wavefiles from an original experiment to a specified target experiment.
 * Meaning that one can feasibly "Clone" an experiment with little effort!
 * Please note that it will NOT adapt the target experiment's sphinx_train.cfg file OR create feats.
 * This is intentional: there is no need to have this script do those tasks. Use train_02.pl and make_feats.pl scripts respectively.

=Week: 7 (8/1 8/7) =

Goals:

 * Run a full 308 hour train with existing and updated processes.
 * Support other team members as needed.
 * Document how to use Git for future usage.
 * Begin to update how to run Experiments using the new scripts and processes.

Results
cp: cannot stat `/mnt/main/corpus/dist/Switchboard/flat/sw02289.sph': No such file or directory cp: cannot stat `/mnt/main/corpus/dist/Switchboard/flat/sw04361.sph': No such file or directory cp: cannot stat `/mnt/main/corpus/dist/Switchboard/flat/sw04379.sph': No such file or directory I won't be able to use that corpus until I get those audio files!
 * Run a full 308 hour train with existing and updated processes.
 * Since I am using a new corpus, I wanted to validate that everything was in place in preparation for the experiment.
 * Turns out there are two different corpuses which utilize the Full Switchboard corpus.
 * 308-hour and full
 * The differences between the two are minimal, but the latter must be slightly longer as it has a larger transcript.
 * The audio files for either transcript is not populated.
 * Using copySph.pl on the full corpus failed as there were missing audio files within /mnt/main/corpus/dist/Switchboard/flat.
 * Through investigation, I determined that Disks 4 and 8 were not included within the wav files!
 * This was probably due to how it was stored. The entire contents of the directories representing those disks consisted of capitalized names; as Linux is Case-senstive, someone probably used a script which assumed that everything used lowercase and thus failed.
 * More concerningly, I have 3 audio files for which I have transcripts for, but no wavefiles!


 * In place of those experiments, I created a new 10 hour corpus, 10hr2, and ran 4 experiments on that.
 * 10hr2 was created by concatenating first_5hr with last_5hr.
 * Experiments 0127-0130 contain two sets of experiments. A training set using 10hr2 and an associated decoding step which uses the last_5hr test corpus.
 * As the 10hr2 corpus was based partly on last_5hr, last_5hr/test is appropriate for a test on train.
 * Results were slightly worse than models created with last_5hr.
 * However, it is anticipated these models are superior decoding new data as they contain more data.


 * Document Git:

I've completed implementing git on Caesar.


 * There are two master repositories:
 * A 'bare' repository at /mnt/main/scripts/user.git for which users will pull and push to.
 * Due to git's implementation, it is impossible to find executable within this directory.
 * An 'executable' repository at /mnt/main/scripts/user for which users are discouraged from cloning, but contains an always up-to-date and easily accessible version of the executable.
 * When a push is made to the main 'bare' repository, a script is triggered which pulls any updates made to the executable repository.

Information on how its implemented, along with instructions can be found here: Speech:Git

=Week: 7 (8/7 8/14) =

Goals:

 * Work on running the full 308 hour train with existing and updated processes.
 * Support other team members as needed.
 * Finish Paperwork

Results

 * Run a full 308 hour train:
 * The files above are not recoverable due to an error on the original disks holding the data.
 * To get past it, we have to remove the audio data.
 * This is quite painful for me, but necessary if we are to get the train to run.
 * Created a new sub-corpus within full.
 * train2
 * Removes all references to the missing audio files: sw02289, sw04361, and sw04379.
 * Started a new experiment: 0134.
 * Started genTrans5.pl. It took 12+ hours for a full 308 train.
 * The add.txt file is 4662 words long....
 * This will take quite a bit. I may have to


 * Support other team members as needed:
 * Assisted Tommy with Torque
 * Turns out Torque does not like having two address-differing hosts with the same hostname.
 * Removing these redundant entries worked like a charm.