Speech:Spring 2012 Old


 * Home
 * Semesters
 * Spring 2012
 * Proposal [Old]
 * Report

=Introduction=

Spring 2012 proposal will outline the Networking group attempt to bridge the Unix Server together. Giving each the ability to have access to the Internet. The Software group will attempt to catalog the the Speech Recognition software called Sphinx and create script that will link the files to a share drive called /mnt/main. The Data group will attempt to clean and prepare the transcript of audio files for the running of transcripts in a mini train and full train. Also creating a dictionary from the transcript-ed text. The Data group will also covert the .sph files to .wav files for the running of a mini and full train.

COMMENTS: (Mike Jonas) Need an introduction on what the point is of this project proposal.

COMMENTS: (Mike Jonas) Also, for all groups, we need definitive time lines with start and stop date estimates and specific individual assignments...see Spring 2011 proposal.

COMMENTS: (Mike Jonas) Still need final two sections covering Mini-train and Full-train and who is going to do what when.

Hardware Group
Sky Swendsboe, Evan Morse, Aaron Green, Damir Ibrahimovic

Caesar Update

Overview
Our proposal is a three-phase approach to update the software currently on the server. The following phases are subject to team input and can be updated at anytime.

Hardware & Networking Proposal
An Idea presented to have Caesar as an DHCP router and a DNS server and have the other 9 servers connect via through Caesars 2nd NIC card so the 9 servers can always go through to the internet at all times via 192.168.X.X network to Caesars IP to the world.

Using this method, although the load on Caesar bandwidth wise will be greater, no new connection will need to be made to the internet: users will not need to physically switch Ethernet lines around or go into the root system and enable/disable the NIC’s in each machine.

With this configuration, if need be, the servers can update their software automatically by downloading it via the Internet or establishing a connection with a website to obtain an update: making the updating process and Internet connection process smooth and eliminating any extra configuration needs.

We will accomplish this by utilizing DNS/DHCP with Caesars extra NIC card: one being dedicated to the outside world while the other line will be a inside network 192.168.X.X that will go through Caesar’s outside line: Caesar will be a acting, dedicated router.

This method of making a machine act as a router has already been proved on a Windows machine by Myself and Evan Morse. Our challenge in this procedure is figuring out how to assign Internet sharing with Open Susie/Linux. We will need two Linux powered machines to test this theory on: one with two NIC’s and one with just 1 NIC; but both will need a copy of Linux installed.

After we prove this can be successfully accomplished on the test machines, we can setup this design on Caesar. The other machines will only need a static address and a default gateway back to Caesar.

COMMENTS: (Mike Jonas) Guys, we are not updating the systems for the sake of updating. We need justifications as to why the system needs to be updated, what we will be unable to do. (i.e. can we not do speech training and recognition with the current installation base? Are there new tools we need that will slow our work down? Please keep the entire project in mind. I think the bar should be high for us to undertake updates. Backups are important however...

Phase 1
The first what we have to do is Backup the current configuration. Every update may create problem, so we want to backup good data. The backup process can be done with many different ways, but we could use the simple and most effective way. We can use Partition Cloning to copy the /home partition to another partition (on second hard drive, either external or internal) as root type. The following command will execute this task:

dd if=/dev/sda4 of=/dev/sdb2 dev 	means device sda4 	is disk where /home partition is found sdb2	in this case is external USB drive Hint: To list all hard disks and their partitions (including USB drives) we can use: fdisk -l In order to execute this command you have to be logged in as root or super user. It will take some time but it will clone hard drive and you will be able to copy data to new hard drive. The other way is Partition Image to copy the /home partition to a file, we will type following command: dd if=/dev/sda4 of=/yourFilename.dd This option will copy data to our filename we chose for backup. To restore the partition from the file, we will type following command: dd if=/yourFilename.dd of=/dev/sda4

Phase 2
After backup is done we will update current system which is version 11 to version 12. This update is for openSUSE only. The update will be done using following commands: 1) zypper refresh	2) zypper update The process is automatic and may need or not user input.

Phase 3
After the system has been updated testing will need to be performed on the system in order to ensure no functionality has been lost. The test phase with include running all the machines against the server to ensure no operational malfunctions. Documentation in all three phases is key and will help future classes in perform speedy updates to the system.

Software Group
The software group consists of Jonathan Schultz,  Bethany Ross, and  Matt Vartanian.

Overview
The software group has taken on the challenge of documenting the Sphinx files, creating a hierarchy for the Sphinx files on the mount drive, creating script that will make directories, link Sphinx file to /mnt/main that will make training and decoding easier for our fellow classmates and other future Capstone student projects. Also a place that these files will be safe if something should happen to the Unix System.

Software Proposal
We plan to document the Unix system to find and catalog the Sphinx files that are on the Caesar machine. We will also find a Open Source Cloud to store these files for protection. This needs to be on a free service that could be accessed from anywhere in case of a failure of the Caesar machine and physical backups. Once we determine which folders and files relate to Sphinx, that reside in the /usr/local on the Caesar machine. We will create a catalog for those files and determine what each does. We will write a script to link the Sphinx files on the /mnt/main that is shared by all servers. To link these files we will need to write shell scripts that will link them to the /mnt/main. We will document the new home of the files. We will need to come up with a naming convention for the folders that will make finding them easier. This hierarchy is believed will help other groups that will write scripts for a mini train and full train in the future. We will possible have to go back to older scripts that were written for the current location of the files and change them, if we determine that the new location will help with future progress of the project.

COMMENTS: (Mike Jonas) Think about using our NFS mount to reduce repetition. Since we have /mnt/main shared by everyone, why not find a place to copy all of the shared sources on the NFS mount and then deal with using Unix links to link executables/binaries from their needed /usr/local/bin or /usr/bin directories into the shared /mnt/main/.../bin directories. Links are pretty easy to maintain and the bulk of the code would live on the NFS mount share. (P.S. I cleaned up /root/... on caesar to only show things we need).

Phase 1
In Phase 1 of our Software Proposal we will determine the tools we will use to complete our part of the overall proposal. We will need to determine the scripting language of our script to automate the process of linking the files to /mnt/main, also making new folders for the new hierarchy of the Sphinx files. We will need to find an Open Source way of securing our scripts and the Sphinx files on an offsite location. We will need to collect the name of each file in Caesar that relates to Sphinx and determine what these file do.


 * Jonathan Schultz will determine the scripting language to link the Sphinx files to /mnt/main by February 27th.
 * Jonathan Schultz will determine how to catalog the Sphinx files by February 27th.
 * Jonathan Schultz will find out what each sphinx file does for cataloging by February 27th.

Phase 2
In Phase 2 of our Software Proposal, we will write the scripts in the language that best fits the needs of our project. We will determine what each Sphinx file does. We will come up with a naming convention that we and our fellow classmates believe will help us find the Sphinx files that we will use to create a mini train and a full train. We'll also design the hierarchy of the new location on the mount drive.


 * Jonathan Schultz will write script with scripting language by March 12th.
 * Jonathan Schultz will create catalog for the files of Sphinx by March 14th.

Phase 3
In the third and final phase of our Software Proposal we will create folders and directory on the /mnt/main and link the Sphinx files to the Caesar to them and copy the files to an offsite location by running scripts we have created.


 * Jonathan Schultz will test script on Virtual Machine by March 17th.
 * Jonathan Schultz will run script to link files to /mnt/main on March 19th.

Data Group
This group consists of Brandon McLaughlin,  Johnny Mom, and  Michael Henenberg

Overview
The purpose of the data group is to take the transcripts and convert them into the correct format acceptable by sphinx to correctly recognize speech. The sphere files (.sph) need to be converted to usable wave files (.wav), which sphinx can use.

Data Proposal
Our group intends to clean and prepare the transcripts of the conversation in the database that we have. We will first write test scripts in PERL to see how it will impact a set amount of transcriptions (100 lines), and once we have worked out the problems we can put them in an official script to automate the processes needed to format the transcripts on a mass scale. This process will clean the files up to 100% complete, we hope this is possible but there are always a few characters in some that might stick around. We will work with PERL Programming Language to write a full script which will pull the transcript files out of the directory and clean them up one at a time. The conversion of SPHERE (SPeech HEader Resources) will utilize the SoX command, SoX is a command line app program in UNIX that can convert audio files to other audio file formats. We will be able to convert the SPHERE (.sph) files to the WAVE (.wav) files, which then Sphinx can decode and train the audio files while also producing a dictionary in the process.

COMMENTS: (Mike Jonas) Check out /root/SCRIPTS/genTrans.pl as it does most of this already. It also ties the generating of wave files into the process.

Phase 1

 * Start: Feb 21, 2012


 * End: May 15: 2012

First, we will utilize the PERL Programming Language to write some test scripts to do the process of cleaning the transcripts. We will try automating the process to converting the transcripts to usable form accepted by Sphinx. We will first test the script by moving a few lines at a time to our home directory for the script to be executed. The automated process will try to grab the specific transcript files and copy it to a specific folder which will be labeled "Clean_Transcripts" to which the files residing are supposed to be more than 95 percent cleaned.

Phase 2

 * Start: Feb 21, 2012


 * End: May 15: 2012

The second phase will be converting the files to the correct format that sphinx will understand and use. The file format we need to convert is the SPHERE (.sph) file format and the file that sphinx will accept is the WAVE (.wav) file. We will utilize the SoX command on OPEN SUSE which is already installed. Lastly we need to find a way to automate this process rather than converting each SPHERE file one by one.

Phase 3

 * Start: Whenever the mini/full train is ready for testing.


 * End: May 15: 2012

In the third and final phase we will coordinate with speech tools group to allow them to use our clean transcripts and converted SPHERE files for testing on the mini/full train.

Speech Tools Group
Aaron Jarzombek, Brice Rader,  Chad Connors, and  Ted Pulkowski

Overview
The speech tools group has taken on the challenge of getting Sphinx 3 up and running on local machines to run the decoder ( Aaron Jarzombek and Brice Rader) and trainer ( Chad Connors, and  Ted Pulkowski). We will document how we get everything to work so we can share with the rest of the groups. We will have all of our "major" tasks completed for April 3rd so the rest of the groups can then have enough time to be taught and to use the tools themselves.

Mini Train
The Mini Train will take a small part of the audio and transcript, about 1 hour worth. A mini switchboard set will need to be created. With the work for the Data groups transcripts an dictionary. We will also need a language model to run the mini train. We will then move to decoding the audio files and comparing them to the transcripts created by the Data Group.


 * Jonathan Schultz will attempt to make a language model from the Data groups transcripts once the transcripts have been cleaned up and the dictionary created. Date to be determined.

Create Mini Train Set
To create the Mini Train Set, 1 hour of the transcripts and audio will need to be created.

Create Mini Test Set
To create a Mini Test Set, we will need to develop a set consisting of a small part of the Switchboard Corpus. We will need to break up the 1 hour audio and transcripts to smaller pieces to test the parallel running of the Servers.

We will need to evaluated the performance of the Servers in parallel.

Run Mini Switchboard
We will then take the dictionary, train the audio in parallel and then decode to see how accurate our models and dictionary come to recognizing the audio.

Full Train
The Full Train will take all 100 hours of auto and transcripts and run them parallel on each server in pieces of 100 hours each. We will need to create a bigger more robust language model. We will need all of the transcripts to be cleaned and a dictionary of all 100 hours to be created.


 * Jonathan Schultz will update Mini Train Language model if needed. Date to be determined as we come closer to this stage of the experiment.

Create Full Train Set
To create the Full Train Set, we will need a develop a set consisting of 10 hours each from the Switchboard Corpus.

Create Full Test Set
To create a Full Test Set, we will need to take the full 100 hours that was broken up and run them parallel on the Server to save processing time.

We need to evaluated if the performance of the Servers running parallel with the full audio is capable.

Run Full Switchboard
We will take our updated dictionary of all 100 hours, train the audio is parallel and then decode to see how accurate our models and dictionary come to recognizing the full audio. We then need to make adjustment to higher the recognition rate of the experiment.