Speech:Spring 2012 Proposal

From Openitware
Revision as of 18:36, 27 February 2012 by Cpc2 (Talk | contribs)

Jump to: navigation, search


Contents

Introduction

This proposal will outline tasks needed to build a system capable of utilizing a full set of robust Switchboard models. COMMENT: I'm starting you off, add more to the introduction...

Infrastructure

COMMENT: All work on Infrastructure needs to be complete by March 27th. (Note that you lose a week for Spring Break.)

Capture

Hardware Configuration

Assigned to: Aaron Green

This portion of the class proposal will outline the hardware makeup of the Speech Tools System. This proposal will outline my individual contributions to the team “Speech” project in an effort to move the system closer to its full functionally. The hardware makeup is a key component in this project and properly identifying the major components will be one of the many parts of this proposal. Within the hardware proposal will sit three categories. The categories will outline the hardware makeup and specific system performance aspects and over all capabilities of each piece of equipment that make up the entire hardware foot print. After looking at the key components, research must be done to see if the components are up to date and are functional for the task. Identifying possible upgrade requirements will need to be properly vetted and weighted against cost, labor hours, and school network restrictions before a possible upgrade can be completed.

The first phase of the Hardware proposal is to properly identify each piece of equipment in an informative table for future users to use. The information for the table will have individual component speed, memory, disc space, CPU information, and the estimated value of each item in our hardware configuration. In order to properly map each components capabilities, I will need to look at linux command line server commands similar to the resource links below to run queries about the hardware. This will take some time to investigate, but via the command line I should be able to extract all the data with a few complete linux line commands. Once all the data is compiled is will be inserted into a functional and usable table.

The table will be drafted and made using a image capture from a spreadsheet program then moving the information to wiki HTML table tags within our wiki page. This will allow the table to have future link capabilities and allow for system changes to easily be reflected by small changes in the HTML tags. The table will have a header, boarder, and clearly show all the components listed by name. The names of each components can have wiki links for future development. The table will be simple but informative, but also have additional information about each component that may not be easily apparent to system users.

The additional information that will be about the components will take some investigating and research in hardware makeup, possible upgrade opportunities and cost estimates of the current components. The makeup will be covered in the first part of the hardware outline, but upgrading possibilities will take online research about our server and system performance limits vice cost to upgrade to newer components. This can all be done online after the system has been effectively outlined.

I plan to use all available tools to complete this project prior to the project deadline.

TimeLine:

  1. (Feb 28 – Mar 5) Identify all the components in the system within a spreadsheet. For each component, indentify CPU information, Speed, Memory, Max Memory and Disc Space. Start on wiki table to insert data from spreadsheet to live wiki site. Extra table blocks will be made for following weeks work.
  2. (Mar 6 – Mar 12) Look at each component anatomically. Research estimated cost for current hardware components. Look at possible hardware upgrade opportunities within each component. This information needs to be inserted into the wiki table. This process will continue into next week.
  3. (Mar 13 – Mar 20) Continue..Look at each component anatomically. Research estimated cost for current hardware components. Look at possible hardware upgrade opportunities within each component. Also, estimate labor to upgrade system components and complete a risk verses gain on the upgrade information researched. This information needs to be inserted into the wiki table.
  4. (Mar 21 –Mar 26) Finalize Hardware table and write documentation/description for the hardware section.

System Software Configuration

Assigned to: Damir Ibrahimovic (3-4 paragraphs plus bulleted timeline)

  • The correct software configuration on every system is very important. For the every hardware in current machines we have to have correct software installed in order to operate properly. It is important to follow up with updates for your release and find out what update can increase your system performance or resolve some knows bugs or other issues. The openSUSE announced on November 30th that there will be no more updates coming for the openSUSE 11.3 Linux distribution, starting with January 16th, 2012. Support for openSUSE 11.3 will be officially dropped, which means that starting with January 16th, 2012, the openSUSE project will stop "feeding" the openSUSE 11.3 operating system with security or critical fixes, and software updates.We have system that runs this version, and to know that we will be stuck with what we have is a bit scary, but if it works for our project then we are Ok!
  • What I need to do is to find what exact current OS version is. Because we are using Sphinx3 for our project, I need to find out is this version fully compatible with current OS version and possible new version of OS. There are also additional packages that come with Sphinx3 that need to be checked. The new release usually brings new futures, bug fixes and other improvements to OS, but does not mean that everything will work like on old system. For an example; the packages that are intended to be used on openSUSE 11.3. are unlikely that they will work on other releases. This is due to numerous differences in compilers, libraries and release contents.
  • Having all software correctly installed and running versions that are supported in current system is a must. There is new OS version of openSUSE that support Cloud technologies, better handling of smaller screens and multi-screen setups, better notifications and a centralized online accounts configuration. The current version is working well for now, but we have been talking about backing up system to a cloud! So, this way and for other possible benefits I will make my proposal on system software. Before starting update there are a few things we should be aware of and that is System Minimum Requirements, will other currently installed programs work on new system and what needs to be done before update.
  • The time line is:
    • On 03/05 access Caesar and other machines and check current software configuration using Yast or command line.
    • On 03/12 compare current software configuration with new, create comparison table for current, new, and beta version (if exist). Write down benefits of upgrading to new software version and why not to upgrade to new software version.
    • On 03/19 finish up the proposal and post it to Wiki.

Speech Software Configuration

Assigned to: Bethany Ross (3-4 paragraphs plus bulleted timeline)


Modify

Speech Software Installation

Matt Vartanian and Jonathan Schultz of the software team have been given the the job of documenting and installing a CMU Speech Recognition software on a server called Majestix. Majestix is one of several networked servers connected to a main server, Caesar. As of 2/24/2012, Caesar is the only server to have CMU's Speech Recognition software installed. The software team plans to locate the files that will be created during the installation of the Speech software, perform the installation of the Speech software and tools on Majestix, and make a soft link to those files on the mounted directory of /mnt/main.

The directory /mnt/main is physically located on Caesar. Each server networked to Caesar (including Majestix) has a mounted directory called /mnt/main which points to Caesar's /mnt/main directory. The software team will be installing the Speech Recognition software in the the directory of /usr/local/bin under the root of Majestix. Once this has been accomplished, they will then create soft links in Caesar's /mnt/main which point to the Speech Recognition software on Majestix, making it readily available on all servers.

To complete this task the software team must first create a directory under /mnt/main on Caesar called /root/tools. Next, they will create a soft link in /root/tools to Majestix's /usr/local/bin folder after installation of the Speech software has been completed on Majestix. Under the root directory in /mnt/main, the software team will create a tools folder where they will make a soft link that points to Majestix. Once this task is complete, every server on Caesar's network will be able to run the tools needed to decode and train with the CMU Speech Recognition software. If not, installation of the speech tools and Sphinx 3 will have to be performed on each server individually.

To create the the soft links, the software team will need to change the bin folder in /usr/local to binold. Then they will create a soft link named bin in the /usr/local directory and copy the files over to this link location. Once this is done they will create another link in the /mnt/main/root/tools directory to complete the link. Once they are done creating the link they will delete binold. In theory this will create the install point of the software and make it possible for the other server to run the software off of the share /mnt/main directory.

The timeline for these tasks are as followed.

  • Jonathan Schultz will create the folders on /mnt/main between February 28th and February 29th
  • Jonathan Schultz will install Sphinx 3 and locate all files on Majestix between March 1 and March 4
  • Matt Vartanian will install the Speech Tools and locate all files on Majestix between March 1 and March 4
  • Jonathan Schultz and Matt Vartanian will decide if the soft link will work or if we need to install everything on each server
  • If the files are in the right place, Jonathan Schultz will create the link on Majestix between March 5 and March 6
  • If the files are in the right place, Matt Vartanian will create the link on /mnt/main between March 5 and March 6
  • If the files are not in /usr/local/bin, Jonathan Schultz will install Sphinx 3 on each server between March 5 and March 19
  • If the files are not in /usr/local/bin, Matt Vartanian will install the Speech Tools on each server between March 5 and March 19

Assigned to: Jonathan Schultz & Matt Vartanian (4-5 paragraphs plus bulleted timeline)

Speech Data Corpora

Assigned to: Brandon McLaughlin & Michael Henenberg (4-5 paragraphs plus bulleted timeline)

Divide into Mini & Full with train, dev & eval sets for each

The speech data Corpora will consist of all the data transcriptions and the converted .sph files into .wav files. All the disks right now rest in the media/data/switchboard directory. The disks must be moved to the mnt/main/corpus/dist directory. In order for the .sph files to be converted into .wav files the SOX command must be used. The SOX command will take the .sph file and convert it to a .wav file. The command to be used is: SOX filename.sph SOX filename.wav. The best way to do this will be to bring a whole disk into Brandon or Mike's testing folder so no damages will be done to the real files.

The folders needed to hold the files will be created first with the appropriate files moved into them as they are completed. There is a script that can create the needed directories and files. Michael attemt to learn to use the script to create the folder. However, if the deadline becomes a problem then he will instead create the folder manually in the directories, so progress can continue.

The second part of the file conversions are cleaning up the transcripts. Right now all the transcriptions still have the headers and the brackets for when emotions were shown. Also they all have brackets for where completed words were finished when the person talking did not fully say the word. Again all these files would be moved to Brandon or Mike's testing folder so that no damages get done to the real files. The command that can be used is the SED command. This command will run through the text and eliminate what is specified in the command. This command will clean up the transcriptions to where they will be about 95% complete. Writing a script to do this command will be much more efficient that way we will not have to run the command by hand on every file.

The switchboard directory created in corpus will hold min and full. These folders will hold the files needed for the mini and full trains and be divided into train, dev, and eval. The /train in both will store the cleaned transcripts and the wave files that were created in the second part. The dev folders will hold the file samples used in the test of the trains rate of correct transcribing. The eval folders will hold the files used to analyze the results of the train in order to judge the accuracy of the trains transcriptions.

The timeline for this section is as follows:

Michael

  • Create folders between February 28th and March 6th.
  • Copy files in old folders to the new folders between February 28th and March 6th.
  • Make sure everything works properly by March 6th.

Brandon/Mike (once done his roles)

  • Convert .sph files to .wav files using the SOX command and putting the files in the correct folder between February 28th and March 10th
  • Copy the converted files to the correct folder (mnt/main/corpus/dist) March 2nd and March 27th
  • Clean up the transcripts using the SED command between March 9th and March 24th.
  • Copy the converted transcripts to the Switchboard/(mini,full)/train folders between March 9th and March 24th.


Dates subject to small changes.

Develop

Network Bridge

Assigned to: Skyler Swendsboe & Evan Morse (4-5 paragraphs plus bulleted timeline)

Here is a DRAFT: we will refine, but this is our current work. The main task for our section of the hardware group is to set up a network bridge. This will be accomplished by using Caesar as a DHCP router and DNS server. Caesar will act as a gateway for the other 9 servers so that they will always have connection to the internet. The way this will work is by using a second NIC card in Caesar for the other servers to connect to.

The setup will consist of two networks; Caesars connection to the UNH network, and the local network created by Caesar for the other 9 servers. Caesar will have a UNH IP address for its internet connection. The network Caesar creates locally will be a 192.168.X.X network. Each server should also have its own static IP for the sake of ease of use.

The physical setup of this bridge will be simple; Caesar has two NIC cards. One card is connected to the UNH network, the other will connect to a switch. The switch connected to Caesar will also have a wire from each of the servers feeding into it. This is how all 9 servers can connect simultaneously to Caesar.

The purpose of this setup is to allow each machine to maintain a connection to the internet for ease of updating/staying current with the software used. This will allow all groups in the project to get whatever packages their machines need with ease. This will also allow the servers to perform any other type of internet related task if necessary.

The process of creating a “DHCP/DNS router” has been done by both members of this group on windows machines. The main obstacle for this task is learning how to do this on an openSUSE machine through terminal. This is going to be tested on two openSUSE test machines before a full deployment onto Caesar.

There is not much required to test the network bridge plan we have laid out in this proposal. The only things needed are: Two openSUSE sytems. One system will have two NIC cards. The other will have one NIC card. Getting this test system working will directly translate onto Caesar. The only difference for this setup being that on the real system, the extra NIC plugs into a switch, not another NIC directly.

Timeline: (just a draft)

• Evan Morse Setting up hardware of test systems (2/21/12) • Evan Morse Researching openSUSE bridging (2/22/12) • Sky Swendsboe Setup experiment linux at home (2/22/12) • Sky Swendsboe Researching linux networking Enviro. (TBD) •


Given the materials mentioned here, Skye and I are both sure we can have this system up and running just as expected.

An Idea presented to have Caesar as an DHCP router and a DNS server and have the other 9 servers connect via through Caesars 2nd NIC card so the 9 servers can always go through to the internet at all times via 192.168.X.X network to Caesars IP to the world.

Using this method, although the load on Caesar bandwidth wise will be greater, no new connection will need to be made to the internet: users will not need to physically switch Ethernet lines around or go into the root system and enable/disable the NIC’s in each machine.

With this configuration, if need be, the servers can update their software automatically by downloading it via the Internet or establishing a connection with a website to obtain an update: making the updating process and Internet connection process smooth and eliminating any extra configuration needs.

We will accomplish this by utilizing DNS/DHCP with Caesars extra NIC card: one being dedicated to the outside world while the other line will be a inside network 192.168.X.X that will go through Caesar’s outside line: Caesar will be a acting, dedicated router.

This method of making a machine act as a router has already been proved on a Windows machine by Myself and Evan Morse. Our challenge in this procedure is figuring out how to assign Internet sharing with Open Susie/Linux. We will need two Linux powered machines to test this theory on: one with two NIC’s and one with just 1 NIC; but both will need a copy of Linux installed.

After we prove this can be successfully accomplished on the test machines, we can setup this design on Caesar. The other machines will only need a static address and a default gateway back to Caesar.

Backup Mechanism

Assigned to: Bethany Ross (1 paragraph plus bulleted timeline)

We will find a Open Source Cloud to store these files for protection. This needs to be on a free service that could be accessed from anywhere in case of a failure of the Caesar machine and physical backups. Once this location is discovered I will post the information to the wiki so that everyone can access the files if needed.

Experiment Repository

Assigned to: Johnny Mom (3-4 paragraphs plus bulleted timeline)

The Experiment Repository will include the decoder experiment files created by the current script "run_decode.pl". The script, "run_decode.pl" runs the a Sphinx 3 decode job which creates various files for the specific task name or experiment ID in this case. The Decode experiment script uses the models utilized in the training experiment to decode. The script will be edited to execute a decode job that will organize the files in the correct experiment folders based on ID and the various files created by the job.

The new decoder script will be able to run a decode job depending on the Experiment ID (ex: 1003). The script will be run by typing the command "run_decode.pl <taskname>\n". The <taskname> is where the ID will put in such as 1003. The script will then create the appropriate folders in /mnt/main/exp/1003. The folder 1003 will include folders such as trains, wavs, config, trans, logs, and more based on what is create by the decode experiment. This will help users who run an experiment an easier time to see where certain files are and build structure to the experiment rather than having random files strewn into one folder.

The type of files and structure of folders will also be explained through documentation on foss.unh.edu. The files in the folders created by the script will be explained as to what purpose it serves for sphinx 3. The folders although not created initially by sphinx but the folders will be explained as to why those files are in those folders. The categorization of the files will help users understand the significance of the folders/files which in turn will help show how sphinx works in running a decode job.

The timeline for this section is as follows: Johnny

  • (Feb 25 - Mar 5) Look at current decode script to understand how it currently works and start to document which file and folder needs to be where with a reason.
  • (Mar 6 - Mar 12) Create a new script to automatically create specific experiment folders and place files correctly in those corresponding folders based on what was found.
  • (Mar 13 - Mar 20) Document on FOSS.UNH.EDU on why the decode output created those files and folders based on previous weeks of analysis.
  • (Mar 1 - Mar 26) Finalize and make sure everything works properly.

Timeline may be subject to change.

Speech Modeling

Sample Run

COMMENT: All work for Sample Run excluding Dissemination needs to be complete by March 27th. (Note that you lose a week for Spring Break.)

Data Preparation

Assigned to: Chad Connors (3-4 paragraphs plus bulleted timeline)

Capture what input files are needed for train and decode and how they were created...this includes dictionary creation.

In order for us to set up a train and decode we must first prepare the data that we will be using in the test. In order to do this we must navigate around Caesar to find the current wav files. They appear to be in in the sphinx train section under wav files ending with an .sph file name. This is Sphere file that the data group will be converting to wav files to use for our train and decode. It looks as though they will be moving it to another directory which will be more streamlined, which should be called mnt/main/corpus/dist directory.

All sound files we have come with transcripts of what is being said. As of right now there is some extra brackets and items in the transcript that can potentially mess up our train and decode. The speech data corp will clean this up so there will have to be monitoring of the situation while trying to prepare all the information that is needed for the train and decode. In this task I will work with the information that they have found and look through transcripts and wav files to document which items are needed for our train and decode. This might be simple observation and documentation of their work or more detailed items depending on where they will be.

Once all wav files and transcripts are accounted for we will need a working dictionary. There are currently dictionaries in Caesar. There will have to be further inspection of the current dictionary to see if it just needs to be updated or completely redone. The second feat with dictionaries that are used in speech recognition software is they usually require two instances. First is a language dictionary which is used to for standard words used (in our case) the english language. The second dictionary which is called the Filler Dictionary is used for non-speech sounds that are mapped to corresponding non-speech or speech-like sound units. There will need to be further investigation into whether we really need both dictionaries for our purposes, but as of now it appears we will need the Language dictionary only. Once all these tasks our completed the rest of the team will be able to start the train and decode.

Projected Timeline

    Complete by 3/5
  • Look into current wav files and transcripts get a general idea of what the data group will be doing with the files
  • Complete by 3/12

  • Finish the log of where the transcripts and wav files will be. Track their name, content and all info associated with them. Begin looking into dictionary creation
  • Complete by 3/26

  • Create and finish the dictionary

Language Modeling

Assigned to: Ted Pulkowski (3-4 paragraphs plus bulleted timeline)

Capture what steps are needed to generate language model.

In order to generate a language model, there are a few requirements and steps that need to be followed. Two perl scripts that are located in the /media/data/trans directory on caesar. One is called CreateLanguageModelFromText.perl, which is the script that actually generates the language model. The other is called ParseTranscript.perl. I’m not exactly what this script does specifically because although Nick from last year’s capstone class mentions in his logs that both scripts are in the same directory, ParseTranscript.perl is missing. I will do everything I can to locate the missing script.

  • Completed by Monday March 5

There are five different files that are created during the process of generating a language model. The script first takes a text file and from that generates a word frequency file. The word frequency file is the used to make a vocab file. This then creates an ID 3 gram file. Once that is done, two language models are created, one in arpa format, and one in binary format. The text file is simply a copy of the transcripts. As of February 27, I am not really sure what the other files are used for. My plan is to read each of the file to learn what each one does and why it's required.

  • Completed by Monday March 19

After learning what data is stored in each file in the /media/data/trans directory and attempt to run both the CreateLanguageModelFromText.perl and the ParseTranscript.perl scripts. My hope is that I can gain enough of an understanding of how the scripts and files work together to be able to explain everything to the rest of the class. If things progress nicely, I would like to have this done by March 19. I am giving myself more than enough time so I understand it perfectly.

  • Completed by Monday March 26

Building and Verifying Models

Assigned to: Aaron Jarzombek & Brice Rader (4-5 paragraphs plus bulleted timeline)

Capture what steps are needed specifically in training and what it generates that then decoder uses...verification is test on train data.

In order to understand how the Sphinx decode system works we must first break it down into its base components. By breaking it down we will be able to analyze each step that the system goes through, in order to make a functioning decode. We are going to have to dig around on Caesar in order to find where the scripts are located. We are aware of the run_decode script at this time, but have to find the others. After finding the script we run cat on it and it shows the scripts and directories associated with it.

  • Completed by 3/5, if completed early in week we will start on phase two

With the list of files, directories and scripts that the main script uses we will be able to track down what each of the individual scripts do. If we can catalogue the information that is called and make a datasheet to reference, it will be much more efficient for us to understand. The internal commands link the necessary files, directories and scripts with their path name and assign them to a static variable. Knowing the path is key to using the information, unless it was all located in the local directory with the script. This is not usually the case, so exploring the main scripts is necessary. Once we find other scripts we must learn how they work in order to figure out if they are applicable to us or not.

  • Completed by 3/12

We hope to run some of the scripts and see what the output is with the input given in the system already. After we have some sample data, we would like to try and change the input data. This will hopefully give us some different output. By obtaining a different output we can begin to understand how to manipulate the data to get the information we desire from it. This task could be very cumbersome, but would provide essential information if we wish to understand how the decode works.

  • Completed by 3/19

To go along with the data that we will be trying to manipulate and change the output of, we are going to find the .Wav files and match the output from the decoder to the sound. The sounds in the .Wav files should be the same as the input values. In Caesar, the .Wav files are swxxxxx. If the information comes out differently we will have to discuss training methods with the training portion of our team. Once the training is matched up well we should receive the correct sample output. After this step is complete the system will be in good shape to run a sample train on the data to make sure everything is correct. We will attempt to grab utterances and take the audio file to the utterances and put them together.

  • Completed by 3/26 (Overall progress completed will have documentation of how the perl scripts work that pertain to the decoder and trainer and documentation of how the trainer works)

Dissemination

COMMENT: Give approx two weeks to have 4 groups, each led by one of Speech Tools members, repeat training & decoding using new infrastructure


Mini Run

A mini run will take a small part of the audio and transcript, about 1 hour worth, to run through both training and decoding. A mini Switchboard set will need to be created, with the work from the Data Group's transcripts and dictionary. We will also need a language model to run the mini train. We will then move to decoding the audio files and comparing them to the transcripts created by the Data Group.

Training

To create the mini train set, 1 hour of the transcripts and audio will need to be created. We will need a phonetic dictionary and a language model to create a mini train set. A dictionary with phonetic spelling must be created from a mini switchboard and a test set of the 1 hour of transcripts that we will use. The language model can be created with CMU language model toolkit or we can use Spring 2011's language model if we use the same 1 hour transcripts and audio that was used to create their language model. Once we have these items we will be able to run a mini switchboard training set. If we are able to complete this task we will be able to prepare to run a full switchboard training set with the knowledge we learned from our mini train.COMMENT: Expand explanation...

Decoding

To create a mini test set, we will need to develop a set consisting of a small part of the Switchboard Corpus. We will need to break up the audio and transcripts to smaller pieces to test. We will then evaluated the performance by taking the dictionary, trained models and decode with them to see how accurate our models come to recognizing the audio.

Full Run

The full train will take all 100 hours of audio and transcripts and run them in parallel on each server in pieces of 10 hours each. We will need to create a bigger more robust language model. We will need all of the transcripts to be cleaned and a dictionary of all 100 hours to be created.

Training

To create the full train set, we will need to divide it into 10 subsets consisting of 10 hours each from the Switchboard Corpus. COMMENT: Expand explanation...

Parallelization

COMMENT: How do we parallelize our training runs? Open question...anyone want to tackle this for the second half of semester?

Decoding

To create a full test set, we will need to take a subset of the 100 hours for testing. Switchboard Corpus. COMMENT: Expand explanation...