Speech:Spring 2014 John Kelley Log


 * Home
 * Semesters
 * Spring 2014
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 4th, 2014

 * Task:

Read past logs and do research on experiments and data gathered the previous semesters


 * Results:

Have a basic understanding of requirements and past experiments, along with data gathered


 * Plan:

Continue to work on proposal and understand requirements
 * Concerns:

Communicating effectively with group

Week Ending February 11, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending February 18, 2014

 * 2/15/2014: Read logs and brainstormed proposal ideas
 * 2/16/2014: Read logs
 * 2/17/2014: Read logs
 * 2/18/2014: Complete proposal and submitted it to the Proposal group page

My first task for this week is to to work on our proposal. After reading Josh's proposal group page, I believe his idea would work best for creating a streamlined narrative, something that professor Jonas is looking for. Uniform entries would help the flow of the document drastically. Reading past proposals, you can clearly see that there was no real continuity in layouts or even discussion. It seemed to me like every semester was unique in the way they approached the proposal. I believe our proposal will present a lot of what the professor is looking for. I fully plan to take advantage of Josh's idea, and submit our proposal to the proposal group's page in the same format he has listed.
 * Task:

My second task was to continue working on training. I'm still reading the how-to guide on running trains and tests on trains.

My third task was to familiarize myself more with Unix. I still find myself struggling with basic commands, that others seem to have no problem remembering. This is my first time actually working with Unix, so remembering basic commands is proving to be my only setback. I have been reading basic unix tutorials in various sections of the internet.

I have successfully put our assignments and thoughts into text with our proposal. I feel as though the Data Group will accomplish a lot this semester, and make things much easier for future Data Groups. Coming into this blind certainly wasn't fun. I constantly found myself stumped when people asked me questions that I felt I should know. Like what format is the audio in, where is it located, are the transcripts complete, do they contain odd characters etc. After doing some research, I finally feel more confident in my abilities, and my understanding of our groups tasks and responsibilities.
 * Results:

I plan to complete a couple of basic Perl tutorials to get an understanding of the language. I have little programming experience, and what I do have isn't very strong. Programming has never been a strong suit of mine, but I'm willing to learn Perl, not only for this project but for being helpful in future endeavors. I also plan to run an instance of the genTrans6 script myself. I found this website when googling basic perl tutorials (http://www.perl.com/pub/2000/10/begperl1.html) After following it for a while, I feel like I have the most basic understanding of the language. Thankfully, it doesn't seem nearly as difficult as some other languages. For some reason, Perl reminds me of PHP. Maybe this is because they are similar, but I wouldn't know for sure because my knowledge of PHP is essentially non-existent.
 * Plan:

Learning Perl will be difficult. I don't expect to understand it much by the time the week is over. It will take me a couple of weeks to feel comfortable using it. As I explained earlier, I have little to no programming experience.
 * Concerns:

Week Ending February 25, 2014
My task was to create a page on media wiki that would be useful to us as Data Group, and future Data Groups. My plan is for it to contain all the most up to date information and locations of the things Data Group is responsible for, such as transcripts and audio. The URL for this page is http://foss.unh.edu/projects/index.php/Speech:DataInfo. I will be continuing to update it periodically over the week.
 * 2/23/2014: Read logs
 * 2/23/2014: Read logs
 * 2/24/2014: In the process of updating wiki
 * Task:

I have added the locations of the various transcript files in Cesar to the wiki page. I also added a link to Matt Henniger's transcript excel spreadsheet, which is used to calculate the time of the transcripts.
 * Results:


 * Plan:


 * Concerns:

Week Ending March 4, 2014
Learn unix better to help modeling group clean out the transcripts.
 * 3/1/2014: Read logs
 * 3/2/2014: Read logs
 * 3/3/2014: Read logs and worked on unix commands for cleaning transcripts
 * 3/4/2014: Added additional unix commands containing more advanced regular expressions
 * Task:

After using basic commands and asking for the assistance of my peers, I performed the following commands to try and replicate what professor Jonas did in class. My unix knowledge is still very basic, and my regular expressions were very weak. I first did a cd to the transcript file location cd /mnt/main/corpus/dist/Switchboard/transcripts/ICSI_Transcriptions/trans/icsi/ Then I performed an ls command to see all the files listed. I couldn't remember the name of the main transcript file, but saw it was called ms98_isci_word.text Then I did a cat command to see the text in the putty shell window cat ms98_icsi_word.text | less I wanted to use the vim built in text editor to do a quick find of the brackets, so I typed in  vim ms98_icsi_word.text At this point I realized I was definitely behind with my unix knowledge because I was stuck. With help from a coworker I just did a cat uniq command to see unique text. cat ms98_icsi_word.text |uniq Then I did a grep for the left bracket cat ms98_icsi_word.text | grep '\[' I realized this wasn't nearly enough, especially considering the complexity of professor Jonas' regular expression, so I did a grep help command grep --help followed by  grep --help | more I saw the -w command and thought it would be useful for showing only lines with the brackets, and it was. cat ms98_icsi_word.text | grep -w '\[\ My coworker also told me about another built in text editor that didn't require command knowledge as much, called nano. I opened the transcript in nano to try and make the process easier, but this is where I became stuck. nano ms98_icsi_word.text This is where I'm at right now, and want to continue working on regular expressions and unix knowledge to get these transcripts cleaned up a bit.
 * Results:

I plan to continue collaborating with not only my coworker who is knowledgeable in unix, but also the modeling group, especially Colby and David who seem to have a good amount of unix and regular expression knowledge. After doing more research on regular expressions I came across a useful website. http://www.grymoire.com/unix/Regular.html contains a large list of regular expression commands, including grep, and explains the syntax and different characters that can be used. For example, the ^ and $ characters are used to define another characters placement on the line / string. So if I want to find brackets in our transcript. I could do the following command: grep '\$[\[...]\$]' This says that the left bracket starts anywhere on the line, and is followed by any range of characters, and then a right bracket anywhere on the line. And even more advanced grep that could only look for lines with [text-text] would be  grep '\$[\[...]\-\[...]\$]' My biggest concern was messing something up, because of lack of knowledge with unix. I didn't want to ruin the transcript file and have everyone hate me. Even more so, I don't know what our backup is like, so I thought if I really wanted to make changes to the transcript file I could send it to a new text file, that way if something catastrophic happens, a backup has been created.
 * Plan:
 * Concerns:

The Unix command I learned to copy a file is as follows: cp ms98_icsi_word.text ms98_icsi_word_duplicate.text This command will copy the contents of the first transcript file to a second text file.

Week Ending March 18, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending March 25, 2014

 * 3/23/2014: Read logs
 * 3/25/2014: Began running an experiment. The directory to my newly created experiment can be found here: http://foss.unh.edu/projects/index.php/Speech:Exps_0228

I will be following the Wiki page on creating an experiment and running a train. I'm hoping the issues I run into will be easily resolved. I would like to complete my first experiment this week, and not have to wait until class to sort out my issues.
 * Task:

I noticed the Wikimedia Experiments page had its last experiment listed as 0227, so I created a 0228 page. This is odd, however, because when I ran the master_run_train.pl script, the experiment directory 0237 was created. This leads me to believe there were most likely experiments run but not entered into the wikimedia page. Output from script: Successfully created 0237 experiment directory. Please type '1' to continue: Output from script:
 * Results:

I then continued to follow the steps recreating the 0161 experiment. I used a density of 8 and a senone value of 3000. I ran it on the first_5hr/train corpus. Once gentrans8.pl was run, I received the following output from the script. Output from script: can't open file: No such file or directory at /mnt/main/scripts/user/genTrans8.pl line 29. I don't know what this error means, but it asked me to enter a 1 to continue so I did. I entered 1, and then the rest of the script ran, with a few errors here and there. Most of them relating back to the missing file / directory as described above. According to the output of the script, the dictionary and phone files were copied successfully. This is the output I received. Executing pruneDictionary2.pl: /mnt/main/scripts/train/scripts_pl/pruneDictionary2.pl 0237_train.trans /mnt/main/corpus/dist/custom/switchboard.dic 0237.dic Output from script: text2wfreq : Reading text from standard input... cat: 0237_train.trans : No such file or directory text2wfreq : Done. Now inside directory: /mnt/main/Exp/0237/etc Copying over the filler dictionary ... cp -i /mnt/main/root/tools/SphinxTrain-1.0/train1/etc/train1.filler 0237.filler Success! Copying over the genPhones.csh script ...  cp -i /mnt/main/scripts/user/genPhones.csh. Success! Executing genPhones.csh: ./genPhones.csh 0237 ./genPhones.csh 0237 0237  Success! Successfully created the Phones file located in the /etc directory.

Then came step 5, where I need to navigate to my experiment directory to call the RunAll.pl script. I ran into the same issue that Jared experienced, and described in his log. The error I received was as follows: Configuration (e.g. etc/sphinx_train.cfg) not defined Compilation failed in require at RunAll.pl line 48. BEGIN failed--compilation aborted at RunAll.pl line 48.

I realized that I stupidly wasn't in my Experiment's directory when I ran the script which is why it was missing the cfg file. I navigated to my directory and ran the command again. This is what I received as output from the script: MODULE: 00 verify training files O.S. is case sensitive ("A" != "a"). Phones will be treated as case sensitive. Phase 1: DICT - Checking to see if the dict and filler dict agrees with the phonelist file. WARNING: The phonelist (/mnt/main/Exp/0237/etc/0237.phone) does not define the phone SIL (required!) Found 3 words using 1 phones WARNING: This phone (SIL) occurs in the dictionary (/mnt/main/Exp/0237/etc/0237.dic), but not in the phonelist (/mnt/main/Exp/0237/etc/0237.phone) Phase 2: DICT - Checking to make sure there are not duplicate entries in the dictionary Can not open listoffiles (/mnt/main/Exp/0237/etc/0237_train.fileids) at  /mnt/main/Exp/0237/scripts_pl/00.verify/verify_all.pl line 203. Something failed: (/mnt/main/Exp/0237/scripts_pl/00.verify/verify_all.pl)

I have no idea what the error "Something failed:" is supposed to mean, so at this point I'm at a stand still until I get to class tomorrow (Wednesday) and can discuss my issues with the experiment experts. I wish I had a better understanding of the issue as I would really like to successfully run a train on my created experiment. On the original tutorial page I noticed it mentioned that the first train will usually fail, and to run it again. It said an html file would be outputted to the exp directory containing readable information. When I ran lynx 0236.html however, all I got was what was in the terminal window in the lynx text editor and it wasn't very useful.

I want to go through the wikimedia tutorial to run my first experiment (0161 clone). I chose 0161 because It was a first 5 hr with a density of 8 and senome of 3000, generally basic values. Also, Jared seemed to think this would be a good experiment to start with, so I took after him and chose it as well. I would like to finish all of this Tuesday night when I get home.
 * Plan:

I realize I'm bound to run into errors.
 * Concerns:
 * 3/25/2014: I of course, ran into an issue. The error presented was when I ran RunAll.pl. I received the output "Something failed: (/mnt/main/Exp/0236/scripts_pl/00.verify/verify_all.pl"

Week Ending April 1, 2014

 * 3/26/2014: Worked on experiments in and out of class
 * 3/27/2014: Attempted to run the decode on experiment 0241 with no success
 * 3/28/2014: Successfully got my decode to run on experiment 0241
 * 3/30/2014: Read other's logs. Added more to my results section

Due to last week's unsuccessful experiment, I'm going to re-create my original failed experiment with Colby Johnson, and then create another on my own.
 * Task:

In class today (3/26/2014) I went over my issues with Colby. We first discovered my RunAll.pl script failed because SIL was missing from my phonelist. After further observation, we realized that the master script failed to run gentrans8.pl. I saw this in the terminal when I ran it yesterday (the 25th) but due to my naivety assumed it was ok to continue, considering the script kept going, and asked me to enter '1' to continue. Once Colby and I took a look at my experiment directory, we realized the phonelist, and essentially every other created file was empty, due to gentrans8.pl failing. None of my files were populated with information, which is why when I ran RunAll.pl I received that error. He helped me recreate this experiment, and we used a density of 8, a senome value of 5,000, and ran it against the mini corpus subset. I followed him as he successfully trained the data, and then ran the decode. Unfortunately, even though this was my experiment originally, the modifications were made with Colby's username so I was unable to score the data without receiving a permission denied. I even tried using sudo command when I ran sudo sclite -r _train.trans -h hyp.trans -i swb >> scoring.log However even after giving the root password, I was still presented with the permission denied error. I'm going to have to leave scoring this experiment up to Colby. SYSTEM SUMMARY PERCENTAGES by SPEAKER ,-.     |                            hyp.trans                            | |-|     | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err | |=================================================================|     | Sum/Avg |  549  10774 | 79.7    8.7   11.6    4.8   25.0   85.4 | |=================================================================|     |  Mean   |  2.9   57.0 | 81.5    8.2   10.3    7.8   26.3   86.2 | | S.D.   |  1.9   44.5 | 16.3    8.1   13.5   14.8   19.7   25.2 | | Median |  3.0   47.0 | 86.1    6.9    5.4    3.9   21.0  100.0 | `-'                            Successful Completion
 * Results:
 * Update: Colby ran the score and the following results were presented:

I then ran my own experiment in class today. The experiment is 0241 (http://foss.unh.edu/projects/index.php/Speech:Exps_0241). I successfully trained the data in class, and all that's left is for me to create the language model, decode and then score the data. I am in the process of completing these steps as I write this log.

ERROR: "ngram_model_arpa.c", line 76: No \data\ mark in LM file So now I'm moving on to the decode process. I remembered when we ran the decode in class it was running but nothing was outputted to the terminal. I forgot this initially however, and assumed it failed because I didn't see anything, and then interrupted my terminal. I don't know if that will cause the decode to fail or not, but I ran it again with a trick I learned from Colby to run it in the background which is  nohup & So I ran nohup ./run_decode.pl 0241 0241 & and now I have the process running in the background. I remembered in class the decode didn't take too long to complete, so I'm hoping I'll see some results soon, and then I can score my test on train. I opened my decode.log and noticed the following error on page 11, and believe my decode has stopped running. FATAL_ERROR: "mdef.c", line 680: No mdef-file
 * Update: I have successfully created the language model. One of my last lines of output however concerned me, as I don't know if it's a real issue or not. The following was presented to me after running % ./lm_create.pl trans_parsed

I spoke with Colby and we determined the issue was that I was using the wrong decode script, along with not specifying the senone value as part of my parameter for running the decode command. I then copied the decode2.pl script to my DECODE directory and tried it again, but this time with the following command: nohup run_decode2.pl 0241 0241 3000 & This specifies I'm using the decode2 script on my experiment with my experiment's acoustic model and a senone value of 3000, while running it in the background. It ran a little longer than last time, however still failed. I opened my decode.log in Lynx to see what the issue was. Instead of only 11 pages, I got 17 pages this time, so it ran about a millisecond longer than last time, which is still progress. The error I received was the following: FATAL_ERROR: "lm_3g_dmp.c", line 462: fread(/mnt/main/Exp/0241/LM/tmp.arpa) failed
 * 3/27/2014

This was after it starting reading the tmp.arpa file. I'm still at a standstill, and waiting for any feedback from Colby. He's been incredibly helpful.

Today I decided to try to run the decode again. I went to see Colby yesterday and he was stumped as to why my decode wasn't running. We looked at my language model, and although it was built, all of the files were empty. We thought this was strange, so we tried to build the language model again. Unfortunately, it failed a second time. Colby wasn't sure why the language model wouldn't build, and thought something must have changed. I suggested trying to build the language model with a different corpus subset than the mini/train we had been using. I left soon after, and don't know what happened since that time (~2:30pm) and right now (10:15am) but the language model was built successfully. I'm now running the decode, and so far it looks like it's going to finish without error.
 * 3/28/2014

,-.     |                         hyp.trans.uniq                          | |-|     | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err | |-+-+-|     |=================================================================|      | Sum/Avg |  549  10774 | 86.0    6.6    7.4    4.8   18.8   84.2 | |=================================================================|     |  Mean   |  2.9   57.0 | 87.3    6.5    6.2   10.1   22.8   86.1 | | S.D.   |  1.9   44.5 | 10.8    7.7    6.8   22.8   24.2   23.8 | | Median |  3.0   47.0 | 89.6    5.5    4.9    3.8   16.7  100.0 | `-'
 * Update: My decode ran successfully, and I ran the SClite to score my experiment. I came out with the following results:

I don't know much about interpreting results of the train, but I assume an 18% WER is good? I'll have to talk to Colby more about understanding what I'm actually doing, and not just spitting out scoring charts. I did, however, run into a few issues when scoring the experiment. The two common issues that occur when running SClite as described on the wiki page happened for me. The two errors I ran into were as follows: (Note, I didn't capture the actual errors I got so my ID's are different) Error: double reference text for id '(sw2479a-ms98-a-0071)' Error: Not enough Reference files loaded Missing: (sw2259a-ms98-a-0021) (sw2295b-ms98-a-0011) (sw2331a-ms98-a-0049) (sw2389b-ms98-a-0096) (sw2428a-ms98-a-0017) (sw2442b-ms98-a-0059) (sw2451b-ms98-a-0044)

I was able to fix this easily with the help of my coworker. I think the wiki page should be updated with simple VI usability commands, because without the help of my coworker I wouldn't have known how to enter search mode etc. Anyways, I a vi 0241_train.trans to open it up, and then pressed Shift + : to enter the command mode. I typed : set ignorecase. Then, I searched by entering a forward slash /. I searched for my ID, and found the two instances the wiki said I would. They were basically the same line with a slight difference. So I followed the wiki and removed it with a :d command, then saved and exited with a :wq.

After this, I had to fix the second issue, where not enough reference files were loaded. To fix this I followed the wiki tutorial and it was much easier than the first part. I copied the hyp.trans with a uniq parameter, and then called the hyp.trans.uniq when I ran the SClite script the second time, and it actually worked!

The last issue I ran into was getting the table to my log / experiment page. I opened the scoring.log file in lynx, and then remembered that when I highlight something and do a ctrl+c, it closes the lynx text viewer. I found an option called print, and just did a print to screen, which printed the whole table into my terminal window. Then I just highlighted it all and copied it into a text document, where I could remove the extra junk and just leave the stuff I wanted (the averages). Then I pasted it into my logs above, and the experiment page.

Colby said something about the permissions in a directory changing, so I guess building the language model was temporarily broken for everyone, and not just me. I noticed the permissions must have changed, because performing commands that never used to give me issues now ask for confirmation. For example, trying an rm to remove my decode.log file in my decode directory asked for confirmation. Also, when I tried to recursively remove all the files out of my LM directory to rebuild my language model, I used an rm -r, and it said permission denied, so I had to throw a sudo in front of it and enter the root password. After this I didn't run into any more issues.


 * 3/30/2014: I spoke with Colby about the installation of SciPy on Fedora 19. Initially, I assumed he was asking me if it could be installed on Fedora 19. I replied that it could be, because I believe Fedora 19 comes with Python version 2, and SciPy is compatible with versions 2.6 and 2.7, as well as versions 3.2 and newer. Once I spoke more with Colby, he told me he was running into an issue in the terminal with mirrors missing for downloading the SciPy package. It sounds like he's connected to the repositories, but they just fail when it comes to the download. I saw his logs, and he created a page that shows the error messages he's getting, and it's essentially a 404. The mirrors are missing, which is preventing the download of the package. I suggested he might try to add more repositories, and maybe that way he would be successful. I don't know what his progress is on that, but I figure if he doesn't have it working by class on Wednesday I can show him how to add more repositories to the Fedora terminal.

I plan to review the scored experiment created with Colby Johnson (0237) I also plan to run my own experiment without the assistance of others After watching Colby run an experiment, train that experiment and then create the LM & Decode + score it, I'm confident in my ability to do it without assistance, however, if I do run into issues I will consult the wiki page and then my bootcamp group members (team one). I'm concerned about fatal errors I've been getting consistently when trying to decode my experiment. The first was resolved, however the second time I ran the decode with a different script and different parameters, it ran for about 1 millisecond longer, but ended up failing with another fatal error. See above logs for specifics of the error.
 * Plan:
 * 3/27/2014: Today I want to decode the experiment I ran yesterday. After reading Colby's email it seems I need to change the command to use the decode2.pl script, and add an additional parameter of my senone value, because it wasn't the default.
 * 3/28/2014: I'm going to try one last time to decode and score my experiment. I'm hoping whatever was wrong with the language model yesterday has been fixed today.
 * 3/30/2014: Plan for today is to read logs and update my results section
 * Concerns:
 * 3/28/2014: This is my third day trying to successfully decode my experiment and I'm really hoping I don't run into anymore fatal errors.
 * 3/30/2014: I don't know how the progress with installing SciPy on Fedora 19 is going. Colby wanted the data group to help out with this issue, so I've been looking into possible causes, and my thought is that maybe he just needs to add additional repositories.

Week Ending April 8, 2014

 * 4/4/2014: Trying to tackle my task of getting Scipy and Numpy installed on Rome
 * 4/5/2014: Continuing to run my experiment with MLLT parameter... failing
 * 4/6/2014: Last night I successfully trained my data set using the LDA/MLLT parameters
 * 4/8/2014: Starting a new experiment. Essentially a replica of my previous experiment, but with a new language model

The first thing I thought of was, well the repositories are messed up. So I logged into rome as root and did a CD to /etc/yum.reps.d. An ls listed all of the default repositories. fedora.repo fedora-updates.repo  fedora-updates-testing.repo These are the default repositories, so why they wouldn't work didn't make any sense to me. I tried a yum check-update to see if there were updates, which there were. I then tried yum update, but ran into the same issue. A ton of 404's and trying other mirror messages were being printed to the terminal. A few google searches let me to believe that the repositories were bunk, so I created a temporary directory (backup) and did a mv *.repo ./backup to move the repo files into the backup folder. I then created my own repo file. I did this with a vi command. I created the parameters which were as follows: [kernelrepo] name=Kernelrepo baseurl=http://mirrors.kernel.org/fedora/releases/19/Everything/x86_64/os/ enabled=1 gpgcheck=0 :wq! I then tried the yum update again, and it said it couldn't find a valid base URL. I figured the repository I added wasn't going to work, and was really confused at this point. I moved the files back into their original directory and deleted the backup directory, and the repo I just created. I then opened up the repositories in VI to see if something was up with the URL, which is wasn't. I noticed something peculiar when I ran the yum update. It displayed for a brief second, but it was enough to catch my eye. It said the peer's certificate could not be resolved. This didn't make any sense, because the certificate had to work, it was dated for 2015. SO I ran a   URLGRABBER_DEBUG=1 yum check-update This command essentially displays the process of connecting to the repository. It displays the IP, successful connections, certificates, and goes through the process of pinging it to show you exactly what's happening. And this is what made my jaw drop. I noticed a timestamp before the requests to ping the server. They were 2004:08:08. That's right, 2004! I did a date command. It responded 2004:08:08. Rome's date was set to 2004, and that what was causing the issues. I facepalmed so hard once I realized that. I did a simple date command to change the date to today's date, which worked. I then realized though that it reset the time to 12:00 am, so depending on the time of day someone tries to run a yum command, they might encounter the same issue. I need to figure out how to use an ntp command to get the date automatically set via the internet. I then ran the yum update... and it worked! Rome was good to go, and I was finally able to update our packages. The best part was, there were some 1,084 updates to be applied. It was half a gig worth of updates. I went ahead and ran the updates before doing anything else. Once that was completed, I ran a  yum install scipy The SciPy package includes numpy as well, which is the module we needed to progress. Rome is currently running correctly again, and the yum commands work as they should. With SciPy and NumPy now installed, we should be able to make some progress on training. I realized instead of finding an ntp command to have the date and time be synchronized via the internet, I would just change it manually myself. Rome's date and timestamp are now accurate, however will need to be changed manually during daylight savings time alterations. To change the date on Rome, follow these simple steps: 1) Log into Rome as root        $ su -   2) Type the Date command to see the current date and time $ date 3) Change the Date of Rome by following this criteria         $ date -s "D M Y H:M:S"              For example              $ date -s "04 OCT 2012 16:45:05"              This would set the date to October 4th, 2012 at 4:45 PM
 * Task:
 * The Avengers tasked me with getting Numpy and Scipy installed on our fedora machine (Rome). Colby told me of his initial problems. Every time he would attempt a yum install scipy, he would receive a series of HTTP 404 errors, that would then say trying other mirror, and just continue until it eventually ran out of mirrors to check.
 * My next task was to run an experiment using the LDA/MLLT parameters.
 * My third and final task this week was to re-run my original experiment, but with a newly built language model.
 * Results:

I started a new child experiment in the Avenger's directory (0251). My child experiment is 008. This experiment will be running with the CDMLLT run all script. With the Scipy and NumPy modules now installed, we should be able to train the data with the MLLT parameter. I have already created and trained the experiment twice, however the first time I didn't realize it was supposed to go into our directory, and created 0252. I deleted my page off the experiments wiki page and removed the 0252 directory so that the justice league could use 0252 as their directory. The second time, I didn't realize I had to change the sphinx_train.cfg file line 156 to yes, which would turn the MLLT functionality on. It was essentially just running normally without this configuration. I'm going to train my experiment for the third time so far, and hopefully third time's a charm.

Phase 2: Flat initialize FATAL_ERROR: "main.c", line 98: Failed to read LDA matrix This step had 1 ERROR messages and 0 WARNING messages. Please check the log file for details. I've done the following things to try and remedy this situation, with no luck. 1) I went to /mnt/main/root/tools/SphinxTrain-1.0/python and ran python setup.py build and then sudo python setup.py install which was on the CMU website - This didn't work, so I tried the following
 * 4/5/2014: I've spent the better half of 7 hours trying to get my experiment to run with the RunAll_CDMLLT.pl script. Every time I do, I get the following output:

2) I then copied the python folder from root/tools/SphinxTrain-1.0 to /mnt/main/Exp/0251//python, and ran the python setup.py and python setup.py install once again - This also didn't work

3) I changed the permission of the python folder with the following command: $ chmod -R 777 python - This still didn't work

4) I opened the log file for the LDA train > $ vim .lda-train.log - At the bottom of the log, it reads LDA training complete. This means the Matrix must have been created successfully.

5) I then ran the lda training stage, and checked if the matrix file was created in the directory model_parameters/name.lda (it was not) this is confusing, because the LDA said it was trained successfully,  yet the lda matrix file was missing from my model_parameters directory. So I ran train_lda.pl and just got a series of missing config files etc.

The only thing I've found so far is to modify the train_lda.pl script and add the following code: print catfile($ST::CFG_BASE_DIR, 'python', 'sphinx', 'lda.py'), $logfile, 0, $ldafile, @bwaccumdirs); However I'm not about to modify the entire project's train_lda.pl script, in case something blows up. I moved a copy of it to my directory but can't run it without the context of where it is in the scripts/train/scripts_pl directory. I have resorted to posting my question on the Google Group, Sphinx-Users. Hopefully someone on there can help me out, because I'm so burnt out from trying to get this stupid experiment to train correctly. I don't know what else I can do at this point, because unfortunately there's not thousands of people reporting issues with Sphinx online. Searching their issue database led me to nothing. I only found one lead on a sourceforge forum, but following the steps listed there have no helped me.

/mnt/main/Exp//logdir/46.lda_train do an ls in this directory, and open .N-1.bw.log and .N-2.bw.log with the following command $ vim .N-1.bw.log This will show you how many utterances you have. For example, my data set's log has the excerpt from it: utt>  260       sw2109B-ms98-a-0080  430    0    68 37  2 3 1.444418e-12 1.694514e+00 7.286411e+02 utt>  261       sw2110A-ms98-a-0065  961    0   324 33  2 4 3.864975e-12 2.175610e+00 2.090761e+03 utt>  262       sw2110A-ms98-a-0083  474    0   208 35  3 6 5.329107e-12 -7.955380e-01 -3.770850e+02 utt>  263       sw2110A-ms98-a-0086  899    0   400 33  2 4 3.317159e-12 1.881487e+00 1.691456e+03 utt>  264       sw2111A-ms98-a-0004  186    0    76 36  3 6 6.287379e-12 -3.991770e-01 -7.424692e+01 utt>  265       sw2111B-ms98-a-0035  399    0   136 40  2 4 4.862602e-12 2.905137e+00 1.159150e+03 I basically learned that for the number of utterances, you should tune your senone value accordingly. Since mine only contained ~1000, I tuned my senones to a more appropriate number, preventing a series of nans from being inserted into my LDA matrix. Right now I'm in the process of running the new decode method on my experiment to get the results. The steps have been changed, and they're slightly different. I'm talking to Colby to find out more about the new decode process. It looks easier, because apparently I don't need to score it afterwards, as the score is already completed and stored in a new directory as described on the Decode wiki page. After speaking with Colby, it seems the new Decode process is much different, however easier. NOTE: ONLY FOLLOW STEP #6 ON SMALL DATA SETS, SUCH AS MINI/TRAIN 1) Make sure you have the Language Model build.  2) Run the following command from the base experiment directory $ /mnt/main/root/sphinx3/scripts/setup_sphinx3.pl -task  3) cd to /etc  4) Enter the command $vim sphinx_decode.cfg 5) Move down to lines 43 and 44 and replace the instance of the word "test" with the word "train"      For example: change the line $DEC_CFG_LISTOFFILES    = "$DEC_CFG_BASE_DIR/etc/${DEC_CFG_DB_NAME}_test.fileids";                                 to $DEC_CFG_LISTOFFILES    = "$DEC_CFG_BASE_DIR/etc/${DEC_CFG_DB_NAME}_train.fileids";   6) Move down to line 51 and change .lm.DMP to tmp.arpa For example: change the line $DEC_CFG_LANGUAGEMODEL = "$DEC_CFG_LANGUAGEMODEL_DIR/.lm.DMP"; to $DEC_CFG_LANGUAGEMODEL = "$DEC_CFG_LANGUAGEMODEL_DIR/tmp.arpa"; 7) Save and close the sphinx_decode.cfg file   8) cd back to base experiment directory 9) Run the following command: $ nohup scripts_pl/decode/slave.pl &
 * 4/6/2014: I was ready to give up out of frustration on my experiment yesterday, after trying for roughly 10 hours to do everything I could to get my data set to train successfully using the LDA/MLLT parameters. I decided I would try and find an expert on CMU's Sphinx, and looked at their website for direction. I noticed on the contacts page, that they have a chat channel on webchat.freenode.come. The channel name is #cmusphinx. While I joined, I noticed one of the administrators was active, and asked him for assistance. After exchanging logs and describing the problem, he knew what the problem was right away. According to him, our senone values have been far too high for the data sets we are training on. Basically, he told me because the senone values were too high, I was getting nans in my LDA matrix, which corrupted it, which caused it to be read incorrectly by the MLLT module, which caused my train to fail. I then took his suggestion and lowered my senone values based on the number of utterances, which can be found in a certain log file in the following directory:
 * Here are the new steps to complete a decode on your trained data set.

My experiment is currently in the decode process. Results to follow... TOTAL Words: 9699 Correct: 2398 Errors: 7748 TOTAL Percent correct = 24.72% Error = 79.88% Accuracy = 20.12% TOTAL Insertions: 447 Deletions: 2032 Substitutions: 5269 Now these results are obviously not very good. After a brief discussion with Colby and David, however, we believe we know why this is the case and are working on fixing it. I'm going to run another experiment that is essentially a replica of my last experiment, except this time with a modified language model. To modify the LM, I followed these steps as described to me by Colby: 1) I moved into the experiment's /etc directory and copied the _train.trans file to the LM directory     $ cp _train.trans ../LM 2) I then renamed the file to trans_unedited $ mv _train.trans trans_unedited 3) I ran the following sed command     $ sed 's/(sw[0-9][a-z][a-z][0-9][a-z]-[0-9]\" \"[0-9].[0-9]\" \"[0-9].[0-9])//g' trans_unedited >> trans_parsed 4) I moved to the logdir/decode directory and renamed my files $ mv -1-2.log -1-2.log.old $ mv -2-2.log <exp#>-2-2.log.old 5) Did the same thing in my result directory     $ mv <exp#>.align <exp#>.align.old Now I'm running the decode again on the same data set. The new process for decoding is different, and requires fewer steps. I simply navigate to my base directory, and run the command:  % nohup scripts_pl/decode/slave.pl & This automatically generates the results and spits it into the result directory, labeled as the .align file. This was a script built into sphinx, which previous classes didn't know about. It seems like previous classes spent a lot of time writing perl scripts that were already included in Sphinx. This could be because they didn't look through included scripts, so they didn't know of their existence, or they simply just didn't know what the scripts did. Now that I've got the same experiment being decoded with the new language model, I will post the results as soon as they are finished. TOTAL Words: 9699 Correct: 2398 Errors: 7748 TOTAL Percent correct = 24.72% Error = 79.88% Accuracy = 20.12% TOTAL Insertions: 447 Deletions: 2032 Substitutions: 5269 I really don't know how to interpret this, and will have to discuss my findings with my group members. The only thing I can think of is that the data set is so small, it wouldn't make a difference anyway.
 * 4/8/2014: I ran the decode on my experiment and the results were incredibly surprising... They are as follows:
 * Update: I finished decoding my replica experiment and the results are surprising. For some reason, even with the modified language model, my word error rate was exactly the same. Here are my results:

I have absolutely no idea why Colby is getting a series of 404s from the default repositories. I found out why we were receiving 404s, and now my main concern is getting a good WER on my next experiment, detailed above. I can't get the experiment to train without receiving the fatal error mentioned above. It has something to do with python I believe, but I'm not sure why no matter what I try, the experiment won't train. It says it's missing an LDA matrix, but the LDA log file says it was trained completely, which means a matrix must have been created. Now that I've trained my data, my only concern from here is learning the new decode process. Also, however, I am concerned about the results, and how this new breakthrough of methodology will affect our word error rates.
 * Plan:
 * 4/4/2014: Today I plan on tackling the repository issue head on, and getting it resolved
 * 4/4/2014: I successfully got SciPy and NumPy installed on Fedora. The problem was the date of the machine was set to 2004. This causes conflicts with the certificates of the repositories. I logged into Rome as root and changed the date and time to today's current date and time. See my steps above for changing the date in the UNIX environment.
 * 4/5/2014: I'm going to continue working on getting my experiment to run with the RunAll_CDMLLT.pl script
 * 4/6/2014: I successfully trained my experiment last night thanks to a breakthrough in methodology I learned from a Sphinx expert on the cmusphinx IRC chat channel. I am now going to decode my experiment and see what our word error rate is, then go from there.
 * 4/8/2014: I'm going to re-run my last experiment, except this time with a new language model
 * Concerns:

Week Ending April 15, 2014

 * 4/12/2014: Updated tasks and continued reading CMU tutorials. This week I'm going to focus on creating a 100_hr/train2 AM, with several different parameters changed, a new LM and hopefully a few new scripts to see how it turns out.
 * 4/13/2014: Read logs. I'm still waiting for the steps on creating an acoustic model based on the new methodology.
 * 4/14/2014: I spoke with David this morning about the new process for creating an acoustic model. I'm just waiting to hear back from him about his results. There was an issue with a newly created transcript file containing more lines than a base transcript file, so once this is resolved I will be able to continue work.
 * 4/15/2014: Unfortunately, it seems like it's been a slow week for anyone who wasn't delegated the task of reworking Caesar's directories. Those of us not writing scripts were kind of at a stand still this week.

Today I went ahead and created the initial steps of my Acoustic Model for 100hr/train2. I followed the new steps as described to me by Colby and David. Instead of using the previous methods and scripts, I'm doing it the way it was intended, which is by using the tools that come with Sphinx. Right now, however, we have a blocker. I cannot train my acoustic model because of this, and progress has come to an unfortunate halt. I'm positive everything will be up and running again in no time.
 * Task:
 * This week I'll be creating a 100_hr/train1 acoustic model. I'll be using a modified version of genTrans, a modified language model, and heavily modified sphinx_train.cfg and sphinx_decode.cfg files. I'm hoping we can get Sphinx 4 up and running sometime soon so I can re-run this experiment with a feature only available to sphinx 4, that essentially pre-trains the data. We have a couple of barriers right now. The first is that decoding my acoustic model will take roughly a week on a single machine. We have a few ideas on how to cut down on time. Obviously, the first of those ideas is getting Torque running, however I don't know what Forest's progress is on that. I know he was able to send tasks to nodes but couldn't get them to respond. Utilizing torque would reduce our decode time and xRT factor by a substantial amount. The other plan would be to execute a strategy we discussed with Jonas. I can't go into too much detail on how it works, but essentially we decode our acoustic model via a ghetto parallelization method. Our groups's (the Avengers) level of knowledge about speech has improved at least threefold these past two weeks. We're working on some relatively advanced stuff, compared to the past, and even current semesters.


 * Results:
 * Results to follow...
 * Created Acoustic Model for 100hr/train 2. I'm not able to train it yet though due to the blocker we have.

To execute my plan, I'll need to wait for the new acoustic model creation steps to be in place, because of the recent change to our directory structure on Caesar and the branch machines. Essentially, now instead of having genTrans recreate the wheel each time an experiment (acoustic model) is created, links will be put in place to reference the necessary .sph files. We were running out of space on Caesar, and needed to change our structure drastically. Once these steps have been created, I will go forth with the creation of my AM and hope for the best. I can't wait to see how it turns out. My biggest concern right now is getting our xRT factor down to the require 1.5. This has been our biggest hurdle, because it seems that there are so many variables that affect the real time. The majority of which we have just begun to tinker with. The xRT factor also directly impacts how quickly we can decode experiments. Higher real time factors obviously mean longer decodes, which is bad for us. We're trying to run and decode as many AMs as we can to tweak the variables and parameters just right, before we run a real experiment we could potentially use as our submission. I'm also hoping I can get this acoustic model created before Wednesday's class. The recent changes have kind of thrown everyone off a bit, maybe the end of the semester wasn't the best time to restructure caesar. I do understand, however, the need to do so. It was clear we couldn't continue going the old route, as we had used up 98% of Caesar's disk capacity.
 * Plan:
 * Next week I'll be able to continue progress on my AM. I'll hopefully get it trained and ready to test on a solid data set. Next week should be much more eventful. It's hard going from being constantly busy with work last week to having nothing I can do this week.
 * Concerns:

Week Ending April 22, 2014

 * 4/21/2014: Another slow week. We're still waiting on progress with the new acoustic model creation / training system before we can begin.

This week I was really hoping to start creating a few acoustic models so we could finally begin training on real data sets. Unfortunately, we still haven't worked out the kinks of our new AM creation and training structure. This is putting us all behind, but it's going to be worth it in the end. Our group has a plan of attack that won't be fully divulged until the competition is over. At this point, we will release our documentation for the upcoming semesters. The progress made on speech this semester is most definitely astounding. I firmly believe we're a lot closer to making real-world progress. I received an email from Colby today explaining what our status currently is, and it looks like we'll begin working very soon. David said implementing the new process was more difficult than he expected.
 * Task:

So far I haven't been able to really do anything Capstone related, as we've all been at a standstill. I'm sure at this point now next week we will begin creating and training our acoustic models on the newly created data sets and improved structures.
 * Results:

My task beginning today is to research subvq, which should help improve our real-time factor. My number one concern is getting to the point where we can start making progress
 * Plan:
 * Concerns:

Week Ending April 29, 2014

 * 4/27/2014: Logged in and read logs to see if anything changed, it hasn't. Little communication has been going on, so I'm gonna shoot everyone an email to get a quick update.

It seems yet again we are at a stand still. During class last week we discussed our course of action from now until the end of the competition, which will be essentially to run about 10 experiments simultaneously across different machines using varying senone values in an attempt to fine tune our AM and train to its absolute maximum potential, given our current knowledge. A lot has been put into modifying the way our acoustic models are created and trained on. We're going to test using a 1hr data set, and then find the best parameters, and retrain using the 5 hr data set, which is what's required for us to have a valid submission in the competition. Our current best is 30% with a 2.0xRT, which is significantly higher than we know we're capable of, and have a few good tricks up our sleeves. I would really like to get out training underway, as the process takes a very long time. 100 Hours of data will takes roughly one and a half days of running in the background on our machines to complete, and extrapolate that by how many we need to put out concerns me.
 * Task:

Results will follow
 * Results:

Communicate with teammates, essentially. I've sent out an email asking everyone where they're at, and where we can go from here. Finishing training our AMs in time
 * Plan:
 * Concerns:

Week Ending May 6, 2014

 * Task:


 * Results:


 * Plan:


 * Concerns: