Speech:Spring 2013 Eric Beikman Log


 * Home
 * Semesters
 * Spring 2013
 * Proposal
 * Report

Week Ending February 5th, 2013

 * Task:
 * Gain knowledge regarding the Sphinx system, specifically how to build and run Experiments. We will need a fairly intimate knowledge of this process as we will be teaching others in the future.

(2/2)
 * Results:
 * I was curious about each of different train sizes, and the amount of audio each SPH file contains. Each corpus subset is located in /mnt/main/corpus/switchboard/; there are four directories in here, 10hr, full, mini, and tiny. Most of the corpuses don’t seem to be set up correctly. According to [] there should be three directories within each corpus, each containing two subfolders for the sound files and transcriptions. Only the mini train appears to be set up correctly.


 * To more quickly determine the size of each corpus (in minutes of audio), I determined the easiest way would be to determine the average size of each .sph file, along with the bitrate its incoded with. According to www.fileinfo.com/etention/sph, sph files utilze a 16-bit linear PCM and a 16KHz sample rate, with a 1024Byte ASCII header. Looking at this header using:

reveals a bit more. Looking at it suggests that it uses a 8khz sample rate, a 8 bit sample depth, 2 audio channels, and utilizes a form of the G.711 audio codec. By using the formula samplePerSecond ×bitRate ×channels we can determine how many bits per second the codec uses: 8000 x 8 x 2 = 128000. We will need to figure out the average filesize in bits of each sph file, subtract 1024 bytes from it, and divide it by the bitrate we determined. For something like this, I determined that a small AWK script would give the results we need. The command ls –l will give us a filesize in bits in each fifth column. I determined that returning the output of ls –l to the following awk statement should determine the average filesize and length of each audio file. The following statement assumes that there are 27 directories correlating to 27 disks, each directory only holds sph files, and there are 3 lines which separate the output of each directory listing.


 * Executing the above on Verleihnix gives the following:

2269 Files with an average size of 6006451 bits. Average audio length of 46 seconds


 * I was also curious to see how long it takes for a single train to run. Finding such information now will be invaluable for benchmarking purposes when we running experiments of our own. Looking at /mnt/main/Exp/0017/0017.html, Experiment 17 was executed on Caesar and started at 8/30/2012 at 12:43 with the last log entry on 8/31/2012 at 3:29. We also know that this train session went through 10 hours of audio. With this information, we can gather that a single train session, running on Caesar will take about 14 hours, thus we can safely assume that that every 1 hour of audio samples will take about 1.4 hours to process on Caesar. I assume that the other servers will take longer as they are slightly slower and will have a greater I/O overhead due the data being on the networked drives.

(2/3)
 * Reviewed other experiment logs.


 * Discovered Sphinx FAQ, which has some helpful information regarding training, tests, and modeling.
 * The FAQ for Sphinx 3 mentions there is an option, ‘-npart,’ which appears to have something to do with partitioning training data for multiple machines to share the job. This definitely warrants additional research as it has the potential to drastically cut down the time needed to create models.


 * Realized I forgot to measure the sizes of the various corpus subsets yesterday.


 * -will count the number of files in the given directory, in this case, its the mini corpus.


 * There are 2289 files in the mini Training corpus. Multiply this by the average size of each sph file determined above (46), divide that by 60 gives about 1754 minutes of audio data, or roughly 30 hours.
 * Accordingly there is 18 files in the /mnt/main/corpus/switchboard/tiny/train/wav/ corpus. This is about 13 minutes worth of audio.
 * The /mnt/main/corpus/switchboard/full/train/wav corpus is empty.
 * Something isn’t right in the 10hr corpus. There are 1626 files contained within the training directory, which equates to 1246 minutes, or roughly 21 hours worth of audio.
 * Either my calculations are incorrect, which is likely, or the 10hr corpus is mislabeled.
 * It’s also possible that there is 20 hours worth of audio in the training folder, but only 10 hours of transcription data available, so Sphinx ignores the rest of the audio which it doesn’t have.

(2/4)
 * Reviewed Sphinx FAQ to learn more about Modeling process.
 * Looked into issues Tommy McCarthy was experiencing when attempting to run trains on the client servers.
 * By looking at the results of  cat /etc/export Caesar restricts its /mnt/main NFS share by giving root accounts clients which utilize this share only read access.
 * Until we get our own accounts, all processes which require write access to /mnt/main will have to be done on caesar.
 * We need to be careful as caesar is running with two fans missing. Running a large train may overheat the unit.

(2/5) FEDERALDES F EH1 D R AHO L EH1 Z DUCTWORK D AH1 K T W ER1 K COGNIZITIVE K AA1 G N AH0 Z IH0 T IH0 V CHOWPERD CH AW1 P ER0 D ALBRIDGE A01 L B R IH1 JH SOUTHBEND S AW1 TH B EH1 N D
 * Attempted Experiment based on instructions written here: [].
 * Did not create new Corpus subset
 * Used the "Tiny" subset.
 * Encountered issues with words not within dictionary. Added words to Experiment's dictionary file.
 * Some of these words don't appear to be spelled correctly within the transcript.
 * Was able to create Acoustic model
 * Took about 20 minutes for about 15 minutes worth of audio.
 * Was able to create language model.
 * Decoding (run_decode.pl) resulted in failure
 * Looked at logfile, run_decode.pl was looking for a 0018/model_parameters/0018.cd_cont_1000 resource.
 * Found a 0018.cd_semi_1000 directory. Edited the local copy of ./run_decode to utilize this file.
 * Was able to continue onwards But hit another error in run_decode.pl
 * FATAL_ERROR: "kbcore.c", line 492: #Feature streams(4) in the feature for continuous HMM!= 1

Plan on learning enough of the process to run a mini or tiny train experiment by the end of this week. (Complete) Discuss with modeling team results of Experiment.
 * Plan:
 * Concerns:

Week Ending February 12, 2013

 * Task:
 * Our Task for this week is a continuation of the previous week's goals. We are to gain more familiarity with the modeling process.
 * We will also try to debug the decode issues and run a successfully executed experiment.

(2/7) (2/10)
 * Results:
 * Had team meeting over Google+ Hangout. Walked through the steps required to run an experiment. Creating Experiment #0019
 * Encountered similar results as Experiment 0018, namely, the Decode process failed.
 * Now that everybody is up to speed, we will begin to look into this issue.
 * Comparing training HTML log files between the last sucsesf9l experiment (0017), and with EXP 0018.
 * In Experiment 0019 and in 0017, Phase 3 Normalization for Phase 3 both had 5 error messages, so that isn't the issue. There were similar error messages which appear in the same location in both logs; I don't think they are the issue.
 * Compared experiment 0017 and 0019's decode logs.
 * It appears that the decode process for 0018 is looking for files in 0018/model_parameters but can't find what its looking for.
 * Comparing between the model_parameter directories between 0019 and 0017 reveals that files within are in 0019 named differently.
 * Specifically, directories are named 0019.cd_semi_1000* instead of 0019.cd_cont_1000*
 * According to the logs, this is the reason why the decode is failing.
 * Thinking that there was a bad configuration within the cfg file, I searched for "semi" within "sphinx_train.cfg".
 * Surely enough, the config file was configured to setup the train in the Sphinx 2 format, which uses the "semi" name.
 * The assignment statement on line 80 needs to be commented out while line 79 needs to be uncommented.
 * Recreated the 0019 Experiment based on these changes.
 * Train failed not to long into it. It is having trouble finding /mnt/main/Exp/0019/trees/0019.unpruned/EY0-0.dtree and errors out.
 * The equivalent of this file exists in 0018 and 0017.
 * It appears that the script and instructions are designed for SPHINX 2 and not SPHINX 3
 * I need to look into more about what each option in the config file does.

(2/11/2013) Spoke with Tyler Martin, Caesar was shut down due to maintenance and replacement of some of the fans. (2/12/2013) Caesar is still down as of Tuesday Evening. I need to compare between the two experiments to fully understand why one is failing an the other succeeded. Read logs and the Sphinx online documentation.
 * Looked in configuration files.
 * Comparing between the training logfiles between 0018 and 0017, I noticed that the results for the last module are different. It appears that the last step in 0018 was converting Sphinx 3 format models to the format used by Sphinx 2.
 * I'm not sure why it did that. I can't find any difference in the configuration between the two.
 * I guess there are two types of Models, Continuous and semi-continuous models.
 * The area which you specify where to switch between the two in the sphinx_train.cfg file has a comment with "Sphinx III" for the continuous, and "Sphinx II" for the semi-continuous option.
 * Caesar had a sudden system halt and kicked me off while I was still investigating. System is still down as of 15 minutes later. Will try again tonight.

Look into what each option in the sphinx training configuration file (sphinx_train.cfg) does.
 * Plan:


 * Concerns:

Week Ending February 19, 2013

 * Task:
 * Get Caesar and Sphinx up and running so the modeling team can continue.

(2/16/2013)
 * Results:
 * Now that Caesar is up and running, I continued onwards with my investigation

I created a new corpus, mini2, which has an hour block of transcripts. To make a more distinct model from would be created with mini, the transcripts for mini2 will start right after the ones in mini. This means that mini and mini2 will comprise of two different blocks of the transcript.
 * I've decided to start with a 'clean slate' and create a new corpus. After reading the documentation online, I determined that the mini corpus has one hours worth of dialog in it.


 * Attempted to create new experiment, ran into command not found errors. We will probably need to re-install Sphinx.

(2/17/2013) (2/18/2013)
 * Installed via zypper the necessary tools needed for compilation and installation. Such as GCC, Automake, Make, and others including any required dependencies.
 * To play it safe, I decided to re-compile the binaries, just in case we are missing any libraries.
 * Installed SphinxBase, SphinxTrainier, and the CMU toolkit
 * The Sphinx 3 decoder is not installing correctly, it doesn't look like it was compiled before and its not compiling now.
 * I keep getting the error ../../src/libs3decoder/.libs/libs3decoder.so: undefined reference to `bio_read_wavfile'  when compiling.
 * Attempted to hard-code the required header and library paths.
 * The same error occurred
 * Thinking that this would only affect compilation, I decided to hold off on this and start a train.
 * The dictionary pruning errors out. Returns the error message: sh: text2wfreq: command not found.
 * This is a part of the CMU toolkit. I checked the install of that package and it seems like its compiled.
 * I can't find it in /usr/local though. Its in /mnt/main/local though. I have no idea why.
 * Assisted Tyler in doing some final configuring Caesar
 * Set Eth0 to static address.
 * We also experimented with getting Caesar to allow logins via the UNH's Wildcats accounts.
 * We were able to get Caesar joined to the domain using Samba
 * It may not meet our needs, so we may need to disable it.

(2/19/2013)


 * Continuing onwards in my quest to get everything working!
 * I have a feeling that the instructions [] are incomplete.


 * I will start by attempting to undo everything I've done and start new.
 * Ran make uninstall for Sphinxbase, The other packages don't have this ability. Confirmed that no sphinx entries were in /usr/local
 * Ran make distclean in all sphinx source-code directories to delete all files created during previous compilations and any Makefiles.
 * Created a new directory, test, under /mnt/main/root
 * After reviewing readme files for all packages. I've determined that we need to compile and install Sphinxbase first, and it needs to be compiled in a directory called 'sphinxbase', otherwise all other sphinx packages will fail during compilation.
 * Copied the existing sphinxbase-0.6.1 directory into the new test directory, renaming it to 'sphinxbase'.
 * After running the configure script for sphinxbase, I noticed that it was complaining about not being able to find header files for python.
 * Just to play it safe, I decided to install the python-devel package using zypper, run make distclean again, and successfully restarted the configure script.
 * Was able to successfully compile sphinxbase, ran make check successfully to test the compiled files, was able to install the program successfully. Confirmed that files have been inserted into /usr/local/bin and /usr/local/lib among others.


 * Now onto the CMU-Toolkit
 * This should contain tools which are utilized by some of the scripts we use.
 * Namely text2wfreq
 * To ensure a 'fresh' and unmodified codebase, I have extracted from /mnt/main/install/tar the speechtools.tar file.
 * These should be the same versions as was previously installed.
 * Within that file, I have extracted the CMU-Cam_toolkit.
 * The README file specifies that Little-endian machines, such as the one we are running, are required to set the "ByteSwap_Flag" to -DSLM_SWAP_BYTES within the makefile
 * Ran Make, executed is a shell script called "install-sh" which according to the readme, is to be executed to install everything.
 * This script merely returns an error that it's missing arguments.
 * Looking at the makefile, I've determined that this script is called using 'make install'
 * Also looking at the makefile, I've determined that by default, its installing all binaries and library files to two directories, bin and lib located one directory higher than where the source code is.
 * Looking at the Prune_dictionary.pl script, its calling text2wfreq directly, not referencing where the executable is.
 * The executable for this program is not located in any of the directories defined by the path variable, hence it can't find it.
 * To resolve this, I will be copying the existing makefile, naming the copy Makefile.old
 * I will then define the binary and library directories within the Makefile to point to /usr/local/bin and /usr/local/lib respectively.
 * Executed make install again. The installer script deposited the correct files to the correct locations.
 * There is no 'make uninstall' defined, if we want to remove what we've inserted into /usr/local, we will need to remove everything manually.


 * Now for the Sphinx Trainer:
 * Extracted SphinxTrain into test directory.
 * Reviewed Readme file to see if there was anything out of the normal.
 * Ran configure script and make as instructed in README.
 * It created a file in the parent directory of the source code.
 * make install didn't work. It isn't defined in the Makefile.
 * install-sh script is located in the similar position, but like the one in the CMU toolkit, it doesn't define anything.
 * After reviewing the Sphinx trainer's usage, I've determined that it doesn't need to be installed, as the setup_SphinxTrain.pl script will copy over the appropriate binary files over to the experiment directories.
 * Confirmed that the existing /mnt/main/root/tools/SphinxTrain-1.0 directory has been untouched. Everything looks fine for right there.
 * Removed /mnt/main/root/test/SphinxTrain-1.0 as it isn't needed.


 * Now, for the fun one...the Sphinx decoder....
 * Extracted a fresh version of the sphinx 3 decoder from the /mnt/main/install/tar directory into my working test directory.
 * Read the readme
 * Confirmed that the sphinxbase source code was located at the proper position and the right name (sphinxbase) as specified by the readme.
 * Executed autogen.sh
 * Executed make
 * Make failed again.
 * Took another look at the compiler output.
 * The actual compilation of the code is executing successfully. The linker is the aspect which is throwing an error. Specifically, its throwing an error when it tries to link the function bio_read_wavfile.
 * Used find /mnt/main/root -type f | xargs grep -rl 'bio_read_wav' to find where it is being referenced.
 * It was being called in /mnt/main/root/test/sphinx3/src/programs/main_align.c, but it wasn't defined anywhere.
 * After doing some searching on google. I determined it should of been defined in bio.c, within sphinxbase.
 * I couldn't find the declaration within the sphinxbase 0.6.1 which was already on Caesar
 * Found it was defined within Sphinxbase 0.7.
 * Uninstalled Sphinxbase 0.6.1 and downloaded, compiled, and installed sphinxbase 0.7.
 * Recompiling the sphinx 3 decoder worked successfully after, was able to install it after compilation.


 * Continued onwards with 0020
 * Failed to create the feats. Traced the failure back to genTrans.pl
 * The genTrans.pl script has a flaw where it assumes that sox is installed on the system and doesn't take into account any failures
 * Made a modification which fixed these flaws, new script is called genTrans2.pl. Confirmed it works sucessfully.
 * Installed Sox
 * I will keep this script separate until I receive permission to incorporate my changes into genTrans.pl
 * Ran The train, train errored out due to missing dictionary entries.
 * Began entering dictionary entries when someone took caesar down.
 * Its not accepting logins. Will have to try another time.


 * Plan:


 * Concerns:

Week Ending February 26, 2013
Get a train to run successfully!
 * Task:

(2/23/13)
 * Results:
 * Looking over the last attempted run of Experiment 0020, it failed as there were multiple dictionary entries which had failed.
 * The task of getting adding dictionary entries is extremely tedious. To the point on long trains with many words missing, this isn't a productive use of time.
 * I created a new script, updateDict.pl, which takes a list of words and the according pronunciations, adds them into the dictionary, and then sorts them.
 * It also can be used to update the pronunciations of existing pronunciation entries.
 * The script is still somewhat new and shouldn't be trusted.
 * To minimize potential damage (Deletion or unwanted modification of the dictionary) Copy of the existing dictionary should be made, moved into the user's home directory, and executed there.
 * The Tool works fairly well. It was able to do in 20 seconds what took me more than 3 hours.


 * Added missing word entries utilizing word pronunciations added in Experiment 0017 [].
 * Restarted train.
 * 3 words were still missing: 7, 20, and Baltimore.
 * Added new words to the dictionary using new script.
 * Started Train on Obelix.
 * This may take a while...


 * Train ran successfully after about 2-3 hours.
 * Unlike experiment 0018 and 0019, the train created the continuous models.
 * Created the corresponding Language model for this experiment.
 * Started a Decode session using the Acoustic models created in this experiment.
 * The decode kept going throughout the night.

(2/24/2013)
 * Decode ran successfully after about 6-7 hours of run-time.
 * Read Logs and Information regarding next steps.

(2/25/2013)
 * Started the verification process using SCLite,
 * Ran the decoder parser,
 * Started SC lite.
 * SCLite threw out errors relating to missing transcripts.
 * According to instructions, this is normal and can be resolved by filtering out redundant entries within the decoder and training transcripts.
 * Ran through this process
 * Still am getting the same error.
 * The Instructions for the decode do not give a good explanation for what each of the two arguments represent.
 * I Know that the first one specifies the Experiment to run the decoder for and the second one specifies the Acoustic model, The example utilizes two separate experiments, running a decode for Experiment 0015 with the acoustic models developed in 0012.
 * I don't know why it is implemented in this fashion.
 * I used 0020 for each argument. Are they supposed to be different?
 * I will have to do some more research and testing on this.
 * Perhaps try a decode utilizing a different Acoustic model to verify this.

(2/26/2013)
 * Read more about the decode script.
 * Started a Decode for experiment 0020 utilizing the acoustic model for 0015
 * The process is very slow, It likely will keep going until tomorrow morning (Wednesday 2/26)


 * Plan:


 * Concerns:

Week Ending March 5, 2013
Determine why scoring isn't working properly.
 * Task:

(3/2) (3/3)
 * Results:
 * On Wednesday the decode process I started the previous day finished.
 * Scoring it resulted in the same errors as before.
 * We need to determine why its failing when scoring.
 * I may have an issue with my Corpus subset.
 * Read Logs and more about the NIST SCLITE program.
 * This site is a pretty good summary of what SCLite does.
 * From my understanding, SCLite has two main input files:
 * A Hypothesized text file
 * Which is the output from a speech decoder
 * And a Reference text
 * Which is an accurate transcript file of the audio files that were decoded.
 * The scoring process first "Aligns" these two texts by:
 * Selecting Matching Reference and Hypothesis texts
 * Aligning the Reference and the Hypothesis texts
 * Finally, it scores the decoding by determining the amount of correctly decoded words per speaker.

Missing:'', followed by a long list of Transcript ids, Suggests to me that the corpus may have included wav files which do not have any associated transcripts. (3/4)
 * From the error message that I am getting ''Error: Not enough Reference files loaded
 * I have no idea why this is so. I would think that the Train would not initiate due to the missing transcripts....
 * Investigated the missing reference files mentioned in the SCLite error.
 * Oddly enough, they aren't mentioned in either the Hypothesis file and the transcript.
 * Perhaps the next course of action would be to attempt a scoring utilizing a reference transcript from Experiment 0017, which consisted of all the audio for which we have a transcript for.
 * According to the SCLite documentation, this won't be much of a problem as the program will only use portions for which there are reference text.

(3/5)
 * Met with Professor Jonas to discuss the scoring issues.
 * After some troubleshooting, we deterimined that the hypothesis and the reference transcript had the hour long corpus section that I used for 0020 was actually 5-15 minutes worth of unique audio.....and the transcript file contained 4 separate copies of the same transcripts.
 * What really threw us off was that one of those extra copies was slightly different than the others (used 'the' instead of 'this').
 * I need to be aware that the Transcripts (/mnt/main/Exp//etc/_train.trans) and the Hypothesis file created during the decoding process may be screwy in this manner.
 * If scoring fails, we may need to sort them alphabetically, remove redundant entries, and insure that the Hyp and Trans files have the same amount of words/lines.


 * Plan:


 * Concerns:
 * I'm concerned that perhaps that the corpus subset I created for experiment 0020 was flawed in some way, meaning that the Scoring won't proceed.

Week Ending March 12, 2013
The Task for this week is to run more experiments; preferably to get to the point where we can run these experiments with little to no issues. The team, including myself, needs to also list and detail the experiments we have created to this point, and any future experiments, on the Speech wiki page.
 * Task:

(3/10/13)
 * Results:
 * Read logs to determine the status of other teams.
 * Added Experiment data to the Wiki.

(3/11/13)
 * Finished adding all data onto the wiki for experiments which I have been directly involved with.
 * Re-read the notes and instructions on making corpus subsets to determine why the Mini2 corpus subset I created for Experiment 0020 contained so much repeating data.

(3/12/13)
 * Started a new Experiment, Experiment 0024.
 * To ensure that there will be no issues with the corpus, I have utilized a pre-existing corpus subset, Mini.
 * This subset has already had experiments performed on it, so we will have minimal issues with it.
 * Started Train, Train failed due to missing words and incorrect phones. Added Missing words and corrected phones in the dictionary.
 * Started Train successfully.
 * Train took about 0.5-1 hour to process.
 * Created Language model.
 * Started Decode
 * Strangely, the decode only took a fraction of the time that Experiment 0020 took.
 * Will Score tomorrow

(3/13/13)
 * Started scoring process.
 * Ran into similar issues as Experiment 0020.
 * Was able to resolve issues by removing redundant transcript entries.
 * This step probably should be done before the Training.
 * The instructions on the wiki may need to be adjusted.
 * First we should run another experiment with these adjustments, comparing the scores of both.
 * Also, perhaps a script should be made which verifies the transcripts for such redundancies and resolve them.

(3/11/13) Plan on starting a new train tomorrow.
 * Plan:
 * Concerns:

Week Ending March 19, 2013

 * Task:
 * Run trains to gain a familiarity with the system.
 * Run Experiment 0025, testing to see if training with the transcript and dictionary created to score 0024 will result in an increase in accuracy.
 * Attempt to build a valid proper corpus subset.

(3/16/2013)
 * Results:
 * Started Experiment 0025
 * Utilized the transcript and dictionary from Experiment 0024.
 * First attempt to run the train failed:
 * Had to run the transcript filelist (0025_train.fileids) through uniq due to it having redundant entries.
 * We have to note when training with a transcript that has been run through uniq, the file id list needs to be ran through uniq as well.


 * Train ran successfully,
 * Created Language Model without issues.
 * Started Decode.
 * Scoring will be addressed tomorrow.

(3/17/2013)
 * Scored Experiment 0025
 * Oddly enough, Experiment 0024 had a slightly lower error rate than Experiment 0025. Proving that training with redundant transcript entries actually slightly increase the Accuracy of the resulting models.
 * Typed up Experiment reports for Experiment, 0024 and  0025.

(3/18)
 * Started a new decode experiment, 0026.
 * The goal of this, and another experiment (to be created after the conclusion of Exp. 0026)is to confirm the results found in experiments 0024 and 0025.
 * Attempted to run decode. Decode is failing, Will need to troubleshoot.

(3/19)
 * Was able to get the Decode running for Experiment 0026.
 * The decoder requires that a dictionary is created within the experiment to be decoded.
 * I assumed it was going to use the main dictionary for sphinx.
 * Additionally, Feat data needs to be created.
 * To do that, the Sphinxtrain configuration file needs to be created.
 * After creating the dictionary and the Feats data, the decoder started.
 * Decoder is expected to run for a little bit.
 * Read logs afterwards.

(3/17/2013) Score Experiment 0025. (3/19) Determine why Experiment 0026 is not decoding.
 * Plan:
 * Concerns:

Week Ending March 26, 2013
Bring Team "B" up to speed on building models.
 * Task:

(3/22/2013) (3/24)
 * Results:
 * Worked with Josh to update the Documentation/instructions.
 * Updated Speech:Training by incorporating the information created by Cedric Woodbury over Summer 2012 with more clarifications, and modifications that the Spring 2013 modelling team have made to the process.
 * Updated Speech:Run_Decode by removing the portions regarding building the language model.
 * Moved it to its appropriate position in the existing Speech:Create_LM.
 * Created Speech:Spring_2013_updateDict.pl
 * This script was created by myself earlier in the semester. The modelling team have been using it and we feel its stable enough to be used by others.
 * Moved script into /mnt/main/scripts/user for easier access by others.
 * Emailed team regarding update. Contacted other Modelling team members as well to inform them of the changes.
 * Made clarifications and formatting changes to Speech:Training.
 * The Instructions for making a train now more closely resemble the instructions for adding dictionary entries.
 * Also changed all to strip out everything but the first 6 characters on a line, remove unique entries, then counted the remaining lines which represent unique files. The result was 61, which correlates to the amount of file data I calculated earlier in this process.

After confirming the written transcript's validity, I moved it into the /mnt/main/corpus/switchboard/last_5hr/train/trans directory.

For the second part, we needed to create softlinks to each audio file referenced in the transcript. Luckily, the copySph.pl script works great and did this quickly and with little issues. It was able to get all 61 audio files and put them in the wav directory.To confirm we have 5 hours worth of audio, I used  The code uses sox to get the length of all audio data in the directory, then retrieves only the second line, containing the audio length, then uses awk to get only the second column, which contains a number. Surprisingly, this added the resulting number up for me as well. The corpus has about 18238.138 seconds worth of audio, which is a tiny bit over 5 hours worth of audio, or about 5 hours and 4 minutes to be more precise.

Now. We needed to create a list of words which are missing from the dictionary but are used in the transcript. Normally we have to build the experiment and run a train to get a list of missing words. To bypass this, I have updated the dictionary creation scripts.

They will now keep track of any words it cant find, outputting them to an add.txt file at the end of the process; this integrates itself well with the updateDict.pl script. Furthermore, I have made tweaks to make the process run faster. The script now is about twice as fast as the old script. I experimented with another updated version which much faster still, but I'm still having trouble with it, I will work on it another time.

Using this script, I generated a transcript within my home directory using genTrans2.pl, then used the pruneDictionary2.pl script to create a dictionary. We are missing about 104 words. The words have been shared amongst the group and will be gathered.

(4/7/13)

Read Logs. Participated in inter-team communications. We should start the train by Monday Evening by the latest, any later than that we run the risk of not having time to troubleshoot.

(4/8/13)

Started work on the train. Another group member had already performed the initial steps, I just had to run genTrans.pl, merge the dictionary with addition textfile using updateDict.pl, and do everything else needed for the train and started it. The new pruneDictionary2.pl and dictionary2.pl scripts work quite well, using it we were able to generate a list of missing words early in the week, sending it out to the group to fill. This is a much superior method than just having to run the train and having it fail to get a list.

I expect the train to run until sometime tomorrow morning. After that we will begin the decode and scoring processes. I've been in communication with Kevin A. and he got the test corpus set up, so we should be good in that regard.

(4/9/13) The train ran for much shorter than I had anticipated, finishing roughly about 1:00-2:00am.

Created the Language model and Started the decode. Scoring was extremely easy, the new corpus did not have any redundant transcripts so no extra work was needed to score.

Run the train on 4/8. Run the decode & score on 4/9.
 * Plan:


 * Concerns:

Week Ending April 16, 2013
*Get the error rate of models down.
 * Task:

(4/12)
 * Results:
 * There are two areas which I think are affecting accuracy:
 * The dictionary used for the decoder and LM creation may be affecting accuracy.
 * All documentation I've seen stated that stress-indicators are not used.
 * The issue with GenTrans stripping out the markers which mark transcribers notes, effectually making them words in the transcript.
 * This is likely the main issue.
 * Two/Three experiments will need to be done:
 * A test LM creation and Decode using 0074's acoustic model.
 * Two experiments to train and score a transcript with a better transcript.
 * The First experiment can be done now.
 * The second set of experiments I'll need to research the best way to fix the transcriptions.
 * Updated Group assignment sheet to reflect the above.
 * Created a new dictionary using 0074's dictionary, but stripping out any numbers.
 * Effectively removing stress indicators.

(4/13)
 * Did research on how best to fix the transcript.
 * Right now I'm leaning towards:
 * Removing the brackets around words, which are marked by a [..]-.
 * Replacing the brackets around noises and non-word elements with a "++"
 * Then defining such words in the filler dictionary to a special phone. Effectively forcing the trainer to expect the word, but not account for it when training.
 * We'll need to generate the script to do this.
 * If we get a better accuracy, then we can make an improved version of GenTrans.

(4/14)
 * Team members finished Experiment 0080
 * Turns out the Decoder did not like the fact that the LM and the Acoustic model did not have the same phone list. Preventing a decode.
 * Instead of abandoning our work with this experiment, we shall create a new Train experiment, called Experiment 0082, which will train using a dictionary and phone list that doesn't have stress indicators.
 * Team members Completed Experiment 0082.
 * Team members finished a decode for Experiment 0080 using the acoustic model from Experiment 0082.
 * The Decoder experiencing an issue which prevents it from decoding all but two audio files.
 * Not sure what is going on, will debug Monday or Tuesday.


 * Created a new experiment, 0081, for the train with the fixed transcript.
 * Created a transcript for 0081:
 * The process I've come up with is as follows:
 * Created a small perl script to encapsulate noises in '++' and remove brackets around incomplete words.
 * Used script to generate new corpus last_5hr/train2.
 * This process is easier as inserting the regex used in the perl script into GenTrans2 only had it not run correctly.
 * Ran GenTrans2 like normal. GenTrans will ignore the ++s, so it runs correctly
 * Executed "cat 0081_train.trans | text2wfreq | sort | grep -v sw | grep ++ | more" to get a list of words which have the ++ in it.
 * Used the list to fix instances where the ++ was encapsulated within an existing word.
 * This is extremely tedious, a better solution is to create a script which will not do this in the first place!
 * Finally, as the transcript contains bracketed notations with spaces in the center, the trainer will consider it two separate words and error out. For now, we will replace such words with ++INTELLIGIBLE++ by renaming the transcript generated by genTrans to 0081_train.trans.old and then executing
 * Populate the filler dictionary with all the new non-word entries, such as ++NOISE++, ++INTELLIGIBLE++, ++LAUGHTER++, and so on. Mapping them to those three phones.
 * Create three new phones, Called +NOISE+, +INTELLIGIBLE+, and +LAUGHTER+ for which the new entries in the filler dictionary are mapped to. The trainer will account for them but will ignore these when building the models.
 * Due to the large number of changes, I ran the train myself as there was some debugging I had to do.
 * The train appeared to run correctly, the logs noted that the filler phones were ignored.
 * Started Experiment 0083 to score experiment 0081.
 * Based on the initial results of experiment 0080, Experiment 0083 will need to use the filler dictionary, the dictionary, and the phone list of experiment 0081 or the decoder may error out.
 * I'm unsure at this time if we should perform the similar steps to create the transcript for 0081, or if we can just build the transcript like normal.
 * My main concern is that the decoder will crash due to the unexpected phones.
 * Ultimately decided we should create the transcript like normal. We will fix it or create a new experiment with the improved transcript if issues arise.

(4/16)
 * Performed inter-group communications.
 * Scoring experiment 0083 was using the incorrect corpus, and thus could not be used to compare with last week's experiment.
 * Created experiment 0087 and performed a decode and score on the acoustic model generated in 0081.
 * The test scores were marginally worse than what was observed in 0075.
 * I think that the method defined above still needs adjustment.


 * Plan:
 * Create a better transcription.
 * Concerns:
 * There isn't a lot of documentation on how best to use the filler dictionary for Sphinx 3.

Week Ending April 23, 2013

 * Task:
 * Help prepare the modelling group's URC poster & abstract
 * This is our primary objective.
 * Preform a new experiment using the new genTrans scripts to improve accuracy further.
 * This can be done after the poster and abstract is completed.

(4/18/13)
 * Results:
 * Prepared a rough draft of the abstract and shared it with the team.
 * Communicated with modelling team members and improved abstract.

(4/18)
 * Made adjustments to Abstract.
 * Submitted it via email to Ellen Ruggles.

(4/20)
 * Collaborated with team to complete the final version of the poster.
 * Submitted poster and abstract to URC service.

(4/21)
 * Found that Matt had created an updated version of genTrans.pl
 * Started experiment 0089 to test it out.


 * Plan:


 * Concerns:

Week Ending April 30, 2013

 * Task:

(4/27/13)
 * Results:
 * Updated assignment sheets with assignments for everybody.
 * Experiment 0094 will be to test 0024's acoustic model with last_5hr/test.
 * Experiment 0095 will be to test 0089's acoustic model with mini/train.
 * Another task has been created to sort through the logs created by the trainer in 0089 to determine the amount and types of errors encountered.

(4/28/13)
 * Checked output of genTrans5.pl in experiment 0094.
 * The script is leaving in double quotes.
 * Removed double-quotes with sed.
 * We will need to debug genTrans5.pl
 * Since we found that genTrans5.pl produces a better transcript for creating an acoustic model, we should probably consider either creating a new Language model transcript creation script either based on genTrans5, or integrate the capability as a feature within genTrans6.

(4/29/13)
 * Read logs & monitored the group assignment page.
 * Assigned experiment 0096
 * In Experiment 0096, we will be testing the latest and greatest Acoustic and Language models from CMU themselves.
 * From my understanding, these are roughly equivalent what we could expect with a full 308 hour train with no issues.
 * We hope to get a baseline as to what type of accuracy Sphinx is capable of.
 * The Experiment is running with the full CMU 0.7a Dictionary without stress indicators.
 * This is due to the fact that both the LM and AM were created with this dictionary, using a dictionary with stress indicators will result in a phone mismatch, preventing a successful Decode.
 * Helped Brian D. get the experiment up and running.
 * The decode execution script needed some modifications to point to the right models, other than that, the decode was able to start without any major modifications.
 * The Decode is progressing slower than normal due to the size of the models and the size of the dictionary. We will let this decoder run until the morning.

(4/30/13)
 * Compared the results of all experiments.
 * Experiment 0096 had a very poor word error rate at 72.0
 * After doing some research, turns out the models were created with high quality (16KHZ) audio data, since we are using telephone quality (8KHZ) data, this can result in a very poor decode.
 * Experiment 0094, using the acoustic model from Experiment 0024 with last_5hr/test, we got an error rate of 58.1.
 * Experiment 0095, using the acoustic model from Experiment 0089 with mini/train, we got an error rate of 75.5.
 * I think that based on the results of 0094 and 0095, I think that the issue is still with genTrans.pl.


 * Plan:


 * Concerns:

Week Ending May 7, 2013
Increase the scores.
 * Task:

(5/2/13) (5/4/13)
 * Results:
 * Created a task list for a new experiment, Exp 0098.
 * This experiment will be to get the word error rate of a decode using the full 5 hour last_5hr/train corpus.
 * Read logs.
 * Looked into the issues gathered from experiment 0089 last week.
 * Most of the issues appear to be data-related.
 * We can't really make any adjustments to the Sphinx Trainer to improve this.
 * Saw that Matt had created a new version of GenTrans to resolve some issues with the wavefiles we observed.
 * Created task lists for experiment 0099, and 0100 to run a train and decode with this new script.
 * Plan:

(5/5/13)
 * Read Logs
 * Monitored team members.

(5/7/13)
 * Researched the results of the experiment this week.
 * Oddly enough, we have had worse error rate using genTrans6 than genTrans5.
 * There may be one or two reasons for this:
 * The Interference caused by not cleanly cutting audio may create superior models as the models account for such interference.
 * The Sphinx Decoder is not looking at the individual channels but is looking at the audio as a whole (I.E. Mono). As some portions overlap, we may be getting some errors through it being incomprehensible.
 * Concerns: