Speech:Spring 2016 Jonathon Shallow Log


 * Home
 * Semesters
 * Spring 2016
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Helpful Links:

 * 1) Wiki markup guide
 * 2) Wiki-Markup
 * 3) Introduction to Speech Recognition
 * 4) http://people.inf.ethz.ch/jaggim/meetup/3/slides/ML-Meetup-3-Dixon.pdf
 * 5) CMU Pronouncing Dictionary
 * 6) http://www.speech.cs.cmu.edu/cgi-bin/cmudict
 * 7) SP16 Competition Report (Good Recap of Findings)
 * 8) http://pubpages.unh.edu/~jax472/capstone/Competition_Report/report.txt

03-Feb

 * Task:
 * Today we split into groups and had our first official meeting as the modelling group. Ben led the show due to his previous experience which allowed us to hit the ground running. The task of the day was to correct some script errors Ben had found in order to be able to run our first experiment as a team.
 * Results:
 * We went through the process to run a train. It follows:
 * SSH into caesar, then SSH into a working machine
 * CD into /mnt/main/Exp
 * Make a directory corresponding to your experiment #:
 * mkdir 0281
 * Run the script to create the directory the structure:
 * "/mnt/main/scripts/user/prepareTrainExperiment.pl switchboard first_5hr/train"
 * Generate Feats Data
 * "/mnt/main/scripts/user/generateFeats.pl"
 * Run the train
 * "nohup scripts_pl/RunAll.pl . &"
 * the nohup command is crucial, it allows the command to continue to process are you log-out. There is no need to remain logged into the machine while you wait the X amount of hours for the train to complete.
 * Debugging Script
 * The errors we received are listed on the experiments page under 0281
 * The errors were all user errors that we corrected after looking at the offending scripts and debugging them. We were passing in the wrong paramets in the step 4 mentioned above
 * 4. Run the script to create the directory the structure:
 * "/mnt/main/scripts/user/prepareTrainExperiment.pl switchboard first_5hr/train"
 * However the command we entered was:
 * /mnt/main/scripts/user/prepareTrainExperiment.pl switchboard /mnt/main/corpus/switchboard/train
 * As you can see, the second arguments vary. When we examined the prepareTrainExperiment.pl script, we realized that the second argument is relative to the "/mnt/main/corpus/" directory, where the switchboard directory is located.
 * After that was fixed the scripts ran as expected and we closed the session for the day while we waited for the experiment to run.
 * Plan:
 * Waiting for the train to complete so we can move onto the decode process.
 * Concerns:

04-Feb

 * Task:
 * Fix errors found in running scoring on experiment 0281-005
 * Results:
 * Previous train completed and we moved on the decode/score phase. However we encountered an error while scoring.
 * Results are posted in the modeling log. We believe the issue was user error in the form of an improperly entered command while setting up the decode environment. This ultimately caused the score to fail due to a required file not following the sub-experiments naming conventions. We are unable to resolve the issue at this time due not have individual user accounts to run the required commands.
 * Plan:
 * Establish individual user accounts on caesar so all members of the modeling group can participate.
 * Concerns:

05-Feb

 * Task:
 * Fix errors found in running scoring on experiment 0281-005 (continued)
 * Results:
 * Unable to rename the 0281/005/001_decode.fileids to the required 005_decode.fileids due to folder permission problems. This was because the 0281 experiment was created with a user name from last semester that is no longer available. Because of this I created experiment 0283 and we plan on using that as the modeling groups master experiment directory.


 * I initiated a train on 0283 sub-experiment 001. The train is currently in progress and should be available for the decode/score process around 1700 05 Feb 2016.
 * Plan:
 * Wait for 0283/001 to train and then attempt to decode/score and see if we receive the same results as 0281/005.
 * Concerns:
 * I did not make a 0281/001 directory prior to running the train. The fix should be to simply copy all the contents of 0281 into a 001 subdirectory after the train runs. However the tutorial does not be updated to reflect this.

09-Feb

 * Task:
 * Attempt decode on 0283/001.
 * Results:
 * Fix 0283 directory
 * created a 001 directory (I failed to do this before running a train, so all the 001 sub-experiment files were in the 0283 root directory). I then moved all 001 files into the appropriate directory
 * Language Model Creation
 * from 0283/001 directory: $ mkdir LM
 * from 0283/001 directory: $ cd LM
 * from 0283/001/LM directory: cp -i /mnt/main/corpus/switchboard/first_5hr/train/trans/train.trans trans_unedited
 * I verified that train.trans existed in the above directory before entering the command
 * from 0283/001/LM directory: LS
 * verified trans_unedited was copied
 * Preparing transcript per tutorial (verified ParseTranscript.perl script was in corresponding directory)
 * 0283/001/LM $ /mnt/main/corpus/switchboard/dist/transcripts/ICSI_Transcriptions/trans/icsi/ParseTranscript.perl trans_unedited trans_parsed
 * The parsed transcript was printed to the screen, you will see the transcript of the conversation. trans_parsed should appear in the directory (LS to confirm)
 * Copy script to create the language model
 * 0283/001/LM $ cp -i /mnt/main/scripts/user/lm_create.pl.
 * verified script was in corresponding directory prior to entering command (don't forget the period at the end of the command!)
 * As of 15:57 I entered the above command and it is hanging after "Merging temporary files". There is nothing about the length it takes this command to execute in the tutorial, recommend adding that.
 * ignore the above line. I forget the "cp -i" on the command. corrected myself, re ran the command, and everything worked in a matter of a second or two.
 * Execute the script
 * 0281/001/LM $ ./lm_create.pl trans_parsed
 * Run the Decode
 * Create subset of _train.fileids in etc.
 * 0281/001 $ cd etc
 * $ head -1000 0283_train.fileids > 0283_decode.fileids
 * my file structure is not in line with the subset experiment numbers we would expect. this is because I failed to create the 001 subset directory before running the train. I am guessing this is going to cause in error in running the decode.
 * if you LS the directory, you should notice the file naming convention. 001_dic 001.filler 001_decode.fileids etc. you need to make sure the above command has the appropriate convention. Because I run the train in the 0283 directory, all my files had a 0283 prefix, that is why the above command has 0283_train.fileids > 0283 decode.fileids.
 * Set up decoding
 * $ cd ..
 * 0283/001 $ mkdir DECODE
 * $ cd DECODE
 * cp -i /mnt/main/scripts/user/run_decode.pl.
 * nohup run_decode.pl 0283 0283/001 1000
 * nohup run_decode.pl  
 * again my failure to create the 001 directory prior to trianing is coming back to haunt me. the in the tutorial is 001, which I believe is 0283 for my naming convention.
 * Getting a "Command not found" error when running "nohup run_decode.pl 0283 0283/001 1000". This doesn't make sense as the run_decode.pl script is in the directory.
 * Sent a request out to team members to see if they can replicate the error. It may be worth it to cancel this and make a new sub-experiment ensuring to the create the 002 directory before the train, so that all naming conventions are in order with the tutorial.
 * Plan:
 * Started a new train 0283/002
 * Hopefully by creating the 002 directory and creating the train inside it, we will avoid the errors we encountered in 0283/001
 * 0283/002 training as of 17:54 09 Feb 16
 * Concerns:

15-Feb

 * Task:
 * I have been busy this week with a job at work that is stopping me from helping the team much this week. This job was scheduled prior to the start of the semester and I accepted it due to it falling in the second week, figuring this would be the less intensive portion of the semester. Currently stuck in Kentucky due to the winter storm and am hoping to get back by class on 17 February.


 * I will be reviewing team member logs and other group logs today and tomorrow. I will also review the CMU SPhinx Tutorial
 * Results:
 * Plan:
 * Concerns:
 * I do not want to fall behind the curve or expect the modeling group to pick up my slack. I have coordinated my absence with the group and we are communicating via email/text. It doesn't seem to be a burden on them currently.

16-Feb

 * Task:
 * Read logs to gain a better understand of the processes themselves and attempt to gain an understanding of how to reduce the Word Error Rate
 * Results:
 * I read through the first few sections of the CMU tutorial. I had read them earlier but they made more sense now that I have had more exposure to the sphinx environment. It also helped get an understanding of the vocabulary (i.e. what is a phone, subword, triphone, etc), while this isn't crucial I think it will make it easier to wrap my head the big picture.
 * I read through the logs of Sam Sweet (Spring '15 Semester) and the Bruins team log from the same semester. The Bruins team log was very helpful as they archived their email traffic. This makes it very easy to see the configuration tweaks and what worked/didn't work with them. The group is running a 125 hr train/decode now with non-default configuration values.
 * Plan:
 * Concerns:
 * I need to gain a better understand of what the individual configuration variables do. The time it takes to run an experiment means we can't just play around with different settings and see what they do. The time we have is limited, the process takes days to complete, so we need to research and make educated decisions regarding the settings.
 * I am also still stuck in Kentucky due to winter storms. I was supposed to be back on Sunday and now it is looking doubtful to be back before friday. This will no doubt leave me behind the curve as per what the modeling group is doing, however they are document everything appropriately and we are staying in touch via email. I will be able to hit the ground running when I return and make up for the lost time.

21-Feb

 * Task:
 * Update Proposal


 * Results:
 * Following updates are proposed, working here and then getting group approval prior to copy and pasting to wiki proposal


 * Overview
 * The role of the modeling group is develop effective models to be used in the speech recognition process. There are three models that are used in speech recognition: acoustic, phonetic, and language.
 * Acoustic
 * An acoustic model contains acoustic properties for each senone. There are context-independent models that contain properties (most probable feature vectors for each phone) and context-dependent ones (built from senones with context. The acoustic model is very complex and contains multiple variables that can be manipulated.
 * Phonetic
 * 1. Phonetic dictionaries contains a mapping from words to phones. However only two to three pronunciations variations are noted, and this degrades the accuracy (although it is practical enough for most applications).
 * Language
 * 1. The language model restricts word search. It defines which words could follow previously recognized words. For example, if we recognize the word “Merry” we can assign a high probability to words such as Christmas, Hanukkah, or Holidays.  A language model with a successful search space restriction will result in a lower WER.
 * Additionally the modeling group is required to serve as subject matter experts (SME) in the speech recognition field. While this in way takes away from the work of the other groups, we are expected to have an in depth understanding and to serve as advisors to the other groups. This will require close collaboration with internal groups and external assets such as the CMU speech forum.


 * Goals:
 * The previous semesters have established a very respectable baseline that we hope to build upon with the following goals:
 * WER Reduction
 * First and foremost, we will replicate the the previous semesters WER. This will be accomplished by reviewing the previous experiments, implementing their language model modifications, and recreating a successful decode on par with historical results.
 * Secondly, we will focus on researching, understanding, and tweaking the variables that affect the acoustic model. We have identified the following:
 * Variance Normalization
 * Senome Count
 * Density
 * Convergence Ratio
 * Language Weight
 * Possible changes to the dictionary will also receive attention
 * Advising
 * Our research into the technical understanding of speech recognition will allow us to provide advice and guidance to other groups. By sharing learned knowledge with other groups we will increase the overall effectiveness of the ultimate goal of the reduction of the WER.
 * Documentation
 * One of the major issues has been errors in historical documentation or lack of sufficient documentation. This issue will require the collaboration of all groups to update or build documentation in each of our specific focus areas. The end result is to ensure the following semesters can minimize the learning curve and maximize the time spend improving the system.
 * Plan:
 * While we have the previously mentioned goals, it is difficult to determine to going-forward plan on how to accomplish them. This is due to our current limited knowledge of the speech recognition software. Our current plan is as follows, however it is subject to change on further research.
 * Modification of the Sphinx train configuration
 * Each week we will modify the variables in the acoustic model to what we hypothesize will achieve a lower WER.
 * We will compare the weekly decode results against our best baseline and determine what changes are most effective and which changes were least effective.
 * Continued Research
 * While trains are running, further research into the models will be conducted into improving the Real-Time Factor (RTF) of both the training and decoding while avoiding comprising the WER.
 * Collaboration
 * We will continue to share our lessons learned with internal groups while reaching out to external assets.
 * We will help the Data Group with data verification and the Experiments Group with script verification.
 * Assign team leaders and develop a master contact list.
 * Documentation
 * Documentation will be reviewed and updated weekly.
 * A vocabulary section will be added to the tutorial and will be updated as new terms are encountered.
 * Plan on adding screen-shots of actual train running, language model building, decoding, scoring and other trains configurations.
 * Assign a master editor for the wiki entries in the general areas of the capstone wiki to ensure accountability and accuracy of the material.
 * Sources
 * http://cmusphinx.sourceforge.net/wiki/tutorialconcepts#models
 * http://cmusphinx.sourceforge.net/wiki/tutorialadapt

23-Feb

 * Task:
 * Continue research using CMU SPhinx Tutorial and the CMU Sphinx Forum
 * Results:
 * Registered for the forum. Using the search function for any of the keywords relating to the project, models, vocabulary, or WER reduction techniques
 * Real-Time Factor
 * It appears we can configure parallel jobs to possibly speedup the training speed. I have not checked if this is currently being utilitized.
 * Following changes are to the sphinx_train.cfg file in the /etc directory
 * Src: Training Acoustic Model

If you are on multicore machine or in PBS cluster you can run training in parallel, the following options should do the trick: $CFG_QUEUE_TYPE = "Queue"; Change type to “Queue::POSIX” to run on multicore. Then change number of parallel processes to run: $CFG_NPART = 1; $DEC_CFG_NPART = 1;            #  Define how many pieces to split decode in If you are running on 8 core machine start around 10 parts to fully load the CPU during training.
 * 1) Queue::POSIX for multiple CPUs on a local machine
 * 2) Queue::PBS to use a PBS/TORQUE queue
 * 1) How many parts to run Forward-Backward estimatinon in


 * Acoustic Model Types (src: Acoustic Model Types)
 * Three Types of acoustic model types. PTM, semi-continous, and continous. A mixture of gaussians is used to compute the score of each frame, the difference between the models is how the mixture of gaussians is built.
 * Continuous
 * Every senone has it's own set of gaussians. Total number in the model is about 150 Thousand.
 * Semi-Continuous
 * Has 700 gaussians. Way less than continuous, only use them with different mixtures to score the frame. Fast, but a bit less accurate
 * PTM
 * The "gold middle". 5000 gaussians, provides better accuracy than semi-continuous but is still fast enough to be used in most application (Accuracy is almost the same as continUous model!)

Support for phonetically-tied mixture acoustic models has been added to the Subversion repository for SphinxTrain, Sphinx3, and PocketSphinx. Briefly, phonetically-tied mixture models are somewhere between semi-continuous and fully-continuous models, offering most of the speed of the former combined with the ability of the latter to effectively use large amounts of training data. Parameter settings for training PTM models are present in the template sphinx_train.cfg file created by SphinxTrain, and can be enabled by  setting $CFG_HMM_TYPE to “.ptm.”. The development version of PocketSphinx will automatically recognize PTM models, while Sphinx3 requires you to add “-senmgau .ptm.” to the command line. We have made PTM models for English and Mandarin available for download on the SourceForge dowloads page. These have not been extensively optimized, but the English models, at least, already offer better performance than comparable fully-continuous models. Compressed and optimized versions of these in 8k bandwidth will be released with PocketSphinx 0.6.
 * $CFG_FINAL_NUM_DENSITIES
 * 0283/004 used 64. 32 is recommended for Continous models with more than 100 hours of data.
 * if switching to PTM, this should set to 256
 * $CFG_N_TIED_STATES
 * Value of senones to train in a model. The more senones, the more precisely it discriminates souds. Too many senones and the model will not be generic enough to recognize unseen data (higher WER on unseen data).

Vocabulary 	Hours in db 	Senones 	Densities 	Example 20 	         5 	         200 	           8 	Tidigits Digits Recognition 100 	         20 	         2000 	           8 	RM1 Command and Control 5000 	         30 	         4000 	           16 	WSJ1 5k Small Dictation 20000 	         80 	         4000 	           32 	WSJ1 20k Big Dictation 60000 	         200 	         6000 	           16 	HUB4 Broadcast News 60000 	         2000 	         12000 	           64 	Fisher Rich Telephone Transcription For semi-continuous and PTM models use fixed number of 256 densities.


 * Plan:
 * To be completed:
 * RTF
 * Determine if our servers can handle running parallel jobs and if our scripts are configured properly
 * Verify the acoustic model types we are using. Determine if Spinhx3 supports PTM models. Test to determine differences in RTF and WER by using semi-continous and PTM models (note: if training with semi-continuous or PTM models, use 256 gaussians for $CFG_FINAL_NUM_DENSITIES
 * Concerns:
 * Acoustic Model
 * using 0283/003/etc/sphinx_train.cfg
 * $CFG_QUEUE_TYPE is set to "Queue::POSIX" as we would expect to configure for parallel jobs.However I do not see the $CFG_NPART or $DEC_CFG_NPART variables. The documentation says to set these to determine the number of parallel processes to run. Either I am not seeing them, or they are not declared, which could mean they default to 1 (which I'm assuming wouldn't split the jobs at all)
 * We are also using a continous model. Suggest changing to PTM if sphinx 3 supports. Documentation states this is faster with approximately the same accuracy as continous

24-Feb

 * Task:
 * Increase vocab to 30,000. Run another train/decode to compare results.


 * Results:
 * Class day. As a class we discussed the proposal. There were some problems we needed to work out, primarily the lack of consistency in formatting and differences between 3rd person and 1st person writing in some groups. I volunteered to be the "master editor" which sounds a lot fancier than it really is. I am really just responsible for ensuring the formatting and writing tense is consistent throughout all groups.


 * We reviewed the results of the last score/decode (0283/004). That resulted in a WER of 33.9% (1.5% improvement).
 * Using documentation: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html#text2wfreq and lm_create.pl
 * -Realized on line 31 lm_create.pl we can add "-top 30000" to increase the vocab from the default of 20,000 most commonly used words to 30,000. Note that "-top 20000" is not included in the script because wfreq2vocab defaults to 20,000


 * Plan:
 * Write script to sort tmp.wfreq into an ascending list of word-frequency pairs by the frequency.


 * Concerns:

25-Feb

 * Task:
 * Write script as noted in 24-Feb Plan
 * Results:
 * Wrote a quick perl script to achieve the above plan. Of note is that there are approximately 19,000 instances of words with a frequency of less than 10. This is important because when we run lm_create.pl there is a paramater '-gt' that is defaulted to 10. This includes only the words with a minimum frequency of 10 into our tmp.vocab file. This omits approximately 4500 instances from the tmp.wfreq file.
 * Plan:
 * Further research the implications of this. Plan to adjust the '-gt' variable (set to 5) for the next train to analyse its affect on the decode/score process and the vocabulary/dictionary.
 * Concerns:

27-Feb

 * Task:
 * Meet with group to discuss 0283/005 results and work on the proposal.
 * Results:
 * 0283/005 resulted in a 33.0% WER, a 0.9% increase. The vocab file also included 23,569 words (up from 20,000 in 0283/004), due us adding the -top 30000 parameter to the lm_create.pl file.
 * Worked on standardizing the format of the proposal.
 * Plan:
 * 0283/006 will be started with purpose of evaluating the '-gt'paraemeter. This will be set to 5, and should include all words from tmp.wfreq with a minimum frequency of 5 (approximately 3,000 more than what would be included with the default 10 value used in the previous trains).
 * Concerns:

1-Mar

 * Task:
 * Read further into the LM documentation
 * Results:
 * Realized there is a linux command to sort the tmp.wfreq file as opposed to writing a perl script to do it.

sort -n -k 2 tmp.wfreq > sorted.txt
 * sort is the main command
 * -n is to sort by numerical (next lexicographic) order
 * -k is for "keys", the following 2 selects the second key (or token) in each line to sort by
 * tmp.wfreq is the file to sort
 * > is the output redirection symbol, and writes the commands results to a file
 * sorted.txt is the file name you wish to write to. If the file name does not exist, it will be created.


 * Plan
 * Concern

2-Mar

 * Task:
 * Re-run trains on 256 hour data.


 * Results:
 * After discussions in class, it became apparent that our training on 125 hr data would yield better results regardless of our inputs compared to 256 hours data. We also realized, after discussions with the data group, that there is a significant error in the corpus data (minimum 5% invalid data). After our discussion, we decided to create a new corpus with only verified data from the data group. This gives us approximately 32 hours of verified data. We created a new corpus (fixed_30k) with the valid data and will run a train with the 0271/003 (best results from previous semester) on our new data (although it should be noted that their configurations will be for a much larger corpus).
 * Ran into a few errors due to the new directory changes not being reflected in /mnt/main/scripts/user/prepareTrainExperiment.pl. Created a new temporary prepareTrainExperiment2.pl script with the update paths, currently running a train on our new corpus.
 * Updated the tutorial.
 * Added the trailing '&' for the run decode example
 * Removed the '.' in step 6 of the "run a train" tutorial


 * Plan:
 * We used the config file from 0271/003. This is tuned for a larger corpus. However it was also used for a corpus with data we have recently found to be invalid.


 * Concerns:

3-Mar

 * Task:
 * Decode/Score 0283/007
 * Start train on 0283/008


 * Results:
 * Scored 0283/007. see Here
 * Started 0283/008. see Here
 * Made two changes to the tutorial:
 * Decode on Trained Data - added "./" before run_decode.pl, without it the script will fail.
 * Run Train Setup - Under section 5: Generate Feats, added a quick clarification that generateFeats.pl needs to be ran from the top level of the sub experiment.


 * Plan:
 * Decode/score 0283/008 when it is complete.


 * Concerns:

4-Mar

 * Task:
 * Decode/Score 0283/008


 * Results:
 * Decode complete on 0283/008. WER rate of 43.7%. This is disappointing, we expected a much better WER considering we were training with known good data and the recommended configuration settings from CMU. Will continue to investigate.


 * Plan:
 * Continue to investigate WER. Come up with plan for 0283/009
 * Decode/score our new train (0283/007) when it is completed (Est. 1pm 3-March)


 * Concerns:
 * Decode only took approximately 2 hours. We expected much longer. This combined with the WER of 43.7% could be a sign of an error somewhere.

7-Mar

 * Task:
 * Read current logs/recent email traffic regarding possible errors in corpus .sph size.


 * Results:
 * James has a good writeup in his log regarding the current findings of the email traffic.


 * Plan:
 * Assist James and the data group in rectifying the data issues.


 * Concerns:
 * While continuing research into modeling is important, it is useless to continue experimenting on data that is not verified. Will meet up with data group during class on the 9th and determine if we need to divert some modeling group assets to assist them.

Week Ending March 22, 2016

 * Task:
 * 9-March
 * Continue to investigate corpus data issues
 * 10-March
 * Review email traffic regarding setting up new corpus
 * Review switchboard corpus documentation
 * Review Sox Documentation
 * Review email traffic
 * Meet with James, write script to generate new utterance audio files
 * 14-March
 * Meet with James and Ryan to work on new corpus structure
 * Write script to create new corpus directory structure
 * Create utterances for new corpus
 * 18-March
 * Meet with James, continue work on his genUtts.pl, expand the makeCorpus.pl


 * Results:
 * 9-March
 * To accurately determine the length of an audio file from the file size I found this:

bitrate = bitsPerSample * samplesPerSecond * channels So in this case for stereo the bitrate is 8 * 44100 * 2 = 705,600kbps To get the file size mutliply by the bitrate by the duration (in seconds), and divide by 8 (to get from bits to bytes): fileSize = (bitsPerSample * samplesPerSecond * channels * duration) / 8; So in this case 30 seconds of stereo will take up (8 * 44100 * 2 * 30) / 8 = 2,646,000 src: http://stackoverflow.com/questions/13556265/how-to-caluclate-audio-file-size
 * With the above in mind, discovered the file size also includes a 1024 byte header. src: http://www.ee.columbia.edu/ln/LabROSA/doc/HTKBook21/node64.html | https://www.isip.piconepress.com/projects/speech/software/tutorials/production/fundamentals/v1.0/section_02/s02_01_p04.html. The .sph files in the /mnt/main/corpus/switchboard/fixed_30k/train/audio conv and utt directories have 1024 byte headers. The difference between the /conv/*.sph and /utt/*.sph headers are below:

/conv/*.psh NIST_1A 1024 conversation_id -s4 2001 database_id -s25 Switchboard-1_release-2.0 channel_count -i 2 sample_coding -s4 ulaw channels_interleaved -s4 TRUE sample_count -i 2018387 sample_rate -i 8000 sample_n_bytes -i 1 sample_sig_bits -i 8 end_head

/utt/*.sph NIST_1A 1024 sample_count -i 84670 sample_n_bytes -i 1 channel_count -i 1 sample_byte_format -s1 1 sample_rate -i 8000 sample_coding -s4 ulaw end_head


 * Sample_rate: sample rate in Hz
 * Sample rate is the number of samples of audio carried per second, measured in Hz or kHz (one kHz being 1 000 Hz). For example, 44 100 samples per second can be expressed as either 44 100 Hz, or 44.1 kHz. Bandwidth is the difference between the highest and lowest frequencies carried in an audio stream
 * Sample_n_bytes: number of bytes in each sample. (1 byte per sample, 8000 samples per second)
 * Sample_count: number of samples in file
 * So filesize of /conv/*.sph is sample_count * channel_count = 2018387 * 2 = 4036774
 * add in our 1024 header = 4036774 + 1024 = 4037798
 * ls -l confirms this is the filesize. This confirms the 1024 filesize.
 * Filesize of /utt/*.sph is sample_count * channel_count = 84670 * 1 = 84670
 * add in our 2014 header = 84670 + 1024 = 85694
 * ls -l confirms this again. Confirms both /utt and /conv .sph files have headers of 1024 bits


 * To determine length of the audio file, simply divide sample count by sample rate:
 * using /utt/*.sph header above: sample count / sample rate = 84670 / 8000 = 10.58375
 * to verify that I found the file in train.trans

grep sw2001A-ms98-a-002 train.trans

sw2001A-ms98-a-0002 0.977k:
 * 9-M625 11.561375 hi um yeah i'd like to talk about how you dres

s for work and and um what do you normally what type of outfit do you normally have to wear
 * taking the end time - the start time = 11.561375 - 0.977625 = 10.58375
 * this confirms above. the times in the train.trans (at least for this one file, match the utterance length).


 * 10-March
 * Read emails
 * Read switchboard documentation https://catalog.ldc.upenn.edu/docs/LDC97S62/swb1_manual.txt
 * Met with prof Jonas to discuss new corpus structure
 * Reviewed SoX documentation http://sox.sourceforge.net/sox.pdf
 * Script
 * Using some peer programming techniques, James and I started writing createUtts.pl, the script that will eventually generate the new audio files. James has a great write-up on it:

Pseudocode: Takes in a transcript file (i.e. train.trans) and generates utts from the conversations audio files in /mnt/main/corpus/switchboard/dist/flat Usage: createUtts.pl /absolute/path/to/train.trans /absolute/path /to/the directory you want the utts in Get arguments Open file Loop Successively read each line Throw full file name into variable And Throw a formatted file name (i.e. sw2345) and add a 0 after the w (i.e. sw02345) into a variable Throw the start time into a variable Throw the end time into a variable Throw the diff between end time and start time into a variable Use sox command like so: sox /mnt/main/corpus/dist/flat/formatted file name /absolute/path/to/the directory you want the utts in [start time variable] [end time variable - start time variable] End Loop Close file

The script was about 90% complete when I had to leave.


 * 14-March
 * Wrote makeCorpus.pl to automate creating new corpus directory structure
 * James ran the createUtts script at 2:37pm
 * I started a train 0283/009 at 2:59pm


 * 18-March
 * Ran the createUtts.pl on the full transcript. Took approximately two hours. After running some grep checks for negative numbers we concluded that 256 of our utterances were bad data (negative durations). Word counts on the transcripts verified this. Unfortunately the log files from the createUtts.pl script do not list the conv files they belong too.
 * Expanded MakeCorpus.pl. Previously it only built the directory structure for a new structure. Added functionality to automatically populate the directory with the appropriate utterance files and train.trans file based on a command argument of desired number of hours for the corpus.


 * 20-March
 * James completed the changes to the createUtts.pl log functionality such that the new logs would include the conv files the utterances were pulled from. We had a brief phone/text conversation and discovered 3 conv files were missing, which accounts for all 256 missing utterances.


 * Plan:
 * 10-March
 * I will be out of the loop for the next few days. James is going to finish the script and test it. I will be in contact with him via cell phone.
 * 18-March
 * Update createUtts.pl to include conversation files in the conv logs so we can track down what conv files are bad.
 * Discuss with Jonas if he wants to add random sampling functionality for the MakeCorpus.pl. They have been talks about this but I am unsure as to exactly what is desired.
 * 20-March
 * Either find the missing conv files or remove the offending utterances from the full transcript. Awaiting guidance from Jonas.


 * Concerns:

23-March

 * Task:
 * Find missing conv files.
 * Expand makeCorpus.pl to automatically build the test/dev.trans, test/eval.trans, and test/train.trans files.


 * Results:
 * They are supposed to be on disks 3 and 22. Disk 3 is missing sw02289.sph. sw04361.sph & sw04379.sph are not on disk 22.
 * To demonstrate the corpus/test/trans folder structure:
 * Dev.trans is a random sample across the corpus, it is independent of the corpus/train/train.trans (the utterances in dev.trans will not appear in the corpus/train/train.trans).
 * Eval.tran is also a random independent sample across the corpus.
 * Test/train.trans is a random dependent sample across the corpus, the utterances in the test/train.trans will appear in the train/train.trans.

24-March

 * Task:
 * Develop corpus building method
 * Remove missing utterances from train.trans.


 * Results:
 * Wrote linkTransAudio.pl, sampleTrans.pl and altered makeCorpus.pl
 * linkTransAudio goes through the transcript and creates links to the utterances in the utt src directory. This must be run from inside an utt directory.
 * sampleTrans.pl [-r] creates samples of every nth line of the transcript where n = . It creates a new file called train.trans-sampled with the output. The option [-r] command will place every nth line in a file called train.trans-sampled and a second file (train.trans-remaining) will contain the complement of the sampled file. This is how you get the dev.trans and eval.trans pulled out of the original transcript.
 * makeCorpus.pl  simply creates the directory tree, nothing more.


 * The process to create a new corpus is as follows:
 * Run makeCorpus.pl 
 * Copy a transcript file to /info/misc
 * CD into /info/misc
 * Sample Transcripts:
 * Run sampleTrans.pl -r 
 * This will create train.trans-sampled (every nth line) and train.trains-remaining (the remainder)
 * Rename train.trains-sampled to dev.trans, move it into the /test/trans directory
 * Rename train.trains to train.trans-orig1 (archiving the untouched train.trans file)
 * Rename train.trans-sample to train.trans (allows us to repeat a sample on the trans file that has the dev.trans lines remove from it).
 * Run sampleTrans.pl -r 
 * This will create train.trans-sampled (every nth line) and train.trains-remaining (the remainder)
 * Rename train.trains-sampled to eval.trans, move it into the /test/trans directory
 * Rename train.trains to train.trans-orig2 (archiving the untouched train.trans file)
 * Rename train.trans-sample to train.trans (allows us to repeat a sample on the trans file that has the dev.trans and eval.trans lines remove from it).
 * Run sampleTrans.pl    NO -R HERE
 * This will create train.trans-sampled file, no train.trans-remaining will be created
 * Move train.trans-remaining /test/utt/trans and rename it train.trans
 * Copy /info/misc/train.trans to /train/trans/train.trans (this is the trans file remaining after all our samples, it is what we will use for the trains)
 * Create Links to utterances
 * The train/audio/utt files
 * CD into train/audio/utt
 * Run linkTransAudio.pl <path to train/trans/train.trans> <path to src utterances (such as /mnt/main/corpus/switchboard/full/train/audio/utt/)>
 * Ls afterward to verify you have good links
 * The test/audio/utt files
 * Repeat the same process as above 3 times: eval.trans, dev.trans, and train.trans


 * Fixing utterances
 * We came to conclusion to remove all instances of sw02289,sw04361,and sw04379 from the full/trans/train.trans.
 * Ran this script


 * 1) !/usr/bin/perl
 * 2) sw02289 sw04361 sw04379

$file = $ARGV[0]; $fileout = "file.txt";

open FIN, "<", $file or die "Can not open file!\n"; open my $fout, '>', $fileout or die "Error\n";

$count = 1; while(my $entry = <FIN>) {       # Fill array with items in entry (i. e. file name, start time, etc.) @entryItems = split ' ', $entry; # Copy full file name (i.e. sw3041A-ms98-a-0002) $fullFileName = $entryItems[0]; # Creating a formatted file name to find in the flat directory $part1FileName = substr $fullFileName, 0, 2; # sw -- Using the example full file name above $part2FileName = substr $fullFileName, 2, 4; # 3041 $formattedFileName = $part1FileName. "0" . $part2FileName; # sw03041

if($formattedFileName eq "sw02289" or $formattedFileName eq "sw04361" or $formattedFileName eq "sw04379"){ #print "$entry\n"; print "$count\n"; $count += 1; }else{ print {$fout} $entry; }

}


 * renamed the full/trans/train.trans to full/trans/train-old.trans
 * moved my fixed_train.trans to full/trans/train.trans

[jax472@caesar trans]$ ls -l total 51156 drwxr-xr-x. 3 root root      4096 Apr 12  2015 tmp -rw-rw-r--. 1 root cis790 26203468 Aug 13 2013 train-old.trans -rw-r--r--. 1 root root  26172492 Mar 24 15:33 train.trans
 * Note the 256 line count difference between train-old.trans and train.trans.


 * Making 150hr corpus
 * Using Sudo:
 * Creating directory structure, copying full trans into it
 * cd /mnt/main/corpus/switchboard
 * perl /mnt/main/scripts/user/makeCorpus.pl 150hr
 * cd 150hr/info/misc
 * cp /mnt/main/corpus/switchboard/full/train/trans/train.trans.
 * Sampling train.trans to get ~160hr trans(starting with full 311 hr transcript, working in 150hr/info/misc)
 * awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * 311.761 (full 311 hour trans confirmed)
 * perl /mnt/main/scripts/user/sampleTrans.pl -r 2 train.trans
 * awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled
 * 156.072 hrs
 * mv train.trans train.trans-full (archive full transcript as train.trans-full)
 * rm train.trans-remaining (throwaway extra sampling info, not needed)
 * mv train.trans-sampled train.trans (sampled trans to work with)
 * awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * 156.072 (confirms train.trans is the correct sampled trans)
 * Sampling our ~160 hr trans to get eval (~5hr removed), dev(~5hr removed), test/trans/train.trans (~5hr not removed), and train/trans/train.trans (remainder, approx ~150hrs).
 * perl /mnt/main/scripts/user/sampleTrans.pl -r 30 train.trans
 * awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * 156.072 (will be train.trans-old1)
 * awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled
 * 5.2679 (will be dev.trans)
 * awk '{total += $3 - $2} END {print total / 3600}' train.trans-remaining
 * 150.804 (will be new train.trans)
 * mv train.trans train.train-old1 (arching 156hr train.trans)
 * mv train.trans-sampled ../../test/trans/dev.trans (5.2679 file now dev.trans in test/trans)
 * mv train.trans-remaining train.trans (prepare to sample leftover 150.804 hr)
 * awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * 150.804 (confirms train.trans is the leftover trans, ready to sample and remove eval)
 * perl /mnt/main/scripts/user/sampleTrans -r 30 train.trans
 * awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled
 * 4.93205 (will be eval.trans)
 * awk '{total += $3 - $2} END {print total / 3600}' train.trans-remaining
 * 145.872 (will be new train.trans with eval.trans samples and removed)
 * mv train.trans-sampled ../../test/trans/eval.trans (sample now test/trans/eval.trans)
 * mv train.trans train.trans-old2 (archiving again)
 * mv train.trans-remaining train.trans (now working with remaining 145.872)
 * perl /mnt/main/scripts/user/sampleTrans.pl 30 train.trans (sampling WITHOUT removing)
 * awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled
 * 4.95455 (will be test/trans/train.trans)
 * mv train.trans-sampled ../../test/trans/train.trans (sample now /test/trans/train.trans)
 * awk '{total += $3 - $2} END {print total / 3600}' train.trans
 * 145.872 (confirming what will be our train/trans/train.trans)
 * cp train.trans ../../train/trans/train.trans (copying train.trans into train/trans)
 * ls (files left in info/misc)
 * train.trans
 * train.trans-full
 * train.trans-old1
 * train.trans-old2
 * Verifying test files
 * cd ../../test/trans/
 * ls
 * dev.trans (awked = 5.2679)
 * eval.trans (awked = 4.93205)
 * train.trans (awked = 4.95455)
 * Creating utterance links
 * cd ../audio/utt (going into test/audio/utt)
 * perl /mnt/main/scripts/user/linkTransAudio.pl ../../trans/dev.trans /mnt/main/corpus/switchboard/full/train/audio/utt/ (creating links for dev.trans to full corpus utts)
 * ls -l | wc -l
 * 4174
 * perl /mnt/main/scripts/user/linkTransAudio.pl ../../trans/eval.trans /mnt/main/corpus/switchboard/full/train/audio/utt/ (creating links for eval.trans to full corpus utts)
 * ls -l | wc -l
 * 8208
 * perl /mnt/main/scripts/user/linkTransAudio.pl ../../trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt/ (creating links for test/trans/train.trans to full corpus utts)
 * ls -l | wc -l
 * 12106 (final number of utts in test/audio/utt
 * Creating utterance links for 150hr/train/audio/utt
 * cd ../../../train/trans (now in 150hr/train/trans)
 * ls
 * train.trans (awked = 145.872)
 * cd ../audio/utt (now in 150hr/train/audio/utt)
 * ls
 * empty
 * perl /mnt/main/scripts/user/linkTransAudio.pl ../../trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt/ (creating links for train/trans/train.trans to full corpus utts.)
 * ls -l | wc -l
 * 116959
 * exit (done with sudo)
 * 150hr corpus successfully built

Starting train on 150hr:
 * cd /mnt/main/Exp/0283
 * mkdir 014
 * cd 014
 * prepareTrainExperiment.pl switchboard 150hr/train

[jax472@caesar 014]$ prepareTrainExperiment.pl switchboard 150hr/train 014Creating directory structure...done! Modifying sphinx_train.cfg...done! rm: cannot remove `wav/*.sph': No such file or directory Linking to utterance files... done! Preparing data input files... genTransMJ.pl 100% completed.processing. donegenTrans.pl 100% complete! Generating dictionary file.../mnt/main/corpus/switchboard/dist/dict/custom/master.dic

Processing 26194 words against dictionary... Added 3 files to add.txt Created 014.dic done! Replacing filler words in transcript...done! Generating filler dictionary...done! Generating phones list...etc/014 done! Preparation complete!
 * Internet died for about 5 mins during this process, removed all files from 0283/014 and restarting to avoid any conflicts this caused.

[jax472@caesar 014]$ prepareTrainExperiment.pl switchboard 150hr/train 014Creating directory structure...done! Modifying sphinx_train.cfg...done! rm: cannot remove `wav/*.sph': No such file or directory Linking to utterance files... done! Preparing data input files... genTransMJ.pl 100% completed.processing. donegenTrans.pl 100% complete! Generating dictionary file.../mnt/main/corpus/switchboard/dist/dict/custom/master.dic

Processing 26194 words against dictionary... Added 3 files to add.txt Created 014.dic done! Replacing filler words in transcript...done! Generating filler dictionary...done! Generating phones list...etc/014 done! Preparation complete!
 * nohup generateFeats.pl & (took ~10 minutes)
 * top (confirm no trains are running on caesar)
 * nohup scripts_pl/RunAll.pl & (train running 9:33pm 24-March)


 * Plan:
 * Verify the working sox files and the new corpora creation scripts with a train.

25-March

 * Task:
 * Score 014
 * Search log to determine why WERs are so high


 * Results:


 * 010/013/014 inconsistent with the training data. 10-lines experiments were testing the train process on the new utts. 013/014 were full trains on the new utts. 009 and previous were on previous semester utts. Definitely a discrepancy caused by the new utts. Looking to see if I can find some documentation on how the hour count from the .html files under module 00 are calculated.
 * It does show we have been lax on reviewing the .html files and the corresponding logs. Definitely something we need to improve upon.

27-March

 * Task:
 * Run experiment testing new utterances with senone count and density values configured to the appropriate values.


 * Results:
 * Exp 0283/016
 * $CFG_N_TIED_STATES set to 8000 (senones)
 * $CFG_FINAL_NUM_DENSITY set to 32 (density)

28-March

 * Task:
 * Create LM and Decode/Score 0283/016


 * Results:
 * Exp 0283/016
 * Created LM as show in tutorial with one change:
 * modified lm_create.pl line 31 from
 * system( $folder."wfreq2vocab <tmp.wfreq> tmp.vocab" );
 * to system( $folder."wfreq2vocab -top 30000 <tmp.wfreq> tmp.vocab" );
 * Ran into some confusion with the 016_decode.fileids. The 016_train.fileids was across the whole 145hr corpus (116958 lines), James said we want to pull the fileids from the 145hr/test/train/train.trans. The tutorial hasn't been updated to reflect that in our new corpus. I'm 95% certain I understand what's going on but I will clarify with James and we will update the tutorial as needed.
 * the command to grab the fileids from the test/trans/train.trans (~5hr randomly sampled without removal from 145hr/train/trans.tran) is: awk '{print $1}' /mnt/main/corpus/switchboard/145hr/test/trans/train.trans >> ./016_decode.fileids
 * this grabs the utterance file ids from the 5hr samples and puts them in the decode.fileids file that is used for the decode. We do this because decoding on the 5hr random sample across the whole corpus gives us a better insight into the whole corpus as opposed to just running the first X amount of lines from the 145hr/train/trans/train.trans (only grabbing the first X amount of the corpus).


 * Score
 * 30.2% WER. Best result yet. We had several utterances that had a 0% WER, but we also had some that had as high as a 2000% WER. Unclear as to how this is even possible.

31-Mar

 * Task:
 * Run train to determine if system was affected by accidental emacs install.


 * Results:
 * Train 0283/017 started


 * Plan:


 * Concerns:

06-Apr

 * Task:
 * Class
 * Find command line SCLITE documentation in hopes we can produce a scoring.log for individual utterances. The intent is help data group verify newly generated utterances.


 * Results:
 * http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/options.htm#report_options_0
 * SCLITE has a -o (output) option that takes multiple arguments. "lur" is the flag that produces a Labeled Utterance Report.


 * Plan:
 * Concerns:
 * Concerns:

07-Apr

 * Task:
 * Score/decode team cap 0288/003
 * Investigate how to generate feats on a corpus not on experiment
 * Organizational work for team cap


 * Results:
 * 0288/003 results secret. Will be published after competition
 * Determined that the sphinx_train.cfg file is required for generating feats. Will have to create an experiment on the full (throw away experiment, just want to sphinx_train.cfg file). It also needs the ctl file which is the XXX_train.fileids. This contains the file name of the utterances, this will be easy to develop using just a simple one line awk command


 * Plan:
 * Generate feats on a small (first_5hr) test corpus in my home directory. When I am confident in the process I will do it on the full corpus.


 * Concerns:

08-Apr

 * Task:
 * SCore/decode team cap 004/005
 * Organizational work for team cap


 * Results:
 * 003/004 results secret. Will be published after competitione published after competition
 * Attempted a decode on unseen data. It failed. Examination of the decode.log showed a fatal error showing it appears the feats for the unseen data are not generate in the experiment directory. This makes sense as the dev.trans and eval.trans are removed from the main train.trans (that is why it is "unseen"). From our earlier investigation, we know that generateFeats uses the XXX_train.fileids. This file is generated during prepareTrainExperiment.pl and goes off the whatever trans was pass as an argument. Because we train on seen data (the main train.trans) the XXX_train.fileds contains those fileids and does not have the dev.trans/eval.trans (as they are removed from the main train.trans)


 * Plan:
 * Concerns:

09-Apr

 * Task:
 * SCore team cap 006. Train team cap 007.
 * Organizational work for team cap


 * Results:
 * 006 results secret. Will be published after competition published after competition


 * Plan:
 * Concerns:

10-Apr

 * Task:
 * Generate feats on full corpus
 * Organizational work for team cap


 * Results:
 * Due to the reliance of generateFeats.pl on the subexperiment/etx/sphinx_train.cfg file, the simplest manner to generate the feats for the full corpus would be to create a new experiment in my home directory on the full corpus. I created the directory, ran prepareTrainExperiment, and then ran generate feats. This produced all the feats for the full corpus in my home/subexp/feat directory. This directory was then copied to /mnt/main/corpus/switchboard/full/train/audio/mfc


 * Plan:
 * Create a script to generate links to the mfc files. The intent is to replace generateFeats.pl with this.


 * Concerns:

11-Apr

 * Task:
 * Modify linkTransAudio to create soft links to full/train/audio/mfc feats.


 * Results:
 * LinkTransAudio was created by myself several weeks ago in order to create links from transcripts files to utterances in the full corpus. Links are ideal because they are quick to create and size a lot of space. A third parameter was added to LinkTransAudio which tells the script which file extensions to use: "mfc" for *.mfc (feats) and "sph" for *.sph (utterances).


 * Plan:
 * Run some experiments using the new linkTransAudio script to determine the savings in time and disk space over generating new feats for every experiment.


 * Concerns:

13-Apr

 * Task:
 * Meet with team and plan for upcoming week


 * Results:
 * The team decided not to trade. The other team decided not to trade as well. Both teams seem confident with their team members.
 * Did our first decode on unseen data.
 * Team meeting went great. Everyone contributed to the discussion and we hashed a plan for the upcoming week.


 * Plan:
 * Concerns:

17-Apr

 * Task:
 * Decoded/scored some secret experiments.
 * Researched into optimal WER for switchboard corpus


 * Results:
 * Results of secret experiments to be released after the competition.
 * While I only spent about half and hour to 45 minutes on the research into the optimal WER for switchboard corpus, I have not found anything conclusive. The majority of the results are from the mid 90's with WERs starting around ~70% and being reduced to ~50%.


 * Plan:
 * Continue research into optimal WER for switchboard corpus.


 * Concerns:

18-Apr

 * Task:
 * Start new trains


 * Results:
 * Started new trains on team cap drones. They will be done later this afternoon and James is gonna pick up there with the decodes.


 * Plan:
 * Continue research into optimal WER for switchboard corpus.


 * Concerns:

20-Apr

 * Task:
 * Research switchboard WER target


 * Results:
 * Appears that advancements in artificial neural networks (including deep neural networks & convulational neural networks) led to dramatic improvements in WER in 2011. This produced a switchboard benchmark of 18.5%. Sphinx uses GMM-HMM (Gaussian mixture model hidden markov models), so we can expect a benchmark of 25.2%. Currently we are in the low 40's for unseen dataa WER.
 * Sources:
 * http://recognize-speech.com/acoustic-model/knn/benchmarks-comparison-of-different-architectures
 * http://research.microsoft.com/en-us/news/features/speechrecognition-082911.aspx


 * Plan:


 * Concerns:

21-Apr

 * Task:
 * Update genFeats.pl and linkTransAudio.pl to create soft links to /mnt/main/corpus/switchboard/full/train/audio/mfc/ as opposed to generating new feats every time. This will save time and disk space.
 * Researched reasons of failed team cap trains.


 * Results:
 * Scripts update and placed in /mnt/main/scripts/user/history/genFeats/6/ and /mnt/main/scripts/user/history/linkTransAudio/1/. Created a new experiment using the updated scripts and confirmed successful linking of feats and no train errors. Informed Matt of the progress and asked him for a code review before moving them to /mnt/main/scripts/user/ and updated the corresponding wiki pages.


 * Plan:
 * Concerns:

23-Apr

 * Task:
 * Google hangout with Tom and James
 * Results:
 * James, Tom, and I were working on figuring out why some of Team Cap's experiments were failing. After several attempts and some research, we determined that the failures were due to outdated sphinx software and uninstalled libraries. Tom researched how to install the libraries and worked on installing these on Miraculix.


 * Plan:
 * Concerns:

26-Apr

 * Task:
 * Continue research into decoding issues
 * Get in contact with Tom to get Miraculix back up
 * Meet with James


 * Results:
 * Attempting to get decoding going on multiple cores. sphinx_decode.cfg is not use in the run_decode script. Attempted using the scripts_pl/decode/slave.pl and received multiple errors in started the decode process. Not much information in historical logs about why we don't use the build in decode scripts like we do for the built in train scripts. Found documentation on what to add to the run_decode.pl to get the process working. However I would still like to find the reasoning behind not using the built in sphinx decoder scripts, and why we still generate a sphinx_decode.cfg is we never actually use the script that uses it.
 * Sent emails to Tom about getting Miraculix back up. Currently were running at 25% server power with Miraculix down and data group running a decode on Asterix.
 * James and I met for about an hour. Discussed the decode issues and worked through them. Looked into other reason trains are failing and found missing binaries and outdated scripts.


 * Plan:
 * Meet with team cap and Jonas


 * Concerns:
 * Caesar needs to be updated with the most recent Sphinx builds. I have no doubt we can establish the world class baseline if our hands were not tied with outdated software and missing scripts.

27-Apr

 * Task:
 * Score 023.
 * Train decode 024.
 * Meet with team Stark.
 * Meet with team Cap in class.


 * Results:
 * Surprising results on 023. 10% decrease in WER over baseline. Assuming this is an error. Confirming with 024.
 * 024 showed same 10% decrease over baseline.
 * Merged teams back together. We decided we could work better as a class to work on installing the missing dependencies. Both teams were pretty much at the same point anyways.


 * Plan:
 * Dig further into 023/024. I am very suspicious of such a large increase in performance.


 * Concerns:

28-Apr

 * Task:
 * Score 0294/001
 * Determine way to include sphinx_decode.cfg file in run_decode


 * Results:
 * Decoding on multiple cores (using ctloffset parameter) splits decode logs into multple parts as well. To put these back together:

cat decode.log* >> decode.log
 * This will give you a notification:

cat: decode.log: input file is output file
 * Do not worry about this
 * 0294/001 scored 17.5% WER. This is suspect. I have seen an issue of suspect decode details coming from running decodes on multiple cores.
 * Moved all decode logs, run_decode.pl script used, hyp.trans, and scoring.log to 0294/001/DECODE/decode-old. Re decoding 001 using single core.
 * Sphinx_decode.cfg
 * After some brainstorming on the issue, I realized that the decode.cfg file is not very helpful. All it really does is change what parameters sphinx3_decode.cfg uses. Our run_decode.pl calls the sphinx3_decode manually and specifies cmd line arguments to pass in. We can just modify the call to sphinx3_decode manually changing whatever arguments we want.
 * I built a perl module called sp16Decode. In it is a readme that details instructions of use and lists all parameters that sphinx3_decode utilizes. I modified a run_decode script to iterate through a hash table (found in sp16Decode/decode_config.pm). It adds the parameter value as set in decode_config.pm to the system call to sphinx3_decode in run_decode.pl.
 * This allows us to easily change decode parameters without having to change run_decode.pl every time. It also allows us to easily change much more parameters than sphinx_decode.cfg does (a few in the decode.cfg compared to all of them in sp16Decode/decode_config.pm)


 * Plan:
 * Decode 0294/001 again, using single core (using the run_decode script we have always previously used)
 * Sent out a code review request to others to review the above sp16Decode module


 * Concerns:

30-Apr

 * Task;
 * Score 005, check scores on 004
 * Results;
 * Matt scored 004. Showed WER of 29.0 on seen data. The baseline exp showed 29.1%. We wanted to see exactly 29.1%, but a small error could be some rounding error.
 * Most importantly, 004 showed 29.0% on both single core decodes and multi-core (multithread) decodes. This verifies that our process of splitting the decode (use -ctloffset and -ctlcount args) then catting the logs together again (cat decode.log* >> decode.log) does not affect WER.
 * I scored 005, high WER in the 75's on unseen data (dev.trans). Matt is going to look into it.
 * Wrote a new version of the sp16Decode module. Added a decode_util.pm module that will serve as the utility(helper) module. Placed two subroutines in there, one that gets the proper ctloffset, and another that gets the proper ctlcount. Run_decode.pl for the package was also modified to put the system call to $cmd in a while loop (so the proper cmd with the proper ctloffset is generated depending on the number of cores desired).

3-May

 * Task;
 * Catch on emails for the team
 * Work on sp16Decode module
 * Results;
 * Matt started another train experimenting with language weight. The value was set to 13, which is the high end of the CMUSphinx recommendations. The first attempt at language weight, were it was set to 27, showed very bad results. It will be interesting to test the high and low end range of the recommendations.
 * Check on the sp16Decode module. I wanted to port of the most recent changes (stored in sp16Decode/dev) to the main module. I noticed that Matt had already ported over the changes, but I found a few bad references (still pointing to sp16DecodeDev which is in the dev folder). Fixed those minor changes and confirmed all references to Dev were remove by using grep.

May-4

 * Task:
 * Meet in class


 * Results:
 * Discussed changes to be made to sp16Decode module. We are going to rename it to run_decode and store it in the user/scripts directory. From what I understand, it will be used in the same manner, however we are going away from the DECODE directory in subexperiments. Because of this, run_decode will be ran from the subexp/etc directory. The new process would be to copy the decode_config.pm file in the etc directory, configure it as appropriate, and then call run_decode.pl from the user/scripts directory. We also want to eliminate all the cmd line arguments from run_decode except for the npart, so:

run_decode.pl 001 0294/001 4000 8

would just be:

run_decode.pl 8

and all the other settings would set via the decode_config.pm.


 * Plan:
 * Do the above


 * Concerns: