Speech:Spring 2017 Dylan Lindstrom Log


 * Home
 * Semesters
 * Spring 2017
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 7th, 2017
2/3 - Read through logs from previous semesters, log into Caesar and change my password, familiarize myself with the directories, read up on Sphinx, sign up for Slack.
 * Task:

2/6 - Set up Pulse Secure on laptop so I can VPN into UNH's network, read through documentation on how to conduct experiments then attempt to create one on Caesar and run my first train.

2/7 - Check to see if train completed, create language model, run decode, generate scoring report, read through previous semester's Data logs to help devise our proposal.


 * Results:

2/3 - I went through a lot of documentation from past semesters. I am now more familiar with the terminology surrounding the project. In particular, I found the following logs to be helpful in understanding the project:


 * Spring 2016 Data Group Log
 * Brenden Collins' Spring 2016 Log
 * Dakota Heyman's Spring 2015 Log
 * Capstone Terms

I logged into Caesar and successfully updated my password. I also navigated to /mnt and explored the different directories there. I read this documentation on Sphinx, which helped me to better understand some of the terms used in the Wiki. I also set up an account for Slack and joined our class and group channels.

2/6 - I successfully set up Pulse Secure so I can access the servers from home. I read up on the documentation for setting up experiments and familiarized myself with the experiment directory on Caesar. I was able to successfully set up my first experiment and am running my first train now. I will check back later to see if it has completed. The following entries provided all the information I needed to set up my first experiment and run my first train:


 * Exp directory
 * 'addExp.pl' script
 * Starting a new train

2/7 - The train completed successfully. However, after trying to decode, I do not believe the 'run_decode.pl' script ran correctly. I also received an error (segmentation fault) when I went to generate the scoring report. I attempted a new sub-experiment after reading through some of the other logs. I saw that Vitali completed his experiment, so I looked at his log and found that he discovered the language model should be created within the sub-experiment directory rather than the base directory. I decided to try this, but this time around I did not receive any errors, only a blank line. I will have to talk to other members to see what might be going on.


 * Plan:

2/3 - Continue to go through logs, read documentation on experiments, then try to set one up.

2/6 - Wait for train to finish, create language model, run decode, read through previous semester's Data logs to help devise our proposal.

2/7 - Try to run another decode, work on proposal with group.


 * Concerns:

2/3 - My main concern at this point is getting myself familiarized with important terminology and making sure I understand all the different components of the project.

2/6 - It seems like there might be a minor issue with the 'addExp.pl' script, as it is included in the /scripts/user directory which has been added to the path, but the command is not recognized unless you use the absolute path. Outside of that, I am concerned about our proposal and will need to look more closely at what the Data group accomplished last year so we can start to build off that.

2/7 - I am concerned about the decode issue.

Week Ending February 14, 2017
2/8 - Work on proposal, continue reading past semesters' logs, look for ways to improve Word Error Rate in relation to the data, familiarize myself with the audio directories/files
 * Task:

2/12 - Attempt to get the decode to work again, locate and examine the dictionary file and find ways to make improvements

2/13 - Check on the decode, find the purpose of different files on the server (.sph, .mfc, .wav) by going through logs, try to download and listen to a file and match it up with its transcript

2/14 - Look into creating a new corpus size, try to convert a .sph file to a .wav file, then listen to the .wav file on my machine and match it up with its transcript


 * Results:

2/8 - While reading through Brenden Collins' log from last year, I discovered an entry about the segmentation of data. He linked to this paper from Mississippi State, which states:

"Segmentation of conversational speech into relatively short phrases enhances the transcription accuracy, helps in reducing the computational requirements for training and testing each utterance,  and simplifies the application of the language model (LM) during recognition."

I am not sure how our data is currently segmented, but they offered an approach to re-segmenting data, which improved their Word Error Rate by 2%. This approach included:


 * Echo cancellation
 * Manual adjustment of utterance boundaries
 * Correction of the orthographic transcription of the new utterance
 * Readjustment of boundaries if necessary
 * Supervised recognition on the new utterances to get a time-aligned transcription
 * Review of the word boundaries and final correction of transcriptions

I believe these steps are worth looking into for the data group. It seems like last year's group touched on this a bit, but we could potentially do more. I also looked at the /mnt/main/corpus/switchboard/full/train/audio directory to familiarize myself with the types of files I will be working with. I discovered there are files with .mfc, .sph, and .wav file extensions. I will have to look into specifically what the purpose of each type of file is.

2/12 - The decode is running. In the past I did not see the process running in the background when I did the 'top' command, so I am hoping this time it will be a success. The dictionary files are located in the '/mnt/main/corpus/dict' directory. According to the CMU Sphinx website, 'cmudict.0.7b' is the most recent dictionary file. Upon looking at this file, it appears that the dictionary already takes into account alternate pronunciations of words. Words that have multiple known pronunciations can be seen with a number surrounded by parenthesis after the given word. [ex: Abducted (1)] This number correlates to the number of alternative pronunciations there are for the given word. This documentation from CMU Sphinx gives ideas on how improvements could be made to the dictionary file, including:


 * Adding new words to the dictionary, such as new slang and tech-related words (CMU cites the words 'Spotify', 'Skype', and 'iPad' as good examples)
 * Figuring out the context in which words are spoken
 * Automatic dictionary acquisition

We will have to look into each of these options to determine whether or not they are something our group can work toward.

2/13 - The decode completed successfully, but I ran into a "Not enough reference files loaded" error when trying to generate the scoring report. I followed up by trying to use the 'uniq' command, as suggested here, but received a "Too many arguments" error. I will have to consult with my group about this. Aside from that, I located all the original .sph files [sw02001.sph - sw04940.sph] in /mnt/main/corpus/switchboard/dist/consolidated. I discovered that these are the original audio files, whereas the utterance files can be found in /mnt/main/corpus/switchboard/full/train/audio/utt. The .wav files I saw are converted utterance files, and the .mfc files can be best explained by CMU here, or click here for a more in-depth explanation.

2/14 - I found this information on how to create a specific corpus size from last year's data group. However, there seem to be a lot of steps involved and I am a bit confused about what I read. Maybe I can work on this tomorrow with the group, if it is deemed important, however there are likely to be more pressing needs at this time.

I tried to use the following command to convert a portion of a .sph file to a .wav file:

sox .sph .wav

But I received permission errors.


 * Plan:

2/8 - Try to get the decode working, read more into possible steps to reduce Word Error Rate, read logs to try and find information on the audio files, look at the dictionary file and think of ways to make improvements

2/12 - Check on the decode then (hopefully) generate the scoring report, find the purpose of each different audio file on the server (.sph, .mfc, .wav) by going through logs, try to download and listen to a file and match it up with its transcript

2/13 - Look through logs to find out how to create a new corpus size, try to convert a portion of a .sph file to a .wav file, listen to the .wav file and match it up with its transcript

2/14 - Talk about corpus sizes with my group, try to get SOX working so we can convert files


 * Concerns:

2/8 - I am concerned with the purpose of all the different file types. I will have to dive back into logs/documentation to determine that.

2/12 - I am still concerned with the purpose of the different file types. I will examine these files tomorrow.

2/13 - I think that perhaps our group should look into creating a 5hr sub-corpus to use for experiment testing purposes. It does not seem like the rest of the class has had a lot of success with running an experiment all the way through either. If we create a 5hr corpus, train and decode times should be significantly shorter, allowing us to troubleshoot these issues with more ease.

2/14 - I am concerned about using SOX to convert files and getting our experiments to finally run.

Week Ending February 21, 2017

 * Task:

2/16 - Get sox working, match up audio files to transcripts, read more into the basics of linguistics to better understand how to improve the dictionary file

2/19 - Revise proposal, check on train and try to run decode again, look at transcript file for new cases to use in our script

2/20 - Look into adding new words to the dictionary file

2/21 - Create 5hr corpus, finalize proposal


 * Results:

2/16 - Thanks to Matt, I was able to get sox working on my own machine so that I could copy over files from the server to my PC and then convert the files to .wav format for listening. However, a bit later, we found that VLC (a free media player) was able to play .sph files without converting them. I was able to successfully listen to audio files on my PC and match up the audio with the corresponding transcripts. I also read up on the basics of linguistics so that I could better understand the dictionary file. The following links provided me with some new information on the subject:


 * CMU Dictionary Tutorial
 * CMU Pronouncing Dictionary
 * Lexicon Tool

2/19 - Talked with the group on Slack for a bit about current tasks. The train completed, but the decode failed instantly with FATAL_ERROR: "mdef.c", line xxx: No mdef-file. I tried creating a 5hr sub-corpus to troubleshoot these issues, but ran into more issues, specifically it looks like a script is appending 'etc/' to some of the paths I used while following the instructions listed here. Additionally, I did a search for lines in the main transcript file that had brackets [] in it, then pasted all the lines into a new .txt document so that our group will have an easier time looking for new cases to use in our script.

2/20 - Started a word file of possible words to be added to the dictionary file, utilizing the CMU Pronouncing Dictionary/Lexicon Tool, as detailed in the links I posted on 2/16. The words will not be added to the dictionary until they are approved. The word file contains all new words, along with predicted pronunciations of each word as generated by the CMU Lexicon Tool. As a reference for new words, I am using the Oxford English Dictionary, to which new words are added four times per year (I started with March 2013). A hand file, which contains corrections for pronunciations, will also be maintained in the event that the pronunciations provided by the Lexicon Tool are not accurate. Maintaining this file will require some basic knowledge of linguistics. I also found this tool for CMU Sphinx, which includes a way to generate pronunciations in the command line (type in "Hello" and the console outputs "HH EH L OW"). This might be something to incorporate in the future. If adding new words to the dictionary file is deemed important, a program could be created that would take the word file and compare it to the dictionary file, removing any duplicate entries from the word file. The word file would then be uploaded into the Lexicon Tool to generate the pronunciations, and then the file could be fed into another program that would add the new words to the dictionary file and then sort it alphabetically. This would make the task of adding new words much less tedious.

2/21 - I attempted to create a 5hr corpus again, though the documentation on how to do so correctly is not at all clear. I had to make a copy of the linkTransAudio.pl script and make some modifications, as it seems that it was edited last semester for another purpose and therefore was no longer working in the way that the instructions claimed. Specifically, I removed the 'etc/' that was being prepended to the transcript argument, removed the 'feats/' that was prepended to $target_file, and I removed the extension argument and hard coded .sph as the default extension. All of these changes have been temporarily saved as 'linkTransUtt.pl'. After this, I thought that I had it mostly figured out, but when I went to run a train, the genTrans.pl script did not complete. It appears that some required files are missing, specifically _train.fileids. I will have to look into this more and try to figure out a way to understand all the scripts that are being used in this process.


 * Plan:

2/16 - Revise our proposal, get experiments working

2/19 - Continue working on proposal, enlist the help of someone who has successfully run a decode/scoring report, look for words to add to dictionary file.

2/20 - Continue adding words to word file and looking through the transcript file

2/21 - Work on fixing the 5hr corpus and update corresponding documentation, begin work on genTrans.pl script


 * Concerns:

2/16 - I am still concerned about getting our experiments to work.

2/19 - Experiments still not working.

2/20 - No new concerns.

2/21 - Some documentation from past semesters is really lacking and I am concerned that we will be spending too much time trying to get things to work.

Week Ending February 28, 2017
2/24 - Test a full train/decode on the 5hr corpus I created.
 * Task:

2/25 - Look through my portion of transcript, work on updating corpus creation documentation

2/27 - Update documentation for corpus creation

2/28 - Finish fixing 5hr corpus, upload new documentation


 * Results:

2/24 - I ran a full train/decode on the 5hr corpus and it completed successfully. The experiment results can be viewed here. While looking at the results, I compared hyp.trans to the original transcript and noticed that hyp.trans was already stripped of brackets while the words within them were being preserved. This confirmed our suspicions from class on Wednesday that the 'genTrans.pl' script was already doing what our proposal outlined. I am not sure how this will affect our proposal and our tasks for the week ahead or the rest of the semester. Matt will be working with the Experiment group over the weekend to test the script further to make sure the regular expressions are all correct. Below are two comparisons of a line from the original transcript file and a line from the generated hyp.trans file:

Line from original transcript: sw2001A-ms98-a-0023 105.068750 112.051750 uh other times it could be very casual if you knew you would be at a desk [laughter-all] day and nobody would see you [noise] um Line from hyp.trans: UH OTHER TIMES IT COULD BE VERY CASUAL YOU YOU WOULD BE IN A DESK ALL DAY IN NOBODY WOULD SEE IT BUT UM (sw2001A-ms98-a-0023)

Line from original transcript: sw2001A-ms98-a-0029 153.333375 162.042250 the men often have to wear shirt and [laughter-tie] [laughter-no] [laughter-matter] r[ight]- right right what time of what time of the year that's right Line from hyp.trans: THE MEN OFTEN HAVE TO WEAR CERTAIN TYPES NO MATTER WHAT RIGHT RIGHT WHAT TIME IS THE RIGHT TIME OF THE YEAR THAT'S RIGHT (sw2001A-ms98-a-0029)

2/25 - Upon inspection of my portion of the transcript file, I found several words with '_1' appended to the ends of them, but after looking at the 'genTrans.pl' script, it looks like this has already been taken care of by a regular expression ($message =~ s/\_1//g; #remove _1). Additionally, I have begun work on a much cleaner list of instructions on how to properly add new corpora. It should be done and updated on the Wiki soon.

2/27 - While updating the documentation for corpus creation, I discovered a few potential issues with my 5hr corpus that I created previously. There are a lot of files that need to be renamed/moved around during the process of creation, and the two documents I have been using as guides from the data group last year do completely different things at multiple parts. I have done a lot of trial and error on the server trying to figure out which instructions from which documents are correct. This has been and will continue to be a lengthy process.

2/28 - Today I was able to figure out how to correctly create new corpora. Using the instructions found here, and some of the commands found in Brenden Collins' log, I was able to piece things together and successfully created a working 5hr corpus for class-wide use in experiments. The new documentation for corpora creation can currently be viewed on our group page. I also successfully ran a full train/decode on the corpus to verify that it works. The results of the experiment can be found here. During this process, I discovered I was looking at the incorrect transcript file when checking to see if the [brackets] were being removed. After looking at 008_train.trans, I saw that the only cases being removed are [laughter], [noise], [vocalized-noise], and < >. However, it appears all the regular expressions we need for all the other cases we have found (so far) are commented out in 'genTrans.pl', so we will have to test those regular expressions to make sure they are correct.


 * Plan:

2/24 - Wait for confirmation on the script, revise documentation/scripts for the creation of new corpora

2/25 - Wait for confirmation on the script, upload revised corpus creation documentation, create working version of 'linkTransAudio.pl'

2/27 - Continue working through the corpus creation documents.

2/28 - Add documentation for a new script ('linkTransUtt.pl') that was created based off the 'linkAudioTrans.pl' script, for use only in corpus creation


 * Concerns:

2/24 - I am concerned with our current and future tasks now that we have some confirmation that the script already does what we wanted to add.

2/25 - No new concerns.

2/27 - I am concerned that I have not heard from my team about the script.

2/28 - I am still concerned that I have not heard anything about the script.

Week Ending March 7, 2017

 * Task:

3/3 - Work on Perl script to fetch new words added to the Oxford English Dictionary

3/6 - Finish going through my portion of the transcript

3/7 - Continue work on Perl script for dictionary file


 * Results:

3/3 - I began working on the Perl script. The script is able to successfully grab the HTML from the OED website so that I can then parse the HTML and grab the newly added words. I ran into some errors while trying to use some different modules, so I will have to look into those to fully understand them before I can continue.

3/6 - After completing my portion of the transcript, I discovered a couple of lines that we may need to address in our script:

sw3807A-ms98-a-0073 266.859625 268.855250 yeah that's right so h[e]- -[h]e -[h]e sw3817B-ms98-a-0026 165.001000 168.897125 [noise] sounds like the [trelve/twelve] twelve tribes of Israel or something

Specifically, we should check to make sure the script will correctly preserve the word "he" three times while removing the brackets and dashes. I am not sure what the [trelve/twelve] tag is about, so we should check to see how the script currently handles this line and then make any necessary changes. There are additional lines like these with the same formats. Outside of these two cases, I did not see anything in the transcript that has not already been documented.

3/7 - Today I was able to get a good portion of the dictionary script done. Currently it grabs HTML from an OED page, like here, finds the appropriate words on the page, saves them to an HTML file, then strips the li tags and puts each word on a new line and saves it as a CSV file. The next step I will have to take is to eliminate any characters after the comma on each line. I will try to do this by using the Text::CSV Perl module and a regular expression to extract only the word. For example, the list currently has the word followed by a comma and then the type of word it is, like below:

andic, adj. Andisol, n. andosol, n.


 * Plan:

3/3 - Continue working on Perl script, finish going through my portion of the transcript

3/6 - Continue working on the dictionary script

3/7 - Finish first portion of the dictionary script


 * Concerns:

3/3 - No concerns today

3/6 - No concerns today

3/7 - None

Week Ending March 21, 2017

 * Task:

3/10 - Finish first portion of dictionary script

3/13 - Test and revise 'getNewWords.pl' script, begin working on new script to add words to dictionary file

3/20 - Add support for the different URL structures that I discovered to the 'getNewWords.pl' script, do further testing

3/21 - Begin working on second portion of dictionary script


 * Results:

3/10 - I completed the first portion of the script to grab new words from the Oxford English Dictionary website. The script is called 'getNewWords.pl' and accepts two arguments: a month (September, December, March, or June) and a year. Depending on what the user inputs, the script grabs the newly added words matching the date provided and adds them to a text file in a format that can be interpreted by the CMU Lexicon Tool. The Lexicon Tool is meant only as a temporary solution. Future semesters may look into installing a tool that could do this entirely in the command line, such as Logios, which is what the Lexicon Tool is built on. Future semesters may also look into adapting the script to use other websites for grabbing new words.

3/13 - Upon further testing of the script, I found that some of the older entries on the OED website use a different URL structure than the newer ones, which was causing an error whenever the script tried to grab the HTML from the website. After searching through all entries dating back to the year 2000, I found four different URL structures used. Below are four examples of the different URL structures and which months/years they were used for.

March 2000 - March 2011: /recent-updates-to-the-oed/previous-updates/march-2000-update/ June 2011 - June 2012 /recent-updates-to-the-oed/previous-updates/june-2011/new-words-list/ September 2012: /recent-updates-to-the-oed/previous-updates/september-2012/new-words-list-september-2012/ December 2012 - Present: /recent-updates-to-the-oed/previous-updates/december-2012-update/new-words-list-december-2012/

I will update the script to reflect these URL structures during my next session.

3/20 - I successfully updated the script to reflect the different URL structures. The script is now in working condition, with the exception of at least one date (March 2011) due to a unique page layout. An example of the word file that is generated by the CMU Lexicon Tool after the script generates the .txt file can be found here.

3/21 - I began working on the second portion of the dictionary script, which will take the generated word file and add the new words/pronunciations to the dictionary file. Right now I am using a copy of the dictionary file so that nothing gets overwritten.


 * Plan:

3/10 - Test and revise 'getNewWords.pl' script, begin working on new script to add words to dictionary file

3/13 - Add support for the different URL structures that I discovered to the 'getNewWords.pl' script

3/20 - Begin work on 'addNewWords.pl' script, which will take the word file and add new words/pronunciations to the dictionary file

3/21 - Continue working on dictionary script, research more ways to improve quality of data


 * Concerns:

3/10 - Concerned that we may have lost a group member and may need to rework our proposal

3/13 - No new concerns

3/20 - None

3/21 - None

Week Ending March 28, 2017

 * Task:

3/27 - Run initial experiment on Miraculix, look into an error MJ has been receiving with the new genTrans.pl script

3/28 - Check experiment results, work on dictionary script


 * Results:

3/27 - During the initial experiment, the train worked, though the 'genTrans.pl' script took much longer than usual to run (likely due to SSH session). When I moved onto the decode process and typed in the TOP command to see if the script was running, the process I usually see did not appear, so something went wrong somewhere. I was using the 5hr corpus. I will try the experiment over again using the same corpus to see if it was just an error on my part. After MJ ran into some issues with the new version of the genTrans.pl script, I checked out what could be going wrong but was not able to successfully get the script working. There appears to be something wrong with the section of code that generates the transcript file ( _train.trans). Matt will look into this error further.

3/28 - The experiment failed again at the same point. Andrew says he believes it is because /usr/local/ needs to be copied over to Miraculix. Aside from that, the dictionary script is almost complete. I am having some issues getting the alphanumeric sort to work, but should have it figured out soon.


 * Plan:

3/27 - Check on results from initial experiment, work on dictionary script

3/28 - Finish dictionary script, get the updated version of 'genTrans.pl' working on the server
 * Concerns:

3/27 - Concerned that the updated genTrans.pl script is not yet working

3/28 - No new concerns

Week Ending April 4, 2017

 * Task:

4/3 - Continue working on dictionary script

4/4 - Run experiments using the new 'genTrans.pl'


 * Results:

4/3 - The dictionary script is almost complete. I am looking into using a different module (Sort::Naturally) for sorting the words in the dictionary file. I also need to figure out a regular expression to ignore commented lines in the dictionary file when sorting.

4/4 - The updated version of 'genTrans.pl' is now working, though the 'verify_all.pl' script now spits out some warnings/errors. This is likely due to the fact that 'verify_all.pl' checks for at least one of the following phones: [laughter], [noise], [vocalized], and all of these phones are being removed by the updated 'genTrans.pl' script. We will have to update 'verify_all.pl' to reflect these changes. I ran experiment 008 on the 5hr corpus using the old 'genTrans.pl' and experiment 013 with the 5hr corpus using the new 'genTrans.pl'. After comparing results, the script is in mostly working condition. Upon looking at the new words found in the generated transcript, I found there were a lot of words like "WHA", or "FI", which comes as a result of only removing the bracketed part of words like WH[AT] or FI[ND]. We will have to make changes to 'genTrans.pl' to either include the full word or remove it entirely. Aside from this, it looks like the script is doing everything we want it to. I will run a longer train and post the results when it is complete.


 * Plan:

4/3 - Finish dictionary script, run experiments using updated genTrans.pl script

4/4 - Check results of train, fix 'verify_all.pl' and 'genTrans.pl' scripts, continue working on dictionary script


 * Concerns:

4/3 - Not sure if the genTrans.pl script is in working condition yet

4/4 - None

Week Ending April 11, 2017

 * Task:

4/7 - Work on dictionary script, look at results from train

4/10 - Finish dictionary script

4/11 - Begin generating new version of dictionary file


 * Results:

4/7 - I've run into some issues with the dictionary script. I will have to read into Perl's scope a bit more before it is figured out. The results from the longer (30hr) train I ran came back with similar results to the prior train (5hr). The results are still skewed because of the issue I previously mentioned with the 'genTrans.pl' script, but there were several "words" not found in the dictionary: debtwise, delicacy, staticky, fledged, anticapitalism, Reynold, jazzercise, depleter, vestment, antisupporters, Scorpio, antigovernment, sustaining, oopsy, inculturated, avant, undocument, assumingly, hallucinations, aquacise, restorage, bystandard, climatized, composting, garde, sheetrocking, junipers, unmelodic, & subminimum.

4/10 - I was able to fix the errors I was receiving from before. As a result, I was able to successfully run the dictionary scripts. The scripts are getNewWords.pl and addNewWords.pl. getNewWords.pl grabs a list of newly added words from the Oxford English Dictionary website and then adds them to a text file that can be read by the CMU Lexicon Tool. Once the word file is generated by the CMU Lexicon Tool, the file must be uploaded to the server where it can be accepted as an argument in addNewWords.pl. When running addNewWords.pl, the following arguments are accepted: word file path and dictionary file path.

getNewWords.pl example: perl getNewWords.pl -S 2011

This command will grab the words that were added in the month of September (-S or -s) during the year 2011. Since the OED only updates their dictionary with new words four times a year, there are four month flags that can be used: -m or -M for March, -j or -J for June, -s or -S for September, and -d or -D for December. The archive goes back to the year 2000, so any year from 2000-present should work.

addNewWords.pl example: perl addNewWords.pl 6080.dict cmudict.0.7b

This command will accept a word file (6080.dict) and a dictionary file (cmudict.0.7b). A new file is then created that combines the word and dictionary file. The script sorts the file alphanumerically and also removes any repeated entries.

4/11 - I created a new version of the dictionary file for use in experiments. The new version of the file was generated using the dictionary scripts I created. This version has 9,017 new entries in it, none of which should be duplicates. More words will be added once the genTrans.pl script is figured out, and then I will run some experiments to see what effect the added words have on word error rate.


 * Plan:

4/7 - Continue working on both scripts

4/10 - Finalize/test the dictionary scripts and the updated genTrans.pl script

4/11 - Finalize scripts, run experiments to see effect of new data


 * Concerns:

4/7 - Concerned about the dictionary script

4/10 - None

4/11 - None

Week Ending April 18, 2017

 * Task:

4/12 - Troubleshoot errors with the updated genTrans.pl script


 * Results:

4/12 - Tried fixing the error we are getting with the 'verify_all.pl' script. After discovering I was using the wrong copy of verify_all.pl (there are at least three occurrences of this file on the server -- the correct one is copied over to the subexperiment directory from /mnt/main/root/tools/SphinxTrain-1.0/scripts_pl/00.verify), I tried commenting out the code that calls the script in 'RunAll.pl', but the script ran into another error further down the line with "Something failed: (/mnt/main/Exp/0298/016/scripts_pl/20.ci_hmm/slave_convg.pl)". The issue likely lies with the elimination of the +vocalized+, +laughter, and +noise+ phones from the transcript, but we aren't sure how to remedy this. We will continue to look into these issues so that we can run a full train/decode.


 * Plan:

4/12 - Continue working on resolving errors associated with the updated genTrans.pl script
 * Concerns:

4/12 - I've exhausted a lot of possibilities with the errors the script is giving, so I am not sure what else to try at the moment. As a whole, this is our group's biggest concern.

Week Ending April 25, 2017
4/24 - Update documentation on Wiki
 * Task:


 * Results:

4/24 - I added the following documentation to the Wiki: linkTransUtt.pl and Creating New Corpora.


 * Plan:

4/24 - Continue debugging updated genTrans.pl script
 * Concerns:

4/24 - Concerned about the genTrans.pl script. It doesn't seem like the new data will be ready this semester.

Week Ending May 2, 2017
4/26 - Continue trying to fix the errors with genTrans.pl
 * Task:

4/27 - Continue working on errors


 * Results:

4/26 - With the help of the CMU Lexicon Tool, I attempted to manually add the partial words to the master dictionary to see if it would solve the error we have been getting, and while the 'add.txt' for the experiment I ran had no words to add to the dictionary, it still failed.

4/27 - I tried deleting the phones from the phones list in my sub-experiment directory, but it did not work. It looks like the phones are being used in several different scripts, which is causing a lot of errors in different places.

5/2 - Work with updated dictionary to see if we can get a train running using old data


 * Plan:

4/26 - More debugging

4/27 - More debugging

5/2 - Any time I use a dictionary file that differs from 'master.dic', including the improved dictionary file I created or the CMU's latest dictionary file, something fails during the train. It appears that duplicate phones are being added to the phone list, which is then causing errors in the 'verify_all.pl' script. Alex is going to look into this issue further. I am also interested as to why 'master.dic' (the dictionary file that is used for all experiments when run with 'makeTrain.pl') has only ~40k words in it whereas the latest release from CMU has over 130k. This may have been to reduce train/decode times, similar to how the 'pruneDictionary.pl' script works, but I am not sure what the purpose of pruning the dictionary twice would be.


 * Concerns:

4/26 - Script still not working

4/27 - Script still not working

5/2 - Script(s) still not working

Week Ending May 9, 2017

 * Task:

5/8 - Make a last-ditch effort on getting scripts to work

5/9 - Make sure all scripts are uploaded to server and all documentation is on the Wiki


 * Results:

5/8 - I thought I would see if I could try a few more things, but when running experiments using either 'makeTrain.dic.pl' (test script which points to the improved dictionary) or 'makeTrain.new.pl' (test script which calls the updated 'genTrans.pl' script), I am still receiving the "Something failed" error in the 'verify_all.pl' script. Getting these scripts to work may be a priority for next semester's data group, though it seems programming knowledge is imperative for this task, so it might be best left up to a different group like Modeling.

5/9 - Added the dictionary scripts to the server under the scripts/user directory for any future semesters that might want to use it. The improved dictionary we were working on is also on the server, in the same location as the master dictionary.


 * Plan:

5/8 - Update any remaining documentation on the Wiki

5/9 - Finalize reports


 * Concerns:

5/8 - Nothing new

5/9 - Nothing new