Speech:Spring 2017 Matthew Fintonis Log

From Openitware
Jump to: navigation, search


Week Ending February 7th, 2017

Task
2/4 - Research Linux/Unix command line prompts and try them out. Installed a Virtual Box Red Hat locally on my personal computer to do all my learning and testing.
2/5 - Logged onto wiki to write required post
2/7 - Work on overview part of our proposal and what our goal will be
Results
2/4 - I have a much better understanding for Linux commands which will be very beneficial for the future.
2/5 - Required post wrote
2/7 - Added the overview and main goals of our group in the proposal.
Plan
2/4 - Now that I have a better understanding for Linux commands, I’m going to attempt to copy over some of the sound files we will be looking at to my local machine for listening. Going to see if I can write a Perl script to automatically do it.
2/5 - Write next required post
2/7 - Discuss what I have done with my group and complete our portion of the proposal
Concerns
2/4 - Only concern right now is messing something up with the server with my script as the drone servers are not up yet. Might wait until the drone servers are up before running the script (will still work on script though).
2/5 - Forgetting to write required post
2/7 - Not completing our portion of the proposal

Week Ending February 14, 2017

Task


Results


Plan


Concerns


Week Ending February 21, 2017

Task
2/18 - Start research into the transcript to see how extraneous data and non-words are marked in it. Document findings.
2/19 - Write a log
Results
2/18 - Out of the 250,000 lines in the transcript, about 65,000 contain a word inside brackets. As of right now, the current script just removes any content inside brackets. While this is fine for things such as [noise] and [laughter] this is not fine for other tags that actually still have words. For example, [laughter-word] means that the person is laughing while saying “word”. The current scripts completely remove this word and this is a problem because with some lines, half of the sentence laughter while speaking. There are other cases that can happen as well which will be listed in the table below.
Case Description What should stay/be removed
[laughter] Laughter Remove
[laughter-word] Laughter while speaking Remove the laughter tag so [laughter-word] becomes word
[noise] Random noise Remove
[vocalized-noise] Vocal noise (Ex: pfft) Remove
wo[rd]- Person said “wo” but got cut off or stuttered. “Word” was the intended thing to be said Remove “[rd]-” and keep “wo”
-[wo]rd Person said “rd” but meant to say “word”. Usually occurs when clip begins with someone already speaking Remove “-[wo]” and keep “rd”
[worm/word] Person said “worm” but “word” makes sense in the context (misspeak) Keep “worm” as the voice recognition will pick this up and not “word”
[laughter-wo[rd]-] Combination of laughter and unfinished/cut off word Keep “wo” and remove “[laughter-[rd]-]”
[wokd/word] “Wokd is what the person said but is not in the English language. “Word” makes sense in the context Keep “wokd” as the voice recognition will pick this up and not “word”
2/19 - Log planned
Plan
2/18 - Keep investigating the transcript to try and find any other marked words other than the cases listed above.
2/19 - Follow through on plan
Concerns
2/18 - None at the moment
2/19 - Not following through on plan

Week Ending February 28, 2017

Task
2/22 - Discuss with group on future plans for the week. Get the lines from the transcript that only have [] markings and divvy them up among us. Create a plan to update
Results
2/22 - Created a small program that located all the lines with bracket markings in the 250,000 line transcript file. From these remaining 65,000 lines, the program split them up equally into 4 files. Each of us will be looking through the file to find any other bracket markings we haven't already found. We are also looking for data that only have [noise], [vocalized-noise], and [laughter] as their only data. We want to find these because data that only has sound and no words is pointless data and only raises the word error rate. I talked with Cody from the Experiments group and we have a plan on improving the script that removes the bracketed data.
Plan
2/22 - Cody and I will both work on the script and I will provide him samples from the transcript for him to test. Cody and I work well together and improving this should be a relatively easy for us to accomplish.
Concerns
2/22 - None as of now

Week Ending March 7, 2017

Task


Results


Plan


Concerns


Week Ending March 21, 2017

Task


Results


Plan


Concerns


Week Ending March 28, 2017

Task
3/24 - Plan to do two proper logs for the week
Results
3/24 - Planned to do two proper logs for the week
Plan
3/24 - Do two proper logs for the week
Concerns
3/24 - Not doing two proper logs for the week

Week Ending April 4, 2017

Task
3/29 - Look into why script is makeTrain.new.pl is failing and figure out a way to fix it. Once it is fixed try and run a train and decode and see results and word error rate with updated regular expressions and compare them with the old scripts.
4/3 - Update script that crosschecks the trans file that will be decoded with the dictionary to add words that were not originally in the dictionary
Results
3/29 - Discovered why the script was not working. I had to add perl to the beginning of the command that ran genTrans.new.pl. The reason I didn't need to clarify perl before was because genTrans.pl was already registered in the system as a perl function that could be run from anywhere while the new one did not have that. After fixing the file to specify perl, the script ran successfully and was ready to run a train and decode. However, upon starting the trade and decode, many errors showed up. While there were many errors, they all were the same error. Since Cody and I now allow partial words to get through, these fail to pass the check against the dictionary. We are working on updating the script that detects these words to instead of fail the check to simply add the partial words to the dictionary.
4/3 - I have successfully updated the script that cross checks the train with the dictionary for invalid words. Originally, the console would just spit out an error for every invalid word found. However, since we updated the regular expressions to include valid words, this will no longer be necessary. Now when the check runs into an unknown word, it will add that word (since it is most likely a partial word that the person actually said) to the dictionary.
Plan
3/29 - Work on updating the script to add words to the dictionary instead of failing the check.
4/3 - Run a train and decode with the updated script and compare the results of the old scripts with thew new ones.
Concerns
3/29 - Hopefully we won't have issues updating the script. It should be a simple update so my concern for this is low.
4/3 - The updated script should bring a large improvement to the word error rate, however I need to actually run a train and decode to verify the changes.

Week Ending April 11, 2017

Task
4/9 - Was extremely busy this week and weekend due to work and a programming competition in New York. Going to work on scripts and other work for the week tomorrow and Tuesday night after school.
4/10 - The task I had for Sunday (the one day this week I was able to work on this) was to update the scripts to add the partial words created from the doc to the dictionary.
Results
4/9 - Did not get much done and planned for future work.
4/10 - I worked on the scripts for a while and did a lot of testing but was unable to get the script 100% working. Even after finding partial words and adding them to the dictionary file, the script that cross checks the transcript with the dictionary was still throwing errors about the missing items. I may need to add a completely new script that adds the partial words to the dictionary before the check between them occurs. If I can get this working, then I can run a train and decode and see if the updated macros improve the WER of the train and decode.
Plan
4/9 - Do week's tasks after work
4/10 - Continue working on scripts and figure out a way to fix the errors that are being thrown when the transcript checks against the dictionary. Find a good way to add the partial/new words to the dictionary.
Concerns
4/9 - Still not having the time to do the work I need to do.
4/10 - While I may be able to add the words to the dictionary, I'm worried that those words will never be able to be properly recognized because adding the words does not add the phonemes associated with those words. I'm concerned that these partial words may end up causing more issues rather than fixing others.

Week Ending April 18, 2017

Task
4/12 - Work on trying to get the scripts working so a train and decode can work with the updated regular expressions.
4/17 - Continue looking into scripts and try and get a train and decode working with the new data
4/18 - Continue investigating
Results
4/12 - Tried many different variations on the scripts but could not get a train and decode working. The train and decode would constantly spit out error messages about items being in the transcript but not in the dictionary even when we manually placed missing words from the transcript into the dictionary.
4/17 - Still was unable to come to a solution to the scripts. The train and decode process is still throwing errors during the verification stage where it compares the transcript to the dictionary making sure all words in the transcript are in the dictionary.
4/18 - I was finally able to get past the verification step of the train and decode process by manually adding the missing items in the verification step was complaining about in their proper places. I had to debug the perl script verify_all.pl by placing logs throughout the entire program to see where it was breaking. Turns out that the program failed at the very end when it checked it's current status. The status was being incorrectly set earlier in the program so this fixed that issue. The train and decode process started, but the train and decode process threw errors and was unable to be completed. You can see the results of the train and decode I tried doing in the experiment I ran located here: /mnt/main/Exp/0298/016.016.html. The 016.html has all the attempts of that train and decode and near the bottom of it you will see the errors the train and decode spat out.
Plan
4/12 - Continue investigating scripts and try and find a solution to the issue.
4/17 - Continue investigating
4/18 - Look into errors generated during train and decode process. Once a train and decode can be fully run I can compare the new regular expressions to the old ones and see if there is a measurable improvement to the WER.
Concerns
4/12 - At this rate, we may not be able to use the new regular expressions in our data. If we are unable to, I don't think the WER rate would have changed too much with these changes, but it would still be a measurable improvement even if it was only a 1% improvement.
4/17 - Same concern as before.
4/18 - If I cannot get this working then we will have to revert to using the old regular expressions (basically same concern as before)

Week Ending April 25, 2017

Task
4/25: Continue investigating scripts and other issues and determine if it is feasible to have the data ready by the end of next week.
Results
4/25: I was very busy this week due to work, other classes, and some family things I had to take care of so I wasn't able to work on this for more than one day. After investigating the regular expressions and the data it produces in the transcript, keeping partial words may not be a possibility. For example, an entry in the transcript like prev[ious]- would used to completely remove the word from the transcript. However with our updated transcripts, the word "prev" is kept in the transcript. This causes a few issues during the verification step of a train and decode. The word "prev" is not in the dictionary and throws an error. One would think that this could be fixed by simply adding that word to the dictionary, but that is not enough. The word also needs to have the phonemes that tell the train and decode how the word is actually pronounced. These CANNOT be auto generated and must be manually created for partial words. Manually adding all the partial words to the dictonary and their respective phonemes would take a rediculous amount of time and it may be better to completely remove partial words. Finally, there seems to be some kind of automated process that grabs the associated words that are in the train and decode from the dictonary and puts it into a local copy in the experiments folder. This causes issues because that automated process only works for the old regular expressions because it knows what words to expect. Even manually adding the words that are missing to the dictionary does not cause the issue to be resolved. Overall, a lot of work would need to be done to get the regular expressions to work correctly and we simply may not have the time at this point in the year to finish working on this. This might be something that next year's Data group can work on as they can focus on this from the start of the semester (*wink* *wink* *nudge* *nudge*)
Plan
4/25: Continue investigating and try a train and decode with the other improvements we made excluding the partial words. Continue to try and get a train and decode to run using the new regular expressions.
Concerns
4/25: This may still not work because for example our new regular expressions will turn "[laughter-process]" to "process" and the words inside here still may not be in the dictionary which will continue to cause the same issues.

Week Ending May 2, 2017

Task
5/2 - Look into the feasibility of generating the phonemes of partial words on the fly during the train and decode verification step. Move updated scripts into their own folder and make readme file describing what was changed and what the point of the updates were.
Results
5/2 - From my testing and work done with the verification scripts in the train and decode process, it will not be very feasible to generate the phonemes for the words on the fly. In order to generate the phonemes on the fly, we would have to use a fully trained model and doing that during the verification process, especially on larger trains, could really slow down this normally small process.
Plan
5/2 - After more researching I discovered this (http://cmusphinx.sourceforge.net/wiki/tutorialdict) on sphinx's website. This could potentially be used to generate the phonemes for the words, but it must be run on a trained model. We may have to have the tools group install this to Caesar if it is not already installed. I'm going to try creating a script to pull all the partial words out of the transcript and then running this command on them to generate the phonemes. If it is successful then I will add those to the dictionary file and see if a train and decode can finally be run. I will then compare the WER of before the updated regular expressions update to after
Concerns
5/2 - I'm not sure if the new source I found to generate phonemes is feasible to do especially with what little time we have left in the class. It is something I will work on, but I don't think I will be able to complete it before time is up.

Week Ending May 9, 2017

Task


Results


Plan


Concerns