Speech:Spring 2016 Justin Gauthier Log


 * Home
 * Semesters
 * Spring 2016
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 9, 2016
Going through online unix command tutorials. I have also gone through the dict and noaa directories to make my self comfortable with file locations and where all necessary files are to run experiments. Looking through other teams log files. Searched through the switchbox directory.
 * Task:

There seem to be four massive dictionary files containing phonemes for thousands of words.
 * Results:

/mnt/main/corpus/dict

There are also .wav and transcript files located in the noaa directory.

/mnt/main/corpus/noaa/40min_split/adapt/audio/wav /mnt/main/corpus/noaa/40min_split/adapt/trans/transcript.txt

Searching through the switchbox directory, I ended up eventually finding .wav and transcript.txt files. It seems like all files on the corpus directory all rely on each other to run experiments. These locations will be documents for easy access.


 * Plan:

The amount of files and directories that each file is located will be an issue. This is why we are familiarizing ourselves with the file locations so no one has to search.
 * Concerns:

Week Ending February 16, 2016
Reviewed logs of all users to get caught up for Wednesday. Since I was gone last week I wanted to get familiar with what my teammates have done. Looking over Brian A's log file really gave me a solid guideline for my work today. Going through and locating .sph files and figuring out how to convert them to .wav giving us the chance to review these to see if they are worthy of for testing. Then trying to understand soft links as well. There is a concern of the amount of .sph files there are so I am going to try and see if there is a way to convert multiple .sph files at a time. My goal Wednesday is to take a look at the proposal written by some of the group, understand the reasoning behind it and help write the proposal next week.
 * Task:

I translated sw2001A-ms98-a-0002.sph to a .wav. Two woman were having a conversation about what type of clothes to wear for work. It was ten seconds long which wasn't to bad but there are a massive amount of them to go through. It did however sound clear and had good dialog. I have also successfully created a script that allows you to convert as many .sph files into .wav files you want. I have created a sphFolder and a wavFolder so the files stay separated after the script is finished running. I will share this script to my team and whoever else would like to use it tomorrow. I also think that the concept of soft links is a genius concept and I feel like they should be implemented throughout the directory for easy access.
 * Results:

Need to discuss what parts of the directory that soft links are needed. I also want to talk to the team about the proposal so I can do my part of it this week. The final thing I would like to talk about is what .sph files we should convert and listen to as well is priories for next class. Missing class last week. Not being apart of the proposal so far because of my personal situation.
 * Plan:
 * Concerns:

Week Ending February 23, 2016
During class time we received our goal for the semester which is to review the .trans and the .sph files so they correspond with each other. This hasn't been done at all yet so no one knows if the tests that we have been using are correct. Our goal during class was to figure out a way to grab 120 of the first 30000 .sph translations so we can distribute them up between the four of us and listen to them as well. I have also created a script to convert .sph to .wav which I will explain how to use in the results section.
 * Task:

Find all 30 .sph files that I was assigned. Convert all of them to .wav. Find a more convenient way to get .sph files from caesar to personal PC.

Listen to all .wav files and compare them to the .sph files.

The unix command I used to find every 2500 segment is shown below.
 * Results:

head -n 30K 001_train.trans | awk '!(NR%250)'

In order for this batch file to work you are going to have to place it in your sox folder. Mine is located in C:\Program Files (x86)\sox-14-4-2. You are also going to want to create 2 folders in the same location: sphFolder and wavFolder. After you have done all of this open CMD and use the cd command to get to the location of that batch file and simply run it from there. I have attached a screen shot of what it should look like when you run it. If you have any questions let me know.

NOTE: You are going to have to fun notepad as an admin before you create the batch file in order to save it in the sox folder.

Batch file content.

cd %~dp0 for %%a in (sphFolder\*.sph) do sox "%%~a" "wavFolder\%%~na.wav" pause

Finding all 30 .sph files took longer than expected, so I went searching for an alternative to transfer files from caesar to my personal PC and found a backup option. The answer is PSCP.exe. This allows you to use CMD to transfer files from one place to the other.

1. Download PSCP.EXE from Putty download page 2. Open command prompt and type set PATH= 3. In command prompt point to the location of the pscp.exe using cd command 4. Type pscp

Here is an example of what you type in to CMD to transfer a file as well as the outcome.

pscp jmg2014@caesar.unh.edu:/mnt/main/corpus/switchboard/256hr/train/audio/utt/sw2256B-ms98-a-0074.sph C:\Users\justin\Desktop\ jmg2014@caesar.unh.edu's password: sw2256B-ms98-a-0074.sph  | 28 kB |  28.7 kB/s | ETA: 00:00:00 | 100%

All .wav files matched the corresponding transcripts however there were a couple that were low quality that I wouldn't like to use with testing/train.

Locate all .sph files that I am assigned for this week and review them with the segments of the .trans file that we are using. Use the PSCP file transfer .exe to transfer all files from caesar to my personal PC from now on. This will save time and allow us to listen to more .wav files instead. Potentially remove unwanted .sph files.
 * Plan:

The amount of time it is going to take to review the entire .trans file.
 * Concerns:

Week Ending March 1, 2016
Automate the process of finding the .sph and corresponding utterances for review. This week we are testing the amount of utterances we can actually listen to on a weekly basis. As of right now we have split the weekly utterances to 60. The main goal of the project is to review at least 1% of the .trans file which is about 600 utterances per person. Finalize Data group portion of the proposal. Find a way to search for the 256 missing utterance files. Located the 60 .sph files that I was assigned to and converted all of them to .wav files. Will listen to some of them as well. Listen to the rest of the utterances after the last faulty one that I listened too.
 * Task:

Proposal is in its final stage before submission. Comparing the .trans and the .wav files there were 43 out of the 60 that did not match. It seems like the first 43 of the .wav files that I converted all said the same thing ("you know you can't even buy a loaf of bread in this country"). That statement was not in the portion of the .trans file that I am comparing the .wav files to. The rest of the 17 .wav files with the corresponding text matched.
 * Results:

Locate all 60 .sph files that correspond to the .trans segments that we assigned to each other in class. Convert them to .wav files and compare the two. Finish proposal at a reasonable time for final edit and submission. Pick random .sph files within the 60 .sph files that I grabbed before and see if those say the same thing as well. Find out exactly how many .sph are the same and find that line in the .trans to figure out where the problem may have started.
 * Plan:


 * Concerns:
 * The difficulty of finding the missing utterance files.
 * Listening to the amount of utterances in a weeks time frame.
 * Are there more parts of the .trans that don't match up with the .sph files that we review?
 * How serious this find could be.

Week Ending March 8, 2016

 * Task:
 * Figure out the starting and ending point of the faulty .sph files.
 * Grab the next set of the 30K files for the team to look over this week.
 * Take the good 30K files we have checked already and create a new corpus with that for testing.
 * Search for the 11K+ missing .sph files in 125hr while looking for the same error.
 * Grab 62 of my assigned .sph files.
 * Take a look at the 125hr corpus and make sure that the error stated in the 256hr corpus is not found.
 * Go through my 62 .sph files and check if they correspond with the .trans on the 256hr corpus.


 * Results:
 * Going through the 256hr utterance files it looks like the fault starts at sw2333A-ms98-a-0166 or line 32602 of the .trans file and ends at sw2416B-ms98-a-0143 or line 43760 of the .trans file. The total amount of unusable .sph files totals 11,158. This is part of the reason why our WER is 40%.
 * The next set of utterance files to check is between 90K and 60K. Total the group has 245 utterances to go through this week which we split up evenly.
 * The modelling group created a new corpus with the 32K good utterance files with the corresponding .trans so they can run an accurate test this week.
 * Quickly searching through the 125hr train I did not find any bad utterances. It did however make me wonder how the 125hr corpus was actually made because the .sph files are different compared to the 256hr.
 * Grabbed all 62 files that we assigned each other in the Data group. Will review tomorrow.
 * While searching through the 125hr corpus I realized that the .sph file names are not the same in each. This makes me want to look into how the names of these .sph files actually came to be and the potential script that was used to do so. This could give us a better understanding as to why we are beginning to find corrupt data.
 * Found more corrupted files in my 62 .sph file set and located the first utterance that the repeat starts (sw2657A-ms98-a-0114 (line 75007)).


 * Plan:
 * Go through the 125hr corpus and try to locate missing .sph files and understand how it was created.
 * Find the end of the repeated utterance.
 * Figure out how the 256hr corpus was actually created to have a better understanding as to why these repeated .sph files keep occurring.
 * Concerns:
 * How many other corrupt .sph files there are in the 256hr corpus.
 * Brenden found more corrupt data in his set of 60 .sph files and described earlier in an email we came to realize that there is a total of 311hrs in the 256hr corpus. The main concern now is how many files in the 256hr corpus are corrupt compared to correct.

Week Ending March 22, 2016

 * Task:
 * Go through the model building tutorial and run my first train. Read through everything and understand how everything works.
 * Check to see if training completed successfully. Then create a language model and run a successful decode.
 * Looked over teams logs and email from this week.


 * Results:
 * As of right now the train is running. I have gone through all of the steps. The last step is interesting, however, because there are two parts of the explanation that don't add up. It states that Note: the nohup and .& allow you to essentially "disconnect"...The actual command that is stated at the bottom of that entry have the command without the period to the left of the ampersand. I started my train using the command with the .& at the end and as of right now it has a failure but still seems to be running.

Training failed in iteration 1 Something failed: (/mnt/main/Exp/0284/002/scripts_pl/20.ci_hmm/slave_convg.pl)


 * Train completed successfully overnight. Created a language model successfully as well as running a successful decode. Even though the train was only five hours I feel like the error rate is still to high.


 * Still searching for the reason why the size of the current corpus is 258 hours and the manual behind the switchboard we used says "over 240hrs."


 * Plan:
 * It still seems to be running after the failure so I am going to run the train overnight and check for errors tomorrow.
 * Get a better understanding of running an experiment and try to make some changes to get better overall results.
 * More research on current switchboard.
 * Concerns:
 * The train is not going to finish. If that is the case I will go back to previous logs and review my mistake.
 * Any potential that can be created when creating/running an experiment.
 * Not finding the answer to switchboard size discrepancy.

Week Ending March 29, 2016

 * Task:
 * Understand the logic and how to use the three scripts that Jon made to create the new corpus.


 * Go through the train.trans in the full corpus and grab every 1000 utterance to review. The reason why we decided to go through every 1000th was because the amount of error files we ended up finding reached over 10,000.


 * Grabbed all 62 .sph files that I was assigned from the full .trans.


 * Listen to all 62 of the .sph files that I assigned to myself as well as comparing those to the corresponding .trans utterances.


 * Results:
 * Grabbed every 1000th utterance for review and divided them up between the Data group.
 * Justin - 1-62
 * Brian A.- 63-125
 * Brian D. - 126-188
 * Brenden - 189-250


 * While transferring the files over to my PC, I did notice that all of the file sizes were different which seems to be a good sign.


 * All audio files that I listened to were in above average quality with no errors when compared to the utterances in the .trans. There were however a couple of audio files that had other people talking behind the main person speaking.


 * Plan:
 * Review scripts.
 * Grab specific sph files according to the utterances that I extracted from the .trans.
 * Compare the 62 .sph files to the parts of the .trans that was extracted.
 * Concerns:
 * Potential errors within the the full corpus.
 * Bad auto.
 * Errors in sections of the corpus that we did not review.

Week Ending April 5, 2016

 * Task:
 * Look into CFG_VARNORM and understand the difference between the (yes/no) Normalize variance of input files to 1.0.
 * Review the information from our first train.
 * More research regarding CFG_VARNORM
 * Grab every 1000 utterance between 300000 - 250000.
 * Create a train with CFG_VARNORM enabled and run an experiment to test results.
 * Create a train with CFG_VARNORM disabled and run an experiment to test results.


 * Results:
 * As of right now I am unable to find documentation describing why you should set the normalize variance of input files to 1.0 or not too. From the websites that I have looked at so far it seems like everyone that runs a train seems to set CFG_VARNORM = no. Will look into this more tomorrow.
 * Results from first train were not what we were expecting since the 145hr train ended up being less of a WER than what we ended up getting on our 300hr train.
 * Still could not find much information on CFG_VARNORM except for the fact that everyone seems to run a train with it disabled.
 * Grabbed 50 utterances and the .sph files that match.
 * Currently running a train on the first_5hr corpus with CFG_VARNORM enabled.
 * After reviewing both experiments with VARNORM enabled and disabled I ended up with some interesting results that I have already told my team. I feel like this finding might help our WER on the 311hr corpus.


 * Plan:
 * More research to try and uncover the proper setting for CFG_VARNORM.
 * Finish experiment with CFG_VARNORM enabled and then run another one with it disabled to compare results.
 * Concerns:
 * Unable to solve VARNORM.
 * Unable to have a better result of the 300hr.
 * Same WER when comparing the two experiments.
 * Results that I found from my two experiments on the first_5hr corpus does not help WER on 311hr corpus.

Week Ending April 12, 2016

 * Task:
 * Do my part of the poster which is the methodology or how the Data team did our specific jobs from week to week.
 * Research how to use the sclite -o lur function.
 * Need to understand how to create a .stm file for the proper use of using the sclite -o lur function.

Once we had all of the audio files that we needed to listen to we would then compare them to the utterances we exported in the beginning to test whether or not we were actually using good data.
 * Results:
 * Finished my part of the Data team poster.
 * In order to grab the data needed to compare the utterances in the 256hr corpus as well as the .sph files. We needed a command that would shrink the corpus down to 30,000 and then grab 125th utterance in that 30,000.
 * head -n 30K 001_train.trans | awk '!(NR%250)'
 * head -n 60K 001_train.trans | tail -n 30K | awk '!(NR%125)‘
 * From there we would then use an executable known a pscp in command prompt to grab each .sph that matched the utterances we grabbed from the previous command.
 * pscp -pw ****** jmg2014@caesar.unh.edu:/mnt/main/corpus/switchboard/256hr/train/audio/utt/sw2256B-ms98-a-0074.sph C:\Users\justin\Desktop\sphfolder
 * After we exported all of the .sph files that we needed to listen to we then created a script that uses SOX to convert the .sph files to .wav files.
 * cd %~dp0
 * for %%a in (sphFolder\*.sph) do sox "%%~a" "wavFolder\%%~na.wav"
 * Pause
 * Later on we found that we could simply listen to the .sph files using VLC player.
 * After researching the sclite -o lur I decided to test it with one of my experiments.
 * sclite -r 002_train.trans -h hyp.trans -i swb -o lur
 * This ended up creating a table but provided no results.
 * After doing more research it looks like a .stm file is going to need to be used or the file will needed to be formatted like it.
 * Example of a .stm file structure:
 * 2345 A 2345-a 0.10 2.03 uh huh yes i thought
 * 2345 A 2345-b 2.10 3.04 dog walking is a very
 * 2345 A 2345-a 3.50 4.59 yes but it's worth it
 * http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/infmts.htm#ctm_fmt_name_0


 * Plan:
 * Search for a way to create or alter information that we already have to create a .stm file with the proper format.
 * Possible new script needed to create the .stm file.
 * Concerns:
 * Having difficulty finding information regarding sclite -o lur and .stm.

Week Ending April 19, 2016

 * Task:
 * Try to run a score after the decode on the full corpus to test our data.
 * Researched information regarding FINAL_NUM_DENSITIES and CFG_HMM_TYPE = ptm.
 * Try to setup a decode which scores the error rate of each utterance compared to the same .sph file.


 * Results:
 * At first I tried to use the ***_train.trans and the hyp.trans to calculate the score using the -o lur function in sclite. After the first attempt it came up with a scored table but it only had the title of each column. Taking a closer look at the notes that I found from the previous log it looks like our train.trans and hyp.trans are not in the proper format. According to the notes on http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/infmts.htm#ctm_fmt_name_0 the files have to be in the same format as a .stm file (example below). I thought that the train.trans file that we have in the corpora looked similar to the file structure so I used that and it gave me the following errors "Error: double reference text for id ''" and "Error: Not enough Reference files loaded".


 * 2345 A 2345-a 0.10 2.03 uh huh yes i thought
 * 2345 A 2345-b 2.10 3.04 dog walking is a very
 * 2345 A 2345-a 3.50 4.59 yes but it's worth it


 * Information for team use only.
 * It looks like the file type needed or the file structure needed is based off of a .stm file which has a file structure that is represented from above. The problem that I don't fully understand is what information is required and where to get that information to create it.


 * Plan:
 * More research will be needed.
 * Potentially create a script from the information we receive from the decode in order to create a file structure similar to a .stm.


 * Concerns:
 * Will not be able to get the -o lur command to work.
 * Not completely understanding the file structure of a .stm file which could delay the scoring process of the full corpus.

Week Ending April 26, 2016

 * Task:
 * Try to get a better understanding on how sclite's -o lur command works as well as the file structures needed.
 * Searching for more information regarding the lur command.
 * More research on -o lur. Check to see if the decode on half of the full corpus on Asterix is finished and if it is run a proper score.


 * Results:
 * USING STM FORMAT FOR LABELED UTTERANCE REPORTS (LUR):
 * Motivation:
 * For the Fall '95 ARPA CSR Evaluation, it was desirable to not only report overall error-rate statistics but also error-rate statistics for arbitrary partitions and/or groups of partitions within the test set. To this end, the :STM file format was extended to encode arbitrary subset information for each segment.
 * Usage:
 * The subset information is encoded by adding two types of information into the STM file. The first information type, is a special comment line, the subset information line, (SIL). The SIL defines the subset's label id, a short :column heading and a description. The special comment line format is:
 * LABEL "" "" ""
 * where:
 * 
 * The subset id. Used to label each segment that belongs to the subset. The format is arbitrary, but without spaces.
 * 
 * Used as column headings in generated reports. Format is arbitrary.
 * 
 * Used for subset descriptions in generated reports. May be of arbitrary length and for- mat. Double backslashes '\\' add a line feed.
 * The order of the SIL lines in the STM file defines the order of subset presentation the generated reports. The second type of information incorporated into the STM file is an optional sixth field to the text segment record. The :field consists of a comma separated list of subset ids enclosed in angle brackets. Each unique id must have a special comment line, specified above, to be properly interpreted. Otherwise the id will be ignored.
 * Each position within the label field, separated by a commas, defines a group of subsets that are presented separately in the generated reports. So for instance, the first group might be all segments, and the second might be :either male or female, and the third might be the story. The example below shows an STM file encoded with this information.


 * LABEL "M" "Male" "Male Talkers"
 * LABEL "F" "Female" "Female Talkers"
 * LABEL "01" "Story 1" "Business news"
 * LABEL "00" "Not in Story" "Words or Phrases not contained in a story"
 * 940328 1 A 4.00 18.10  FROM LOS ANGELES
 * 940328 1 B 18.10 25.55  MEXICO IN TURMOIL


 * -o lur output example: http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/outputs.htm#outputs_lur_name_0
 * -o lur example: https://catalog.ldc.upenn.edu/docs/LDC97S66/H496EVSC.TXT


 * .stm file is the reference file and the .ctm is the hyp file. The file structures for each seem to be different.
 * More info. found here: http://www1.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm


 * Decode on half of the corpus is still running. Will have to wait to run score sometime next week.


 * Plan:
 * Look for examples of how -o lur is used.
 * Would like to find an example of a .stm file structure.
 * Run score after Asterix is finished decoding half of the full corpus.
 * Concerns:
 * Won't be able to find more information about sclite.
 * Very little information regarding -o lur.
 * Can't seem to find examples of -o lur needed file structure.
 * Not knowing the exact time that the decode will end on Asterix.

Week Ending May 3, 2016

 * Task:
 * Create a script that uses the hyp.trans.pra file and calculate the word error rate of each sentence in the full corpus.
 * Work on the script and try to figure out all of the parts needed for it to function properly.
 * Script did not seem to function properly in Windows so I decided to start the process over in my Ubuntu virtual box which was a more difficult process for me than I thought.


 * Results:
 * It looks like the decode is still running on the full corpus. I grabbed a test hyp.trans.pra to test the script while I am making the script in order to test it. The script is going to create a log file with all of the necessary information. We also figured out that sclite seems to score the start and stop of each utterance. Still need to do more research to see whether or not we need to keep the start and stop or if we can get rid of them.
 * The script will output the WER of all sentences as well as each reference sentence and hyp sentence so we can understand what is actually being delivered when full corpus is being decoded. It still looks like the decode is still going on Astrix. To me this is a good thing because I still haven't finished the script to score this corpus properly.
 * I installed Komoto IDE to use for testing and editing purposes because I have used it briefly before. I then grabbed a mini script to test to see if I had everything and I didn't seem to have some of the corresponding modules needed. After that I installed cpan as well as the modules needed for the script to run but I don't 100% know what the problem is so I will continue to work on it tomorrow in class.


 * Plan:
 * Keep working on scoring script and check to see if start and stop is not needed.
 * Keep working on script and hope that the decode will finish by class on Wednesday.
 * Work on script tomorrow in class with help from team.


 * Concerns:
 * Creating a pearl script might take a little longer than expected because I have to learn the language as I go.
 * Decode won't finish in time.
 * Script will not work properly.
 * Did not research start and stop.
 * Perl script hard for me because I am a beginner.

Week Ending May 10, 2016

 * Task:
 * Run command that scores each utterance in the full corpus.
 * Add information to the wiki documenting what we did and how we did everything throughout the semester as well as potential next steps for the following capstone.


 * Results:
 * After the decode finished I opened the etc directory and didn't find any .trans file. After talking with Brenden he ended up getting the proper .trans file for each of the DECODES that were run. Then I went into the DECODE directory in each of the experiments (007 and 008). I needed to run parseDecode as root. I am guessing that is how the decode was run in the first place. The decode ended up creating 4 decode.log files in each experiment. (decode_1.log, decode_2.log, decode_3.log and decode_4.log). After finding these I ended up having to create 4 hyp.trans files with the same naming structure as the decode.log files. Once I finished with that I was able to run the command sclite -r 008_train.trans -h hyp_1.trans -i swb -o all lur for both experiments and all the hyp.trans files. The total WER for this decode was 41.65%. All of the scoring of each sentence is spelled out in the proper experiment directory because the tables were to big to put in the wiki.


 * All information about the two decodes that were run are at the following directories:
 * /mnt/main/Exp/0284/007/etc
 * /mnt/main/Exp/0284/008/etc


 * Added information to the data group page as well as information regarding the data group in the final report.


 * Out of all of the hyp.trans files there was only one utterance with an error rate that was over 80%: sw2640b.


 * Plan:
 * Document findings for next year's data group to easily find the information needed for whatever task they are given for next year.
 * More documentation will be needed.


 * Concerns:
 * None
 * None
 * None
 * None