Speech:Spring 2016 Brenden Collins Log


 * Home
 * Semesters
 * Spring 2016
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 9, 2016
2/7: After our class session we decided as a group to familiarize ourselves with the system environment, navigating directories and locating where the data is that we are responsible for during this capstone project. Prof. Jonas noted during class that it seems that during experiments it appears that there could be some missing data as the transcripts that come out seem to say some of the same things instead of other lines that would be correct. By familiarizing ourselves with the data being used we should hopefully be able to resolve this problem. Previous semesters had changed file directories and links between them so we hope to reinvestigate the data that was moved and the links to ensure that everything is still properly linked. Prof. Jonas also talked about finding what an accurate error percentage is for the switchboard corpus and hoping for training on unseen data this semester, as it has never been done before.
 * Task:

2/8: Read through other students logs for this semester to see how everyone is getting started with the project as well as my own teams logs. Change password for my own user account. Look into the Experiment group a bit to see exactly how they are using data.

2/9: Try to understand Sphinx a little better.

2/7: I've kept reading through previous semesters logs to get an idea of how the system as a whole is put together as well as navigating the file directories step by step to see that what the previous semester has said is true. So far so good as far as everything is where they say it is. I've taken notes to give me a couple shortcuts and better familiarize myself with the file structure which can come in handy until I'm comfortable enough to start working from memory.
 * Results:

2/8: Password change was easy enough. Now I'll be able to use my account instead of just root to log into caesar. I had some issues before logging in with ssh so I am using the VPN Pulse Secure and then ssh'ing into caesar.unh.edu from there. Still reading logs about the experiment group from previous semester.

2/9: Watched a couple videos on Youtube to get a better idea of what Sphinx does and how it uses data. Still feeling I will need a little bit of a clearer understanding. I'm looking forward to meeting with the team again to see what we all have questions on and what we've found out together.

2/7: The plan for this week will be to continue checking data files and possibly learning about linking in the UNIX environment to see how different file directories can point to each other.
 * Plan:

2/8: Further looking at linking with UNIX. Check out Experiments directories to see where the data is being linked.

2/9: Meet with team tomorrow and share what we have discovered for this first week. 2/7: The concern I have this week is the amount of data files that exist for this project. So far it seems there are tons and tons of files which will keep us very busy if it gets to the point of manually having to read every file to identify it's contents. There may be an easier way but our group will have to figure that out soon within the coming weeks.
 * Concerns:

2/8: The only concern still is trying to wrap my head around the project as a whole while also focuses on my own tasks.

2/9: A lot of information to grasp for this project.

Week Ending February 16, 2016
2/11: Convert .sph to .wav from the corpus to begin to validate audio files are accurate to the transcripts. Work on ideas for the proposal.
 * Task:

2/14: On top of previous tasks, Professor Jonas said that there is a paper that was published with a baseline of the switchboard corpus. I am going to locate that paper and see what the baseline is as well as taking notes about any settings that allowed them to reach that baseline. That might allow us to tweak current settings being used to see if we, the Capstone project, can reach that same baseline.

2/15: Find a documented baseline for the switchboard corpus. Read about the training and decoding process and how and what data is used.

2/16: Look through documentation to check what WER (word error rate) has been achieved in past semesters.

2/11: I know Brian A has figured out how to convert files but I am having some issues. Once we can all get on board with figuring out how to convert files we can then begin to break up the files so that we can all check the accuracy of the files. We will have to decide on a random sample size of the audio files because it seems that trying to do all of them would take forever.
 * Results:

2/14: Was finally successful in installing sox and setting the environment variable to get it to work. I converted sq02001.sph to a .wav file and am now listening to it. It's two ladies talking about clothing they are allowed to wear at work. It is roughly 4 minutes long so I'm wondering if each other .sph file is roughly the same length. If the data group can successfully evaluate all this data with the transcript data we should probably have about 64 hours each of data to look at. We might consider taking a random sample instead, possibly 50% of the data and check that.

2/15: Found a paper published by Mississippi State University talking about a baseline for the switchboard corpus. The paper can be found at: http://groups.inf.ed.ac.uk/switchboard/reseg-swbd.pdf It appears that they made some big changes to the segmentation of the of the speech to better identify words and phrases. I'm not sure how our current files and utterances are broken up, or who originally broke them up. One idea that the students had in the paper is to identify dysfluency, which is things like stutters and parts of words, and also laughing and separating that from other words to increase the baseline. The paper states they were able to decrease the word error rate by almost 2 percent by resegmenting the data which is something we might be able to look into. I will search other further documentation from previous semester's data groups to to try to understand why current files are segmented as they are.

2/16: Found some stuff published by David Meehan from 2014 about getting WER percentages and other information about tweaking configurations in Sphinx that would help decrease the WER. I will see further if there is any documentation about how audio files are segmented or if anyone ever looked into trying to resegment.

2/11: Keep reading about how to convert to .wav, send e-mail to group. Think of some ideas for the proposal.
 * Plan:

2/14: I am going to locate the paper published for the baseline to see what data I can extract from it. I also plan to try to take a look at the data and how it can be divided among the four data team members. I also want to read over previous data group proposals to help prepare for what we want to write for Spring 2016.

2/15: Check for any documentation about why files are segmented the way they are, if they came from the original disk like that, or someone segmented the 256 hours of conversation manually..

2/16: Speak with the group tomorrow and see what everyone found out this week. Also bring up how me might be able to look at audio data and resegment if it is a possibility. 2/11: Amount of audio files and working on the proposal. As the data group we will have to complete our own section but then we will have to meld it into the whole proposal with the other groups.
 * Concerns:

2/14: How we will be able to split up the data as well as finding an accurate way to validate all audio files. Preparing our proposal due in two weeks.

2/15: Doesn't seem like many updates from other groups/people.

2/16: Been sick so I haven't dug as deep as I wanted to this week. Will reconvene with the group tomorrow and we can talk about our plans. Also see previous concerns.

Week Ending February 23, 2016
2/18: Create random sample of data audio files to listen to and compare to transcript. Work on proposal.
 * Task:

2/19: Revise Proposal, convert audio files, prepare document to annotate findings.

2/21: Finish analysis of first chunk of 120 random samples from audio files.

2/22: Read over proposal, brainstorm idea for listing checked audio files, either in an excel document or possibly data group page.

2/18: Successfully took a random sample and will use a script to convert selected audio files to .wav format. Will document how this was done in a future update.
 * Results:

2/19: Revised proposal and sent e-mail to the group. Will wait to hear feedback of revision 2.0 for the data group. I'll also convert audio files tonight from Caesar to my home machine.

2/21: The last full train run as an experiment was 0271/001. We have used the transcript files from there as well as the audio from /mnt/main/corpus/switchboard/256hr/train/audio/utt. That path contains the utterance files for the entire 256 hour data set. The data group has divided 120 random samples between the four of us so that evaluation would be easier. From my list of 30 audio files I didn't find any huge discrepancies. There would a couple files that may have issues to which I will discuss with the team when we meet back up again. If we agree the files in question are fine, then we can move on to the next batch to evaluate. The command to select the first random sample of 120 files was:

head -n 30K 001_train.trans | awk '!(NR%250)

The head command takes the first 30,000 lines of the 001_train.trans file and the awk command makes a selection for every 250th line and outputs that. To find every 250th line, we used modulus to then start over again counting. The /mnt/main/corpus/switchboard/256hr/train/audio/utt directory is a very large directory so to extract the .sph audio files from there I just copied them to my home directory and then used FileZilla to pull the audios to my own laptop.

cp sw2157B-ms98-a-0038.sph /mnt/main/home/sp16/bpu4 is the command I used to copy audio files to my home directory.

2/22: Proposal doesn't look too bad. Will wait until Wednesday to hear Prof. Jonas' feedback and the class' feedback on what the next iteration will be.

2/18: Revise proposal and make notes about audio files.
 * Plan:

2/19: Convert audio files, create document.

2/21: Continue to look at proposal for possible changes, further document commands used to validate audio files.

2/22: Continue thinking about how we will document our audio findings.

2/18: No concerns for this week yet.
 * Concerns:

2/19: None.

2/21: None.

2/22: None.

Week Ending March 1, 2016

 * Task:

2/25: Begin analysis on next section of audio files. Revise proposal for final grade. Think up idea for how to properly note listened to audio files.

2/27: Convert new batch of audio files to .wav. Work on proposal update.

2/28: Continue listening and documenting files.

3/1: See previous tasks above.


 * Results:

2/25: We worked in class to devise ways to more easily copy audio files from Caesar to our personal machines. It seems like a script may be the easiest way to do this but currently we have some automation that will help us accomplish this task. We will also look for ways to create a script which might make this much easier in future weeks and future semesters.

2/27: Successfully converted next batch of .sph files to .wav, will listen and annotate shortly.

2/28: Not having any major issues listening to files. Everything seems to be going well.

3/1: Successfully converted and listened to all audio files this week. I did not find any issues with audio matching to transcripts.


 * Plan:

2/25: Continue copying .sph files and convert to .wav to listen to and annotate. Make changes to proposal that were talked about in class. 2/27: Annotate files and revise proposal for final draft.

2/28: Further annotate files.

3/1: Reconvene with team tomorrow and go over our findings.


 * Concerns:

2/25: A good grade on the finished proposal.

2/27: See previous.

2/28: See previous.

3/1: See previous.

Week Ending March 8, 2016

 * Task:

3/3: Excellent find by Brian A. and Justin G. this past week. They discovered a bunch of duplicated audio files that didn't match the transcript for our random sample last week. These audio files range from sw2333A-ms98-a-0166 to sw2416B-ms98-a-0143 in the /mnt/main/corpus/switchboard/256hr/train/audio/utt directory. We will try to locate these audio files in other directories to ensure that they don't repeat the same audio over and over again. We also plan to continue with our overall random 1% sample so this week I will convert and listen to a new batch of audio files. I will also try to locate a hypothesis transcript file for a 256hr train to check for duplicate transcript lines which could point us in the direction of duplicate audio files that will need examining.

3/5: Finish copy and conversion of 62 audio files this week. Will need to compare against transcript file and document discrepancies. I will also begin the search for hyp.trans files.

3/7: Evaluate audio files

3/8: Locate some hyp.trans files and check for duplicates.


 * Results:

3/3: Currently copying .sph files and converting to .wav for listening this week.

3/5: Finished conversion, will document findings shortly.

3/7: All 62 audio files for this weeks evaluation are no good. All transcript data does not match the audio recording. For each of the audio recordings, it is the same file over and over again. The audio recording is: "I WAS THAT'S EXACTLY WHAT I WAS GOING TO ASK YOU HAVE YOU EVER HAVE HAVE YOU BEEN DOWNTOWN RECENTLY." I will have to look back in the main transcript to see where this recording for starts to evaluate how many files are no good. I will also continue a search for hyp.trans to search for duplicates. The following audio files are all the same:

"I WAS THAT'S EXACTLY WHAT I WAS GOING TO ASK YOU HAVE YOU EVER HAVE HAVE YOU BEEN DOWNTOWN RECENTLY."

(sw2672B-ms98-a-0024) (sw2673B-ms98-a-0029) (sw2674B-ms98-a-0009) (sw2675A-ms98-a-0016) (sw2675B-ms98-a-0107) (sw2676B-ms98-a-0089) (sw2678A-ms98-a-0027) (sw2679B-ms98-a-0067) (sw2680B-ms98-a-0031) (sw2680B-ms98-a-0129) (sw2681A-ms98-a-0083) (sw2682B-ms98-a-0104) (sw2684B-ms98-a-0034) (sw2684B-ms98-a-0124) (sw2685B-ms98-a-0103) (sw2687B-ms98-a-0007) (sw2687B-ms98-a-0106) (sw2688A-ms98-a-0086) (sw2690A-ms98-a-0024) (sw2691A-ms98-a-0013) (sw2692A-ms98-a-0021) (sw2692B-ms98-a-0124) (sw2693A-ms98-a-0133) (sw2694A-ms98-a-0039) (sw2695A-ms98-a-0100) (sw2696A-ms98-a-0028) (sw2697B-ms98-a-0009) (sw2697A-ms98-a-0106) (sw2698A-ms98-a-0079) (sw2699A-ms98-a-0055) (sw2700B-ms98-a-0038) (sw2701A-ms98-a-0043) (sw2702B-ms98-a-0010) (sw2703B-ms98-a-0009) (sw2703A-ms98-a-0102) (sw2704A-ms98-a-0059) (sw2705B-ms98-a-0045) (sw2706A-ms98-a-0017) (sw2707A-ms98-a-0008) (sw2707A-ms98-a-0123) (sw2708A-ms98-a-0047) (sw2709A-ms98-a-0034) (sw2710B-ms98-a-0038) (sw2711A-ms98-a-0024) (sw2711A-ms98-a-0127) (sw2712B-ms98-a-0048) (sw2712B-ms98-a-0156) (sw2713B-ms98-a-0064) (sw2714A-ms98-a-0061) (sw2715A-ms98-a-0067) (sw2715A-ms98-a-0175) (sw2716B-ms98-a-0086) (sw2717A-ms98-a-0065) (sw2718B-ms98-a-0059) (sw2719A-ms98-a-0033) (sw2720B-ms98-a-0009) (sw2721A-ms98-a-0034) (sw2722A-ms98-a-0031) (sw2723A-ms98-a-0037) (sw2723B-ms98-a-0140) (sw2724A-ms98-a-0076) (sw2725A-ms98-a-0089)

3/8: I was able to locate a couple hyp.trans files in Exp 271. The sub experiments that had these files were 003/etc/hyp.trans, /003/etc/hyp1.trans, 004/etc/hyp.trans, 005/etc/hyp.trans. I scan through these transcript files didn't show me any duplicate entries. Prof. Jonas has his sandbox in Exp 272 and sub experiment 004 has a hyp.trans file that has 3 or 4 statements that are repeated over and over. As the data group we may want to confirm these files are all duplicated but also check in another directory like the full directory to see if actual correct audio files exist for these entries.


 * Plan:

3/3: Finish copy and conversion. Check for other 256hr trains in the experiments directory which might include a hyp.trans file to see the computer's proposed transcript. If I can locate one of these files I may be able to find other duplicate audio files in our corpus.

3/5: Locate hyp.trans files to check for duplicate transcriptions which might lead us to identifying audio files that have issues.

3/7: Search through old experiments and if a hyp.trans isn't found maybe we can suggest running a full train to get this file and inspect it.

3/8: Meet with team tomorrow and think of an action plan to what we can do to try to fix these issues with the data.
 * Concerns:

3/3: No major concerns this week.

3/5: No major concerns this week.

3/7: More issues seem to be arising which is going to ultimately lead to much more work. I'm sure we can tackle some of the problems but I don't think as a whole project we will be able to right the wrongs of many previous years.

3/8: Amount of errors with the data.

Week Ending March 22, 2016

 * Task:

3/15: Focus on project team tasks, learn to run a successful train and decode.

3/18: Run Train on first_5hr to become familiar with the training process.

3/20: Decode the train that was run on 3/18.


 * Results:

3/15: After finding that a huge chunk of the beginning evaluation of the audio files were corrupt, the Data group has decided to move on to working with the Modeling group to figure out how to fix the audio files. The Modeling group has met with Prof. Jonas to build a new corpus structure which will include: 5hr, 150hr, 300hr and keeping full which will be true to their title to match 5hrs of transcript to 5hrs of audio files. The Data group also has a lot of understanding to do for helping to build this new structure. I also plan to run a train and decode on the old data to familiarize myself with the process.

3/18: Was able to run a train successfully. I will come back when it is finished to then run the decode. The train was run on the first_5hr corpus which probably isn't the best to run on, but at least it will give me some experience in the process to be an asset to my new team.

3/20: Ran a successful decode. Didn't seem too difficult, however the WER was 80% which is pretty bad. I'm not too worried thus far though as I know there are issues with the corpus as a whole the Modeling group is working on. I still have to read through the e-mails again about the new structure of the corpus and how it is being built.


 * Plan:

3/15: Run successful train/decode and familiarize myself with the process. Learn new corpus structure and assist to ensure things are correct.

3/18: Decode and score when train has completed. 3/20: Understand what is happening in the e-mail chain and also run another train, this time maybe tweaking some settings to see what they do.


 * Concerns:

3/15: This is going to be a lot of work to catch up. But I'll do what I need to be a productive member of the team.

3/18: Figuring out the process.

3/20: Other work that needs to be completed while also focusing on this project.

Week Ending March 29, 2016
3/23: Move old corpora into old/ directory to allow for new ones to be built. Test new corpora when Modeling group finishes the script to build new ones.
 * Task:

3/26: Play around with new scripts that were created that will be used to create new user defined corpus sizes.

3/27: Read through documentation and run scripts.

3/29: Create 300hr corpus as requested by Prof. Jonas. Evaluate a new audio sample from the full/ corpus that has been fixed to ensure integrity of audio data.


 * Results:

3/23: Moved 256hr, 3170, 125hr_3170, fixed_30k into /mnt/main/corpus/switchboard/old. Awaiting the creation of new corpora that will allow us to run trains on it. first_5hr was left in the directory because some students were still running a train on it. Once they have completed that train, first_5hr will also be moved into the old directory.

3/26: The Modeling group, specifically Jon S. has finished some scripts that will allow building user defined corpus sizes to train and decode on. I will try to run through these scripts tonight in my home directory to see how they run and be familiar with the process.

3/27: Ran through scripts to create a corpus in my home directory. Didn't run into any issues with building it. The following is what I did to create the corpus. These directions came from Jon S. of the Modeling group:

1. Run makeCorpus.pl  2. Copy a transcript file to /info/misc 3. CD into /info/misc 4. Sample Transcripts:
 * 1. Run sampleTrans.pl -r 
 * 1. This will create train.trans-sampled (every nth line) and train.trains-remaining (the remainder)
 * 2. Rename train.trains-sampled to dev.trans, move it into the /test/trans directory
 * 3. Rename train.trains to train.trans-orig1 (archiving the untouched train.trans file)
 * 4. Rename train.trans-sample to train.trans (allows us to repeat a sample on the trans file that has the dev.trans lines remove from it).
 * 2. Run sampleTrans.pl -r 
 * 1. This will create train.trans-sampled (every nth line) and train.trains-remaining (the remainder)
 * 2. Rename train.trains-sampled to eval.trans, move it into the /test/trans directory
 * 3. Rename train.trains to train.trans-orig2 (archiving the untouched train.trans file)
 * 4. Rename train.trans-sample to train.trans (allows us to repeat a sample on the trans file that has the dev.trans and eval.trans lines remove from it).
 * 3. Run sampleTrans.pl  NO -R HERE
 * 1. This will create train.trans-sampled file, no train.trans-remaining will be created
 * 2. Move train.trans-remaining /test/utt/trans and rename it train.trans
 * 4. Copy /info/misc/train.trans to /train/trans/train.trans (this is the trans file remaining after all our samples, it is what we will use for the trains)

5. Create Links to utterances
 * 1. The train/audio/utt files
 * 1. CD into train/audio/utt
 * 2. Run linkTransAudio.pl  
 * 3. Ls afterward to verify you have good links
 * 2. The test/audio/utt files
 * 1. Repeat the same process as above 3 times: eval.trans, dev.trans, and train.trans

3/29: Ran through the process of creating a 300hr corpus, which is technically a "full" corpus. ~5 hours have been sampled for the dev.trans and ~5 hours have been samples for the eval.trans. The process is as follows:

[root@caesar switchboard]# perl /mnt/main/scripts/user/makeCorpus.pl 300hr

300hr directory structure created!

[root@caesar switchboard]# ls

145hr 300hr  dist  first_10_lines  first_4hr  full  old

[root@caesar switchboard]# cd 300hr/

[root@caesar 300hr]# cd info/

[root@caesar info]# cd misc/

[root@caesar misc]# pwd

/mnt/main/corpus/switchboard/300hr/info/misc

[root@caesar misc]# cp /mnt/main/corpus/switchboard/full/train/trans/train.trans.

[root@caesar misc]# awk '{total += $3 - $2} END {print total /3600}' train.trans

311.761

[root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl -r 1 train.trans

[root@caesar misc]# awk '{total += $3 - $2} END {print total /3600}' train.trans-sampled

311.761

[root@caesar misc]# mv train.trans train.trans-full

[root@caesar misc]# rm train.trans-remaining

rm: remove regular empty file `train.trans-remaining'? yes

[root@caesar misc]# mv train.trans-sampled train.trans

[root@caesar misc]# awk '{total += $3 - $2} END {print total /3600}' train.trans

311.761

[root@caesar misc]# ls

train.trans train.trans-full

[root@caesar misc]# perl /mnt//main/scripts/user/sampleTrans.pl 60 train.trans

[root@caesar misc]# ls

train.trans train.trans-full  train.trans-sampled

[root@caesar misc]# awk '{total += $3 - $2} END {print total /3600}' train.trans-sampled

5.21426

[root@caesar misc]# ls

train.trans train.trans-full  train.trans-sampled

[root@caesar misc]# rm train.trans-sampled

rm: remove regular file `train.trans-sampled'? yes

[root@caesar misc]# ls

train.trans train.trans-full

[root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 60 train.trans  *******CHECKING MATH HERE TO GET 5HR CHUNK

[root@caesar misc]# ls

train.trans train.trans-full  train.trans-sampled

[root@caesar misc]# awk '{total += $3 - $2} END {print total /3600}' train.trans-sampled

5.21426

[root@caesar misc]# ls

train.trans train.trans-full  train.trans-sampled

[root@caesar misc]# rm train.trans-sampled                                    ********DELETING FILE NOW TO TAKE A SAMPLE WITH ACTUAL REMOVAL

rm: remove regular file `train.trans-sampled'? yes

[root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl -r 60 train.trans

[root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.trans

311.761

[root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled

5.2679

[root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.trans-remaining

306.493

[root@caesar misc]# mv train.trans train.trans-old1

[root@caesar misc]# mv train.trans-sampled ../../test/trans/dev.trans

[root@caesar misc]# mv train.trans-remaining train.trans

[root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.trans

306.493

[root@caesar misc]# ls

train.trans train.trans-full  train.trans-old1

[root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 60 train.trans

[root@caesar misc]# ls

train.trans train.trans-full  train.trans-old1  train.trans-sampled

[root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled

5.17714

[root@caesar misc]# rm train.trans-sampled

rm: remove regular file `train.trans-sampled'? yes

[root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl -r 60 train.trans

[root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled

5.13372

[root@caesar misc]# ls

train.trans      train.trans-old1       train.trans-sampled

train.trans-full train.trans-remaining

[root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.trans-remaining

301.36

[root@caesar misc]# mv train.trans-sampled ../../test/trans/eval.trans

[root@caesar misc]# mv train.trans train.trans-old2

[root@caesar misc]# mv train.trans-remaining train.trans

[root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 60 train.trans

[root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.trans-sampled

4.98117

[root@caesar misc]# mv train.trans-sampled ../../test/trans/train.trans

[root@caesar misc]# awk '{total += $3 - $2} END {print total /3600}' train.trans

301.36

[root@caesar misc]# cp train.trans ../../train/trans/train.trans

[root@caesar misc]# ls

train.trans train.trans-full  train.trans-old1  train.trans-old2

[root@caesar misc]# cd ../../test/trans

[root@caesar trans]# ls

dev.trans eval.trans  train.trans

[root@caesar trans]# awk '{total += $3 - $2} END {print total /3600}' dev.trans

5.2679

[root@caesar trans]# awk '{total += $3 - $2} END {print total /3600}' eval.trans

5.13372

[root@caesar trans]# awk '{total += $3 - $2} END {print total /3600}' train.trans

4.98117

[root@caesar trans]# cd ../audio/utt/

[root@caesar utt]# perl /mnt/main/scripts/user/linkTransAudio.pl ../../tr

ans/dev.trans /mnt/main/corpus/switchboard/full/train/audio/utt/

[root@caesar utt]# ls -l | wc -l

4174

[root@caesar utt]# perl /mnt/main/scripts/user/linkTransAudio.pl ../../trans/eval.trans /mnt/main/corpus/switchboard/full/train/audio/utt/

[root@caesar utt]# ls -l | wc -l

8277

[root@caesar utt]# perl /mnt/main/scripts/user/linkTransAudio.pl ../../trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt/

[root@caesar utt]# ls -l | wc -l

12311

[root@caesar utt]# cd ../../../train/trans/

[root@caesar trans]# pwd

/mnt/main/corpus/switchboard/300hr/train/trans

[root@caesar trans]# ls

train.trans

[root@caesar trans]# awk '{total += $3 - $2} END {print total /3600}' train.trans

301.36

[root@caesar trans]# cd ../audio/utt/

[root@caesar utt]# ls

[root@caesar utt]# perl /mnt/main/scripts/user/linkTransAudio.pl ../../trans/train.trans /mnt/main/corpus/switchboard/full/train/audio/utt

[root@caesar utt]# ls -l | wc -l

242055

[root@caesar utt]#


 * Plan:

3/23: Evaluate new corpora after they have been built. Will pay close attention to e-mail chain when this will be completed so we can get a start on it.

3/26: Run scripts, reach out to the Modeling group if I run into any issues.

3/27: Run a train on newly created corpus and see what happens.

3/29: Continue analysis on audio data of the new full that was created to check for any persistent errors
 * Concerns:

3/23: Getting on track to help Modeling group where we can to evaluate this newly built corpus.

3/26: The amount of work to accomplish this weekend.

3/27: See above.

3/29: A lot of work.

Week Ending April 5, 2016
3/30: Create 30hr corpus for Team Stark 'reasons'
 * Task:

4/5: Evaluate some past bad audio files to ensure they are correct. Also research some parameters that Team Stark may be able to use to get a better WER.

4/6: See above.

3/30: [root@caesar switchboard]# perl /mnt/main/scripts/user/makeCorpus.pl 30hr 30hr directory structure created! [root@caesar switchboard]# cd 30hr/info/misc/ [root@caesar misc]# pwd /mnt/main/corpus/switchboard/30hr/info/misc [root@caesar misc]# cp /mnt/main/corpus/switchboard/full/train/trans/train.trans. [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s 311.761 [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 10 train.trans [root@caesar misc]# ls train.trans train.trans-sampled [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s-sampled 31.0565 [root@caesar misc]# rm train.trans-sampled rm: remove regular file `train.trans-sampled'? yes [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 12 train.trans [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s-sampled 26.0382 [root@caesar misc]# rm train.trans-sampled rm: remove regular file `train.trans-sampled'? yes [root@caesar misc]# ls train.trans [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 8 train.trans                                                                        [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                     s-sampled 38.9952 [root@caesar misc]# rm train.trans-sampled rm: remove regular file `train.trans-sampled'? yes [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 7 train.trans                                                                        [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                     s-sampled 44.7362 [root@caesar misc]# rm train.trans-sampled rm: remove regular file `train.trans-sampled'? yes [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl -r 8 train.trans [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s-sampled 38.922 [root@caesar misc]# mv train.trans train.trans-full [root@caesar misc]# rm train.trans-remaining rm: remove regular file `train.trans-remaining'? yes [root@caesar misc]# mv train.trans-sampled train.trans [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s 38.922 [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 11 train.trans [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s-sampled 3.66002 [root@caesar misc]# rm train.trans-sampled rm: remove regular file `train.trans-sampled'? yes [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 10 train.trans                                                                       [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                     s-sampled 3.88465 [root@caesar misc]# rm train.trans-sampled rm: remove regular file `train.trans-sampled'? yes [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 8 train.trans                                                                        [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                     s-sampled 4.9214 [root@caesar misc]# rm train.trans-sampled rm: remove regular file `train.trans-sampled'? yes [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl -r 8 train.trans                                                                     [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                     s-sampled 4.80358 [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s 38.922 [root@caesar misc]# ls train.trans train.trans-full  train.trans-remaining  train.trans-sampled [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s 38.922 [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s-sampled 4.80358 [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s 38.922 [root@caesar misc]# rm train.trans-sampled rm: remove regular file `train.trans-sampled'? yes [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl -r 8 train.trans [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s 38.922 [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s-sampled 4.80358 [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s-remaining 34.1184 [root@caesar misc]# mv train.trans train.trans-old1 [root@caesar misc]# mv train.trans-sampled ../../test/trans/dev.trans [root@caesar misc]# mv train.trans-remaining train.trans [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s 34.1184 [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 8 train.trans                                                                        [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                     s-sampled 4.32405 [root@caesar misc]# rm train.trans-sampled rm: remove regular file `train.trans-sampled'? yes [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl -r 8 train.trans [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s-sampled 4.26969 [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s-remaining 29.8487 [root@caesar misc]# mv train.trans-sampled ../../test/trans/eval.trans [root@caesar misc]# mv train.trans train.trans-old2 [root@caesar misc]# mv train.trans-remaining train.trans [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 8 train.trans                                                                        [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                     s-sampled 3.70656 [root@caesar misc]# rm train.trans-sampled rm: remove regular file `train.trans-sampled'? yes [root@caesar misc]# perl /mnt/main/scripts/user/sampleTrans.pl 6 train.trans                                                                        [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                     s-sampled 5.00748 [root@caesar misc]# mv train.trans-sampled ../../test/trans/train.trans [root@caesar misc]# awk '{total += $3 - $2} END {print total / 3600}' train.tran                                                                    s 29.8487 [root@caesar misc]# cp train.trans ../../train/trans/train.trans [root@caesar misc]# ls train.trans train.trans-full  train.trans-old1  train.trans-old2 [root@caesar misc]# cd ../../test/trans/ [root@caesar trans]# ls dev.trans eval.trans  train.trans [root@caesar trans]# awk '{total += $3 - $2} END {print total / 3600}' train.tra                                                                    ns 5.00748 [root@caesar trans]# awk '{total += $3 - $2} END {print total / 3600}' eval.tran                                                                    s 4.26969 [root@caesar trans]# awk '{total += $3 - $2} END {print total / 3600}' dev.trans 4.80358 [root@caesar trans]# cd ../audio/utt/ [root@caesar utt]# perl /mnt/main/scripts/user/linkTransAudio.pl ../../trans/dev                                                                    .trans /mnt/main/corpus/switchboard/full/train/audio/utt/ [root@caesar utt]# ls -l | wc -l 3913 [root@caesar utt]# perl /mnt/main/scripts/user/linkTransAudio.pl ../../trans/eva                                                                    l.trans /mnt/main/corpus/switchboard/full/train/audio/utt/ [root@caesar utt]# ls -l | wc -l 7336 [root@caesar utt]# perl /mnt/main/scripts/user/linkTransAudio.pl ../../trans/tra                                                                    in.trans /mnt/main/corpus/switchboard/full/train/audio/utt/ [root@caesar utt]# ls -l | wc -l 11328 [root@caesar utt]# cd ../../../train/trans/ [root@caesar trans]# ls train.trans [root@caesar trans]# awk '{total += $3 - $2} END {print total / 3600}' train.tra                                                                    ns 29.8487 [root@caesar trans]# cd ../audio/utt/ [root@caesar utt]# ls [root@caesar utt]# perl /mnt/main/scripts/user/linkTransAudio.pl ../../trans/tra                                                                    in.trans /mnt/main/corpus/switchboard/full/train/audio/utt/ [root@caesar utt]# ls -l | wc -l 23958 [root@caesar utt]# exit
 * Results:

4/6: Checked the previous erroneous data that first lead to our discovery to ensure it had been completely fixed by the modeling group and the small sample I checked seemed to be fine. I also found a document while looking up parameters which I'll bring to my team today so maybe we can find if any data is useful in it. It was a recently published paper in 2015 located here: http://arxiv.org/pdf/1505.05899.pdf

3/30: Work on other stuff talked about in class.
 * Plan:

4/5: Keep researching on parameters. 4/6: Meet with Team Stark today to review train and decode findings.


 * Concerns:

3/30: Class work balance.

4/5: I have been sick and not able to focus on these tasks. I will have a lot to do before class tomorrow to catch up.

4/6: Getting my groups poster done on top of my other course load this week. There is a lot of work to accomplish.

Week Ending April 12, 2016
4/9: Data group meeting in class lead to discussion about completing URC poster. We will work on putting this together and have divided sections up amongst ourselves. Team Stark meeting discussed what we hope to finish for this week. I have been tasked with running a decode on unseen data from the 145hr train. I will begin this process now and see how everything goes. Prof. Jonas also spoke in class about trying to score the entire 300hr corpus in hopes of identifying possible problem utterances. The Data group will look into how to possibly do this and will report back.
 * Task:

4/10: Run decode on completed 145hr train. Figure out how to correctly point to unseen data for this task so we can check our WER on unseen, something we haven't done yet.

4/11: Figure out what went wrong with decode on unseen data.

4/12: Now that I have some information, check out how to maybe work through a decode on unseen data.


 * Results:

4/9: Brian A and Justin G have submitted their portions of the poster. I will put all the information together tonight and e-mail it out to them to take a look and critique. Currently starting the 145hr decode of unseen data. Will post the results in our experiment directory for Team Stark.

4/10: Positive feedback for the poster for the Data Group so that will be submitted today to Prof. Jonas for the URC. I think it did a good job hitting all the points we wanted to as well as recognizing the Data Groups achievements for this semester. I'm about to start the unseen data decode on 145hr. The results, either successful or unsuccessful will be posted in the next log. Team Stark is currently controlling the publishing of our numerical results to hopefully give us an edge in the competition.

4/11: Went through the process to build a Language Model for the 145hr train and then also tried to run a decode on unseen data, eval.trans specifically which errored out. I spoke with Matt H. and he has informed me there are some additional steps that must be completed to decode on unseen data. Team Stark has scheduled a Google Hangout meeting tonight to go over how to train on unseen data. I will take note of what I need to do to complete this decode and make an update in our shared folder for the meantime.

4/12: Team Stark had a good information session last night via Google Hangout. Some issues were addresses about unseen data which I will try to run through tonight to check if I can be successful.


 * Plan:

4/9: Complete decode on unseen data and report findings to Team Stark. Finish the poster for the URC and send it out to the Data group.

4/10: Complete decode on unseen data from the 145hr train that has just completed.

4/11: Google Hangout meeting this evening with Team Stark to talk about unseen data decode. 4/12: Run decode.


 * Concerns:

4/9: Lots of stuff to do this week.

4/10: Lots of work for other classes as well as this one.

4/11: Dealing with other school work.

4/12: Don't feel too well and have other assignments I have to complete as well.

Week Ending April 19, 2016

 * Task:

4/14: Score my previous experiment 012 and see what the result is. All documentation about experiments will be located in the Experiments directory and will be made public after the competition is over. I will also be doing some research into decoding and seeing if it is possible to decrease decoding time. We will have to look through decoding scripts on Caesar to break down how each work as well as looking up any attribute that may help us get a better score.

4/18: Read through Team Stark e-mails and check more about possibly multithread decoding. Assist team members with anything they need help with. Consider how to decode/score the entire corpus if a multithread decode is possible. 'Multithread decode' means using multiple CPU cores on Caesar to speed up the decode process.

4/19: Think up some good items to address at tomorrows meeting. Tomorrow is the URC as well so we will be limited in what we can go over as a team so I will have to think of some good items to bring to discussion.


 * Results:

4/14: Score was kind of as expected. I will have to look into changing other attributes in the future to see if I can get something better. Will continue to look up stuff in the decoding process and propose ideas to Team Stark through e-mail.

4/18: Still lots to read through as far as some documentation found by my team mates. I will read through some of this tonight and see what I can make sense of it.

4/19: Read through some of the Team Stark e-mails. There are some good findings by Matt H. which might help us to be able to decode much faster than real time. If this becomes a possibility, the Data Group may be able to decode and score the whole corpus which can help us look at the data in numerical form and allow us to possibly remove poor audio recordings to help us get a better WER on the corpus as a whole.


 * Plan:

4/14: Look up how the train/decoding process works.

4/18: Read documentation about linear alignment, and other training attributes that will hopefully give us a world class baseline. Send e-mail to Team Stark if anyone needs any assistance with anything currently running.

4/19: Prepare for the Undergraduate Research Conference tomorrow. It will be more focused on the Data part of things than team work tomorrow. I'm just curious how much time we will have tomorrow to talk and work with our teams.


 * Concerns:

4/14: Dealing with other classes.

4/18: Dealing with other assignments while still trying to focus here.

4/19: Preparing for URC, hoping we have time for a team meeting tomorrow, working on other class assignments.

Week Ending April 26, 2016
4/21: The big task this week will be trying to decode the whole 300 hour corpus. I'll have to do some research behind this and I will be 2 machines to split the corpus. Both teams will be sacrificing a machine to accomplish this task, or a partial machine as this decode will be split into parts to achieve the desired result. I will also be looking at attributes Team Stark wants us to look at. This will be quite a bit of work for this week.
 * Task:

4/23: Have a Google Hangout session with Team Stark about decoding and delegating tasks. Run 2 part decode on Asterix and Obelix of entire 300hr corpus.

4/24: Run successful multi process decode on a split 300hr corpus.

4/26: Check decode process.

4/21: I will be meeting with Matt H. and Ben L. tomorrow night via Google Hangout to assist me with the process. In the meantime I will refresh myself running a decode and look at running a decode on already trained data.
 * Results:

4/23: Google Hangout session went fine. Matt H. helped walk me through some steps. I will be running the decode tonight on Asterix and Obelix of the 300hr corpus. I will be pulling from a 300hr train 0289/002 run by Matt H. so I won't need to retrain this corpus. I will update again when I am successful. I think it should take a couple days to accomplish this task so we will see how it goes.

4/24: Matt H. helped me work out a small issue with getting this running. I have started two experiments 0284//007 which is the first half of the 300hr corpus or 121027 utterance files on Obelix, and 0284/008 which is the second half of the 300hr corpus on Asterix. I will be following these two experiments to see when the decodes complete and will inform both teams. In the meantime I will also try to see what improvements I can make to our WER.

4/26: Decode is still running on both Obelix and Asterix. I think that there may be an issue with this decode as I am unsure if the correct language model (LM) was used. I will let this process finish and figure out in class if this was done the correct way. Tomorrow's class will allow us to regroup as a team because we couldn't talk much last week due to the URC.

4/21: Look through decode process again. Also look to play with some attributes and run a train and decode on a smaller set of data. I will have to check when we have server availability to do so.
 * Plan:

4/23: Run 2 decodes, one on Asterix and one on Obelix. Look up some attributes Ben L. sent out in an e-mail that relate to the train process which might help us decrease WER.

4/24: Wait for decode processes to complete. Read e-mail again from Ben L. about possible improvements that could be made.

4/26: Bring concerns to class and talk with team tomorrow. 4/21: I have concerns about being successful with this large decode task. Professor Jonas wanted me to head this up personally for the Data group, so hopefully everything works out.
 * Concerns:

4/23: Being successful with these decodes as well as having time for all these other assignments I have.

4/24: See previous.

4/26: See previous.

Week Ending May 3, 2016
4/30: Run new decode on 300hr to reflect good LM. Figure out how to score a decode based on utterance rather than on conversation speaker.
 * Task:

5/2: Still look into how to score individual utterances instead of scoring based on a speaker. Prepare information for our final report.

5/3: Continue looking into scoring individual utterances, continue with information for final report, figure out what needs to be including in the 'team' report.


 * Results:

4/30: Both teams have decided to call a truce as we both hit a point in our research where updates to Sphinx need to be completed to attain a better overall WER. I think this unity will greatly allow us to get a much better overall WER from previous semesters. It's also great to share both teams 'secrets' and it's also great to be able to talk with the Data Group again to achieve what we want which is very accurate and clean data. Experiments 0284/007 and 0284/008 were started last week but it appears they are both still running. Based on calculations by Prof. Jonas, these should have taken about 4 days. I am considering killing the decode and starting a full 8 part decode on one server to see what happens. The other thing the data group is responsible for this week is trying to further figure out how to run a score in SCLite that will score every utterance rather than just a speaker. Some progress was made which might show us that scoring might be a little skewed due to how the transcript files are formatted but further investigation is needed.

5/2: Haven't made any progress with finding how to score per utterance. I have also noticed that the two decodes processes are still running on Obelix and Asterix. I'm not sure how much longer these are going to run for or if we should kill these experiments and run a full decode using 8 parts on Obelix itself. I will consult with the Data Group and see what the consensus is. If it is going to take a while to decode, it might be better to let this finish so we will at least have a decode completed for next semester to take a look at if we can't figure out how to score each utterance.

5/3: I know Justin G. is still looking into the script for formatting the transcript to look at individual utterances rather than conversations. We also suspect that the and tags at the beginning and ending of the transcript might subtly impact WER so that is something else we will have to look into. We will also have to update our section for the Data Group for the final report and wrap up what we won't the next semester to carry on forward with. I think ultimately it is going to be them better working with this multi decode for the entire 300hr, figure out how to score each utterance and lastly decided what a good score or a bad score is and listening to those files. The 300hr decode is still running on both Obelix and Asterix in 2 parts, I'm not sure when this will finish.


 * Plan:

4/30: Wait for Experiment 0284/007 and 0284/008 to complete. If they don't complete by tomorrow I might consider with the consensus of the Data Group to kill this decode because we know the LM that was built won't really give us an accurate score. We also plan to keep looking up how to score every utterance. Justin G. has found some good documentation that we will keep looking through to figure out what we need to do.

5/2: Talk with Data Group about this decode taking longer than expected. Talk with Data Group about what we want to include for our final report.

5/3: See if anyone has found any information about scoring per utterance, talk to Data Group about our portion of the final report.


 * Concerns:

4/30: End of semester wrap up with big projects for my other classes.

5/2: Working on about 3 projects concurrently and also having to write up papers for other classes. Just have to remind myself there is just 2 weeks left.

5/3: Other projects I need to complete this week and finals week coming.

Week Ending May 10, 2016

 * Task:

5/5: Update and smooth "Data Group Wiki Page". Work on final report. Check for finished decode on Experiments 0284/007 and 0284/008.

5/7: Continue working on "Data Group Wiki Page" and also work on final report.

5/8: See above.

5/10: Finish "Data Group Wiki Page". Keep working on final report.


 * Results:

5/5: Justin G. and I composed a list in class this week about what we want to include in the final report as well as making a "Data Group Wiki Page". We never used this during the semester to make updates or anything so we want to use it now to put a lot of the key things we did this semester and how to do it. Some idea included how to build corpora, how to find audio files, how to listen to audio files, how to generate a report to show every utterances score, and whatever else we can think of. Also filled Brian A. in so we should have enough to do for this week to wrap up Capstone.

5/7: Appears that both Experiment 0284/007 on Obelix and 0284/008 on Asterix have completed. I will talk with Justin G. to see if we can get this scored and generate a LUR (Labeled Utterance Report) for both these so that we will be able to see the error percentage for every utterance in this decode.

5/8: Still working on group wiki page. Trying to manage other classes finals material and projects so I will keep working on this as time becomes available. Will also check when LUR has been generated for the two experiments that completed that encompass the whole corpus.

5/10: Was able to lay down some good information on the Data Group Wiki Page which might help future semesters. I've outlined how to build a corpus, where audio files are located, how to listen to audio files, how to generate a Labeled Utterance Report (LUR) and I'll have Brian A. and Justin G. look over it and see what else we should add. I'm about to start adding more stuff to the final report and I will have the Data Group look over it once that is complete as well.


 * Plan:

5/5: Work on wiki page, final report and watch for decode to finish.

5/7: Keep working on wiki page and final report. 5/8: Carry on with finishing wiki and working on final report.

5/10: Finish up Data's portion of the final report. Tie up any other loose ends.


 * Concerns:

5/5: Being successful with the semester wrapping up.

5/7: Finishing the rest of the class projects and studying for finals.

5/8: Equal work balance for last week of class. Almost there.

5/10: Still having to write a paper and study for another exam as well as other projects that are still due.