Speech:Spring 2017 Maryjean Emerson Log

From Openitware
Jump to: navigation, search


Week Ending February 7th, 2017

Task

1. Download and login with VPN

2. Create Experiment Folder for the Data Group experiments

3. Read logs

4. Read logs

Results

1. Was unable to get VPN to work at home. It kept killing my WiFi - I asked for help in my group, class forum and the IT developer at work. None of them were able to figure it out. Finally had Josh Young look at it for me and he found that there was an adapter error and had to delete and re-install all of my network adapters. I am now able to get the VPN to work from home and sign in to Ceasar with no Problem.

2. When I logged in and went to the correct folder I found that the experiment was already created by another group, then one of our members created another folder that was theirs so our numbers seem to have switched. I sent a message to my team about it and let them know that we need to use 0297 as our main Exp folder to create sub experiments.

3. Read logs

4. Read logs

Plan

1. Download VPN, login and familiarize myself with the paths in Unix to access data.

2. Make the main experiment folder in mnt/main for the Data Group so that sub experiments can be created.

3. Read Logs

4. Read Logs

Concerns

1. Very concerned that I was not going to be able to get my VPN to work on my computer because that is where the bulk of my work will need to be done since I can not get to campus but one day a week. Luckily there is no problem not that it is all configured correctly.

2. Want to make sure that all the groups are using the correct Exp main folders for their sub files.

3. None

4. None

Week Ending February 14, 2017

Task

2/8 - Work on the semester goals and finalize our piece of the proposal rough draft due today. Learn How to copy files to our computers for future help when we start listening/comparing. Download the Transcript file onto my local directory.

2/11 - create a sub experiment in our group folder.

2/13 - Since I was successful with creating the folders in our experiment folder I will try to Run a Train and finish the rest of the instructions on the Wiki Site.

2/14 - Try again with the train instructions and finish running experiment.

Results

2/8 - We all agreed that we should start from a specific point and work our ways through in a linear fashion. Then we can give a hard end place where we stopped our analysis and give it to the next semester so they can pick up where we left off. With this process in place as each semester moves along eventually all the files will be checked for accuracy so that the overall data is at a very low WER and therefore the other groups (experiments and modeling) will have good data to work with.

Found in Brian's log this ---> "2/24: I copied all 61 files into a directory in my home drive called "Week2_SPHFiles" using the command cp /filepath/{file1,file2,etc} $home\Week2_SPHFiles. This worked flawlessly. I was able to get the file names with the commas by using excel and its concatenate function. After copying the files to my computer".

Success downloading the transcripts. Matt helped me, showing me to use this command in windows command prompt --> C:\Users\Maryjean>pscp mhe2000@caesar.unh.edu:/mnt/main/corpus/switchboard/full/train/trans/train.trans c:/Users/Maryjean/documents mhe2000@caesar.unh.edu's password: train.trans | 25559 kB | 3651.3 kB/s | ETA: 00:00:00 | 100% As you can see it was a success and by opening it up in Atom I was able to see it well organized. This gets me set up to easily reference it in the future with any audio files that I listen to for errors.

2/11 - Could not get the sub directory 003 to create in our Experiment Group Directory 0298. It told me permission denied. With the help of Tucker and Vitali through slack I was able to get the directory made. Tucker went in with root and changed permissions on the group directories. Then I needed their help again with using the addExp.pl script. I thought it needed to be done in the experiment directory but they informed me to go to the scripts/user directory to do it and with the command perl addExp.pl -s I was able to create it on the Wiki page as well.

2/13 - Read up on logs of the class and team members. Train timed out before it finished.

2/14 - Successfully removed the files/folders in my 0298/003 experiment directory. Had to try a couple of commands but with google I figured it out as this command "rm -rf /mnt/main/Exp/0298/003/*". The train failed though with this error there were hundreds of lines that gave this warning "WARNING: This word: SHIFT was in the transcript file, but is not in the dictionary ( YOU DON'T HAVE TO SHIFT [LAUGHTER] ). Do cases match?" with many different words and phrases. and then at the end this error showed up "Phase 7: TRANSCRIPT - Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once Something failed: (/mnt/main/Exp/0298/003/scripts_pl/00.verify/verify_all.pl)". I am going to try another experiment with the next incremented directory 004.

created the sub-directory but ran the commands in the main directory and had to remove all the directories that were created using these commands:

"[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/0298/bin

[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/0298/bwaccumdir

[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/029 0294/ 0295/ 0296/ 0297/ 0298/ 0299/

[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/0298/etc

[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/0298/feat

[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/0298/logdir

[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/0298/model model_architecture/ model_parameters/

[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/0298/model_architecture

[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/0298/model_parameters

[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/0298/python

[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/0298/scripts_pl

[mhe2000@caesar 0298]$ rm -rf /mnt/main/Exp/0298/wav"

I then went into the correct sub-directory 004 and re-ran the commands. Luckily I didn't run the train yet.

I was able to run the train but am not sure what to do for the decode so I will wait until class tomorrow night to get help.

Plan

2/8 - Finalized our overall goals for this semester.

Talked about increasing the accuracy and decreasing the WER (word error rate. Last semester used a random sampling. Read through logs to see how the team members from last semester did it and the most efficient way to achieve this.

Use a command to copy the transcript file onto my computer (SUCCESS)

2/11 - Run an experiment by myself to make sure I understand the process and can do it alone. I will use the instructions that Vitali shared with the class on slack with this link Speech:Models

2/13 - Finish with my experiment by running the train.

2/14 - Retry the train since it timed out last night. Remove the partial information that was in the experiment /0298/003 directory and start over.

Concerns

2/8 - no concerns at this time

2/11 - no concerns at this time

2/13 - no concerns at this time.

Week Ending February 21, 2017

Task

2/16 - Read through the notes of my group members and catch up with the slack conversation. Make sure I get my tasks for the week and organized to spread them out over the week before next class.

2/19 - Go through Matt's Document and start researching.

2/20 - Start at the end of the transcripts file and look for patterns and brackets that are "noise" to see how we can make the data more complete.

2/21 - Google the Switchboard we are using and other technologies and get more information about them.

Results

2/16 - Discovered we need to do a better job on the final proposal. Downloaded the file from Matt that he created for the train and got some other good notes from him that he posted in the Wiki about one of our main goals as a group. Dylan also had a good resource for listening to .sph files. I will familiarize myself better with all the documentation and experiment at home myself.

2/19 - I added my comments to Matt's document and suggested where I would be the best fit for the different tasks. I started looking into one of the tasks (research other companies or outside party’s transcripts and see how they mark extraneous data). I also looked through the train document to better understand the transcripts and how they are created, what the different [] mean, etc.

Dylan took Matt's tran.txt document and condensed it to only the lines that had brackets in them []. An example is this :

Line 243280: sw4717B-ms98-a-0030 159.530750 161.067875 [laughter-right]

Line 243281: sw4717A-ms98-a-0021 160.303625 162.545625 [laughter] it's great

Line 243283: sw4717A-ms98-a-0023 164.522375 170.195875 [laughter-oh] yeah [laughter] it's about the six grade it's basically a newspaper for those that watch [laughter-TV]

I found though that there were a lot of repeat lines because if there was more than one bracket in a line it would repeat itself that many times, for example that last line above was entered 3 times because it has [laughter-oh] and [laughter] and [laughter-TV]. So I started to delete the multiples to condense the document even more. I will discuss with my group how we could right a script to do that instead of going line by line which is what I was doing.

2/20 - I continued to go through the condensed document to remove multiple lines. This is tedious and I really need to find a way to do it with a script.

I also worked with the group to get the proposal finalized. I started the project overview at the beginning of the proposal and discussed on slack with other group members. Tucker is leading the pack and doing a really great job getting things organized.

2/21 - Used the following sites to get more information about the transcripts.

https://www.isip.piconepress.com/projects/switchboard/doc/transcription_guidelines/

https://www.isip.piconepress.com/projects/switchboard/doc/transcription_guidelines/transcription_guidelines.text

Plan

2/16 - Catch up on what I missed in class.

2/19 - Looked through the new proposal document that Matt Create and add my input.

2/20 - Last night I started going through the transcript document to try and find other types of non-word brackets that are being used and any patterns that I can find.

2/21 - Read logs and read up on switchboard transcripts and how they are created, formatted etc.

Concerns

2/16 - worried that we wont be able to get the final proposal done.

2/19 - still worried about the final proposal.

2/20 - getting the finalized document done. Tucker is doing a really great job and trying to make everything in one voice and look professional and consistent but I worry that we wont be able to get it all done.

2/21 - still worried about the final proposal.

Week Ending February 28, 2017

Task

2/22 - work with Dylan on finishing my experiment. Do more research with the unix commands to filter through the transcripts (looking specifically at tail, head, grep filter, etc commands that Professor Jonas mentioned in class.

2/23 - got started going through my portion of the transcripts that Matt handed out.

2/26 - continue going through transcript to look for varying and different brackets than the standard 4 we found for non-speech audio. Check in on the other team members and see if there is anything I can help with.

2/28 - finish looking through my part of the transcripts and record my findings.

Results

2/22 - finally getting used to the layout of the wiki and how to find information. There is no streamline to get info.

Using the links that I got from the other night I found this statement:

Non-speech sounds during conversations: transcribe these using only the following list of expressions in brackets:
[laughter][noise][vocalized-noise]
Laughter during speech: If laughter occurs directly before a word, place the [laughter] tag before the spoken word. If laughter occurs after a spoken word, place the [laughter] tag after the word. If the speaker laughs while saying the word, but the word is still understood, transcribe this as [laughter-word]

- with this information we know that there are only the 4 types of brackets. So from here on out when I go through the transcripts I will compare the transcripts with brackets and their audio file to make sure the transcript has created the correct notation with the [ ].

got our breakdowns from Matt on what piece of the transcripts we are all supposed to be reviewing.

2/23 - I haven't found anything yet other than the 4 non-speech brackets that are already listed.

2/26 - still not finding anything different from what I have seen before in the transcripts, specifically I do not see a change in brackets then the standard. No update with the group except that Dylan was able to successfully run his experiment on the 5 hour corpus and confirm that the brackets are being removed with laughter but no words are being removed. This is good to know so we can move forward with tasks and work on something new. I will work on running an experiment with his corpus as well.

2/28 - found these anomalies in my transcripts:

[laughter] - 3212 in my document, 18% of total
[laughter-word] - 2842 in document, 16% of total
[word-laughter] - 0 in document
[ ] - 17085 total
them_1 (280), 48% of total
because_1 (243), 42% of total
about_1 (23), 3.9% of total
okay_1 (14), 2.4% of total
se_1 (2) "becau[se_1]", .3% of total
<b_aside> <e_aside> - tags are always together (8 pairs, 16 total), 2.8% of total
"_" - 578 total

- more details are in my .txt file I created and shared with my team. Including each line of all the above findings.

Plan

2/22 - work with Dylan on finishing my experiment. Do more research with the unix commands to filter through the transcripts (looking specifically at tail, head, grep filter, etc commands that Professor Jonas mentioned in class.

2/23 - got started going through my portion of the transcripts that Matt handed out.

2/26 - continue going through transcript to look for varying and different brackets than the standard 4 we found for non-speech audio. Check in on the other team members and see if there is anything I can help with.

2/28 - finish looking through my part of the transcripts and record my findings.

Concerns

2/22 - making sure I can keep up and increase my learning curve so that I can keep up to date with group/class.

2/23 - none so far this week

2/26 - none at this time.

2/28 - when we will be able to get the script up and running to verify the regex

Week Ending March 7, 2017

Task

3/5 - researching more options to add to the dictionary and also more pronunciations for dictionary words.

3/6 - Reach out to the team to see every ones progress

Results

3/5 - started with some of the resources that Dylan found and than tried to branch out from there.

3/6 - checked in with every one and slack. Matt was able to let us know where he and Cody are in the script and that the current script didn't work like it was supposed to, so they made the changes needed. They still need to check it on a full transcript. Dylan said that he was working on the script for adding to the dictionary. I offered my help with that if he needed it. Dylan said that he would send me any information he found on his piece of the transcript so that I can do a full documentation for everyone. I am waiting to hear from Matt and John about that.

Plan

3/5 - researching more options to add to the dictionary and also more pronunciations for dictionary words.

3/6 - Reach out to the team to see every ones progress and start on documenting our team findings on the transcripts.

Concerns

3/5 - worried about the script that Matt and Cody are working on and what the update is, haven't heard from Matt on how it is going this week.

3/6 - worried about my team mates and our communication this week. We haven't talked much.

Week Ending March 21, 2017

Task
Spring Break


Results
Spring Break


Plan
Spring Break


Concerns
Spring Break

Week Ending March 28, 2017

Task

3/22 - Start to analyze the new transcript to see what new words are being kept now with the new script that were not being kept with the old script.

Also tried to run a train in Miraculix with Dylan to make sure that it worked on one of our team (Empire) machines. Vitali ran one on the other machine Asterix to test it as well.

3/23 - Run a Train with the new genTrans.pl script.

3/28 - Check in on everyone's log and see their progress this week

Results

3/22 - Dylan and I did a new Experiment 009 in our group with the old script to get the base results. A good thing to note that we found looking at the results. The scoring report using the existing genTras.pl script is not correct because the report includes the on each line in their scoring to give a better score.

example:
the scoring log (/mnt/main/Exp/0298/009/etc/scoring.log) of the transcript shows that for the first speaker there is 1 sentence and 3 words.
     |-----------------------------------------------------------------|
     | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err |
     |---------+-------------+-----------------------------------------|
     | sw2001b |    1      3 |100.0    0.0    0.0   66.7   66.7  100.0 |
     |---------+-------------+-----------------------------------------|
In the actual 5hr corpus transcript (/mnt/main/corpus/switchboard/5hr/train/trans/train.trans) it looks like this:
sw2001B-ms98-a-0038 154.073750 155.341375 it
The transcript that was created (/mnt/main/Exp/0298/009/etc/009_train.trans) from the 5hr corpus experiment looks like this:
(open s tag using < >) IT (close s tag using < >) (sw2001B-ms98-a-0038)
- so this shows that the genTrans.pl script removes the speaker reference but then adds the (open and close s tag using < >) to the beginning and end of the sentence and then adds both of those tags to the scoring report which increases the score to be better than it should be. We need to speak to Jonas about what he wants to do about that. Matt and Cody said that their new genTrans.pl script will remove those tags. We will be able to compare as soon as we can run the new script on a corpus.
Took a little work to get into Miraculix but we were finally able to do it and Dylan ran the 5hr corpus on his computer since mine was being used for the Data team task that we are also working on.

3/23 - Yesterday in class we were having trouble getting the new makeTrans.new.pl script to produce the correct files. Matt made the adjustments he thought needed to be done so I am trying again to run the Train. Running the Train was unsuccessful again and had the same error as before "Fatal Error: Can not open etc/010_train.fileids!" this is because there is no such file. We need to find out why the new script is not generating the right files. With Experiment 0298/009 the directories that are created are the following:

[mhe2000@caesar 009]$ ls
009.html bin bwaccumdir etc feat LM logdir model_architecture model_parameters python qmanager scripts_pl trees wav
With Experiment 0298/010 (running the new makeTrain.new.pl script) the following directories are created:
[mhe2000@caesar 010]$ ls
010.html bin bwaccumdir etc feat logdir model_architecture model_parameters python scripts_pl wav
- the LM folder is created later with the language model so that difference is ok but the "gmanager" and "trees" directories are not being created. I need to take a look at the original script and see what the differences are to the updated one.

3/28 - Read everyone's log and saw their progress this week. Looks like the Empire team is having trouble with running a train on Miraculix so we will need to address that in class on Wednesday.

Plan

3/22 - Start to analyze the new transcript to see what new words are being kept now with the new script that were not being kept with the old script.

Also tried to run a train in Miraculix with Dylan to make sure that it worked on one of our team (Empire) machines. Vitali ran one on the other machine Asterix to test it as well.

3/23 - Run a Train with the new genTrans.pl script.

3/28 - Check in on everyone's log and see their progress this week

Concerns

3/22 - none at this point in the week

3/23 - worried about getting the train, LM and decode to run properly with the new genTrans.pl script.

3/28 - worried about getting the train, LM and decode to run properly with the new genTrans.pl script still. And getting the train running on Miraculix for the Empire team.

Week Ending April 4, 2017

Task

4/1 - check in on other group and team members to read their logs and see how they are doing so far this week.

4/2 - start on the URC POSTER

4/3 - keep working on the URC POSTER and see if I can ssh into miraculix and do some work in there for the Empire team to run an experiment.


Results

4/1 - looks like empire group needs to copy over some directories to Miraculix to do the decode. I asked if it had been done yet, in slack and if not, told the group I would give it a try. Matt is still working with Cody on the script and getting it to work with the decode.

4/2 - started to get topics, data examples and think about what should and needed to be included on our poster. I told the other group members my ideas and asked for their input on what else they wanted. Was able to get examples, template and requirements from another professor to do it in power point.

4/3 - very good learning experience for me using some linux commands, moving from one machine to the other. I was able to ssh into miraculix successfully using root. I then needed to copy a directory from caesar to miraculix. It took me some research online but I found how to copy from local to a remote. Alex was helpful in that he told me to use the rsync command and not the cp. I initially went into caesar and used this command -

[root@caesar main]# lrsync -r /mnt/main/local root@miraculix:/usr/local

Then I ssh miraculix to see if the directory copied over I found that it did but into the wrong directory - here is what happened.

[root@miraculix ~]# cd /usr/local
[root@miraculix local]# ls
bin etc games include lib lib64 libexec local sbin share src

this showed me that a new local directory was created in the existing local. This was incorrect, I wanted to replace the existing local. So I used this command -

[root@miraculix local]# rm -rf local to remove the directory. I went back into caesar and repeated the steps using this command instead -
[root@caesar main]# lrsync -r /mnt/main/local root@miraculix:/usr/.

When I went back to ssh into miraculix I checked and found that the directories are copied over and checking some of them to what was in caesar it was successful. Hopefully this will help us so we can run the decode on miraculix now.


Plan

4/1 - check in in other group and team members to read their logs and see how they are doing so far this week.

4/2 - start on the URC POSTER

4/3 - keep working on the URC POSTER and see if I can ssh into miraculix and do some work in there for the Empire team to run an experiment.


Concerns

4/1 - how the genTrans script is coming along

4/2 - I'm worried about how long it will take to get the new genTrans.pl script to get up and running so we can provide better data for the other teams.

4/3 - same concerns as before

Week Ending April 11, 2017

Task

4/5 - create a README.txt to put into the scripts.pl to explain what has been done with genTrans.pl.

4/6 - checking in to read logs

4/7 - finish up URC poster

4/11 - checking in with team members and to read logs

Results

4/5 - I created the file and documented the differences between the genTrans and the genTrans.new.pl files.

The regular expressions that were in the old script file were:

$message = $line; # copy line to new variable
$message =~ s/sw[0-9]*[A-B]-ms98-a-[0-9]* [0-9]*.[0-9]* [0-9]*.[0-9]* //; #remove everything before the message
$message =~ s/\[noise\]\s//g; #remove [noise]
$message =~ s/\[laughter\]\s//g; #remove [laughter]
$message =~ s/\[vocalized-noise\]\s//g; #remove [laughter]
$message =~ s/<.*?>//g; # remove <<word>>
$message =~ s/ / /g; #replace double space with single space
$message = uc $message; #all text to uppercase

The regular expressions that were added to the new script file were:

$message = $line; # copy line to new variable
$message =~ s/sw[0-9]*[A-B]-ms98-a-[0-9]* [0-9]*.[0-9]* [0-9]*.[0-9]* //; # remove everything before the message
$message =~ s/\"//g;
CODY'S CODE
$message =~ s/noise]//g;#changed - [noise]
$message =~ s/\[laughter//g;#added
$message =~ s/\[vocalized//g;#added
$message =~ s/\[.*?\]-//g;
$message =~ s/-\[.*?\]//g;
$message =~ s/\/.*?\]//g;
CODY'S CODE END
$message =~ s/\[//g;
$message =~ s/\]//g;
$message =~ s/\-/ /g;
$message =~ s/\// /g;
$message =~ s/\{//g;
$message =~ s/\}//g;
$message =~ s/\_1//g;

So the part between "Cody's code" and "cody's code end" were the new regular expressions that were added. This shows the removing of the non-speech [ ] and the text that is in them. Also modified was the makeTrans.pl file. We had to adjust the path of the gentTrans.pl script. It needed to be modified to pull the new the genTrans.pl script. This was the difference that was made:

makeTrans.pl:

$cmd = "/mnt/main/scripts/user/genTrans.pl $flag $corpus $corpus_dir $exp";

makeTrans.new.pl:

$cmd = "perl /mnt/main/scripts/user/genTrans.new.pl $flag $corpus $corpus_dir $exp";

4/6 - checking in to read logs

4/7 - worked on the final touches of the URC poster. I made the decision to focus the poster on the new genTran.new.pl and the genTran.pl scripts. I gave a small description of what the data is and what we do. Included was a Problem (the quality of the data), the method to fix (implement new regular expressions in the scripts), implementation and results (the script before and what was added/removed to the new script and then transcript examples of the results using the two scripts) I took before and after images of the scripts and the results to compare and contrast, show what the changes are and how they affect the data. I also included what the next steps are for us as a team. After having the other guys review and approve it then I was able to submit it to Professor Jonas on Monday April 10th for printing.

4/11 - checking in with team members and to read logs

Plan

4/5 - create a README.txt to put into the scripts.pl to explain what has been done with genTrans.pl.

4/6 - checking in to read logs

4/7 - finish up URC poster

4/11 - checking in with team members and to read logs

Concerns

4/5 - none at this time

4/6 - none at this time

4/7 - none at this time

4/11 - scripts are still giving us issues so we can get an accurate score for how much the improved decode is.

Week Ending April 18, 2017

Task

4/12 - 4/12 - troubleshoot why the scripts that are attached to the train are spitting errors.

4/13 - checking in

4/18 - checking in


Results


4/12 - Dylan was able to discover at least one thing. There are multiple directories of script_pl in caesar. We were trying to modify the wrong files. Looking at the verify_all.pl script is where the error seems to be occurring. We have tried a couple of things, commenting out parts of the code, blocking the script all together but the same error on the same file is being produced. We are starting to run into a wall. Will continue to research options.

4/13 - checking in

4/18 - checking in


Plan

4/12 - troubleshoot why the scripts that are attached to the train are spitting errors.

4/13 - checking in

4/18 - checking in


Concerns

4/12 - dont think we will be able to discover the problem with the scripts to get a full experiment run.

4/13 - checking in

4/18 - checking in

Week Ending April 25, 2017

Task

4/22 - checking in


Results

4/22 - checking in


Plan

4/22 - checking in

Concerns

4/22 - checking in

Week Ending May 2, 2017

Task

4/30 - help team out with written report

5/1 - start final report

Results

4/30 - help team out with written report

5/1 - started final report in the Wiki, got the outline started for all the groups and team competitions

Plan

5/1 - help team out with written report

5/1 - start final report

Concerns

4/30 - that our team will not get the best results

5/1 - same as above

Week Ending May 9, 2017

Task

5/5 - checking in

5/7 - checking in

Results

5/5 - checking in

5/7 - checking in

Plan

5/5 - checking in

5/7 - checking in

Concerns

5/5 - checking in

5/7 - checking in