Speech:Spring 2018 Tri Nguyen Log


 * Home
 * Semesters
 * Spring 2018
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 5th, 2013
1/31: I was logging into caesar.unh.edu, then I tried some Unix commands that I had found on google.
 * Task:

2/1: Our group wanted to meet at least once more on a different day outside of class. We discussed this through email.

2/3: Read through other students logs and tasks of Spring 2016 and 2017 semester to understand more about the project.

2/4: Try to run Sphinx Train on my experiment folder, but unsuccessful.

1/31: The login was successful.
 * Results:

2/1: We exchanged our schedule and phone number by email.

2/3: Check out Experiments directories and create my directory within 0303 folder.

2/1: Next week our group probably will meet on Tuesday.
 * Plan:

2/3: Read more about Sphinx.

2/4: Read more about Sphinx Train and ask my team about it. 2/1: I checked our schedules but I still couldn't find an available day before Monday.
 * Concerns:

2/3: There are a lot of information and it's hard to cover all of them.

2/4: I followed the instruction to run the train, but unsuccessful.

Week Ending February 12, 2013
2/6: Arias and Rosali show me how they ran the train, the commands are the same but the way how to use them is different with the instruction on the wiki page. Learn about the servers and how to start them directly. 2/7: Read about the language model and decode process. Use FileZilla to explore and Listen to the audio (.wav) files. Learn how to edit the parameters of a train from Guide for Running a Train.
 * Task:

2/8: Meet up with the group from 2 pm to 4 pm, we are talking about the proposal, listening to the .sph files with VLC, creating the language model and decoding the train. I listen to ten .sph files and compare them with the transcript.

2/11: Discuss the Rough draft proposal with my group through email and continue reading logs and tutorial. Delete sub-folder 006 then create it again to test the train.

2/12: I deleted sub-folder 005 and 006 and ran the train on 001 and 002. 2/6: Train with different parameters (5 hours instead of 30 hours)  from Asterix server.
 * Results:

2/8: I had created the Language Model and run the Decode for the trained data before coming to the meeting. I got stuck at the last step of the decoding process Prepare the hypothesis transcript, look like the command  didn't work like the instruction   because the result file hyp.trans is empty. When we met, Asias created the Language Model, but he made it inside his root exp folder   and I created it inside of the trained folder. The rest of the process was the same, we also stuck at the  step. We read about SCLite and searched on google but unfortunately, the problem was not solved. We were out of time, so we decide to discuss the proposal draft on Discord.

2/11: Aries submitted the rough draft proposal for Data group today.

2/12: I got the error In the wiki page Run Decode Trained Data, it say. I tried both ways  but it still didn't work. I opened the hyp.trans file from  and delete all the missing files, then I got the result: 2/6: Read about Guide for Running a Train to know more about it.
 * Plan:

2/8: Try to solve the decoding issue contribute to complete the proposal draft before the deadline.

2/7: Prepare to the meeting on 2/8, create the Language Model and Run the Decode (trained data). 2/11: Read tutorial and try to decode the trained data. 2/6: The train was unsuccessful at first because all the servers were down. I finally could run the train after professor Jonas restarted the servers.
 * Concerns:

2/8: I tried to run the train on the sub-experiment directory within  and this time it's not successful again. I notified that the train only can found around 4781 words using 43 phones for the dictionary instead of 12146 words using 43 phones. I still couldn't find out how to fix it. I am the only one who use, the rest of the group use. 2/12: The decoding was successful, but I knew it's not the right way to do that. I will try to find out how to solve it in tomorrow class.

Week Ending February 19, 2013
2/13: I tried to run the decoding second time on the same trained data. It was successful but the result was very short. I spent a few hours with Rosali and Prof. Jonas to find the reason. I learned a lot of new Linux commands and understood how the scripts work.
 * Task:

2/14: I read the wiki page to find more information about data files (Dictionaries, audio files, transcripts, tags, annotations, filler words).

2/15: I run the train and decode it again then I prepare the presentation and the Proposal for the meeting with my group at 1 pm.

2/16: Counted and checked the soft-links of the audios in the corpus. Discus with Rosali to build the Proposal.

2/17: Read log and learned more about the corpus, learned how to build a new one. Discussed with prof. Jonas and Rosali through email to finish the Proposal because we need to submit it tonight. 2/13: The problem was very complicated. The first issue that I got was the duplicate of data in the dictionary. It happened because I did the decoding two times. Prof. Jonas fixed it by using the command  to sort the data and delete the duplicated.
 * Results:

The second issue was the dictionary on the language model was missing some words, for example. The result was the word will not be recognized.

2/15: The result of the train is better than the last time because I used command  and   to sort and delete unwanted data.

2/16: I counted the number of sound files (.sph) inside each folder of 300hr, 145hr, 30hr and 5hr and they were matched with the number of line in the transcript files (tran.trans). Then I used the command  to check each .sph file was belong to. I learned the commands from Dakota 2015. The .sph files were linking to  and the result was all the soft links were okay.

With the .wav file, I just checked 40 min and full folder in noaa folder, the transcript for 40 min were matched with the audio files, the transcript for full folder has some extra blank lines, but the data also matched.

All the checks above are just let we know the data is still in its placed but it can't make sure that the data is not corrupt.

I think class 2014 and 2015 already cleaned up the bad data, that why group data of 2016 and 2017 didn't do it again. They just changed the scripts to remove the bracket, added new words into the dictionary, and fixed the bad translate from the audios to the transcripts. 2/13: Our group will need to finish the Proposal and prepare the presentation for the next class. 2/16: I will email prof. Jonas to ask about the bad data and soft links. It will affect our Proposal.
 * Plan:

2/14: I don't have much time today but I will spend more time in the weekend to finish all the tasks. 2/16: How to check if the data was corrupt or not will be a problem.
 * Concerns:

Week Ending February 26, 2013

 * Task:

2/21:Our group met at 2 pm to run the second train (the train with the parseLMTrans.pl script which I have edited). The only different between two experiments was a command which used to run parseLMTrans.pl script. The result was better than the first experiment.

The first experiment used the normal command: parseLMTrans.pl trans_unedited trans_parsed

The second experiment used the edited command: /mnt/main/Exp/0303/026/parseLMTrans_nob.pl trans_unedited trans_parsed

I would edit the script again to improve the WER rate of the experiment.

After that Rosali and I went to the server room to meet Daniel Rubin. He showed us how to create a virtual image of a drone server (asterix) and cloned it with BB command. He did it by booting into a persistent bootable Debian distro ( a version of Linux on a flash drive). I know how to do it on window system but it was the first time I saw how it work on Linux (Daniel told us it's not Unix System).

We had a meeting on Discord app at 9:30 PM with Model group. We were talking about the three experiments and the two scripts that we need to run/edit. Data group and Model Group will have another meeting on Friday at 6 pm.

2/21: Issac tried to create another experiment but he had failed. Rosali and I also tried to create our experiment but I was the only one who successfully created it even though all of us run the same commands.Arias Talari passed by and he tried to help us and Rosali could create her experiment at 4 pm.
 * Results:

2/22: Our group had a meeting on the library. We discussed how to choose random audio files to check and run the train and decoding on the new parseLMTrans.pl script that I have edited. The WER of all the results were less than 30%:

The results of our classmate this year:

2/23: I edited the genTrans.pl scripts today, but I couldn't make it worked. I changed the script so it would remove the brackets, [Laughter-] string and the - character

The script makeTrains.pl would call genTrans.pl when we run a train, so I edited it fist. I changed the link to the old genTrains.pl to my edited one

However, when I ran the train, the script would create some new words that only existed on the transcript but not on the dictionary.

I tried to remove all of them to get rid of that issue, but I was afraid it would affect the WER result.

2/24: I read the log of Dylan Lindstrom 2017 and Matthew Fintonis 2017 and found out that they tried to add new words to the dictionary both by hand and script. I didn't have enough of time this week to add new words so I will remove the words that were not existed in the dictionary to make the train working.

I edited genTrains and asked Data group and Model group to test it, but the scripts makeTrain_2018.pl and genTrains_2018.pl only could work well on 5hr corpus. I tested it several time and I thought it's better than the old one:

2/26: I tested the train with the script makeTrain2018.pl on a 30hr corpus. The first and second steps were working

However, when I type the command for the third step, many errors showed up. The program failed to pass the check against the dictionary. I tried to fix this issue by adding more words in the dictionary. 2/21: Our group will have a meeting at 1 PM tomorrow. I will come a little late because I have to work until 1 pm and it takes at least 40 minutes to drive from my company to the campus.
 * Plan:

2/23: Read more log and try to find out how to safely edit the genTrans.pl script.

2/24: Test and improve the genTrans.pl script, listen to the audio files to test and edit the transcript.

2/21: I had run the second experiment at least 6 times today. Four of them have failed, the successful times were running on Caesar server.
 * Concerns:

2/23: I kept repeating the process: edit the script, run the train, and delete the experiment, no less than 20 times today. I did one big mistake today: one time when I was on 0303 folder (/mnt/main/Exp/0303) and I thought I still stood on my experiment folder (/mnt/main/Exp/0303/034), I used the command. I didn't have the permission to delete other experiment and I pressed Ctrl + C right away so nothing happened.

2/24: So far so good.

Week Ending March 5, 2013
2/28: The data group had a meeting today, Rosali and I reviewed the script of the student from 2017 and their customized dictionary. They had several scripts for genTrans.pl and makeTrain.pl like makeTrain.dic.pl, makeTrain.new.pl, makeTrain.test.pl,genTrans.new.pl, genTrans.test.pl. We read and discuss the changed that they did on the scripts (Most of them were regular expressions). Then we tried to run their scripts, but none of them were working.
 * Task:

They tried to remove the brackets, "-" and "_1" then added the new words to the dictionary. There were a ton of new words to add, for example from the word "wh[at]-" we could have the new words: "wh", "wh-","-", or what- which didn't exist in the original dictionary. Rosali reminded me of the official site of CMU, unfortunately, they didn't update their dictionary in a long time and I didn't think they would add word like "wh-". It just doesn't make sense.

We also checked the history of genTrain.pl scripts, but we didn't have enough of time to check them one by one. We didn't know which one of them was working and which one was the bad version. 3/2: I dug into the history folder to find did the students of 2015 do with genTrain.pl. The versions of genTrain.pl that they had modified were in the folder 5 to 11. I checked the scripts on folder 11 first because they would be the least one that the students of 2015 had changed. There were three versions of genTrain.pl in that folder, but all of them didn't make any change to the export truth file. Most of the commands had been disabled by the comment quote "#":

In the other folders, they tested the scripts without the comment quote, but when I used makeTrain.pl to call them, all of them had the same error, many words were in the transcript file, but is not in the dictionary.

I also opened the genTrain.pl of 2014 and everything were the same with the genTrain.pl in folder 11, so I guessed the scripts didn't change from 2014 to 2016. In 2017, The students tried to change that script but they were also unsuccessful.

Professor Jonas told me not to delete character "-", but when I did that, there were hundred of new word appeared, most of them were not in the dictionary:

I edited it so it could run on 30hr and I hoped it also could run on 300hr. To make it run, there were two ways to modify the script. The first way was to add more words in the dictionary and the second way was to remove the words that didn't exist in the dictionary. I used the second method to test the genTrain.pl.

3/3: I changed the script so it would remove both [] and - in the truth transcript and the hyp.transcript. I would test the script tonight.

I downloaded 50 audios files to listen and compared with the transcript. All the translate texts in the transcript were fine. The only problem was the noise. There was a lot of noise in the audio files. When the program turned the audio to the text file, it also turned these noises into words and it's not easy to fix this problem because there were too many of them and we still didn't have any effective solution for this. Issac would use his program to remove the noise and we would test it when we had enough of data.

3/4: Prof. Jonas told me that the result of a test must be the same, but I tried to test both the current scripts and my new scripts and the results always were different. I used some Unix commands to check (cat, wc, uniq) and found out that the hyp.trans file always had less data than the _train.trans file. This was the hyp.trains from my newest test:

I checked the number of audio files and the number of lines on the transcript and they were the same. I read the log of other classmates but I still didn't see anyone was talking about this problem. I would ask them on our discord channel today.

I checked all new experiments which have created this week and all of them had the same problem with mine. I still couldn't find a way to create the hyp.trains which had the same size with _train.trans

2/28: We couldn't use the script of the student from 2017. I will try to read and run the script of the student from 2015 because the current version script was created from 2016.
 * Results:

3/2: I will email and ask prof. Jonas about the "-" character, I still want to get rid of it.


 * Plan:

3/4: Try to find out why the results of the test on the same scripts and data were different.
 * Concerns:

3/3: The same test had a different result on a different drone.

Week Ending March 12, 2013
3/8: We found out the reason why all the hyp.trans didn't have enough of data after reading the email of prof. Jonas. He wanted our group to edit the scripts and had new results.
 * Task:

I used the script of experiment 26 - keep the brackets and the dash character when runing the language model - on Exps_0305_007 and the result was:

|=================================================================|    |  Mean   |  1.3   19.1 | 75.4   18.6    5.9   14.5   39.0   86.4 | | S.D.   |  0.5   16.5 | 18.2   15.4    7.9   28.2   32.5   31.9 | | Median |  1.0   15.0 | 75.0   16.9    2.9    3.6   33.3  100.0 | `-'
 * Sum/Avg | 4172 60215 | 72.2   19.9    7.9    7.1   34.9   86.1 |

I also created the script for removing only the brackets but keep the dash - There were 392 new words appeared in the script but not in the dictionary. I would ask prof. Jonas about this problem. Do I need to add those words in a new customize dictionary and what phonetic I need to use for each word?

The last script I did today was to remove the brackets and dash in both the transcript file and the LM. I ran the train in and the result was:

3/9: Group met on Discord, Rosali suggested that we could change the truth transcript to match with the audio sounds and we might need to discuss with other group and prof. Jonas about this. Professor didn't like the descriptions that we left on the experiments, so I run all my experiments on 0305 and gave them the detail descriptions.

I messed up on the experiment Exps_0305_007 which would have the same result as [Exps_0303_026. I changed the script and didn't check the data after running it. I usually checked the Perl script on rextester.com before running that script on the server, but I thought I forgot to do that this time. I forgot to delete the number at the begin of each line:

I did that experiment again on. It would keep the brackets and the dash on the LM.

The result was okay this time:

| Sum/Avg | 4172 60215 | 72.9   19.3    7.8    7.4   34.5   87.4 | |=================================================================|    |  Mean   |  1.3   19.1 | 75.8   18.4    5.8   15.3   39.5   87.8 | | S.D.   |  0.5   16.5 | 18.0   15.3    7.7   29.0   32.9   30.2 | | Median |  1.0   15.0 | 75.9   16.7    2.6    4.0   33.3  100.0 | `-' Our group will focus on analysis of the training vs dev set this weekend and we may have a meeting on Monday if it is possible.

3/10: The data group needed to clean up all experiments under 0305 folder. Prof. Jonas wanted us to finish the four experiments:

0305/003: remove '[ ]' & '-' from LM, keep in dictionary and training transcript

0305/007: keep '[ ]' & '-' in LM, dictionary, training transcript

0305/008: remove only '[ ]' from LM, dictionary, training transcript but keep '-'

0305/009: remove '[ ]' & '-' from LM, dictionary and training transcript

I was responsible for 0305/003, 0305/007, and 0305/008 experiments. Steve of the Model group informed us that he would run 300hr Decode on Asterix and it would take over 100 hours. The server obelix was very slow, so I did my decode on miraculix. I would be fine with 0305/007 and 0305/009, but for 0305/008, I didn't know how to re-create the dictionary. Professor Jonas showed us how to generated the dictionary in the email, so I would save it for the least.

Look like I misunderstood about "remove only '[ ]' from LM, dictionary, training transcript". It means remove EVERYTHING INSIDE and INCLUDE the brackets, not just the brackets themselves. For example the word F[ALL]- will become F-, not FALL-. I will do this experiment tomorrow, but this is an important issue.

3/11: I ran the train and decoding on drone miraculix, but it's not successful. The drone couldn't create decode.log file. Group system told us that we still could not run the test on miraculix, idefix, brutus, and rome. I would run the train again on obelix, but it's very slowly.


 * Results:


 * Plan:

3/8: I will test the default scripts tomorrow. Brian did that test before in his 0303/042 folder; the result was:

0303/042/etc/scoring.log: | Sum/Avg | 4172 60215 | 72.5 19.6 7.9 7.3 34.8 87.5 |

I just want to try it again and make sure it's the right result.

3/10: 0305/008: remove only '[ ]' from LM, dictionary, training transcript but keep '-', and generated the dictionary on Sunday.
 * Concerns:

3/9: Always remember to check the script and the result after running it.

Week Ending March 26, 2013
03/20: I created two Perl scripts to export and count the speakers on the transcripts. The script speaker.pl could export the speaker's name from a transcript file.
 * Task:

I used two scripts to solve the ratio problem:

./compare.pl dev.trans_speakers_uniq train.trans_speakers_uniq > dev_train

$ wc dev_train

2110 2110 16880 dev_train

There are 2110 speakers in the dev also on the train.

test]$ wc dev.trans_speakers_uniq

3168 3168 25344 dev.trans_speakers_uniq

We have total 3168 speakers in the dev, so there will be 2110 x 100/ 3168 = 66.6% of speakers in the dev also in the train.

03/22: I ran the train/decode on 0305/013 as tmn1001, but it's not working because I wasn't the one who created that experiment. I created a new experiment 0305/014 to test the script on 0305/013 and the result was:

| Sum/Avg | 4172 60569 | 72.6   19.3    8.1    7.0   34.4   87.6 | |=================================================================|    |  Mean   |  1.3   19.2 | 75.6   18.4    6.0   14.7   39.2   88.0 | | S.D.   |  0.5   16.5 | 18.1   15.4    7.8   28.2   32.2   30.1 | | Median |  1.0   15.0 | 75.0   16.7    2.9    3.7   33.3  100.0 | `-'

03/23: I tried to run the train/decode on 0305/013 as tmn1001, but it's not working because I wasn't the one who created that experiment. I changed the user from tmn1001 to root and continued to work on 0305/011 and 0305/013. I fixed the script on 0305/011/scripts, but the system didn't let me run the decoding step:

nohup run_decode.pl 0305/013 0305/013 1000 &

nohup: ignoring input and appending output to `/root/nohup.out'

nohup: failed to run command `run_decode.pl': No such file or directory

I tried it again with the absolute link for the script, but it's still not working:

nohup /mnt/main/scripts/user/run_decode.pl 0305/013 0305/013 1000 &

nohup: ignoring input and appending output to `/root/nohup.out'

[1]+ Done                    nohup /mnt/main/scripts/user/run_decode.pl 0305/013 0305/013 1000

03/24: I ran the train/decode for the unseen data for Avengers Group on 0305/015. I created the main experiment folder for the training on 0305/015 and the sub experiment for decoding and grading on 0305/016. It's the first time I ran a train/decoding for unseen data. I finally could understand the meaning of /

0305/013: tail -8 /mnt/main/Exp/0305/013/etc/scoring.log

Yashna and I edited the Perl script this morning. We planned to remove both brackets and dash character on both transcript and LM. About the specific cases liked [kleeping/keeping], [Bamorghini/Lamborghini], [bystandard/bystander],... Because in most of the case, the second word is the right one. So, we eliminated the first word and keep the second one.

03/30: I did the train/decoding on 30hr for unseen data. The folder 0305/020 was created for training and 0305/021 for decoding. However, at this step:

RUN nohup run_decode.pl /

But I couldn't run the decode with that command. I changed it to one line command:

/usr/local/bin/sphinx3_decode -hmm /mnt/main/Exp/0305/033/model_parameters/033.mllt_cd_cont_1000 -lm /mnt/main/Exp/0305/033/LM/tmp.arpa -dict /mnt/main/Exp/0305/033/etc/033.dic -fdict /mnt/main/Exp/0305/033/etc/033.filler -ctl /mnt/main/Exp/0305/033/etc/033_decode.fileids -cepdir /mnt/main/Exp/0305/033/feat -cepext .mfc >& decode.log &


 * Results:

04/11: Successful created the acoustic model on 300 hours training with my scripts.


 * Plan:

04/11: Run the decoding on Asterix.

04/15: Run LDA decoding with the instruction of prof Jonas.
 * Concerns:

Week Ending April 23, 2013
04/19: I ran the 300 hours-seen-data experiment with the new genTrain.pl on 0310/022 (train folder) and 0310/023 (decode folder). I used the scripts that would remove both [] and - from the transcript and language model.
 * Task:

In the afternoon, Daniel R and I worked to build the new dictionary for Avenger group. I also ran a new 5hrs experiment to test it.

04/20: Created a new Perl script to combine the dictionaries and remove the unwanted words. It is easier if I used Java to do this, if I can't finish the script this week, I will use Java to do the job. - Avenger's Secret -

04/22: The test on both new dictionaries were better than the current one (master.dic). Our group already test them on 5hrs and 30hrs corpus, Lamia also ran her experiments on 145hrs. I thought our group would win this semester.

Steve and Rosali already finished their 300hrs tested with the new geanTrain.pl on Automatix drone. I ran the train two day ago and the decode this morning. It would be done on Monday. All of the tests were used 8000 senones and didn't enable the LDA.

04/22: Yashna and I analyzed their acoustic model and look like they keep the brackets, laughter, and dash in their transcript, but in their Language Model, they removed all the brackets and the laughter and just keep the dash character. It's weird that they also keep some word like BECAUSE, ESPECIALLY, and OKAY_. This year, I think both group are going to use the scripts on both acoustic model and language model. To apply the scripts on acoustic model without modify the genTrain.pl, we need to used parseTrainTrans_no_brackets.pl and pruneDic_no_brackets.pl before running the train. The scripts still had some issues and I hope I can fix all of them next week.

Rosali found out that


 * Results:

04/23: The result for the train on 0310/022, it was an 300hrs experiment on seen data with the scripts that from 0305/013 and the new genTrain.pl:

tail -8 scoring.log | Sum/Avg | 4034 57411 | 73.9   19.8    6.3    8.0   34.1   87.8 | |=================================================================|     |  Mean   |  1.3   18.5 | 76.2   19.1    4.7   16.2   39.9   87.6 | | S.D.   |  0.5   16.1 | 18.4   16.0    7.0   30.6   34.9   30.8 | | Median |  1.0   13.0 | 77.5   16.7    0.0    4.5   33.3  100.0 | `-'


 * Plan:


 * Concerns:

Week Ending April 30, 2013

 * Task:

04/25: Our group ran into a big problem. We also didn't have enough of time to run all the test, especial for 300hrs experiments.

04/26: We decided if we couldn't fix that problem today, we will used the previous version of **** to run the experiments.

04/27: Run the experiments for 300 hrs with my new scripts, along with prof scripts.

04/29: Got the result for two 300hr seen experiments. Started a new one for 300hrs.


 * Results:


 * Plan:


 * Concerns:

Week Ending May 7, 2013
05/01: Steve, Daniel, and I discussed about the LDA training and the normal one. We wanted to focus more to the normal train.
 * Task:

Jaden, Daniel, and I went to prof. Jonas office and asked for help. Our dictionary had problems and prof. Jonas fixed most of them. We only needed to fix 66 words.

05/01: Daniel finished the new dictionary. I checked it with the scripts on 300hr and I thought it was going to work well. Our group would test the new dictionary on 5hr first.

05/03: Ran three 300 hrs experiments for unseen data, two of them used the scripts to remove the brackets but keep the dash character.


 * Results:


 * Plan:


 * Concerns: