Speech:Summer 2014 Erol Aygar


 * Home
 * Semesters
 * Summer 2014

Week Ending June 4th, 2014
I will write my findings, tasks, notes, takeaways here.
 * Introduction:

I also plan to install Fedora 20 on a workstation.

After trying for a while, discovered the installation via USB has issues, however DVD image worked.

During the fedora installation, I found that the following takeaways: - networks settings are hard to configure, still working on it - gnome UI


 * Notes:
 * Meeting with Dr. Jonas.
 * Received tasks for the first couple of weeks.
 * Start disassembling the server
 * Obtain Fedora Installation
 * Started Fedora 20 installation on machines


 * Tasks:
 * Call Router
 * Hardware (inventory/cleanup)
 * Fedora Updates
 * GNU Radio Toolkit | GNH Radio
 * Acoustic Model Improvements

Week Ending June 11th, 2014
I started reviewing previous semester proposals and reports. My favorites are the following, as thet introduced his experience (issues, struggles, solutions) pragmatically:
 * [David's notes]
 * [Colby's notes]
 * [Josh's notes]

Missing files I worked on experiments, and encountered that some files are missing. I think we need to reorganize the server. The switchboard and script folders seem to be changed during the final experiment contest. The .dic .filler .phone .fileids .trans files are missing under the most switchboard corpus ( info folders of train related files are present for first_5hr/train and 100hr/train2, however rest are missing) train.dic train.filler  train.phone  train_train.fileids  train_train.trans

Low space a message on caesar stating that 1Gb space left on server met with Colby and talked about the avengers experiments, obtained notes and experiments of his team, start replicating the experiments they’ve made, encountered problems of missing data, ran successfully a training (first_5hr), understood how they generated the scripts, planning to continue running those experiments I start reviewing configuration of caesar and installation media to propose a fix to the missing files. Therefore I will copy and try to install the switchboard to an external pc. Overviewed the organization of the scripts to see the opportunities of restructuring the way they are used This week the Initial status report generated, published.

I also get explanatory information from team Avengers:
 * Avenger's team notes

Week Ending June 18th, 2014
June 16th, 2014

Today I went to system room to examine how much space we have on CAESAR, it is less than 1GB. I also tried ASTERIX to see if it is different, however and it looks not different. I then checked the sizes of each experiments' folder. The 0251 folder is huge. I tried to copy it to a flash disk and examine it in a convenient place, but I couldn't. When I tried to copy it the size looks it is 40GB for an experiment folder(/mnt/main/Exp/0251/012/). Since the system uses the linking the files, when you try to copy the size of he folder dramatically increases. I think we need more space. I will put them into the external drive tomorrow.

My review started from the first experiment, 001, however the after speaking to Colby, he recommend me to review especially 007 and 012 experiments under folder 0251. I tried to copy the installation folders to a flash disk, so that I can try to figure out the missing files for mini and tiny experiments and the scripts as well. I wonder why the scripts are copied each time to the experiment directory and manipulated by I think we need to use a version control mechamism for our scripts, such as git, or vss.

I arranged a conf call with David tomorrow. I plan to improve my notes.

June 17th, 2014

Planned
 * meet with David around noon
 * copy the files to an external drive
 * organize notes
 * meet with Jared at 6pm
 * run experiment

Actual
 * I started working at 3pm today, and Jared joined at 6pm.
 * I realized the file change is because of using symbolic links instead of using literal files in each experiment. When I tried to copy the files, there was a warning messages about FAT filesystem does not support the symbolic links. Therefore all the dependent files were copied into the destination.
 * I need to find a way to copy the experiment files, especially under the 0251 folder into an external device so that I can understand the mechanism of the scripts, which will help me to fix the missing files. Therefore I compressed the experiment folders (/mnt/main/Exp/025/) to open more space, and let the file to be copied without all the dependent files. Backup procedure might be a solution for space need. I am also trying to tar which might be another alternative for this purpose.
 * Couldn't meet with David yet, I will reschedule.
 * After compressing some files and opening more space in caesar, I will start another experiment
 * the USB ports are so slow, I haven't checked but most probably they are older than USB2, are they?
 * I left the process that I started to compress each 0251 sub directories because it will take a while.
 * We left the room around 9 with Jared, since the school was closing.
 * I copied sphinx installation files. Working on installing on a desktop. (compiled, make check returned some errors: FAIL: test-decode-raw.sh )

June 18th, 2014

Planned Ask Dr. Jonas recommendation about
 * backup to open some space. For instance, we can copy the experiments before 100 to another location. I am not sure it is a good idea but
 * installation files to make a fresh install to another machine to migrate and then continue experiments
 * continue copying the tar files into hdd.

Actual
 * TODO : I will continue filling this field tomorrow.
 * TODO : I will also update/enrich previous logs


 * Enriching Acoustic Model Improvements page [] and reviewing the overall process to have a better understanding, and to identify improvements
 * Running experiment, encountered issues Broken pipe, I will try again at school. my (personal) root experiment directory is 0253; created f5h01, f5h02, f5hr03 base directories to test the procedure.
 * Meeting with Dr. Jonas @3pm

June 19th, 2014 Avenger¹s finest experiment documented on the report which is  Size of train set : 125+ hr   Size of test set : 6.2hr Time to run decode : 32 density 5.06hrs; 64 density 13.1 hr  Real time factor of decode : 32 density 1.3; 64 density 2.12 WER : %48.67 Corpus used: 3170 ( /mnt/main/corpus/switchboard/3170)
 * space issue solved.
 * received directions from Dr. Jonas about the experiments. trying to understand the parameters used in the 0251/07 and 0251/012 experiments
 * instead of using last semester's scripts, utilize the previous summer semester sphinx tools
 * objective is to replicate the best results

June 20th, 2014
 * continue working on the experiments, scrutinizing them (0251/007;0251/012) to replicate their results
 * Reviewing the parameters, logs, files used in this experiments

June 21th, 2014 bw Buam-Welch iteration norm Normalize etc.
 * Reviewing the log files, and trying to use the parameters to generate same results. Trying to get the them work.
 * plan to generate log files by using the parameters under log files, and compare results with replicated sub tasks.
 * successfully ran sub-tasks by using parameters found under log files. Getting familiar with the executables found under log files (bw; norm; cp_parm; init_gau; mk_mdef_gen; mk_flat; init_mixw; mk_mdef_gen; bldtree; make_quests; mk_mdef_gen; prune tree;testate; inc_comp; sphinx3_decode; ..etc )
 * found a good resource about sphinx toolkit, skimming. [sphinx] the explanation of the executables and their parameters are documented under this website, therefore skimming.

Week Ending June 25th, 2014
June 23th, 2014
 * replicating experiment /0251/012/
 * running the train with the same parameters
 * figured out 3170 corpus. I talked to David about the 3170 corpus and learnt that this corpus was created around the last weeks of the spring semester experiment contest between avengers and Justice teams, by Colby and David. The main objective was to have a clean and full length dataset for their experiments.
 * another thing I learned from David is that since the scripts were updated, it is not necessary to use the [.dic, .filler, .phone, etc ] files under train/info folders of each corpus to run the train anymore. They updated the scripts to use files differently. I will check it later.
 * I started the a new experiment to replicate /0251/012/ it is under the following folder /0253/T12/ It seems it will take long time. I used absolutely same parameters for the train, and decode as well. Furthermore, I am watching the folder structure.
 * I reviewed prepareexperiment script. It is a great start point to figure out the overall mechanism, or a good start point at least.
 * LM replicated
 * Training replicated
 * Waiting to finish the decode
 * I also started another experiment to test if it will be successful. /0253/M01 which stands for Mini Corpus number 1. It is running on Asterix.

June 24th, 2014 Although the parameters are the same there is a difference identified in logs. /0251/012 Current Overall Likelihood Per Frame = -8.06946342376317 /0253/A12 Current Overall Likelihood Per Frame = -8.06958701002461
 * The decode is not successful. Reviewing the log files under log directory, and the html file under the base directory of the experiment. /0253/T12
 * It is noted that the decode is terminated by the following message "Bad feature type argument."
 * I am comparing the log files with the 0251/012 logs. Difference starts at Phase 2. the original experiment had normalization, mine has not in phase 3. I will investigate more.
 * Started another experiment with the same parameters.
 * I am side by side following the logs (.html files under each base experiment folder; and log files under logs directory)

June 25th, 2014 the results are parallel.
 * 0253/C12 running on MAJESTIX (train)
 * 0253/A12 running on CAESAR (train)
 * 0253/B12 running on ASTERIX (decode)
 * created LM (language model), results are same, it is ok.
 * reading Pruning. After decoding, I will prune and run the decode 12 times, as they did on the baseline. working on the logs files generated under logdir.
 * experiments are still running. I am checking each step and comparing the result with the baseline. (/0251/012/)
 * Yesterday I also started an experiment with identical configuration. It is under MAJESTIX, which is server that avengers made their finest experiment run on.
 * The delta on "Likelihood Per Frame" value and number of "Errors and Warnings" is observed throughout the experiments. As an example I am copying the following from log files.

This is on page 37 of log file; MODULE 45, Normalization for iteration: 3,

0251/012 (Baseline, ran on MAJESTIX) Current Overall Likelihood Per Frame = 6.77582517416873 Convergence Ratio = 0.0387806164293251 242 ERROR messages and 0 WARNING 276 ERROR messages and 1 WARNING

0253/C12 running on MAJESTIX (train) Current Overall Likelihood Per Frame = 6.77426608919931 Convergence Ratio = 0.038835669173824 242 ERROR messages and 0 WARNING 276 ERROR messages and 1 WARNING

0253/A12 running on CAESAR (train) Current Overall Likelihood Per Frame = 6.77557659995219 Convergence Ratio = 0.0387940635415625 246 ERROR messages and 1 WARNING 280 ERROR messages and 0 WARNING

SENTENCE ERROR: 95.5% (147/154) WORD ERROR RATE: 29.2% (576/1970)
 * Decode on ASTERIX resulted the same result of initial decode, unpruned.
 * will have the weekly meeting with Dr. Jonas and the team at 3pm.


 * I discussed with Dr. Jonas on the results and he asked why the best result is documented as %48.67.

This is the results of last semester result TOTAL Words: 1970 Correct: 1686 Errors: 576 TOTAL Percent correct = 85.58% Error = 29.24% Accuracy = 70.76% TOTAL Insertions: 292 Deletions: 57 Substitutions: 227

result_64.default TOTAL Words: 1970 Correct: 1710 Errors: 552 TOTAL Percent correct = 86.80% Error = 28.02% Accuracy = 71.98% TOTAL Insertions: 292 Deletions: 59 Substitutions: 201

result_64.pruned01 TOTAL Words: 1970 Correct: 1711 Errors: 551 TOTAL Percent correct = 86.85% Error = 27.97% Accuracy = 72.03% TOTAL Insertions: 292 Deletions: 56 Substitutions: 203

result_64.pruned02 TOTAL Words: 1970 Correct: 1703 Errors: 558 TOTAL Percent correct = 86.45% Error = 28.32% Accuracy = 71.68% TOTAL Insertions: 291 Deletions: 58 Substitutions: 209

result_64.pruned03 TOTAL Words: 1970 Correct: 1706 Errors: 555 TOTAL Percent correct = 86.60% Error = 28.17% Accuracy = 71.83% TOTAL Insertions: 291 Deletions: 57 Substitutions: 207 result_64.pruned04 TOTAL Words: 1970 Correct: 1703 Errors: 560 TOTAL Percent correct = 86.45% Error = 28.43% Accuracy = 71.57% TOTAL Insertions: 293 Deletions: 58 Substitutions: 209

result_64.pruned05 TOTAL Words: 1970 Correct: 1705 Errors: 557 TOTAL Percent correct = 86.55% Error = 28.27% Accuracy = 71.73% TOTAL Insertions: 292 Deletions: 56 Substitutions: 209 result_64.pruned06 TOTAL Words: 1970 Correct: 1704 Errors: 559 TOTAL Percent correct = 86.50% Error = 28.38% Accuracy = 71.62% TOTAL Insertions: 293 Deletions: 56 Substitutions: 210

result_64.pruned07 TOTAL Words: 1970 Correct: 1704 Errors: 556 TOTAL Percent correct = 86.50% Error = 28.22% Accuracy = 71.78% TOTAL Insertions: 290 Deletions: 56 Substitutions: 210

result_64.pruned08 TOTAL Words: 1970 Correct: 1698 Errors: 559 TOTAL Percent correct = 86.19% Error = 28.38% Accuracy = 71.62% TOTAL Insertions: 287 Deletions: 60 Substitutions: 212

result_64.pruned09 TOTAL Words: 1970 Correct: 1695 Errors: 564 TOTAL Percent correct = 86.04% Error = 28.63% Accuracy = 71.37% TOTAL Insertions: 289 Deletions: 60 Substitutions: 215

result_64.pruned10 TOTAL Words: 1970 Correct: 1699 Errors: 558 TOTAL Percent correct = 86.24% Error = 28.32% Accuracy = 71.68% TOTAL Insertions: 287 Deletions: 59 Substitutions: 212

result_64.pruned11 TOTAL Words: 1970 Correct: 1700 Errors: 558 TOTAL Percent correct = 86.29% Error = 28.32% Accuracy = 71.68% TOTAL Insertions: 288 Deletions: 57 Substitutions: 213 result_64.pruned12 TOTAL Words: 48386 Correct: 38877 Errors: 19770 TOTAL Percent correct = 80.35% Error = 40.86% Accuracy = 59.14% TOTAL Insertions: 10261 Deletions: 2237 Substitutions: 7272

result_64.pruned13 This is the best result reported last semester TOTAL Words: 48386 Correct: 35528 Errors: 23550 TOTAL Percent correct = 73.43% Error = 48.67% Accuracy = 51.33% TOTAL Insertions: 10692 Deletions: 2673 Substitutions: 10185

June 26th, 2014
 * comparing the results of two new trainings with the baseline. They are not identical, but so close.
 * wonder why avengers haven't used prune12 result? I will contact Colby and David, and ask Prof. Jonas' opinion as well.
 * You can also download the results of the trainings by using the following command. Be aware that the last parameter is local address, therefore you better specify it or create an identical directory.

scp -r ea2003@caesar.unh.edu:"/mnt/main/Exp/0251/012/012.html /mnt/main/Exp/0253/A12/A12.html /mnt/main/Exp/0253/B12/B12.html /mnt/main/Exp/0253/C12/C12.html" ~/development/speech/results

June 27th, 2014
 * Continue working on trainings: I successfully replicated, however there is a slight difference between each run: I wonder why there is a delta between the results, thereby, I decided to run experiments with same parameters, but different machines. I started D12, E12, F12 training, as that will yield 5+ replicated trainings. I started them on different machines.
 * Trainings accomplished. Cannot replicate perfect replicating, trying to understand why. The delta is 0.000x level, and number of warnings and errors are also different by each run.
 * reading HMM, and sphinx3 documents to find why. I started reviewing the parameters, logs, files used in this experiments, and successfully use them by sphinx3 executables.
 * I start breaking down the perl scripts, and working directly with sphinx3 commands. They are working. I think if I can understand the parameters meanings, it will be easier to decrease the error and warning numbers, which in turn yield better acoustic model trainings.
 * Language models are replicated successfully.
 * I am also scrutinizing on Avenger¹s finest experiments. Identified better result under the experiments folder. Trying to figure out why it is not turned in by the team. I will contact to the team members.

Week Ending July 2nd, 2014
June 30th, 2014
 * I am traveling to Istanbul this week, and will continue my study remotely. I prepare and publish the weekly status report a little bit late because of my moving, sorry for delay.
 * Dr. Jonas described the idea behind pruning the tree last week after class in Durham. After this conversation, and also another previous one, I realized that the it is not the best course of action to make dozens of experiments, as Dr. Jonas named this approach as "brute force", instead understanding the parameters and the mechanism behind will be wiser. I am reviewing the course slides, and skimming "Continuous Speech Recognition by Statistical Methods" paper as well, thus I will recognize the parameters and their effects.

July 1st, 2014

July 2nd, 2014

July 3rd, 2014

July 4th, 2014

July 5th, 2014
 * After meeting with Dr. Jonas, I started focusing on the files that are used as an input for the decode. Comparing the .mfc files, I am planning to yield the same result as the avengers did In their last experiments.

012 19770/48386 WER 40.0% 012 23549/48386 WER 48.7% 012 23549/48386 WER 48.7% 012 576/1970 WER 29.2%

my similar results are as the following B12 576/1970 WER 29.2% C12 645/1970 WER 32.8%

I couldn't increase the words processed and decided to run 2 decodes again by using the parameters and file ids used in the following experiment (/mnt/main/Exp/0251/007)

expected results are 007 4968/10951 WER 45.4%

July 6th, 2014
 * I worked on the /0253/G12 0253/C12 and 253/A12 experiments. started decode for full file list
 * identified huge log file under E12 (E12.html)
 * started the decode for the F12 from the beginning

Week Ending July 9nd, 2014
July 7th, 2014
 * sphinx3_decode is still running (I started yesterday night) on Asterix, Majestix, Automatix and Caesar for the following experiments 0253/A12 /F12 /C12 /G12
 * Training is running for /E12

July 8th, 2014

July 9th, 2014 Working on the following experiments with the following findings: /0253/A12 *decoding started on 06 July and still running. *101.011 segments are decoding in 2 parts, ~50500 each *checked the fileids, transcript files, they are identical to the original experiment

/0253/B12 * previous decode resulted %95.5 Sentence Error (147/154)| 29.2% WER under this folder * fileids and transcript files are updated with full list of 101.011 segments * decoding ended with fatal error: Bad feature type argument - feat.c 52

/0253/C12 * fileids and transcript files are updated with full list of 101.011 segments * previous decode resulted %94.2 (145/154) Sentence Error (635/1970)| 32.3% WER * decoding in progress

/0253/D12 * decoding in progress

/0253/E12 * log file is huge! * in the log file there are several notifications stating that "the word is not in the dictionary, not in the dictionary" * reviewing

/0253/F12 * started small test for 5 sentences, however encountered error during decoding "Bad feature type argument" * resulted : Sentence Error %100 (5/5), WER 108% (120/111)

/0253/G12 * decoding stated on July 06th, in progress

/0253/H07 *training started, using the configuration file under 0251/007 folder

July 10th, 2014

July 11th, 2014

Week Ending July 16th, 2014
July 14th, 2014

July 15th, 2014

July 16th, 2014

July 17th, 2014
 * stopped the decoding processes, as they are tying to decode the 125hr data
 * retrained the experiment under /0253/012 folder, by using the parameters are identical to 0251/012 experiment
 * created fileid lists to guarantee that the files are also same for decoding
 * created another decode under /0253/C12

July 18th, 2014
 * I realized that my understanding of the process had flaws, since I tried to decode all the 125hr data. Therefore I decided to google and find more readings. I encountered a descriptive tutorial on a website. They called themselves Robust Group, who are CMU's speech recognition group. They provided valuable stuff about speech recognition, and an open source tutorial about the recognition process. I decided to replicate as they described. Thus I will have a better understanding and hands on experience not only using the tool, but also installing it. Please find their link here [Robust Group Tutorial] And also you can find their website [here].

SENTENCE ERROR: 93.2% (4667/5008) WORD ERROR RATE: 40.9% (19789/48386)
 * the experiment resulted with lower WER and SE rates.
 * Result : /mnt/main/Exp/0253/012/

Last years finest score, again: Size of train set : 125+ hr | Size of test set : 6.2hr Time to run decode : 32 density 5.06hrs; 64 density 13.1 hr Real time factor of decode : 32 density 1.3; 64 density 2.12 WER : 32density %48.67; 64 Density %40.55 Corpus used: 3170 ( /mnt/main/corpus/switchboard/3170)

July 19th, 2014
 * Reviewed logs, checked whether parameters are identical.
 * In one of the experiments, I tried to separate the decoding process by manipulating the $DEC_CFG_NPART parameter under sphinx_decode.cfg. I used 1, 2 even 100 parts. The result is the same. The difference I assume is that number of parallel processes increase by splitting the decode process into different parts. I inferred this because while I was watching the processes to have a better understanding, by using $ top command, I saw more than sphinx3_decode processes (8-10) when I increase this parameter. And regarding the Robust Group's tutorial, the $CFG_QUEUE_TYPE parameter should be assigned to "Queue::POSIX" to use multiple CPU's. I will try to play with this parameters to gain more efficiency (in terms of time) later.

/mnt/main/Exp/0253/012/logdir/decode/012-100-100.log  sphinx3_decode -senmgau .cont. -hmm /mnt/main/Exp/0253/012/model_parameters/012.cd_cont_8000 -lw 11 -feat -beam 1e-50 -pbeam 1e-50 -wbeam 1e-30 -maxhmmpf 2000 -ci_pbeam 1e-7 -maxcdsenpf 2750 -maxwpf 10 -dict /mnt/main/Exp/0253/012/etc/012.dic -fdict /mnt/main/Exp/0253/012/etc/012.filler -lm /mnt/main/Exp/0253/012/LM/tmp.arpa -wip 0.7 -ctl /mnt/main/Exp/0253/012/etc/012_decode.fileids -ctloffset 4957 -ctlcount 51 -cepdir /mnt/main/Exp/0253/012/feat -cepext .mfc -hyp /mnt/main/Exp/0253/012/result/012-100-100.match -agc none -varnorm no -cmn current
 * Here are example parameters which I used to compare whether decode and train configuration files under different base experiment folders use the same parameters. Namely, 0253/012 and 0251/012/ are set to decode same amount of data. I not only check the parameters on the sphinx_train.cfg and sphinx_decode.cfg files, which are under /etc directory. but also follow the log files, which are under /logdir folder.

/mnt/main/Exp/0251/012/logdir/decode_64.pruned12/012-1-2.log sphinx3_decode -senmgau .cont. -hmm /mnt/main/Exp/0251/012/model_parameters/012.cd_cont_8000 -lw 11 -feat -beam 1e-50 -pbeam 1e-50 -wbeam 1e-30 -maxhmmpf 2000 -ci_pbeam 1e-7 -maxcdsenpf 2750 -maxwpf 10 -dict /mnt/main/Exp/0251/012/etc/012.dic -fdict /mnt/main/Exp/0251/012/etc/012.filler -lm /mnt/main/Exp/0251/012/LM/tmp.arpa -wip 0.7 -ctl /mnt/main/Exp/0251/012/etc/012_test.fileids -ctloffset 0 -ctlcount 2504 -cepdir /mnt/main/Exp/0251/012/feat -cepext .mfc -hyp /mnt/main/Exp/0251/012/result/012-1-2.match -agc none -varnorm no -cmn current

Week Ending July 23rd 2014
July 21th, 2014
 * compared experiment results started last week.
 * continue reading the "Robust Group" documents
 * here are some experiment results which I want to share. I also added the data used, and the

/mnt/main/Exp/0253/A12 (test3_fileids) SENTENCE ERROR: 95.5% (147/154) WORD ERROR RATE: 29.2% (576/1970)

I created a fileids list and transcript files from the files used by the training process under this basedir. By this way, I plannded to check if there is a correlation between the WER and the size of the decode files. /mnt/main/Exp/0253/A12 (mini_fileids) SENTENCE ERROR: 94.3% (49/53) WORD ERROR RATE: 46.5% (241/520) We cannot infer that there is a correlation, since it increase as the dataset decreases. at least we can not positively correlate them at this time, and it will be better to re examine this idea.

/mnt/main/Exp/0253/C12 SENTENCE ERROR: 93.4% (4677/5008) WORD ERROR RATE: 43.0% (20805/48386)

/mnt/main/Exp/0253/D12 decode cancelled.

/mnt/main/Exp/0253/F12 SENTENCE ERROR: 96.8% (149/154) WORD ERROR RATE: 32.8% (645/1970)

/mnt/main/Exp/0253/F12 decode cancelled.

The following experiment produced exactly the same results.

/mnt/main/Exp/0253/012 SENTENCE ERROR: 93.2% (4667/5008) WORD ERROR RATE: 40.9% (19789/48386)

From last log files /mnt/main/Exp/0251/012 SENTENCE ERROR: 93.2% (4666/5008) WORD ERROR RATE: 40.9% (19770/48386)

From last semesters report: Size of train set : 125+ hr | Size of test set : 6.2hr Time to run decode : 32 density 5.06hrs; 64 density 13.1 hr Real time factor of decode : 32 density 1.3; 64 density 2.12 WER : 32density %48.67; 64 Density %40.55 Corpus used: 3170 ( /mnt/main/corpus/switchboard/3170)

July 22th, 2014
 * comparing experiment folders and file sizes.

du /mnt/main/Exp/0251/012/feat/ 2495900 du /mnt/main/Exp/0251/012/wav/ 408568

du /mnt/main/Exp/0253/012/feat/ 2497836 du /mnt/main/Exp/0253/012/wav/ 408568

du /mnt/main/Exp/0253/A12/feat/ 2495908 du /mnt/main/Exp/0253/A12/wav/ 36071752

du /mnt/main/Exp/0253/B12/feat/ 2497860 du /mnt/main/Exp/0253/B12/wav/ 408568

du /mnt/main/Exp/0253/C12/feat/ 2495896 du /mnt/main/Exp/0253/C12/wav/ 36071752

du /mnt/main/Exp/0253/D12/feat/ 2495948 du /mnt/main/Exp/0253/D12/wav/ 36071752

du /mnt/main/Exp/0253/E12/feat/ 2495916 du /mnt/main/Exp/0253/E12/wav/ 36071752

du /mnt/main/Exp/0253/F12/feat/ 2495904 du /mnt/main/Exp/0253/F12/wav/ 36071752

du /mnt/main/Exp/0253/G12/feat/ 2495904 du /mnt/main/Exp/0253/G12/wav/ 36071752

du /mnt/main/Exp/0253/H07/feat/ 2495896 du /mnt/main/Exp/0253/H07/wav/ 36071752

du /mnt/main/Exp/0253/M01/feat/ 19168 du /mnt/main/Exp/0253/M01/wav/ 36071752

July 23th, 2014
 * initiated 0253/M01 experiment, 36071752
 * reviewing the sclite, the scoring tool for speech recognition, and the meaning of its' inputs, outputs and parameters from the following website. NIST Sclite Scoring Tool
 * during my readings from Robust Group website, I encountered the following parameter to automize the scoring process. I tried and it is also working.

If we modify the $DEC_CFG_ALIGN parameter from "builtin" to "sclite" it automatically aligns and calculates the score. It can be found in etc/sphinx_decode.cfg file.

July 24th, 2014

During the meeting yesterday we identified 3 main tasks for the next weeks. (125hr training, 32 density will be used)
 * investigate the errors, and warnings on the training the decoding processes to improve
 * run another experiment with not a subset of the train data, which is called test on train
 * run another experiment with not a subset of the train data, namely test on development

This is the result of the 0253/A12 decode. I started it yesterday with the parameters on the decode3_fileids. density 32, same file ids as 64 experiment used.

SENTENCE ERROR: 92.4% (902/977) WORD ERROR RATE: 42.8% (3814/8912)


 * I started another experiment to check with smaller decode data. I am using /A12_mini_fileids and related transcript files.
 * added explanation of files and folders under Acoustic Model Improvements term notes [Acoustic Model Improvements]

SENTENCE ERROR: 98.3% (57/58) WORD ERROR RATE: 46.2% (307/665)
 * This is the result of mini experiment I started on 0253/A12 folder (/A12_mini_fileids)

July 25th, 2014 This is a result of adding DEC_CFG_LANGUAGEWEIGHT = "6"; # it was 11 to an $DEC_CFG_GAUSSIANS = 8;
 * Yesterday I changed/added some more parameters to check if the result is effected. Here are the results.

SENTENCE ERROR: 100.0% (22/22) WORD ERROR RATE: 53.8% (119/223)

This another result of adding DEC_CFG_LANGUAGEWEIGHT = "13"; and $DEC_CFG_GAUSSIANS = 8; parameters. Result did not changed. SENTENCE ERROR: 100.0% (22/22) WORD ERROR RATE: 53.8% (119/223)

SENTENCE ERROR: 98.3% (57/58) WORD ERROR RATE: 46.2% (307/665)
 * I will remove the line and run it with original values again. Done. Result are same.


 * I am reading about creating Language Models and Acoustic Models on CMU website.
 * http://cmusphinx.sourceforge.net/wiki/tutoriallm
 * http://cmusphinx.sourceforge.net/wiki/tutorialam

Here are the commands comes with cmuclmtk. binlm2arpa evallm idngram2lm idngram2stats lm_combine lm_interpolate mergeidngram ngram2mgram text2idngram text2wfreq text2wngram wfreq2vocab wngram2idngram


 * I am reading and testing these commands to have a better understanding about CMU tool kit.
 * I created a mini decode list to test whether the decode run through successfully. It can also be used to introduce the system for new comers for the upcoming semesters.
 * I continue reading on the

July 26th, 2014
 * I realized that the pocket sphinx works great, and better than sphinx3, on local machine. I compiled and installed the sphinxbase on my mac and it is working great with pocket sphinx. On the other hand, sphinx3 compile had encountered errors. I found solutions for most of the ./configure and ./make issues, and finally compiled the code on Mac OSX.
 * The first issue I encountered was the sphinxbase dependency, the compiler couldn't find the sphinx base directory, however I solved it by using the parameters provided in the ./configure --help
 * next issue was choosing between 64bit/32bit on sphinx3 compilation. When I downloaded by using the svn command or .tar.gz format from the official website, I had to run the ./autgen.sh to create the ./configure batch file. It was not possible to continue because of the dependencies (autoconf, autogen, etc.) which I then solved by downloading, compiling and installing. When I continue the process by ./configure, in my first try I couldn't make it compile for 64bit, however I than successfully compiled sphinx3. Some test were not successful, but it is training/decoding.
 * Since sphinx3 is not the one and only decoder, it might be wise decision to try pocket sphinx as well, at least on local machine to catch the warnings and errors. I will do it next.

Week Ending July 30th, 2014
July 28th, 2014


 * I am scrutinizing the CMU toolkit by installing on local, because I feel that if I can catch the errors or warnings from the beginning, at least in the early stage of the training, it will help solve the errors and warnings identified during train and decoding. hopefully improve the recognition process.
 * I also read and played with the utility scripts provided by the CMU dictionary. I read the the phonemes and played with the phoneme dictionary. I also read some about [IPA]--international phonetic alphabet.
 * I am also enriching my notes under semester notes

July 29th, 2014

July 30th, 2014 The 3 main tasks for the upcoming next weeks are as the following: (For the next experiments, training will be with 125hr data, and density will 32)
 * As Dr. Jonas announced last week, and reminded today, we will not meet today. In order to make myself a reminder, I am re-writing my notes that we identified during last week's meeting.
 * investigate the errors, and warnings on the training the decoding processes to improve
 * run another experiment with not a subset of the train data, which is called test on train
 * run another experiment with not a subset of the train data, namely test on development

July 31th, 2014

August 1st, 2014

August 2nd, 2014

Week Ending 6th
August 4th, 2014 SENTENCE ERROR: 93.2% (4667/5008) WORD ERROR RATE: 40.9% (19789/48386) SENTENCE ERROR: 93.4% (4677/5008) WORD ERROR RATE: 43.0% (20805/48386)
 * my best result is under the following experiment folder /0253/012
 * my second best result is under /0253/C12
 * Reviewing log files and trying to understand the flaws on this experiment. Identified 1000+ errors during decoding.

SENTENCE ERROR: 95.5% (147/154) WORD ERROR RATE: 29.2% (576/1970)
 * other results with smaller files is under 0253/012 folder

August 5th, 2014

August 6th, 2014

August 7th, 2014

August 8th, 2014

August 9th, 2014

Week Ending 13th
August 11th, 2014 SENTENCE ERROR: 56.2% (73/130)  WORD ERROR RATE: 16.5% (128/773) SENTENCE ERROR: 56.9% (73/130) WORD ERROR RATE: 17.9% (138/773)
 * working on tutorial experiments to decrease the WER
 * downloaded the tutorial and trained with small data set. My objective is to check whether our installation triggers the errors during the training process or not.
 * In order to compare, and use as benchmark in my diagnostics, I used an audio SET which I downloaded from CMU website. I generate feature files in my local machine, and used the language model provided with the tutorial. Here is the result of this experiment. (/Exp/0253/an40)
 * Furthermore, I uploaded and ran the same experiment, with same inputs and parameters on sphinx and obtained the following results:
 * I tried to use these parameters to decrease my best WER result, but not successful yet. The experiment stopped during the pruning stage with the following log. Unable to open /mnt/main/Exp/0253/an4/trees/an4.unpruned/OY-0.dtree for reading; No such file or directory (/Exp/0253/an40)

August 12th, 2014 $DEC_CFG_WORDPENALTY = "0.7"; $DEC_CFG_PBEAM ="1e-50"; $DEC_CFG_MAXHMMPF = "2000"; $DEC_CFG_CIPBEAM = "1e-7"; $DEC_CFG_MAXCDSENPF = "2750"; $DEC_CFG_MAXWPF = "10"; $CFG_STATESPERHMM = 3; $CFG_SKIPSTATE = 'no';
 * I compared the configuration files under best results, and the an4 tutorial experiment. Here are some findings:
 * the following lines aren't existing under decode configuration of the tutorial experiments:
 * the following lines are also different in experiments, I mean not the values but the places (i.e. they are 3 and not if HMM type is continuos, respectively, and 5, yes if type is .semi under our best results. I will change and experiment these variables.

August 13th, 2014 Here are my findings: SENTENCE ERROR: 90.0% (18/20)  WORD ERROR RATE: 36.6% (77/213) SENTENCE ERROR: 81.0% (162/200)  WORD ERROR RATE: 32.0% (712/2228) SENTENCE ERROR: 85.0% (17/20)  WORD ERROR RATE: 25.4% (54/213) SENTENCE ERROR: 79.5% (159/200)  WORD ERROR RATE: 32.0% (712/2228)
 * Yesterday I worked on smaller data sets to find out what triggers the errors, and I inherited an experiment from the tutorials. I generated language model from the transcript files which I used for the training. Previously the LM was generated from whole data under the 3170 folder. Furthermore, I tried the configuration parameters from my best result, and the files comes from the tutorials.
 * I also read about current corpus structure and its usage. There are some changes made each semester and I want to understand those changes. I found some useful information about corpus (audio and transcript files) under the following links: [corpus] [data] [switchboard].


 * I downloaded, configured, made (but not installed) current versions of sphinx3, sphinx base, and sphinxtrain  from their official websites. I linked my experiments and tried some experiment runs with those versions under my experiment folder in order to check if there is a correlation between the environment configuration and the errors: I did not observe any difference.

SENTENCE ERROR: 85.0% (17/20)  WORD ERROR RATE: 25.4% (54/213) SENTENCE ERROR: 100.0% (44/44)  WORD ERROR RATE: 87.7% (490/559)
 * test on train results:
 * test on development


 * Findings: Instead of using a single LM for a corpus, i.e. 3170, it's better to generate LM after each training data, and use this LM for decoding.
 * the .match file shows an improvement opportunities: some files are perfectly recognized, some are nearly not recognized--I plan to listen those spy files. I want to share some examples below. please also refer to 000/results/000.align file for details.

id: (user-sw3189a-ms98-a-0013) Scores: (#C #S #D #I) 14 0 1 0 REF: that was a great big thing AND we do that with his family about every HYP: that was a great big thing *** we do that with his family about every Eval:

id: (user-sw3189a-ms98-a-0012) Scores: (#C #S #D #I) 37 1 2 1 REF: IT was a great big lodge and so we had plenty of room and everything and and some of them live in that area so IT wasn't too hard for them to come in the day and **** LEAVE  but um HYP:  ** was a great big lodge and so we had plenty of room and everything and and some of them live in that area so ** wasn't too hard for them to come in the day and YEAH UH-HUH but um Eval: D

id: (user-sw3189a-ms98-a-0011) Scores: (#C #S #D #I) 10 4 4 0 REF: OH   I     oh UH JEEZ I think over um three hundred WERE there it WAS QUITE a large HYP: WERE THERE oh ** **** * think over um three hundred **** there it WOW WOW   a large Eval: S   S        D  D    D                             D             S   S

id: (user-sw3189b-ms98-a-0017) Scores: (#C #S #D #I) 22 0 3 0 REF: NOW was this his entire family cousins aunts uncles things like that or his immediate family OH WOW oh wow how many people were there HYP: *** was this his entire family cousins aunts uncles things like that or his immediate family ** *** oh wow how many people were there Eval: D                                                                                           D  D

id: (user-sw3189a-ms98-a-0017) Scores: (#C #S #D #I) 4 1 4 4 REF: * **** yeah so WE HAVE A LITTLE shindig WITH them *** **** HYP: A YEAR yeah so ** **** * ****** shindig THE  them ARE THEY

id: (user-sw3189b-ms98-a-0028) Scores: (#C #S #D #I) 8 1 4 3 REF: *** ** wow it really is AND YOU GET TOGETHER once a year *** wow UH HYP:  HIP OH wow it really is *** *** *** ******** once a year WOW wow LITTLE Eval: I  I                   D   D   D   D                    I       S

August 14th, 2014 $ grep -r "ERROR" ./logdir/ | grep -o "sw.....-....-.-...." | sort | uniq > err.list $ for LINES in `cat < err.list`; do $	sed -ie "s/${LINES}.*$//g" test.fileids; $	echo "${LINES}"; $ done $ sort test.fileids | uniq > ready.fileids
 * I am working on the files which are logged as ERROR: I wrote a script to list down those files. My plan is to eliminate this files and train again.
 * I also started another training with full data (3170) with the configuration parameters from my best mini test results.
 * I manually created symbolic links to /wav and /feat folders to my 0253/C12 experiment to save space. If it will work without problem I will also document a procedure in my logs.
 * I also wrote some scripts to remove those files from .transcript and .fileids

$ for LINES in `cat < err.list`; do $	sed -ie "s/^.*${LINES}.*$//g" test.transcription; $	echo $LINES; $ done $ sort test.transcription | uniq > ready.transcription
 * I plan to try some experiments without those files to check whether the errors are because of the files or not. Secondly, I plan to collect those files (audio and transcripts) to analyze if they have clues or if they create a patterns of flaws for future investigations.

August 15th, 2014

August 16th, 2014

Week Ending 20th
August 18th, 2014
 * I am working on the Corpus, and the scripts which creates the corpus from dist (the switchboard CD) files.
 * We have the following data: 23 CDs under /mnt/main/corpus/dist directory, Total amount of data is 256.395 hrs
 * I am creating the corpus from the scratch to assure that there isn't any missing during the transform process
 * sox is used under scripts for audio manipulation--trimming, format change ..etc.
 * transcript files under corpus include timing information for the CDs, there info is used to trim the audio file for the alignment.
 * some utility scripts, such as corpusSize, genTrans, monoGen...etc are evaluated throughout semesters for special purposes. I review those scripts and compiling/gathering/refining a set for future use
 * Previous semester, a new corpus is created created by the contributors (3170). They named it 3170, because they started from the file number 3170. Last semesters training used half of the Disc 14 to end of Disk 22 .sph files, since it used 3170 corpus. I calculated the total number of hours under this folder and yielded 125.3 hrs of training data. (corpusSize0.pl was used, 102.4 hrs if we do not count the overlaps; by using corpusSize.pl)

August 19th, 2014

August 20th, 2014
 * I plan to create a baseline folder /mnt/main/corpus/BL00 under main corpus folder. I will ask Dr. Jonas for permission.
 * I copied all original .sph files from discs into a separate folder and start regenerating the trimmed utterances from scratch. During this process I recreated a corpusGenerate.pl script which automates this process.
 * At first glance, it seems that there is an improvement opportunity about training data: first some audio files have low volume--we can normalize it to a upper limit. I created the script to make this possible. Second, some utterances are not perfectly trimmed, namely the duration is not set to a sufficient length, therefore end unexpectedly and immediately, for some utterances. To remedy this issue, I think we can increase the duration of utterances a little bit, or by a percentage. I will also try this.
 * I also checked the filelist provided with the Switchboard CDs and files are matching, namely, the data set is not missing.
 * I searched and encountered the following paper about Acoustic Model Normalization, it might enlighten future improvement, therefore I want to share here. [link]

August 21th, 2014 comm -3 full_transcript.text train.trans >delta.txt 03/26/2013: Three previously missing files were added to this release. (sw02289.sph, sw04361.sph, sw04379.sph) File tables and documentation were updated to reflect the addition of these files. Please contact ldc@ldc.upenn.edu to obtain this update. All copies of this corpora obtained after the above date already include this update.
 * I compiled the SPH files and some utility scripts that I generated under BL00 folder
 * I found two transcript files for full set but they have difference, the difference between those files are documented under a delta.txt, which I will use to create a subset trimming to check which one to use. The following script can be used to find the delta between two files. -3 suppresses the common lines:
 * the following 3 files are missing under within the CDs : sw02289.sph sw04361.sph sw04379.sph  therefore, I eliminated those files form the corpus and transcripts.
 * I found some information about those files, and how to obtain them. They are said to be provided on demand. [link]

September
Sep 15th, 2014 /mnt/main/corpus/noaa /mnt/main/corpus/dist/Switchboard/flat /mnt/main/corpus/switchboard (working) /mnt/main/corpus/switchboard_old (not working) /mnt/main/corpus/switchboard/125hr_3170 (3170 means, the file list starts with sw03170.sph; this name is given by last semesters winning team ) removed the audio files under this directory, but create a symbolic link to the original files (ln -s TARGET LINK_NAME) /mnt/main/corpus/switchboard/125hr perl /mnt/main/corpus/BL00/scripts/corpusConsolidate.pl. /mnt/main/corpus/BL00/scripts/file.lst /mnt/main/corpus/dist/Switchboard/consolidated rm *.wav | ls -1 *.sph | sed -e 's/\..*$//' | xargs -I {} sox {}.sph {}.wav
 * folder created for gnu radio decoding
 * using the dist files under the following folder
 * separated not working corpus files and working ones.
 * renamed best result corpus with an appropriate foldername
 * created another 125hr corpus, removed error files, enough audio files to from the following disk
 * wrote a script and consolidated the distribution audio files under a folder
 * removed 10hr from root folder
 * consolidate all Switchboard disk files under a folder to make the symbolic links
 * following scripts converts sph files in a directory into wav files
 * extracted sample transcript and audio files from master transcripts for creating new corpuses. Sent to Marcel to generate the ncaa corpus. also prepared the folders for them.

Sep 16th, 2014 /mnt/main/corpus/switchboard/256hr/ ln -s /mnt/main/corpus/dist/Switchboard/consolidated conv /mnt/main/corpus/switchboard/256hr/clean/audio
 * created utterances from scratch and put them in a new corpus. This is all the data from Switchboard, trimmed with the final transcript. One think to remember is that Switchboard disks have 3 audio files missing. I wrote to Dr. Jonas about how to obtain them, and hopefully we can get those files. The following corpus does not include the wav files and related transcript files--I eliminated them from the .rans file.
 * made the symbolic links to the audio files under dist folder
 * all the utterances and conversation files are located under the following folder

Sep 17th, 2014 grep -r "ERROR" ./logdir/ | grep -o "sw.....-....-.-...." | sort | uniq > err.list cat ERROR.list | sort | uniq > UNIQUE_ERROR.list comm -3 256hr.fileids UNIQUE_ERROR.list >RESULT.txt
 * searched and find files that had encountered errors during experiments, in each experiment run
 * made them distinct
 * extracted those files from th FULL file list of Switchboard distribution
 * use the file list to get train again:
 * started 2 trains under 0253/A125hr_3170
 * and 0253/A125hr_3170Clean (added more files from last)

Sep 24th, 2014
 * after our meeting tonight I started 2 more experiments. Since we talked about possible difference in utterance files, I mapped (C124hr_3170) to my best result's folders feats folder, and secondly I created another training (D124hr_3170) and use 3170 corpus, but local feats. Both experiments use 3170124hr_cleaned.fileids--which I generated last week ( best results - identified error files )