Speech:Exps 0282 012

Description
Author: Peter Ferro

Date: 3/23/16

Purpose: For testing makeTest.pl.

Details: I intend on using makeTest.pl with a -t flag, the source being 0282/005, the destination being 0282/012, and the corpus being first_5hr. This means the sources do not match, the

Results: 3/23/16: After performing a mkdir with etc, LM and DECODE, the code ran unexpectedly fast. I discovered in my decode.log file that this was caused by a fatal error in run_decode. Specifically, mdef.c told me there was no mdef file. Here was the end result... INFO: info.c(65): Host: 'caesar' INFO: info.c(69): Directory: '/mnt/main/Exp/0282/012/DECODE' INFO: info.c(73): /usr/local/bin/sphinx3_decode Compiled on: Apr 23 2012, AT: 10
 * 50:45

INFO: cmd_ln.c(512): Parsing command line: /usr/local/bin/sphinx3_decode \ -hmm /mnt/main/Exp/0282/012/model_parameters/012.cd_cont_3506 \ -lm /mnt/main/Exp/0282/012/LM/tmp.arpa \ -dict /mnt/main/Exp/0282/012/etc/012.dic \ -fdict /mnt/main/Exp/0282/012/etc/012.filler \ -ctl /mnt/main/Exp/0282/012/etc/012_decode.fileids \ -cepdir /mnt/main/Exp/0282/012/feat \ -cepext .mfc ...extra configuration details are omitted... INFO: kbcore.c(442): Begin Initialization of Core Models: ERROR: "cmd_ln.c", line 724: Cannot open configuration file /mnt/main/Exp/0282/012/model_parameters/012.cd_cont_3506/feat.params for reading INFO: kbcore.c(462): Parsed model-specific feature parameters from /mnt/main/Exp/0282/012/model_parameters/012.cd_cont_3506/feat.params INFO:  Initialization of the log add table INFO:  Log-Add table size = 29356 x 2 >> 0 INFO: INFO: feat.c(848): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none' INFO: cmn.c(142): mean[0]= 12.00, mean[1..12]= 0.0 INFO: kbcore.c(489): .cont. INFO:  Initialization of feat_t, report: INFO:  Feature type         = 1s_c_d_dd INFO:  Cepstral size        = 13 INFO:  Number of streams    = 1 INFO:  Vector size of stream[0]: 39 INFO:  Number of subvectors = 0 INFO:  Whether CMN is used  = 1 INFO:  Whether AGC is used  = 0 INFO:  Whether variance is normalized = 0 INFO: INFO:  Reading HMM in Sphinx 3 Model format INFO:  Model Definition File: (null) INFO:  Mean File: (null) INFO:  Variance File: (null) INFO:  Mixture Weight File: (null) INFO:  Transition Matrices File: (null) FATAL_ERROR: "mdef.c", line 680: No mdef-file I suspect it has to do with missing files. This problem could be solved by soft-linking the files and ensuring that the folders mentioned in the configuration are pointed to (along with any additional folders required, pending additional investigation). However, this is not a complete solution due to numbering conflicts between the source and destination. This could result in me completely modifying run_decode.pl, and thus the original would lay abandoned, no longer used by my code. What's running through my head on how to solve this problem is that I need to copy the files specified in the decode.log file that are in the etc directory. On any other files that I am handing, I will create a softlink to them, because I am not directly modifying the directory. I'm going to determine if I need any more than what is specified in the decode.log file. If so, then I'll make another copy. Here's what happened on my second attempt via the log... INFO: info.c(65): Host: 'caesar' INFO: info.c(69): Directory: '/mnt/main/Exp/0282/012/DECODE' INFO: info.c(73): /usr/local/bin/sphinx3_decode Compiled on: Apr 23 2012, AT: 10
 * 50:45

INFO: cmd_ln.c(512): Parsing command line: /usr/local/bin/sphinx3_decode \ -hmm /mnt/main/Exp/0282/012/model_parameters/005.cd_cont_3506 \ -lm /mnt/main/Exp/0282/012/LM/tmp.arpa \ -dict /mnt/main/Exp/0282/012/etc/012.dic \ -fdict /mnt/main/Exp/0282/012/etc/012.filler \ -ctl /mnt/main/Exp/0282/012/etc/012_decode.fileids \ -cepdir /mnt/main/Exp/0282/012/feat \       -cepext .mfc So far so good... INFO: kbcore.c(442): Begin Initialization of Core Models: ERROR: "cmd_ln.c", line 724: Cannot open configuration file /mnt/main/Exp/0282/012/model_parameters/005.cd_cont_3506/feat.params for reading INFO: kbcore.c(462): Parsed model-specific feature parameters from /mnt/main/Exp/0282/012/model_parameters/005.cd_cont_3506/feat.params INFO:  Initialization of the log add table INFO:  Log-Add table size = 29356 x 2 >> 0 INFO: INFO: feat.c(848): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none' INFO: cmn.c(142): mean[0]= 12.00, mean[1..12]= 0.0 INFO: kbcore.c(489): .cont. INFO:  Initialization of feat_t, report: INFO:  Feature type         = 1s_c_d_dd INFO:  Cepstral size        = 13 INFO:  Number of streams    = 1 INFO:  Vector size of stream[0]: 39 INFO:  Number of subvectors = 0 INFO:  Whether CMN is used  = 1 INFO:  Whether AGC is used  = 0 INFO:  Whether variance is normalized = 0 INFO: INFO:  Reading HMM in Sphinx 3 Model format INFO:  Log-Add table size = 29356 x 2 >> 0 INFO: INFO: feat.c(848): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none' INFO: cmn.c(142): mean[0]= 12.00, mean[1..12]= 0.0 INFO: kbcore.c(489): .cont. INFO:  Initialization of feat_t, report: INFO:  Feature type         = 1s_c_d_dd INFO:  Cepstral size        = 13 INFO:  Number of streams    = 1 INFO:  Vector size of stream[0]: 39 INFO:  Number of subvectors = 0 INFO:  Whether CMN is used  = 1 INFO:  Whether AGC is used  = 0 INFO:  Whether variance is normalized = 0 INFO: INFO:  Reading HMM in Sphinx 3 Model format INFO:  Model Definition File: (null) INFO:  Mean File: (null) INFO:  Variance File: (null) INFO:  Mixture Weight File: (null) INFO:  Transition Matrices File: (null) FATAL_ERROR: "mdef.c", line 680: No mdef-file Well, I made better progress, but I fell down again at the model_parameters. My symlink was successful, but I need to figure out how to deal with the language model. I am highly suspicious that these model_parameter directories may actually be modified (or I need to auto-extract or something...). Further investigation reveals that I found an odd one out in Speech:Exps_0282_003. This is exactly what I'm looking for, as it explains exactly why my code is failing (and why non-default parameters are likely to fail): the model_parameter directories are hard-wired to $CFG_N_TIED_STATES. That's not good for my code, as now I need to revamp the code again to deal with this problem... and it reveals that the senome count doesn't appear to be as adjustable as I expected after training. Oh boy... 3/26/16: I decided I would just simply outright steal the folder name from model_parameters so that I could still process all of the files in the decode fileids. In order to do so, I made a test script so that I could get this next part correct: I wanted to perform a ls operation that took the first item alphabetically. Why? Because run_decode.pl has a major flaw with its senome count: it has to match whatever the model_parameters used when it was created via the train. After testing this part, I incorporated the change into my script, and I let the script do its magic. It didn't immediately terminate, which is a very good sign for me. It also appears to have worked, as the log seems to have the usual set of data in it. INFO: info.c(65): Host: 'caesar' INFO: info.c(69): Directory: '/mnt/main/Exp/0282/012/DECODE' INFO: info.c(73): /usr/local/bin/sphinx3_decode Compiled on: Apr 23 2012, AT: 10:50:45

INFO: cmd_ln.c(512): Parsing command line: /usr/local/bin/sphinx3_decode \ -hmm /mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000 \ -lm /mnt/main/Exp/0282/012/LM/tmp.arpa \ -dict /mnt/main/Exp/0282/012/etc/012.dic \ -fdict /mnt/main/Exp/0282/012/etc/012.filler \ -ctl /mnt/main/Exp/0282/012/etc/012_decode.fileids \ -cepdir /mnt/main/Exp/0282/012/feat \ -cepext .mfc Once again, cutting out the fine print in the configuration... INFO: kbcore.c(442): Begin Initialization of Core Models: INFO: cmd_ln.c(512): Parsing command line: \       -alpha 0.97 \ -dither yes \ -doublebw no \ -nfilt 40 \ -ncep 13 \ -lowerf 133.33334 \ -upperf 6855.4976 \ -nfft 512 \ -wlen 0.0256 \ -transform legacy \ -feat 1s_c_d_dd \ -agc none \ -cmn current \ -varnorm no I'm cutting more of the current configuration. INFO:  Initialization of the log add table INFO:  Log-Add table size = 29356 x 2 >> 0 INFO: INFO: feat.c(848): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none' INFO: cmn.c(142): mean[0]= 12.00, mean[1..12]= 0.0 INFO: kbcore.c(489): .cont. INFO:  Initialization of feat_t, report: INFO:  Feature type         = 1s_c_d_dd INFO:  Cepstral size        = 13 INFO:  Number of streams    = 1 INFO:  Vector size of stream[0]: 39 INFO:  Number of subvectors = 0 INFO:  Whether CMN is used  = 1 INFO:  Whether AGC is used  = 0 INFO:  Whether variance is normalized = 0INFO: INFO:  Reading HMM in Sphinx 3 Model format INFO:  Model Definition File: /mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000/mdef INFO:  Mean File: /mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000/means INFO:  Variance File: /mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000/variances INFO:  Mixture Weight File: /mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000/mixture_weights INFO:  Transition Matrices File: /mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000/transition_matrices INFO: mdef.c(683): Reading model definition: /mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000/mdef INFO:  Initialization of mdef_t, report: INFO:  43 CI-phone, 37311 CD-phone, 3 emitstate/phone, 129 CI-sen, 1129 Sen, 3676 Sen-Seq INFO: INFO: kbcore.c(299): Using optimized GMM computation for Continuous HMM, -topn will be ignored INFO: cont_mgau.c(164): Reading mixture gaussian file '/mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000/means' INFO: cont_mgau.c(423): 1129 mixture Gaussians, 8 components, 1 streams, veclen 39 INFO: cont_mgau.c(164): Reading mixture gaussian file '/mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000/variances' INFO: cont_mgau.c(423): 1129 mixture Gaussians, 8 components, 1 streams, veclen 39 INFO: cont_mgau.c(164): Reading mixture gaussian file '/mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000/variances' INFO: cont_mgau.c(423): 1129 mixture Gaussians, 8 components, 1 streams, veclen 39 INFO: cont_mgau.c(524): Reading mixture weights file '/mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000/mixture_weights' WARNING: "cont_mgau.c", line 667: Weight normalization failed for 9 senones INFO: cont_mgau.c(679): Read 1129 x 8 mixture weights INFO: cont_mgau.c(707): Removing uninitialized Gaussian densities 0 1 2 3 4 5 6 7 8 WARNING: "cont_mgau.c", line 781: 72 densities removed (9 mixtures removed entirely) INFO: cont_mgau.c(797): Applying variance floor INFO: cont_mgau.c(815): 21 variance values floored INFO: cont_mgau.c(863): Precomputing Mahalanobis distance invariants INFO: tmat.c(119): Reading HMM transition probability matrices: /mnt/main/Exp/0282/012/model_parameters/005.cd_cont_1000/transition_matrices WARNING: "tmat.c", line 192: Normalization failed for tmat 0 from state 0 WARNING: "tmat.c", line 192: Normalization failed for tmat 0 from state 1 WARNING: "tmat.c", line 192: Normalization failed for tmat 0 from state 2 WARNING: "tmat.c", line 192: Normalization failed for tmat 1 from state 0 WARNING: "tmat.c", line 192: Normalization failed for tmat 1 from state 1 WARNING: "tmat.c", line 192: Normalization failed for tmat 1 from state 2 WARNING: "tmat.c", line 192: Normalization failed for tmat 2 from state 0 WARNING: "tmat.c", line 192: Normalization failed for tmat 2 from state 1 WARNING: "tmat.c", line 192: Normalization failed for tmat 2 from state 2 INFO:  Initialization of tmat_t, report: INFO:  Read 43 transition matrices of size 3x4 There is something concerning about the decode, though. Here's what I discovered... ERROR: "wid.c", line 282:  is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282: -/DIESEL is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282: -EAH is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282: -SE is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282: -T'S is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282: A- is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282: ABOUT_ is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282: ACIDS is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282: ALLEVIATE is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282: ALZHEIMER'S is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282: AMMONIA is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282: AN- is not a word in dictionary and it is not a class tag. I've snipped the rest. The biggest reason this stuff might pop up is because it was never used in the train, and the train itself uses a limited number of senomes. Well, at least that means both seen and unseen data are involved here... I looked at Speech:Exps_0282_005's decode log and I found a few of the same things, but they're not really words... ERROR: "wid.c", line 282:  is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282:  is not a word in dictionary and it is not a class tag. ERROR: "wid.c", line 282:  is not a word in dictionary and it is not a class tag. These dictionaries are presumably formed when training, which explains why some words never make it to the dictionary in the first place. Hmm...