Speech:Exps 0288

From Openitware
Jump to: navigation, search

Description

Authors:
Jon Shallow
James Schumacher
Peter Farro
Brain Anker
Saverna Ahmad
Justin Gauthier
Daisuke Matsukura
Thomas Rubino
Michael Salem
Meagan Wolf

Date:
3/30/16

Purpose:
Produce as low of a WER as possible.

Details:
TBD

Results:

Research Log (copied and pasted -- unformatted for wiki -- sorry):
Captain America


Research Log Spring 2016 Team Members:


Brian Anker Jon Shallow Justin Gauthier James Schumacher Saverna Ahmad Meagan Wolf Tom Rubino Peter Ferro Mike Salem Daisuke Matsukura



Experiments: tExp directory: 0288 Dates: Apr - May 16 Corpus: Switchboard Hours: 30 and 300

Intent: Research and determine the effects of sphinx_train.cfg parameters on WER. A baseline WER for a 30hr corpus will be establishing using CMU recommended parameters. Parameters will then be evaluated in order to determine their effect on the WER. When optimal values for evaluated parameters are determined, the team will evaluate these changes on a 300hr corpus.



0288/001(300HR) Parameter Tested CFG_VARNORM=”no”,CFG_FINAL_NUM_DENSITIES=32, CFG_N_TIED_STATES=8000, CFG_CONVERENCE_RATIO= 0.04 Hypothesis In our first experiment we ran a 300 hour train with CFG VARNORM set to no, FINAL NUM DENSITIES set to 32, CFG TIED STATES set to 8000 and CFG CONVERGENCE RATIO set to .04. We had hoped to get a low WER as a result. Result

35.1% WER on seen data(train.trans)

91 errors in 001.html 34 hours to complete


Sphinx_train.cfg $CFG_AGC = 'none'; $CFG_CMN = 'current'; $CFG_VARNORM = 'no'; $CFG_LTSOOV = 'no'; $CFG_FULLVAR = 'no'; $CFG_DIAGFULL = 'no'; $CFG_FEATURE = "1s_c_d_dd"; $CFG_NUM_STREAMS = 1; $CFG_INITIAL_NUM_DENSITIES = 1; $CFG_FINAL_NUM_DENSITIES = 32; $CFG_CONVERGENCE_RATIO = 0.04; $CFG_LDA_DIMENSION = 29; $CFG_LDA_MLLT = 'no'

0288/002(300HR) Parameter Tested $CFG_VARNORM = 'no',$CFG_FINAL_NUM_DENSITIES = 64,$CFG_N_TIED_STATES = 8000, $CFG_CONVERGENCE_RATIO = 0.004 Hypothesis


In the previous experiment, we had set FINAL NUM DENSITIES to 32. In this experiment we decided to use the same configuration,but set FINAL NUM DENSITIES to 64. The recommend value when training over 100 hours is 32. The higher this number is, the more precisely it discriminates sound. By upping the value we thought we had the chance of obtaining a better WER. Result

82 errors in 002.html 26.6% WER on Seen data(train.trans) 42.2% on unseen(Dev.trans) 4:30 decode time




0288/003(30HR) Parameter Tested $CFG_FINAL_NUM_DENSITIES = 16 (density) and $CFG_N_TIED_STATES = 4000(senones) Hypothesis In our first 30 hour train experiment we decided to set FINAL NUM DENSITIES to 16, and set N TIED STATES to 4000. 32 is the recommended size for corpses over 100 hours, so a value of 16 was thought to be the equally proportionate. Result established the 30hr baseline of 29.1%. This used all default values of the sphinx_train.cfg except for setting: .These are the recommended values for a corpus of 30 hours from the CMUSphinx documentation. 88 errors in 003.html 29.1% WER on seen data.


0288/004(CONSIDER UNSUSABLE)(30HR) Parameter Tested CFG_CONVERGENCE_RATIO = 0.004 Hypothesis Consider Unusable: Two parameters (convergency_rate and VARNORM) were changed in respect to the baseline experiment (003). It is unclear what affected the results. Result Testing the effect of lower convergence ratio (from .04 to .004). Previous test have shown us this increases WER. Also testing the effect of setting varnorm to “yes”, a previous experiment has shown a 2.8% WER increase over a 5 hours corpus. 27.4% WER on Seen(train.trans)

0288/005(30HR) Parameter Tested CFG_AGC: MAX Hypothesis CFG_ACG is a configuration for automatic gain control. This Result

Testing the effect of automatic gain control (cfg_agc). The default value is ‘none’ and the value we’ve set it to for this experiment is ‘max’. Switching to ‘max’ resulted in a decrease of 1.1% from the baseline a.k.a. 28.0%.

28% WER on seen data(train.trans)


0288/006 Parameter Tested CFG_CONVERGENCE_RATIO = 0.004, $CFG_CMN = 'none' Hypothesis CONSIDER UNUSEABLE Two parameters (convergency_rate and CMN) were changed in respect to the baseline experiment (003). It is unclear what affected the results. Result 27.9% WER on seen data(train.trans)

0288/007(30HR) Parameter Tested CFG_CMN = 'none' Hypothesis The previous experiment is considered unusable. In this experiment, we used a single configuration change of setting CMN to none. This parameter is used to reduce distortion on corpses with distorted sound. Result 29.9 WER on seen data



0288/008(30HR) Parameter Tested $CFG_LO_FILT = 200 $CFG_HI_FIL = 3500 $CFG_WAVEFILE_RATE = 8000

All other settings using base 003


Hypothesis CONSIDER UNUSEABLE:The three parameters changed belong to a set of four parameters that work together. These results should not be considered valid.


Result WER 35.4 Hoping for slightly better rate as these are indeed phone conversations recorded in 8 kHz originally


0288/009 (004 REDO)(30HR) Parameter Tested CFG_FINAL_NUM_DENSITIES = 16 (density),$CFG_N_TIED_STATES = 4000(senones),$CFG_VARNORM = “yes” Hypothesis This is a redo of the failed experiment 004. Result 35.6% WER



0288/010 (006 REDO)(30HR) Parameter Tested CFG_FINAL_NUM_DENSITIES = 16 (density),$CFG_N_TIED_STATES = 4000(senones),$CFG_VARNORM = “yes” Hypothesis This is a redo of the failed experiment 006. Result WER: 30% Seen WER: 56.3% Unseen


0288/011(300HR) Parameter Tested $CFG_AGC = ‘max’

    • CFG_CONVERGENCE_RATIO = 0.001
    • Everything identical to 002 (the last 300 hr train) except the agc and convergence ratio values
  • Timeline

Hypothesis The parameter AGC was tested in this experiment. AGC stands for automatic gain control and in two sided conversation automatically adjusts the gain level. Result 30% WER



0288/012(30HR) Parameter Tested

    • $CFG_FINAL_NUM_DENSITIES = 16 (density)
    • $CFG_N_TIED_STATES = 4000(senones)
    • $CFG_CONVERGENCE_RATIO = 0.004;

Hypothesis CFG CONVERGENCE RATIO set to .004 instead of regular .04. Result 53% WER on unseen data


0288/013(30HR) Parameter Tested

    • $CFG_FINAL_NUM_DENSITIES = 32 (density)
    • $CFG_N_TIED_STATES = 4000(senones)


Hypothesis The FINAL NUM DENSITY was changed to 32 in an attempt to gain a better word error rate. The recommended value for corpuses 100 hours+ is 32. Result

53.1% Unseen

0288/014(30HR) Parameter Tested

    • $CFG_FINAL_NUM_DENSITIES = 64 (density)
    • $CFG_N_TIED_STATES = 4000(senones)


Hypothesis FINAL NUM DENSITIES has been set to 62 on a 30 hour corpus. This was yet another attempt to see how the WER was affected by upping the density. Result 53.6% WER on Unseen Data

0288/015(30HR) Parameter Tested

    • $CFG_FINAL_NUM_DENSITIES = 16 (density)
    • $CFG_N_TIED_STATES = 4000(senones)
    • $CFG_LTSOOV = 'yes';

Hypothesis This is the first time LT SOOV has been used in our experiments. We changed the default setting to “yes” for experimentation. Result 54.1% WER on Unseen Data


0288/016(30HR) Parameter Tested $CFG_FINAL_NUM_DENSITIES = 16 (density) $CFG_N_TIED_STATES = 4000(senones) $CFG_FORCEDALIGN = 'yes' Hypothesis


Result 53.9% WER on Unseen Data


0288/017(30HR) Parameter Tested $CFG_FINAL_NUM_DENSITIES = 16 (density) $CFG_N_TIED_STATES = 4000(senones) $CFG_FORCEDALIGN = 'yes' -- required for below $CFG_VTLN = ’yes' $CFG_VTLN_START = 0.70 $CFG_VTLN_END = 1.40 $CFG_VTLN_STEP = 0.05 Hypothesis In this experiment FORCE_ALIGN, VTLN, AND VTLN START were used. This experiment failed do to reasons mentioned below. Result 017 Failed:

Some research (https://sourceforge.net/p/cmusphinx/discussion/help/thread/7c8f824c/?limit=25) led me to believe a possible error would be missing the vtln_align.pl. This is found:[jax472@miraculix 04.vtln_align]$ pwd /mnt/main/scripts/train/scripts_pl/04.vtln_align [jax472@miraculix 04.vtln_align]$ ls slave_align.pl vtln_align.pl

Plan: Create new experiment (019) , copy vtln_align.pl into exp/bin. Retrain.


0288/018(30HR) Parameter Tested $CFG_FINAL_NUM_DENSITIES = 16 (density) $CFG_N_TIED_STATES = 4000(senones) $CFG_FORCEDALIGN = 'yes' $CFG_FALIGN_CI_MGAU = ‘yes’ -- requires the above to be yes


Hypothesis FORCED ALIGN and FALIGN_CI_MGAU both set to yes. These parameters are used for automatic word alignment. Result 53.1% WER on Unseen Data


0288/019(30HR) Parameter Tested $CFG_FINAL_NUM_DENSITIES = 16 (density) $CFG_N_TIED_STATES = 4000(senones) $CFG_FORCEDALIGN = 'yes' -- required for below $CFG_VTLN = ’yes' $CFG_VTLN_START = 0.70 $CFG_VTLN_END = 1.40 $CFG_VTLN_STEP = 0.05 Hypothesis copied /mnt/main/local/bin/sphinx3_align and /mnt/main/scripts/train/scripts_pl/04.vtln_align to 019/bin.

      • Believe missing these is what caused the failure of 017

Result UPDATE:

EMAIL 1: mike,

Team cap is trying to add vocal tract length normalization to our experiments. 0288/017 and 019 failed (due to what we believed was missing bin files).

We believe it is due to a directory (vtnlnout/.070/) not being made:


019.html file: [jax472@caesar 019]$ tail -10 019.html

../scripts_pl/make_feats.pl <a href="file:///mnt/main/Exp/0288/019/logdir/04.vtln_align/019.extract.0.70.log">Log File</a>

completed

Phase 6: Running force alignment in 2 parts

Force alignment starting: (2 of 2)

sphinx3_align <a href="file:///mnt/main/Exp/0288/019/logdir/04.vtln_align/019.2.vtln.log">Log File</a>

Force alignment starting: (1 of 2)

sphinx3_align <a href="file:///mnt/main/Exp/0288/019/logdir/04.vtln_align/019.1.vtln.log">Log File</a>

This step had 1 ERROR messages and 0 WARNING messages. Please check the log file for details.

completed

This step had 1 ERROR messages and 0 WARNING messages. Please check the log file for details.

completed

Failed in part 1

Failed in part 1

[jax472@caesar 04.vtln_align]$ tail -5 019.1.vtln.log -wdsegdir /mnt/main/Exp/0288/019/vtlnout/0.70,CTL -wlen 0.025625 2.562500e-02

SYSTEM_ERROR: "main_align.c", line 974: Failed to open file /mnt/main/Exp/0288/019/vtlnout/0.70/019.alignedtranscripts.1 for writing; No such file or directory Thu Apr 21 13:23:37 2016


found the vtln script here: jax472@miraculix 04.vtln_align]$ pwd /mnt/main/scripts/train/scripts_pl/04.vtln_align [jax472@miraculix 04.vtln_align]$ ls slave_align.pl vtln_align.pl

Phase 6 starts in vtln_align.pl at line 230. Further scanning through vtln_align shows that the directory appears to never be made.

A google search led me to the cmusphinx github: https://github.com/cmusphinx/sphinxtrain

Under the scripts there, they have "12.vtln_align" last updated 2 years ago. Under that, in the vtln_align.pl file on line 61 & 62:

my $outdir = "$ST::CFG_BASE_DIR/vtlnout/$warp"; mkdir($outdir,0777);

where $warp would be the .70 value we need. Appears that this makes the directory we need.

our 04.vtln_align directory was last modified in Apr 2012. Looks like we may have and outdated version where VTLN is not working.

googling the System_error message i found: https://sourceforge.net/p/cmusphinx/discussion/help/thread/7c8f824c/#bf9f

This appears to show that prior to a 2014 fix, VTLN was broke. However the author of that post also claims the VTLN is not supported in decoding, it may not work. The following post from the original post author says that he got it working though.


Thoughts on this? It appears it is possible to get working and does improve performance, but would require some work with updating sphinxtrain.

EMAIL 2: Here is what you can do:

Take Miraculix and remove the /usr/local link to /mnt/main/local you'll have to recreate the /usr/local directory (it might actually just have been renamed to /usr/local-OFF unmount /mnt/main umount -a should do it You need to do this because some sphinx scripts will go into /mnt/main/root and we don't want that stuff touched either.

Plug in the network wire from brutus set up DHCP Download and install anything from sphinx that you want note, you can also copy some (or all) of the executables from /mnt/main/local to /usr/local before unmounting /mnt/main. you can also copy all of /mnt/main/root to (say) /usr/local/root perhaps copy /man/main/install to /root/install Once you have configured your new version of Sphinx that allows you to do LDA as well as VTL, and it is set up to use /usr/local as well as /usr/local/root, you can then mount /mnt/main again to get back the experiment directories and on Asterix you will be able to run your own modified executables.

Please document this (privately if you have to) so that in the future we could for instance just grab your new executables and replace ours. If you move all the root stuff into /usr/local/root then when we move all of the /usr/local stuff into /mnt/main/local we will effectively have moved /mnt/main/root to /mnt/main/local/root so that works really nicely for future upgrades.

This way you won't be hampered by the rule of not changing Caesar but you also don't unfairly hamper the other team.

BTW, please inform Brenden Collins to use Asterix instead of Miraculix to run his three parallel decode jobs. I'd rather keep Asterix clean since that is usually our golden drone machine so I'd like to have that be the template for the others (i.e. the comic book is called Asterix afterall).

Let me know what you think and please be careful with /mnt/main and be sure it is unmounted on Mircaulix before doing any installs.

Mike


0288/020(30HR) Parameter Tested $CFG_FINAL_NUM_DENSITIES = 16 (density) $CFG_N_TIED_STATES = 4000(senones) $CFG_FORCEDALIGN = 'yes' $CFG_FALIGN_CI_MGAU = 'yes' $CFG_MMIE_MAX_ITERATIONS = 4 $CFG_MMIE_CONSTE = "3.0" $CFG_LANGUAGEWEIGHT = "11.5"

$CFG_LANGUAGEMODEL = ”LMFILE"

Hypothesis MMIE MAX ITERATIONS set to 4 in aims to decrease WER. Result n/a


0288/020(30HR) Parameter Tested

    • $CFG_FINAL_NUM_DENSITIES = 16 (density)
    • $CFG_N_TIED_STATES = 4000(senones)
    • $CFG_NPART = 8

Hypothesis

CFG NPART was used in attempt to make the trains run faster. 

Result n/a

    • $CFG_FINAL_NUM_DENSITIES = 16 (density)
    • $CFG_N_TIED_STATES = 4000(senones)
    • $CFG_NPART = 8 ← Hoping this makes training faster




Parameter Research:


CFG_LO_FLT: Tool:sphinxtrain Location:sphinx_train.cfg / feat.parms What it is: For telephone 8kHz speech value is 200 Research Findings: So it appears these values are a filter for fine tuning the frequencies that the system listens to. In other words there is no sense trying to hear audio tones that can not be captured when recording in 8kHz audio. I believe the swithcboard corpus was originally recorded in 8kHz ( https://catalog.ldc.upenn.edu/LDC97S62 )so we probably should be using this. Notes from previous classes indicate they seemed to get an improvement when setting them. Resources: https://foss.unh.edu/projects/index.php/Speech:Summer_2013_Eric_Beikman


CFG_HI_FIL: Tool: sphinxtrain Location: sphinx_train.cfg / feat.parms What it is: For telephone 8kHz speech value is 3500 Research Findings: So it appears these values are a filter for fine tuning the frequencies that the system listens to. In other words there is no sense trying to hear audio tones that can not be captured when recording in 8kHz audio. I believe the swithcboard corpus was originally recorded in 8kHz ( https://catalog.ldc.upenn.edu/LDC97S62 )so we probably should be using this. Notes from previous classes indicate they seemed to get an improvement when setting them. Resources: https://foss.unh.edu/projects/index.php/Speech:Summer_2013_Eric_Beikman

CFG_NUM_FILT: Tool: sphinxtrain Location: sphinx_train.cfg What it is:For wideband speech it's 25, for telephone 8khz reasonable value is 15 Research Findings: Won't the reasonable value depend on what is assigned to the CFG_NUM_FILT? Example if it's 40; i.e. For wideband speech it's 40, for telephone 8khz reasonable value is 31 Resources: http://messe2media.com/files/sphinx_train.cfg


CFG_LIFTER: Tool: sphinxtrain Location: sphinx_train.cfg What it is: Cepstrum lifter is smoothing to improve recognition Research Findings: A good visual representation of cepstral lifting. Resources: http://www.mathworks.com/matlabcentral/mlc-downloads/downloads/submissions/52088/versions/4/screenshot.jpg

CFG_VECTOR_LENGTH: Tools: sphinxtrain Location: sphinx_train.cfg What it is: ????????????????? Research Findings: Every source that I found agrees that 13 is enough: Resources: http://lnu.diva-portal.org/smash/get/diva2:332123/FULLTEXT01

CFG_AGC: Tools: sphinxtrain Location: sphinx_train.cfg What it is: (none/max) Type of AGC to apply to input files Research Findings: What is a ACG?" ACG, or automatic gain control is a power control option used to automatically adjust the speech level of an audio system. One reason this might be used is when devices to record both sides of a telephone conversation. The device must record both the relatively large signal from the local user and the much smaller signal from the remote user at comparable loudnesses. Some telephone recording devices incorporate automatic gain control to produce acceptable-quality recordings.This prevents the AGC from acting on non-speech signals like laptop fans, paper rustling, keyboard typing, etc. Note that the non-speech sounds won't be rejected; they will still be heard at a normal volume, but the AGC will not change its gain in response to those sound". ***NOTE: This is general information on ACG and the details of it's functionality might not be as sophisticated as others with better written algorithms. AVG has two settings that can be configured in Sphinx 3: None and Max. What is sounds like to me is if the AVG is set to none, it will not perform automatic gain control on the audio recordings that are being inputted into sphinx. I was not able to find exact information on what the term "max" in sphinx means, but again, what it sounds like to me is this setting will be turned on and the gain will be adjusted. ***NOTE** AGC settings must match in the training process and in the decode process or it will cause a bad accuracy.The settings use to train your base models may have differed in one or more ways from the settings you used while training with the new data. The most dangerous setting mismatches is the agc (max/none). Check the other settings too, and finally make sure that during decoding you use the same agc (and other relevant settings like varnorm and cmn) during training. Resources: http://www.speech.cs.cmu.edu/sphinxman/FAQ.html

CFG_CMN: Tools: sphinxtrain Location: sphinx_train.cfg What it is: (current/none) Type of cepstral mean subtraction/normalization to apply to input files Research Findings: The goal of CMN is to reduce distortion from some kind of sound input, a microphone being one of the examples. This would probably make more sense if the sound quality were either poor or the volume was wildly inconsistent to the point of distortion. CMN is applied by default, so a test should be done where the only difference is those this setting and see what happens. Resources: https://foss.unh.edu/projects/index.php/Speech:Sphinx_train.cfg http://cmusphinx.sourceforge.net/doc/sphinx4/edu/cmu/sphinx/frontend/feature/BatchCMN.html

CFG_VARNORM: Tool: sphinxtrain Location: sphinx_train.cfg What it is: (yes/no) Normalize variance of input files to 1.0 Research Findings: All sources that I have seen have set VARNORM = 'no'. I ended up running two experiments using the first_5hr corpus to test. I kept all values the same except for the VARNORM during these two tests. Results: VARNORM = 'yes' it had a 44.4% WER; VARNORM = 'no' it had a 47.2% WER. Resources: ?????????????



CFG_FULLVAR: Tool: sphinxtrain Location: sphinx_train.cfg What it is: (yes/no) Train full covariance matrices Research Findings: Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data: -------Team found on 4/13/2016 Resources: https://www.semanticscholar.org/paper/Covariance-Matrix-Enhancement-Approach-to-Train-Vanek-Machlica/3fdb0021cc2b6b8412d21ce8b7ab5c1ff6ee6139 http://cmusphinx.sourceforge.net/wiki/ldamllt


CGFULFG_DIAL: Tool: sphinxtrain Location: sphinx_train.cfg What it is: (yes/no) Use diagonals only of full covariance matrices for forward-Backward evaluation (recommended if CFG_FULLVAR is yes) Research Findings: Found article is here: Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data Resources: https://goo.gl/SaEu45

CFG_FORCE_ALIGN_BEAM: Tool: sphinxtrain Location: sphinx_train.cfg What it is: # Use a particular beam width for force alignment. The wider

  1. (i.e. smaller numerically) the beam, the fewer sentences will be
  2. rejected for bad alignment

Research Findings: Finding #1: Mentions force alignment -- After researching a bit more, the only way to determine the optimal value for this parameter is to experiment. The value we're using right now for 0288/002 is actually slightly bigger than 0288/001 and this is because we're modeling our 0288/002 experiment after 0271/003. So maybe they looked into this parameter and found 1e-50 was a better value than the default 1e-60 Resources: http://www.speech.cs.cmu.edu/sphinxman/FAQ.html#1




CFG_CONVERGENCE_RATIO: Tool: sphintrain Location: sphinx_train.cfg What it is: This number is the ratio of the difference in likelihood between the current and the previous iteration of Baum-Welch to the total likelihood in the previous iteration. Research Findings: Page 6 of the following pdf Resources: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-345-automatic-speech-recognition-spring-2003/assignments/assignment8.pdf

CFG_FINAL_NUM_DENSITIES: Tool: Location: What it is: Research Findings: If you are training continuous models for large vocabulary and have more than 100 hours of data, put 32 here. It can be any degree of 2: 4, 8, 16, 32, 64. If you are training semi-continuous or PTM model, use 256 gaussians. This value is the number of senones to train in a model. The more senones model has, the more precisely it discriminates the sounds. But on the other hand if you have too many senones, model will not be generic enough to recognize unseen speech. That means that the WER will be higher on unseen data. That's why it is important to not overtrain the models. In case there are too many unseen senones. The lower the density used the worse the result. Resources: ???

CFG_HMM_TYPE: Tool: .ptm Location: What it is: Research Findings: The difference between PTM, semi-continuous and continuous models is the following. We use mixture of gaussians to compute the score of each frame, the difference is how do we build such mixture. In continuous model every senone has it’s own set of gaussians thus the total number of gaussians in the model is about 150 thousand. That’s too much to compute the mixture efficiently. In semi-continuous model we have just 700 gaussians, way less than in continuous and we only use them with different mixtures to score the frame. Due to the smaller number of gaussians semi-continuous models are fast, but because of more hardcoded structure they are also a bit less accurate.PTM models is a gold middle here. It uses about 5000 gaussians thus providing better accuracy than semi-continuous, but it is still significantly faster than continuous thus it can be used in mobile applications. Accuracy of PTM model is almost the same as accuracy of continuous model. Resources:  ?????