Speech:Sphinx train.cfg

=Sphinx Trainer Configuration file (sphix_train.cfg) file overview.

The sphinx_train.cfg file is a simple perl script that is used during the feats generation (make_feats.pl) and training, among other tasks. It is located within the experiment's etc directory. It consists of mostly variable assignments, to be used in conjunction with a calling script (runAll.pl) containing the actual logic.

It defines:
 * 1) The experiment number.
 * 2) The base experiment directory for this experiment.
 * 3) */mnt/main/Exp/ 
 * 4) The root experiment directory.
 * 5) */mnt/main/Exp/
 * 6) Where to find the experiment dictionaries, transcripts, corpus file lists, and other important files.
 * 7) What type of model to create.
 * 8) Characteristics about any created model.
 * 9) And more!

This document attempts to explain each part of the file, and its effects on the trainer.

sphinx_train.cfg
The following is an unedited copy of the generated sphinx_train.cfg file.


 * 1) Configuration script for sphinx trainer                  -*-mode:Perl-*-

$CFG_VERBOSE = 1;              # Determines how much goes to the screen.

$CFG_DB_NAME = "train1"; $CFG_BASE_DIR = "/root/speechtools/SphinxTrain-1.0/train1"; $CFG_SPHINXTRAIN_DIR = "/root/speechtools/SphinxTrain-1.0";
 * 1) These are filled in at configuration time

$CFG_BIN_DIR = "$CFG_BASE_DIR/bin"; $CFG_GIF_DIR = "$CFG_BASE_DIR/gifs"; $CFG_SCRIPT_DIR = "$CFG_BASE_DIR/scripts_pl";
 * 1) Directory containing SphinxTrain binaries

$CFG_EXPTNAME = "$CFG_DB_NAME";
 * 1) Experiment name, will be used to name model files and log files

$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav"; $CFG_WAVFILE_EXTENSION = 'sph'; $CFG_WAVFILE_TYPE = 'nist'; # one of nist, mswav, raw $CFG_FEATFILES_DIR = "$CFG_BASE_DIR/feat"; $CFG_FEATFILE_EXTENSION = 'mfc'; $CFG_VECTOR_LENGTH = 13;
 * 1) Audio waveform and feature file information

$CFG_MIN_ITERATIONS = 1; # BW Iterate at least this many times $CFG_MAX_ITERATIONS = 10; # BW Don't iterate more than this, somethings likely wrong.

$CFG_AGC = 'none'; $CFG_CMN = 'current'; $CFG_VARNORM = 'no'; $CFG_LTSOOV = 'no'; $CFG_FULLVAR = 'no'; $CFG_DIAGFULL = 'no';
 * 1) (none/max) Type of AGC to apply to input files
 * 1) (current/none) Type of cepstral mean subtraction/normalization
 * 2) to apply to input files
 * 1) (yes/no) Normalize variance of input files to 1.0
 * 1) (yes/no) Use letter-to-sound rules to guess pronunciations of
 * 2) unknown words (English, 40-phone specific)
 * 1) (yes/no) Train full covariance matrices
 * 1) (yes/no) Use diagonals only of full covariance matrices for
 * 2) Forward-Backward evaluation (recommended if CFG_FULLVAR is yes)

$CFG_VTLN = 'no'; $CFG_VTLN_START = 0.80; $CFG_VTLN_END = 1.40; $CFG_VTLN_STEP = 0.05;
 * 1) (yes/no) Perform vocal tract length normalization in training.  This
 * 2) will result in a "normalized" model which requires VTLN to be done
 * 3) during decoding as well.
 * 1) Starting warp factor for VTLN
 * 1) Ending warp factor for VTLN
 * 1) Step size of warping factors

$CFG_QMGR_DIR = "$CFG_BASE_DIR/qmanager"; $CFG_LOG_DIR = "$CFG_BASE_DIR/logdir"; $CFG_BWACCUM_DIR = "$CFG_BASE_DIR/bwaccumdir"; $CFG_MODEL_DIR = "$CFG_BASE_DIR/model_parameters";
 * 1) Directory to write queue manager logs to
 * 1) Directory to write training logs to
 * 1) Directory for re-estimation counts
 * 1) Directory to write model parameter files to

$CFG_LIST_DIR = "$CFG_BASE_DIR/etc";
 * 1) Directory containing transcripts and control files for
 * 2) speaker-adaptive training

$CFG_DICTIONARY    = "$CFG_LIST_DIR/$CFG_DB_NAME.dic"; $CFG_RAWPHONEFILE  = "$CFG_LIST_DIR/$CFG_DB_NAME.phone"; $CFG_FILLERDICT    = "$CFG_LIST_DIR/$CFG_DB_NAME.filler"; $CFG_LISTOFFILES   = "$CFG_LIST_DIR/${CFG_DB_NAME}_train.fileids"; $CFG_TRANSCRIPTFILE = "$CFG_LIST_DIR/${CFG_DB_NAME}_train.trans"; $CFG_FEATPARAMS    = "$CFG_LIST_DIR/feat.params";
 * 1) *******variables used in main training of models*******


 * 1) *******variables used in characterizing models*******

$CFG_HMM_TYPE = '.semi.'; # Sphinx II
 * 1) $CFG_HMM_TYPE = '.cont.'; # Sphinx III

if (($CFG_HMM_TYPE ne ".semi.") and ($CFG_HMM_TYPE ne ".cont.")) { die "Please choose one CFG_HMM_TYPE out of '.cont.' or '.semi.', ". "currently $CFG_HMM_TYPE\n"; }

if ($CFG_HMM_TYPE eq '.semi.') { $CFG_DIRLABEL = 'semi'; $CFG_STATESPERHMM = 5; $CFG_SKIPSTATE = 'yes'; $CFG_FEATURE = "s2_4x"; $CFG_NUM_STREAMS = 4; $CFG_INITIAL_NUM_DENSITIES = 256; $CFG_FINAL_NUM_DENSITIES = 256; die "For semi continuous models, the initial and final models have the same density" if ($CFG_INITIAL_NUM_DENSITIES != $CFG_FINAL_NUM_DENSITIES); } elsif ($CFG_HMM_TYPE eq '.cont.') { $CFG_DIRLABEL = 'cont'; $CFG_STATESPERHMM = 3; $CFG_SKIPSTATE = 'no'; $CFG_FEATURE = "1s_c_d_dd"; $CFG_NUM_STREAMS = 1; $CFG_INITIAL_NUM_DENSITIES = 1; $CFG_FINAL_NUM_DENSITIES = 8; die "The initial has to be less than the final number of densities" if ($CFG_INITIAL_NUM_DENSITIES > $CFG_FINAL_NUM_DENSITIES); }
 * 1) Four (4) stream features for Sphinx II
 * 1) Single stream features - Sphinx 3

$CFG_FALIGN_CI_MGAU = 'no'; $CFG_CI_MGAU = 'no'; $CFG_N_TIED_STATES = 1000; $CFG_NPART = 1;
 * 1) (yes/no) Train multiple-gaussian context-independent models (useful
 * 2) for alignment, use 'no' otherwise) in the models created
 * 3) specifically for forced alignment
 * 1) (yes/no) Train multiple-gaussian context-independent models (useful
 * 2) for alignment, use 'no' otherwise)
 * 1) Number of tied states (senones) to create in decision-tree clustering
 * 1) How many parts to run Forward-Backward estimatinon in

$CFG_CROSS_PHONE_TREES = 'no';
 * 1) (yes/no) Train a single decision tree for all phones (actually one
 * 2) per state) (useful for grapheme-based models, use 'no' otherwise)

$CFG_FORCEDALIGN = 'no';
 * 1) Use force-aligned transcripts (if available) as input to training

$CFG_FORCE_ALIGN_MDEF = "$CFG_BASE_DIR/model_architecture/$CFG_EXPTNAME.falign_ci.mdef"; if ($CFG_FALIGN_CI_MGAU eq 'yes') { $CFG_FORCE_ALIGN_MODELDIR = "$CFG_MODEL_DIR/$CFG_EXPTNAME.falign_ci_${CFG_DIRLABEL}_$CFG_FINAL_NUM_DENSITIES"; } else { $CFG_FORCE_ALIGN_MODELDIR = "$CFG_MODEL_DIR/$CFG_EXPTNAME.falign_ci_$CFG_DIRLABEL"; }
 * 1) Use a specific set of models for force alignment.  If not defined,
 * 2) context-independent models for the current experiment will be used.


 * 1) Use a specific dictionary and filler dictionary for force alignment.
 * 2) If these are not defined, a dictionary and filler dictionary will be
 * 3) created from $CFG_DICTIONARY and $CFG_FILLERDICT, with noise words
 * 4) removed from the filler dictionary and added to the dictionary (this
 * 5) is because the force alignment is not very good at inserting them)


 * 1) $CFG_FORCE_ALIGN_DICTIONARY = "$ST::CFG_BASE_DIR/falignout$ST::CFG_EXPTNAME.falign.dict";;
 * 2) $CFG_FORCE_ALIGN_FILLERDICT = "$ST::CFG_BASE_DIR/falignout/$ST::CFG_EXPTNAME.falign.fdict";;

$CFG_FORCE_ALIGN_BEAM = 1e-60;
 * 1) Use a particular beam width for force alignment.  The wider
 * 2) (i.e. smaller numerically) the beam, the fewer sentences will be
 * 3) rejected for bad alignment.

$CFG_LDA_MLLT = 'no'; $CFG_LDA_DIMENSION = 29;
 * 1) Calculate an LDA/MLLT transform?
 * 1) Dimensionality of LDA/MLLT output

$CFG_CONVERGENCE_RATIO = 0.04;
 * 1) set convergence_ratio = 0.004

$CFG_QUEUE_TYPE = "Queue";
 * 1) Queue::POSIX for multiple CPUs on a local machine
 * 2) Queue::PBS to use a PBS/TORQUE queue

$CFG_QUEUE_NAME = "workq";
 * 1) Name of queue to use for PBS/TORQUE

$CFG_MAKE_QUESTS = "yes"; $CFG_QUESTION_SET = "${CFG_BASE_DIR}/model_architecture/${CFG_EXPTNAME}.tree_questions";
 * 1) (yes/no) Build questions for decision tree clustering automatically
 * 1) If CFG_MAKE_QUESTS is yes, questions are written to this file.
 * 2) If CFG_MAKE_QUESTS is no, questions are read from this file.
 * 1) $CFG_QUESTION_SET = "${CFG_BASE_DIR}/linguistic_questions";

$CFG_CP_OPERATION = "${CFG_BASE_DIR}/model_architecture/${CFG_EXPTNAME}.cpmeanvar";

$CFG_DONE = 1;
 * 1) This variable has to be defined, otherwise utils.pl will not load.

return 1;

Analysis
The Sphinx config file consists of multiple stanzas, or sections containing related or otherwise similar statements.

Stanza 1: Config directories.
$CFG_VERBOSE = 1;              # Determines how much goes to the screen.

$CFG_DB_NAME = "train1"; $CFG_BASE_DIR = "/root/speechtools/SphinxTrain-1.0/train1"; $CFG_SPHINXTRAIN_DIR = "/root/speechtools/SphinxTrain-1.0";
 * 1) These are filled in at configuration time

$CFG_BIN_DIR = "$CFG_BASE_DIR/bin"; $CFG_GIF_DIR = "$CFG_BASE_DIR/gifs"; $CFG_SCRIPT_DIR = "$CFG_BASE_DIR/scripts_pl";
 * 1) Directory containing SphinxTrain binaries

$CFG_EXPTNAME = "$CFG_DB_NAME";
 * 1) Experiment name, will be used to name model files and log files

This Stanza defines where exactly files can be found within the base experiment directory, along with defining the experiment directory itself. Most filenames and folder paths defined later are based on these values.


 * Important statements:
 * $CFG_VERBOSE
 * Determines the "Verbosity" of the trainer. During normal operation, the trainer can output logging data to Standard Out. This same data (along with HTML tags) is put into the html logfile at the base experiment directory.
 * A value of '1' activates verbose mode, a value of '0' disables verbose mode.
 * Disabling verbose mode is ideal if you are to send the training task as a "Background" job using &.
 * $CFG_DB_NAME
 * The name of the experiment. Use your experiment number as its value.
 * $CFG_BASE_DIR
 * The base directory, usually is /mnt/main/Exp/
 * $CFG_ROOT_DIR
 * The root experiment directory.
 * Should be /mnt/main/Exp

Stanza 2: Waveform & Feat properties
$CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav"; $CFG_WAVFILE_EXTENSION = 'sph'; $CFG_WAVFILE_TYPE = 'nist'; # one of nist, mswav, raw $CFG_FEATFILES_DIR = "$CFG_BASE_DIR/feat"; $CFG_FEATFILE_EXTENSION = 'mfc'; $CFG_VECTOR_LENGTH = 13;
 * 1) Audio waveform and feature file information

This Stanza determines the location and format of the input corpus data. It defines among other things:
 * Where it is.
 * The file extension.
 * The wavefile type
 * Where to put the feats.
 * The file extensions of the feats.


 * Important Statements
 * $CFG_WAVFILE_TYPE
 * The value assigned to this variable MUST represent the WAV file format used by the corpus.
 * Failure to do so will result in either the Trainer crashing, or a really bad word error rate.

Stanza 3: Input data properties & adjustments
$CFG_MIN_ITERATIONS = 1; # BW Iterate at least this many times $CFG_MAX_ITERATIONS = 10; # BW Don't iterate more than this, somethings likely wrong.

$CFG_AGC = 'none'; $CFG_CMN = 'current'; $CFG_VARNORM = 'no'; $CFG_LTSOOV = 'no'; $CFG_FULLVAR = 'no'; $CFG_DIAGFULL = 'no';
 * 1) (none/max) Type of AGC to apply to input files
 * 1) (current/none) Type of cepstral mean subtraction/normalization
 * 2) to apply to input files
 * 1) (yes/no) Normalize variance of input files to 1.0
 * 1) (yes/no) Use letter-to-sound rules to guess pronunciations of
 * 2) unknown words (English, 40-phone specific)
 * 1) (yes/no) Train full covariance matrices
 * 1) (yes/no) Use diagonals only of full covariance matrices for
 * 2) Forward-Backward evaluation (recommended if CFG_FULLVAR is yes)

$CFG_VTLN = 'no'; $CFG_VTLN_START = 0.80; $CFG_VTLN_END = 1.40; $CFG_VTLN_STEP = 0.05;
 * 1) (yes/no) Perform vocal tract length normalization in training.  This
 * 2) will result in a "normalized" model which requires VTLN to be done
 * 3) during decoding as well.
 * 1) Starting warp factor for VTLN
 * 1) Ending warp factor for VTLN
 * 1) Step size of warping factors

This Stanza defines initial training configurations, and features which can be applied to the input data to improve the WER in specific situations. For the most part, these settings should not be touched.
 * Important Statements:
 * $CFG_MIN_ITERATIONS & $CFG_MAX_ITERATIONS
 * These define the minimum and maximum Baum-Welch iterations the trainer will do for models.
 * The default values should not be touched!
 * The Sphinx trainer will automatically determine the optimal BW iterations a model will need.
 * Forcing or preventing BW iterations will result in a terrible WER. Sphinx is better at determining what works best than you are.
 * $CFG_AGC
 * Stands for "Automatic Gain control". Its a type of filtering which attempts to maintain a specific volume by increasing or decreasing the volume if needed.
 * Arguably, this setting COULD cause WERs on quiet portions which are amplified, picking up background noises.
 * As some telephone recording devices utilize AGC, this setting may be already applied to the corpus during its creation.
 * The default is none.
 * Use at own risk.
 * Please note that AGC has little or no affect on WER as determined thorough experiments [Speech:Exps_0119|0119] and [Speech:Exps_0120|0120].
 * $CFG_CMN
 * "Cepstral/channel Mean Substraction/Normalization"
 * Good overview can be found |here
 * According to the link above: "CMN is a technique used to reduce distortions that are introduced by the transfer function of the transmission channel."
 * In other words, its a filtering process used to remove distortion introduced when recording.
 * The default is current, meaning that CMN is applied.