Speech:Spring 2014 Report
- Information - General Project Information
- Experiments - List of speech experiments
The purpose of this project was to provide students experience in working with large, complex software systems such as Sphinx and to provide an open work environment to collaborate and contribute to a team. Students worked in sub groups while also contributing to the greater whole as each team made progress. In doing so, students were exposed to the complexity of speech recognition systems. Our objective for our work this semester was to continue the work of previous semesters in building a world-class baseline for speech recognition using switchboard as our data input.
System Status and Configuration
Below current configuration details about the systems and file structure are discussed about what was changed/modified to improve the system.
The current OpenSUSE system that is installed on the Caesar server and client drives (Asterix, Obelix etc.) is 11.3. This system currently works very well and is able to run the Sphinx speech experiments that the speech project team members conduct on it. However as noted in last year's log, 11.3 is outdated and also no longer supported by the developers as well. The latest version of Fedora is v20. Fedora 20 provides software to suit a wide variety of applications. The storage, memory and processing requirements vary depending on usage. For example, a high traffic database server requires much more memory and storage than a business desktop, which in turn has higher requirements than a single-purpose virtual machine. Minimum system configuration,the requirements may differ, and most applications will benefit from more than the minimum resources. Fedora 20 can be installed and used on systems with limited resources for some applications. Larger package sets require more memory during installation, so users with less than 768MB of system memory may have better results preforming a minimal install and adding to it afterward. For best results on systems with less than 1GB of memory, use the DVD installation image.
That Data Group was tasked with verifying and maintaining the Speech corpus.
It was discovered that Switchboard had two major releases due to errors in the first release. In checking the headers of our files we confirmed that we are using the latest release.
It was unclear in past semesters what the actual length of data on Caesar was. The audio data was fairly simple to measure as the sound exchange program (SoX) can easily retrieve audio length. We added the times for the 23 disks of audio and measured roughly 258.81 hours of audio. The transcripts were a bit trickier and had been troubling semesters in the past and it was ambiguous what the actual lengths were.
Some reports had the length of the transcripts at 304 hours and others had it closer to 250. We created a script to measure the length by taking the start and stop times for each utterance and adding for the total time. With this method, we measured that the full transcripts were about 312 hours long. The Modeling group was running similar tests and discovered that crosstalk was a factor. They created a script that accounted for the crosstalk and reported 256.4 hours of transcripts which is much closer to the audio length.
Tracking down the original, unaltered transcripts proved to be an issue this semester. The transcript file in the distribution directory was only about 10 hours long, a fraction of what it should have been. It was assumed that the transcripts in the Full corpus were complete, and the length is very close to the audio length, but it was not clear. We eventually located the latest version of the transcripts and created a master_trans directory in the distribution directory. We experimented with methods to extract and sort the data into a single text file before running out of time. We managed to create files of both the A and B channel but did not merge them in to one ordered file. It is worth noting that each channel had roughly 259 hours of data which is a better match to the audio length than the Full corpus transcripts.
The data directory is currently partially configured and ready for experiments. Towards the end of the semester it was determined that the management of audio files was inefficient and costing a great deal of server space. With this in mind an effort was made to revise the experimentation process and in turn to modify the configuration of the data files. The big change was to use symbolic links for audio files rather than generate them each time. At the very end of the semester it was determined while implementing the prepareExperiment.pl script that no audio files needed to be generated for data subsets. Previously whenever a new data subset was created, the user would first need to create a transcript file and then generate audio files for that data. This was inefficient because Sphinx is smart enough to determine what audio files it needs. Instead, what we did was create one directory containing all the audio files for the entire full corpus. When an experiment is prepared we link to the master audio directory, and Sphinx uses the fileids to determine which audio files it needs and ignores the rest.
Experiment Directory and Setup
The core of speech recognition is running experiments. Sphinx provides many tools for running an experiment such that one could conceivably run an experiment right out of the box. While this is true, because of the specific nature of our research we need to do some level of preparation before we can run an experiment. To run an experiment, the following files are needed by the system:
- A transcript corresponding to the audio files. The transcript needs to map to the audio files to lines in the transcript.
- A dictionary file containing all the words in the transcript and their respective pronunciations (note: only words in the transcript should appear in the dictionary).
- A filler dictionary containing non-words such as silence and laughter which will be filtered out when building an acoustic model.
- A list of audio files which will be used when training.
- A list of phonemes which must contain all phonemes present in the dictionary and filler dictionary.
- A directory containing all the actual .sph audio files.
There are other requirements as well, but Sphinx generates files using its setup scripts. Assuming that the following exist and that they are properly identified and located in sphinx's training and decoding configuration files (etc/sphinx_train.cfg and etc/sphinx_decode.cfg respectively) one can begin training. Because of the variability of experimentation, these files change between experiments. The reason for this is that we have not yet reached a point where we can effectively train on the full data, and because we are still optimizing these files to produce the best results. Other groups continue working to improve the quality of these files. Areas such as the filler dictionary are still widely unexplored and have shown good results making even slight modifications. The transcripts contain many out of vocabulary (OOV) words which may be affecting our results such as [LAUGHTER], <PARTIAL-WORD>-[LAUGHTER]. What we do with these words has an impact on our results and the quality of our models. Because of this variability, it is necessary to prepare unique experiments with different file modifications and input parameters. This is where experimentation setup and configuration comes into play. Previous semesters wrote a number of scripts to help automate the processes needed to get an experiment ready to run and generate the files listed above.
This semester these scripts were refined to make preparing an experiment a faster and less confusing process. At the beginning of the semester preparing an experiment required over five custom scripts and potentially hours of time given large data. Now, after much refinement, someone can prepare an experiment running two scripts which only take a matter of seconds to run (for more information see the Artifacts - Scripts section). This drastically expedited the experimentation process and allowed us to run more experiments, allowing us to focus on parameters and configurations rather than data preparation. As mentioned, input parameters play a vital role in training and decoding. When preparing an experiment two configuration files are generated. These files contain lists of useful parameters that the trainer and decoder use to fine-tune and improve models in terms of time and accuracy. A big part of experimentation is finding the best parameters list for a given set of data with X length of data and number of utterances. This is the second part of preparing an experiment, and deals with the inner working of modeling. By reducing the complexity of experiment preparation more time was devoted to this.
The main responsibility this semester was to research possible updates of the currently used software suite that compiles the Speech Recognition program. There are five main tools to be examined this semester, as per Prof. Jonas's instructions: the Trainer, the Decoder, the Language Model Toolkit, the Scorer, and a Java Wrapper program (which is not yet in use). Many of the tools currently in use remain outdated, as they worked accurately enough for normal use during previous semesters. Updating the tools has been a consideration over previous semesters, but it was determined that the downtime and need to re-configure the system was not worth the minimal performance upgrade. This semester, with the costs and benefits of each possible update weighed, the decision to upgrade any of the tools will be carefully considered. Currently, the main hindrance to upgrading all tools is the continued use of OpenSUSE 11.3. Many of the tools require a more modern OS, or at the very least an update to OpenSUSE 13.x. The use of Fedora is also to be considered, as many of the updated tool versions support this Operating System as well, which the systems group proved to be a success using the Rome system we have.
The first tool used in the Speech Recognition process is the trainer. This is the actual software procedure, run through a terminal interface, that performs "trains" on the provided data. This procedure analyzes the supplied recordings to "learn" its characteristics; that is, to recognize patterns and acoustic occurrences in the audio file, determined by several user defined variables. A user can run a train in many different ways and in several different lengths and sizes. If a user wishes to run a train, they should refer to the main webpage, under the "information" link, where there are detailed instructions under the "Project Notes" section. Additionally, an index of previously run trains can be found under the "experiments" link. All trains (either performed or not) MUST be documented regardless of outcome, even if they offer no knew insight into the system or produce any worthwhile results. This application is used heavily by both the Modeling and Experiments groups. Please see the "Model Building" section of the documentation for more information (it can be found here). The current installed trainer version is SpinxTrain 1.0, which has been in use for several semesters. The newest version is SphinxTrain 1.0.8 , at: SphinxTrain 1.0.8, which is openfst-based and best used with Sphinx4. There are several bug fixes as well.
Language Model Toolkit
Once a user has completed a successful train and documented it in both the Wiki and on the server, a Language Model must be built. A Language Model uses the results of the performed train to analyze the frequency of words within a corpus and predict their future occurrence. Each word is given an integer value as recorded, and the LM allocates memory to its occurrence. Building a language model helps the software to "understand" how any word is used or said, and is necessary to run the decode procedure. A full documentation can be found here. Currently, the CMU-Cambridge Statistical Language Modeling Toolkit v2 is in use. Version 0.7 is the latest version and provides no real benefits other than minor bug fixes. It appears that the CSLMT is no longer in active development (re: being updated), and neither has a replacement been created.
Once a Language Model has been established, the decode process can be run. The decoder uses the information generated through the Language Modeling and Acoustic Modeling steps (which were based on the trains performed earlier) and applies them to a set of audio files, whose speech we wish to recognize. These decodes can be as little as 1 hour to up to over 100 hours. Several user parameters are established (all of which can be found in the documentation) and can be tweaked to attempt better success during the decode. After the decode is run and the results output (what the program actually interpreted the audio file as), the results can be compared to the written transcript of the input audio file. This data is dealt with in the next step. The current Decoder in use is version is Sphinx 3.7, which has been used for quite some time and remains stable. The latest version is Sphinx 4.0, which is a re-haul of the engine in Java. It is faster, slightly more accurate, and more flexible. However, it can only be used with a suitable Java Wrapper (covered below), which poses problems to our OpenSUSE system. It relies on the Java SE 6 Dev Kit (or higher) and Ant 1.6. The need for these dependencies is an unfavorable result. Documentation of Sphinx 4.0 can be found here: Sphinx4 Documentation
Once the decode has run, the user feeds the results into a scoring program. This program can display several variables the user defines, such as word total, error rate, word insertion, word deletion, etc. These results are the ultimate test of the accuracy of our Speech Recognition System. Several experiments have results posted under the "Experiments" link; to date the most accurate has a 25% word error rate. The current scorer in use is SCLite 2.3. The newest version is SCLite 2.9, which is installed with SCTK 2.4.8. The advantages are unclear, however, as the documentation has not kept up with revisions.
While the user may find it satisfactory to perform the Decode using Sphinx3, the desire to use the latest and greatest version, Sphinx4, can promise more accurate results while running our Speech Recognition System. Unfortunately, the use of Sphinx4 requires several dependencies and upgrades to our System, as it is written in Java (as opposed to C). One solution to circumvent these dependencies is by using a Java Wrapper program. While this tool is not currently installed or in use on the system (and has never been in any previous semester), it provides the ability to wrap Java coded programs into a "shell" that allows for execution on a Linux-based system as a daemon process. This would allow the upgrade of the Speech Recognition system to Sphinx4 and subsequently SphinxTrain 1.0.8. The most popular version of this software is Java Service Wrapper by Tanuki Software: JSW Homepage. However, support for openSUSE is unknown, as it is only known to work with SUSE Enterprise Server. While this would be a large change, minor investigation is warranted as a major upgrade to the system could be performed and possibly enhance the overall Speech-Recognition System. All factors will have to be considered including system downtime, upgrade procedures, dependencies, etc. An alternate software, Yet Another Java Service Wrapper, found at: YAJSW, provides another option. It provides the same functionality, and is tested on openSUSE 11.1. Although it is not as developed and supported as Java Service Wrapper.
The following section outlines the new artifacts created this semester that are important to the current system. The primary artifacts that were created this semester relate to experiment preparation and our data artifacts. Some items in this list are no longer needed under the current architecture but may still be useful for future semesters. These items have been marked as such.
A few major data artifacts were created this semester which are important to note. The first is the master dictionary file which contains all dictionary words present in the switchboard corpus as well as their pronunciations. This dictionary was discovered this semester and allowed us to move forward in building much larger models. While previous semesters had been limited to the five hour and ten hour data sets, we were able to begin training on fifty, one-hundred and even the possibility of the full data. This files can be found at /mnt/main/corpus/dist/custom/switchboard.dic, and is currently used by all our most recent scripts which require a dictionary. The second important data artifact this semester is the master audio file directory (/mnt/main/corpus/switchboard/audio/utt). This directory contains all the sph files for the entire corpus. All experiments now link to this directory for its audio files. The need for audio files outside of this master list is no longer necessary.
During this semester there has been a large increase in the amount and diversity of experiments. Throughout the semester in order to aid in the development and efficiency in experiment results there has been an increase in experiments as well as improvement on existing scripts. These scripts have been documented in the scripts page to provide more clarity as to their purpose. Additionally, these resources can be used to highlight which scripts are still needed or useful to experimentation. To highlight the progress of this semester, the following scripts were created which may provide use to future classes:
Data These scripts directly relate to manipulation and evaluation of the data components of an experiment (the transcripts, audio files, dictionaries)
- createSubTranscript.pl -- This script creates a new transcript file, derived from the base transcript provided, of the specified number of hours and starting at the specified hour. This script uses the same time calculation as corpusSize2.pl, which differs from the way Sphinx calculates time (Sphinx does not account for overlap in the audio files).
- copySph2.pl -- This script was used in conjunction with createSubTranscript.pl. Once a new transcript file exists, this script will create symbolic links to the sph files needed for that transcript. This script is no longer needed as we are no longer using symbolic links to audio files.
- copySph3.pl -- This script provides the same function as copySph2.pl, with the exception that it generates the actual sph files using sox rather than create symbolic links. This script was used to generate the base audio files at /mnt/main/corpus/switchboard/full/audio. There should be no need to run this script again, but still exists if needed. One thing to note is that a small number audio files are missing from the master sph directory. Thus far only a couple file from mini have been found to be missing.
- corpusSize2.pl -- This script takes in a transcript file as an argument and returns the total length of time for that transcript. This script accounts for overlap in the audio files. Using this script the full corpus evaluates to 250 hours of data.
- corpusSize0.pl -- This script also returns the total length of a transcript file but does not account for overlap in the audio files. This is the time calculating mechanic that sphinx uses when it calculates the total training time. Using this script the total size of the full corpus is 308 hours.
Master Scripts Over the course of the semester a point was made to automate processes to make training less time consuming and easier Several scripts were produced. Some of these scripts are no longer used but may function as a framework for other potentially useful scripts.
- prepareExperiment2.pl -- This is the most recent automation script that should currently be used to run an experiment. This script automates the entire training process up to feats generation (it also does not update senone value, density or other trainer parameters which should be modified by hand after running this). The core advantage of this script is that it relies on using symlinks for audio files to reduce the amount of space used and to speed several processes. This script also makes use of the newest versions of genTrans (genTrans10) and pruneDictionary (pruneDictionary4) which drastically improves the performance of generating the transcript and dictionary files.
- generateFeats.pl -- This script calls scripts_pl/makeFeats.pl and removes the symbolic link to the audio files, and creates a new link pointing to the actual audio directory. Run this script after prepareExperiment.pl to fully prepare an experiment up to acoustic model training.
- master_run_train.pl –- This script was the semesters first fully automated solution and is no longer used for training. It was designed to help ease the tedious process of Running a Train . The base foundation of this script was to eliminate redundancies in script parameters and eliminate the possibility of human error in missing steps or providing incorrect parameters. At each step the process is described and user input allows varying degrees of customization in input parameters. While this script is no longer used, it could provide an excellent template for a tutorial script to walk new users through the process of training. The advantage it has over prepareExperiment is that it explains each step in the process in detail. While this sacrifices efficiency for power users that want to quickly prepare an experiment, it is great for new users who are unsure of what steps are needed to prepare an experiment.
Utility Scripts The following scripts provide additional utility or are used by other scripts in the system.
- genTrans10.pl -- genTrans10 generates a parsed transcript for the experiment given the corpus data subset. It is much faster than any of its predecessors because it no longer runs sox to generate sph files. This step is no longer needed because we create symbolic links. This script is used by prepareExperiment2.pl.
- pruneDictionary4.pl -- This script is a complete rebuild of the pruneDictionary scripts. These scripts are used to read in a transcript and create a list of unique dictionary words mapped to pronunciations. This version of the script runs significantly faster than old versions by using powerful unix commands. This script is used by prepareExperiment2.pl.
- Mono Scripts -- The following scripts are used to generate mono audio data
Findings and Results
The following section outlines the actual progress this semester made. This portion is broken down by an overview of the methodologies we applied in relation to the results they produced as well as including the actual results of our largest experiments.
Much of the work done this semester was to directly improve the quality of the models we were building. Early in the semester the modeling efforts were primarily expended determining the ideal inputs for the trainer. Early on these were identified as the senone count and the density mixtures. As other researchers in this field have done, one way to determine these optimal parameters is to run many experiments, with variable changes in relation to control experiments. Many experiments were completed over the past several months. In doing so we were able to generate graphs and data charts showing what the optimal input parameters were for a given set of data. Some important results are as follows:
- Density values should be powers of 2 (although they are not required to be) and should be between 8 and 64. For most smaller data sets under 100 hours the density value should be no higher than 32, as using larger densities begins to reduce accuracy and drastically increases the real time factor. The density size should scale with the size of the data, so smaller data uses fewer density mixtures.
- When generating an acoustic model of density X, all density values less than the max number of densities which are powers of two are also created during training. For instance, if one were to create an acoustic model of density 16, acoustic models for a density of 8, 4 and 2 would also be created in the model_parameters directory. This means that training on a higher density results in models for all the smaller densities too.
- The senones value should be somewhere between 1000 and 10000, once again scaling with the size of the data.
- Sphinx provides a script in <exp>/scripts_pl/ called tune_senones.pl. This script will attempt to train and decode a series of acoustic models within a provided senone value range with the provided increment. To run this script one first needs to run a normal experiment. Tune_senones is highly optimized to use the existing model as a baseline for building the others, which allows it to skip several training steps and improve speed. An alternate version of this script exists in experiment 0251/scripts also called tune_senones.pl. This script decodes on all senones and also on all density mixture arrangements (the original script only decodes on the highest density for each senone).
Towards the end of the semester, our efforts were set on fine-tuning the already determined parameters of interest. Using a variety of scripts such as tune_senones.pl, the group's core objective was to produce a large number of acoustic models and determined the ideal configuration through a brute force method. Early on in the process the decision was made to use a Language Model for the evaluation data set. This provided distinct advantages. The biggest advantage it provided was that it drastically reduced the real time factor as it more drastically cut back on the size of the trees needed to decode while maintaining accuracy.
Other new processes were explored as well which did not necessarily improve our results but did provide more context for the complexity of the system we were working with. Research into areas such as MLLT was done in hope of improving the accuracy of our test on eval experiments. Research initially suggested that we could see up to a 25% increase in performance using MLLT. After going through the process of configuring the system with the pylab libraries we were able to successfully run a train with MLLT. Unfortunately this did not produce the results we expected, and further research proved that based on our data MLLT was not the best way to go.
By building these models we were able to produce a variety of acoustic models with an improved WER over past semesters. Our best results were a a five hour model with a WER of 15% and a 100 hour model with a WER of 40%.
The audio data used for our experimentation is from the switchboard corpus. This data originally was recorded at a sampling rate of 8 kHz with 8 bit precision. This poses a huge problem when training and decoding. In the spring of 2013, audio was converted from 8 bit to 16 bit precision for each utterance, however the resulting audio still contained both channels at a sampling rate of 8 kHz. This process was done using the GenTrans scripts. This 16 bit format enabled basic training with sphinx 3. However, this yielded very poor decode results.
As specified by the CMU tutorial for developers on their website, it is critical to have audio files in the PCM (Pulse Code Modulation) linear format. By default and per recommendation of CMU, Sphinx 3 is configured to train acoustic models from 16 kHz and 16 bit single channeled files in MS wav format. However, Sphinx 3 does support other formats such as sphere and raw. For our experiments, we used the sphere format for our audio data.
The basis of our research was, comparing CMU recommendations for audio format to the audio format that is currently in the switchboard corpus. Upon close inspection, we found that the audio was dual channeled and many of the utterances created, had cross talk in them. Our theory was that this cross talk was interfering with the training and decoding and resulting in the creation of poor acoustic models.
We decided to train with single channeled audio. To do this we needed to split each two channeled audio file in the corpus, into separate single channeled files. For all you bio majors out there, a good analogy would be file mitosis. We installed sphere to pipe which is a program that manipulates sphere audio files. This process was carried out using a Perl script that utilized sphere to pipe to split the audio files.
Another thing that we noticed when running trains in our experiments, is that the sphinx_train.cfg file did not have parameters to specify that the audio file was sampled at 8 kHz. We theorized that this was another underlying factor in why our decodes have an error rate of 60% upon evaluation. Sphinx 3 does have parameters to specify 8 kHz audio. Doing this requires further alterations in the configuration file because a lot of the default values are based on the 16 kHz. The parts of the configuration file that needed to be altered are the feature extraction parameters. We changed the $CFG_WAVEFILE_RATE = to 8000.0, $CFG_NUM_FILT = to 31, $CFG_LO_FILT = to 200, and $CFG_HI_FILT = to 3500. By using these recommendations from CMU (http://cmusphinx.sourceforge.net/wiki/tutorialam ),we further reduced the error rate by another 2 %.
This semester the decoding process was refined and updated. The biggest find was that previous semesters were decoding incorrectly. Sphinx provides a series of scripts that can be used to decode. Previous semesters had not realized this and were writing their own scripts to do this instead. By running a setup script (which is now called during prepareExperiment2.pl), a new file will be created in <exp>/etc called sphinx_decode.cfg. This file contains the configurations for a decode. Additionally, several new scripts are created in <exp>/scripts_pl/decode, including slave.pl (the main script called to run a decode by the user) and s3decode which is called by slave.pl to start the decoding process and passing in input parameters.
Several parameters for decoding were determined to be important. First, we discovered that the decoder could also accept an npart argument which allows the user to locally parallelize the decode by utilizing different cores. This cut the total completion time of our decode in half (although it has no impact on the real time factor). Some crucial parameters include the beam width, phone beam and word beam parameters. These parameters have a strong impact of the results of a decode in terms of the real time factor as well as the WER. The key is to balance these to produce the optimal results. Beam values restrict the size of trees produced by the decoder, which in turn can help improve performance, the real time factor. If the beams are pruned too much the decode will begin to lose accuracy as it drops potentially valid options, while if it is not pruned enough the real time factor will increase drastically because there are too many options available for the decoder to process. To determine the ideal values it is preferable to use smaller sets of data for decoding. Using roughly three hundred sentences or half an hour of data can be used to rapidly produce results on decodes. This means that one can relatively find the ideal beaming parameters quickly and then scale the data up. Tests found that the results of smaller data was relatively accurate for larger data, that is to say that the ideal configuration for smaller data was also the best for larger data.
We also realized that using this new decoding method that the scoring tool SClite was already being used. After slave.pl initializes the decoder and it runs to completion, it then performs an align on the results and scores it. These files can be found in <exp>/result. Two files of interest are here, the .match files which are the hypothesis transcripts and the .align files which are the scored alignment results. The final WER is listed in the align file. The logs are also created and can be found in <exp>/logdir/decode. These files contain the real time factor. Typically the log file will be in multiple parts for each decode partition the user creates using the npart option in the decoder.
One focus for this semester was on providing all members of the project a better experience using the system for the purpose of running experiments, specifically the Modeling group as they were tasked with running a significant amount of Experiments. In order to do this, members examined the inherited system from previous semesters and analyzed information we had and what we could add to and create. Early on most of the group’s time was spent on learning the speech system we had, (Sphinx) by reading through the existing articles on the wiki, talking with the Modeling group, and external research on CMU's wiki.
Over the semester we began modifying the landscape of the wiki by creating new, updated and informative pages that will significantly help new students in the future. One result was a Master Script for the Training process that acts as an easy to use wizard that walks the user through each step with little confusion.
The first task was to completely redesigned the Experiment Information page with high level information about each process in the Experiment (Train, Language Model, Decode) and a variety of helpful links that will be extremely valuable to new students. It had been determined that documentation was unclear or contradictory as multiple versions of the same process existed, and old documentation was not labeled as such.
- Updated all Experiment guides (Training, Language Model building, Decoding) with information directly from Modeling group / creators of
new scripts and methods.
- Got Speak up and running on Rome, accessed through Caesar (https://caesar.unh.edu). The code base lives on Rome located here: /var/www/
- Create a scripts documentation page where we include information about all our scripts such as the usage, author, description and source code.
100+ Hour Models
Exp: 0252 - 011
Input values for parameters:
Corpus: 100hr/train3 Density: 32 Senone Value: 6000
Aligning results to find error rate
WORD ERROR RATE: 51.3% (30522/59507) REAL TIME FACTOR: 3.17% TOTAL TIME DECODE TO RUN: 15.85 Hours LANGUAGE MODEL: Built from full transcript Size of Test on train set: 121.02 Size of training set: 5 hours
$DEC_CFG_BEAMWIDTH = "1e-50"; $DEC_CFG_PBEAM ="1e-50";
Exp: 0251 - 012 | Click here to see
This Experiment was used for the final results in the Avengers Team. This was the competitions best result with the data set, training configuration and decode configuration.
Experiment Overview Setup
Using our Perl script prepareExperiment2.pl, this created all the necessary files including:
- Training transcript
- Phone list
- Filler dictionary
- Training file IDs
Our Acoustic model is using over 125 hours of data from conversation 3170 - the end of disk 22. The reason for this subset of data is the lack of cross talk that exists in it.
The Testing data was created using genTrans10.pl. Using this generated (test and test2) fileids and trans files.
- 5 hours of testing data for Final result
- Test data is a subset of the training data with lines containing the following, removed:
- Bracketed words
- Words with curly braces
- Words ending in _1
- 300 lines of the test data.
- Small subset for tuning purposes
- This data is a subset of the Test data
We added the following lines to the filler dictionary:
- [NOISE] +noise+
- [LAUGHTER] +laugh+
- [VOCALIZED-NOISE] +vocalnoise+
We added the following lines to the phonelist in alphabetical order:
Acoustic Model Setup
The Acoustic model was created using the 3170/train2 data set located here:
This contains 125+ hours of audio files. The dictionary used was the switchboard corpus dictionary located here:
In the Sphinx Configuration file we changed the density to 64 and the senone value to 8000.
The density of 64 was chosen because during the training process the Acoustic Models for all the Gaussian Mixtures were created as well; essentially this produced models using densities of 32, 16 and 8. This allowed the group to test using different densities instead of having to create multiple experiments with varying densities.
The Senone value of 800 was chosen based loosely on a chart found on CMU’s website.
Language Model Setup
The Language Model was created using the test transcription created during set up. The 5 hours of recording resulted in a 3683 word vocabulary trigram language model. We used the arpa Language Model for decoding because of the size. Larger Language Models would use the DMP file for decoding.
When decoding we started with the default parameters to act as the baseline. From there we ran the test data set to tune the parameters in an effort to reduce the computation time. The test data set was small enough that a high xRT would still quickly produce results.
Various beams were tuned as well as other parameters until we achieved and optimized the WER and xRT ratio for our hardware setup.
The following are parameters that were entered in the etc/setup_decode.cfg file:
$DEC_CFG_LANGUAGEMODEL_DIR = "$DEC_CFG_BASE_DIR/LM"; $DEC_CFG_LANGUAGEMODEL = "$DEC_CFG_LANGUAGEMODEL_DIR/tmp.arpa"; $DEC_CFG_LANGUAGEWEIGHT = "11"; $DEC_CFG_WORDPENALTY = "0.7"; $DEC_CFG_BEAMWIDTH = "1e-50"; $DEC_CFG_PBEAM ="1e-50"; $DEC_CFG_WORDBEAM = "1e-30"; $DEC_CFG_MAXHMMPF = "2000"; $DEC_CFG_CIPBEAM = "1e-7"; $DEC_CFG_MAXCDSENPF = "2750"; $DEC_CFG_MAXWPF = "10";
The following are parameters that were entered in the scripts_pl/decode/s3decode.pl:
-lw => $ST::DEC_CFG_LANGUAGEWEIGHT , -beam => $ST::DEC_CFG_BEAMWIDTH, -pbeam => $ST::DEC_CFG_PBEAM, -wbeam => $ST::DEC_CFG_WORDBEAM, -maxhmmpf => $ST::DEC_CFG_MAXHMMPF, -ci_pbeam => $ST::DEC_CFG_CIPBEAM, -maxcdsenpf => $ST::DEC_CFG_MAXCDSENPF, -maxwpf => $ST::DEC_CFG_MAXWPF, -wip => $ST::DEC_CFG_WORDPENALTY
After attaining results from the small testing data set we tuned the parameters and re ran the test to see the relative changes in xRT and WER for the modified parameters. We used the 64 density Acoustic Model when decoding and watched for a 1:1 ratio loss of xRT and WER. I.E. If we gain 1% accuracy after a parameter change, we have to lose a xRT factor of 1 - not anything greater and visa-versa.
Using this method we reduced xRT significantly while only sacrificing a small amount of accuracy.
We decided to provide two different sets of results using identical parameters and setup. The two results show the difference in the xRT and WER between densities. Each result is broken into 2 sections. The first section consists of the summary log of each part the decode was broken into. The second contains the Total overall score results of the decode. The first set of results depict the decode on our 64 Density Acoustic Model. The second are the 32 Density results. The only difference between the two is the density. This affects the number of Gaussians per frame. This will in turn, impact the xRT and WER significantly.
- Computed xRT: 1.3
- WER: 48.67%
Part 1 Log
INFO: stat.c(206): SUMMARY: 983981 fr; 2119 cdsen/fr, 129 cisen/fr, 68060 cdgau/fr, 4128 cigau/fr, 2.25 xCPU 2.25 xClk [Ovhrd 0.14 xCPU 0 xClk]; 1906 hmm/fr, 20 wd/fr, 0.32 xCPU 0.32 xClk; tot: 2.58 xCPU, 2.58 xClk
Part 2 Log
INFO: stat.c(206): SUMMARY: 941489 fr; 2145 cdsen/fr, 129 cisen/fr, 68922 cdgau/fr, 4128 cigau/fr, 2.27 xCPU 2.27 xClk [Ovhrd 0.14 xCPU 0 xClk]; 1946 hmm/fr, 21 wd/fr, 0.33 xCPU 0.34 xClk; tot: 2.61 xCPU, 2.61 xClk
TOTAL Words: 48386 Correct: 35528 Errors: 23550 TOTAL Percent correct = 73.43% Error = 48.67% Accuracy = 51.33% TOTAL Insertions: 10692 Deletions: 2673 Substitutions: 10185
- Computed xRT: 2.12
- WER: 40.86%
Logs Part 1 Log
INFO: stat.c(206): SUMMARY: 983981 fr; 2061 cdsen/fr, 129 cisen/fr, 132017 cdgau/fr, 8256 cigau/fr, 3.89 xCPU 3.89 xClk [Ovhrd 0.26 xCPU 0 xClk]; 1732 hmm/fr, 16 wd/fr, 0.28 xCPU 0.28 xClk; tot: 4.18 xCPU, 4.18 xClk
Part 2 Log
INFO: stat.c(206): SUMMARY: 941489 fr; 2090 cdsen/fr, 129 cisen/fr, 133903 cdgau/fr, 8256 cigau/fr, 4.00 xCPU 4.00 xClk [Ovhrd 0.25 xCPU 0 xClk]; 1774 hmm/fr, 17 wd/fr, 0.29 xCPU 0.29 xClk; tot: 4.30 xCPU, 4.30 xClk
TOTAL Words: 48386 Correct: 38877 Errors: 19770 TOTAL Percent correct = 80.35% Error = 40.86% Accuracy = 59.14% TOTAL Insertions: 10261 Deletions: 2237 Substitutions: 7272
Exp: 0252 - 022 - 002
Input values for parameters:
Corpus: first_5hr/mono Density: 32 Senone Value: 3000
Aligning results to find error rate
SENTENCE ERROR: 83.4% (3884/4659) WORD ERROR RATE: 22.8% (13699/60084) REAL TIME FACTOR: 4.73% TOTAL TIME DECODE TO RUN: 24.7 hours MACHINE: Automatix Size of Test on train set: 5 hours Size of training set: 5 hours
Exp: 0252 - 022 - 003
Input values for parameters:
Corpus: first_5hr/mono Density: 64 Senone Value: 3000
Aligning results to find error rate
SENTENCE ERROR: 73.4% (3422/4659) WORD ERROR RATE: 16.7% (10046/60084) REAL TIME FACTOR: 4.94% TOTAL TIME DECODE TO RUN: 24.7 hours MACHINE: Automatix Size of Test on train set: 5 hours Size of training set: 5 hours
The progress made by the class of spring 2014 has been unprecedented. By week 5, about half the class was running trains with various parameters. Using this brute force approach of running trains and creating model, this class set the record for the lowest word error rate yet for test on train (WER 15%) and test on unseen data (WER 38%).
Recommendations for Future Capstone Groups
- Remove unneeded audio files from the data directories as well as from experiments
- We no longer need trans, audio(utt/conv) directories in each corpus. All we need is the transcript file.
- Add decoder changes to prepareExperiment2.pl
- Change Language Model name to tmp.arpa on line 51
- Generate _test fileids, transcript and feats2 for testing data
- Continue working on parallelization solution