Speech:Spring 2018 Data Group


 * Home
 * Semesters
 * Spring 2018
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Groups

 * Systems Group
 * Experiment Group
 * [Data Group]
 * Modeling Group
 * Software Group

Group Member Logs

 * Isaac Marsh
 * Tri Nguyen
 * Rosali Salemi

Tasks
The main task for the data group for Spring 2018 focused on modifying three scripts, parseLMTrans.pl and genTrans.pl (both of which eventually replaced the original ones in /mnt/main/scripts/user/ so that all new experiments are now using them), and pruneDic.pl.

We sought to discover whether we could improve the WER by modifying the regular expressions in each script in regard to words with special characters such as brackets, dashes, single quotes and forward slashes that were in the original transcript. This was in continuation of work done last semester.

A problem encountered early in the semester was that last year's Capstone students had only modified the scripts that generated the transcript and the dictionary, but not the language model. The language model assigns the probability of a word being correct. During decode, there would be a mismatch of a word being in the transcript and dictionary but not in the language model, so correct words were being substituted with other words which the language model considered more likely. So one of our tasks was to also modify parseLMTrans.pl to make sure it filtered out the same words as the genTrans.pl and pruneDic.pl did.

Examples of problematic words: [LAUGHTER], [NOISE], [VOCALIZED-NOISE]. These three words are known as filler words and have their own dictionary and are regarded by the sphinx3 decode as silence.

[laughter-all] [laughter-this] [laughter-time] where the speaker was laughing as they said the associated words.

[don'n/don't] and [shun't/shouldn't] where the speaker said the first word but meant the second.

an[y], where the part of the word in brackets was not spoken, but the official "truth" transcript has the rest of the presumed word anyway.

We modified the regular expressions in various ways, such as removing the brackets, undesired punctuation and/or special characters such as a single apostrophes or forward slashes, replacing the set of incorrectly and correctly pronounced words such as [don'n/don't] with just the correctly pronounced word "don't", and removing the [laughter- ] from words spoken while the speaker was laughing so only the word it was joined to remained. We also tried removing all brackets and anything inside them; usually there would be a dash after the word which we tried either leaving in or removing, for example: an[y] became either an- or just an.

The original versions of these scripts are in experiments 0305/011, 0303/012, 0305/013. There are many more experiments in the directories of the Data Group (0305), Guardians (0309) and Avengers (0310). In the wiki, https://foss.unh.edu/projects/index.php/Speech:Exps_0305 shows you a list of experiments run.

You can also use the command expStat.sh, which will show you a list of flags. Use the flags (order counts - they must be used in the order listed) to see a list of experiments that have been run, filtered by user or date or current month, etc. Example:

expStat.sh -u

will show you a list of experiments filtered by user. Be patient - the script takes several seconds to do its work. Also, some flags don't work without others - I needed -u in order to see experiments filtered by the scoring flag -s.

and the scripts themselves will be in either the ex. 0305/011/etc directory or, in a few cases, the LM directory of each respective experiment on Caesar (you can also access them through any drone).

The current parseLMTrans.pl and genTrans.pl being used for all experiment, per Professor Jonas, is based on 0305/012, which keeps both bracketed words and dashes. The best WER score so far for 300hr unseen data is 0310/022, which uses scripts that strip both bracketed words and dashes. Further testing is needed, as the Professor says keeping them in should increase the score due to having greater context, which ought to make for better word choices.

Other data group tasks included listening to samples from the audio files and comparing what was spoken to what was in the text transcript to verify the accuracy of translation. We also used an open-source application called Audacity to do manual clean up of background noise for those audio files that needed it.

The pruneDic.pl script and the other two scripts have multiple variations such as pruneDic_no_brackets.pl. A possible task for next semester's data group is modifying the pruneDictionary.pl script that is in current use in the /mnt/main/scripts/user/ directory with changes made to the pruneDic.pl script.