Speech:Exps 0305 012

Description
Author: Prof Jonas (UserID: mcy59)

Date: 3-14-2018

Purpose: Keep instances of [] and - marked words in LM, transcripts and dictionary for 5 hour train and use same set to test.

Details:

Took standard 5 hour train and made sure that [] and - marked words are in the language model, transcripts and dictionary. The first two I just relied on the current genTrans.pl (in /mnt/main/scripts/user). For the latter I wrote a script:


 * convertTrainToLM.pl

That just strips out the following:


 * < /s>
 * ( utterance id )
 * ( utterance id )

Since using the standard pruneDictionary.pl (vial /mnt/main/user/scripts/makeTrain.pl), one only needs to check add.txt file it output to ensure that only the filler words are left (which was the case).

I trained a set of models (i.e. on the 5 hour set) and then decoded, testing with the same training set on caesar.

Results:

So this run replaces (now flawed) 0305/007 which incorrectly removed [LAUGHTER] and [NOISE] from the LM transcripts (this needs to be fixed). This new WER of 34.4% is slightly better than 0305/007 (i.e. 34.5% WER) but still doesn't compare to 0305/011's result of 33.2% WER where [] words were removed.

,-.     |                            hyp.trans                            | |-|     | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err | |-+-+-|     | Sum/Avg | 4172  60215 | 73.1   19.1    7.8    7.4   34.4   87.5 | |=================================================================|     |  Mean   |  1.3   19.1 | 76.0   18.3    5.8   15.4   39.4   87.9 | | S.D.   |  0.5   16.5 | 18.1   15.3    7.7   29.1   33.0   30.1 | | Median |  1.0   15.0 | 76.2   16.7    2.4    4.2   33.3  100.0 | `-'