Speech:Exps 0305 011

Description
Author: Prof Jonas (UserID: mcy59)

Date: 3-13-2018

Purpose: Removes instances of [] marked words but keep trailing - in LM, transcripts and dictionary for 5 hour train and use same set to test.

Details:

Took standard 5 hour train and removed instances of [] marked words as follows:


 * [???]abc -> abc
 * -[???]abc -> -abc
 * abc[???] -> abc
 * abc[???]- -> abc-
 * [laughter-abc] -> abc
 * [laughter-abc[???]] -> abc
 * [laughter-abc[???]-] -> abc-
 * [abc/???] -> abc

Created three scripts to help convert unedited transcript file (i.e. the trans file from corpus) into 011_train.trans and 011.dic for training and trans_parsed for LM. The scripts are in etc/scripts:


 * parseLMTrans_no_brackets.pl
 * parseTrainTrans_no_brackets.pl
 * pruneDic_no_brackets.pl

The first two scripts are basically identical except for two lines in the latter (I grab the ID and then add it along with and to final line). With the rest of the filtering lines all identical in both scripts it insures that training, decoding and LM all use the same words.

Also, checked the trans_parsed by opening in emacs and paging throw all 4172 lines to see if I missed anything. The first few times I found issues but the last time I think I got it all. I also checked to see if every word in my transcripts (011_train.trans) also existed in my dictionary (011.dic):

% sed "s/..> //" 011_train.trans | sed "s/ <.*//" | sed "s/ /\n/g" | sort | uniq > trans.words % awk '{print $1}' 011.dic | sort > dic.words % diff trans.words dic.words 1d0 <

I trained a set of models (i.e. on the 5 hour set) and then decoded, testing with the same training set.

Results:

So if we can trust the results from 0305/007 where we keep both [] and - (i.e. 34.5% WER) then removing the partial markings (i.e. []) seems to do fairly better (i.e. 32.7% WER). This is a bit of a surprise as I would imagine that creating a more distinct language model (which is what you get with more context added by [] words) would improve performance. With only 5 hours it may not be enough data so we definitely need to look at the full 300 hour set.

,-.     |                            hyp.trans                            | |-|     | SPKR    | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err | |-+-+-|     | Sum/Avg | 4172  60569 | 73.8   18.4    7.8    6.5   32.7   88.3 | |=================================================================|     |  Mean   |  1.3   19.2 | 76.5   17.8    5.8   15.2   38.7   88.5 | | S.D.   |  0.5   16.5 | 17.8   15.0    7.8   29.2   32.4   29.5 | | Median |  1.0   15.0 | 76.9   15.9    2.1    3.1   33.3  100.0 | `-'