Speech:Spring 2019 Data Group New300

From Openitware
Jump to: navigation, search


Project Logs


Project Member Logs


Tasks

The goal of this project was to create a new set of data transcripts. One transcript would be for training the decoder and another transcript would be used for testing unseen data or, in other words, testing to see how well the program had learned how to recognize audio. This needed to be done because previous testing used randomly selected audio files to be used as the test data. Unfortunately, sometimes these randomly selected audio files would also be used in training which led to an unfair advantage when those are used during testing. This would create an inaccurate representation of how the machine learning was improving and recognizing words.

Approach

The main problem with the old transcripts was that they did not necessarily have a protected test. To correct this, I selected specific audio files that would be used for testing and found the corresponding transcript and transferred that into a new file called Test_train.trans which would be the transcript used for the tests and a new file called Train_train.trans which would be used for the training of the language modeling decoder. The first thing I had to do in order to select which audio files to use was determine the length and the ID of each file we were using in the audio training. To do this, I had to find the information provided by the people who provided the original files then take this information and create an easy-to-follow chart that I could use to determine the length of each audio file. I needed a total of 10 hours of protected files. Once I determined the length of each file, I started to sort them by length of time. I ended up settling on files of 5 minute duration. This let me have a good sampling of audio files from each of the discs as well as making it easy on the math needed to determine how my files I needed. Once I figured out which file IDs I was using for the protected data, I then worked on creating a script to automatically edit the original transcript files into the two new files I listed above. Unfortunately the script is quite blunt and works through brute force. I am unable to use any type of loop because of the random nature of the IDs I am using to determine the audio files being used. As a note, this script directly edits the files and should not be used except in protected directories that do not have access to the transcripts currently being used. It is also worth making a backup of the original transcript to avoid potential problems before using this script.

Important Notes

Script

 #! /bin/csh/

rm -i Tmp_train.trans
rm -i Test_train.trans 
rm -i Train_train.trans
cp -i train.trans Tmp_train.trans

grep sw2010 Tmp_train.trans >> Test_train.trans
grep -v sw2010 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2022 Tmp_train.trans >> Test_train.trans
grep -v sw2022 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2035 Tmp_train.trans >> Test_train.trans
grep -v sw2035 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2055 Tmp_train.trans >> Test_train.trans
grep -v sw2055 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2072 Tmp_train.trans >> Test_train.trans
grep -v sw2072 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2079 Tmp_train.trans >> Test_train.trans
grep -v sw2079 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2082 Tmp_train.trans >> Test_train.trans
grep -v sw2082 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2090 Tmp_train.trans >> Test_train.trans
grep -v sw2090 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2095 Tmp_train.trans >> Test_train.trans
grep -v sw2095 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2105 Tmp_train.trans >> Test_train.trans
grep -v sw2105 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2113 Tmp_train.trans >> Test_train.trans
grep -v sw2113 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2161 Tmp_train.trans >> Test_train.trans
grep -v sw2161 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2195 Tmp_train.trans >> Test_train.trans
grep -v sw2195 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2223 Tmp_train.trans >> Test_train.trans
grep -v sw2223 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2265 Tmp_train.trans >> Test_train.trans
grep -v sw2265 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2326 Tmp_train.trans >> Test_train.trans
grep -v sw2326 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2374 Tmp_train.trans >> Test_train.trans
grep -v sw2374 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2383 Tmp_train.trans >> Test_train.trans
grep -v sw2383 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2413 Tmp_train.trans >> Test_train.trans
grep -v sw2413 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2471 Tmp_train.trans >> Test_train.trans
grep -v sw2471 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2519 Tmp_train.trans >> Test_train.trans
grep -v sw24519 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2560 Tmp_train.trans >> Test_train.trans
grep -v sw2560 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans 

grep sw2605 Tmp_train.trans >> Test_train.trans
grep -v sw2605 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2689 Tmp_train.trans >> Test_train.trans
grep -v sw2689 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2741 Tmp_train.trans >> Test_train.trans
grep -v sw2741 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2751 Tmp_train.trans >> Test_train.trans
grep -v sw2751 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2774 Tmp_train.trans >> Test_train.trans
grep -v sw2774 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2800 Tmp_train.trans >> Test_train.trans
grep -v sw2800 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2819 Tmp_train.trans >> Test_train.trans
grep -v sw2819 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2842 Tmp_train.trans >> Test_train.trans
grep -v sw2842 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2876 Tmp_train.trans >> Test_train.trans
grep -v sw2876 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2889 Tmp_train.trans >> Test_train.trans
grep -v sw2889 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2910 Tmp_train.trans >> Test_train.trans
grep -v sw2910 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2930 Tmp_train.trans >> Test_train.trans
grep -v sw2930 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2955 Tmp_train.trans >> Test_train.trans
grep -v sw2955 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2963 Tmp_train.trans >> Test_train.trans
grep -v sw2963 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2988 Tmp_train.trans >> Test_train.trans
grep -v sw2988 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2989 Tmp_train.trans >> Test_train.trans
grep -v sw2989 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw2994 Tmp_train.trans >> Test_train.trans
grep -v sw2994 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3010 Tmp_train.trans >> Test_train.trans
grep -v sw3010 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3044 Tmp_train.trans >> Test_train.trans
grep -v sw3044 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3054 Tmp_train.trans >> Test_train.trans
grep -v sw3054 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3128 Tmp_train.trans >> Test_train.trans
grep -v sw3128 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3054 Tmp_train.trans >> Test_train.trans
grep -v sw3054 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3155 Tmp_train.trans >> Test_train.trans
grep -v sw3155 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3156 Tmp_train.trans >> Test_train.trans
grep -v sw3156 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3157 Tmp_train.trans >> Test_train.trans
grep -v sw3157 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3160 Tmp_train.trans >> Test_train.trans
grep -v sw3160 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3161 Tmp_train.trans >> Test_train.trans
grep -v sw3161 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3162 Tmp_train.trans >> Test_train.trans
grep -v sw3162 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3305 Tmp_train.trans >> Test_train.trans
grep -v sw3305 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3306 Tmp_train.trans >> Test_train.trans
grep -v sw3306 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3307 Tmp_train.trans >> Test_train.trans
grep -v sw3307 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3308 Tmp_train.trans >> Test_train.trans
grep -v sw3308 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3309 Tmp_train.trans >> Test_train.trans
grep -v sw3309 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3310 Tmp_train.trans >> Test_train.trans
grep -v sw3310 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3311 Tmp_train.trans >> Test_train.trans
grep -v sw3311 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3312 Tmp_train.trans >> Test_train.trans
grep -v sw3312 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3449 Tmp_train.trans >> Test_train.trans
grep -v sw3449 Tmp_train.trans >> Train_train.trans

mv Train_train.trans train.tran

grep sw3450 Tmp_train.trans >> Test_train.trans
grep -v sw3450 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3451 Tmp_train.trans >> Test_train.trans
grep -v sw3451 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3452 Tmp_train.trans >> Test_train.trans
grep -v sw3452 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3453 Tmp_train.trans >> Test_train.trans
grep -v sw3453 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3454 Tmp_train.trans >> Test_train.trans
grep -v sw3454 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3455 Tmp_train.trans >> Test_train.trans
grep -v sw3455 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3456 Tmp_train.trans >> Test_train.trans
grep -v sw3456 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3590 Tmp_train.trans >> Test_train.trans
grep -v sw3590 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3591 Tmp_train.trans >> Test_train.trans
grep -v sw3591 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3592 Tmp_train.trans >> Test_train.trans
grep -v sw3592 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3594 Tmp_train.trans >> Test_train.trans
grep -v sw3594 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3595 Tmp_train.trans >> Test_train.trans
grep -v sw3595 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3596 Tmp_train.trans >> Test_train.trans
grep -v sw3596 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3597 Tmp_train.trans >> Test_train.trans
grep -v sw3597 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3598 Tmp_train.trans >> Test_train.trans
grep -v sw3598 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3732 Tmp_train.trans >> Test_train.trans
grep -v sw3732 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3733 Tmp_train.trans >> Test_train.trans
grep -v sw3733 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3734 Tmp_train.trans >> Test_train.trans
grep -v sw3734 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3735 Tmp_train.trans >> Test_train.trans
grep -v sw3735 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3736 Tmp_train.trans >> Test_train.trans
grep -v sw3736 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3737 Tmp_train.trans >> Test_train.trans
grep -v sw3737 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3739 Tmp_train.trans >> Test_train.trans
grep -v sw3739 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3872 Tmp_train.trans >> Test_train.trans
grep -v sw3872 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3873 Tmp_train.trans >> Test_train.trans
grep -v sw3873 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3874 Tmp_train.trans >> Test_train.trans
grep -v sw3874 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3875 Tmp_train.trans >> Test_train.trans
grep -v sw3875 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3876 Tmp_train.trans >> Test_train.trans
grep -v sw3876 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3877 Tmp_train.trans >> Test_train.trans
grep -v sw3877 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3878 Tmp_train.trans >> Test_train.trans
grep -v sw3878 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3879 Tmp_train.trans >> Test_train.trans
grep -v sw3879 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw3880 Tmp_train.trans >> Test_train.trans
grep -v sw3880 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4015 Tmp_train.trans >> Test_train.trans
grep -v sw4015 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4016 Tmp_train.trans >> Test_train.trans
grep -v sw4016 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4017 Tmp_train.trans >> Test_train.trans
grep -v sw4017 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4018 Tmp_train.trans >> Test_train.trans
grep -v sw4018 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4019 Tmp_train.trans >> Test_train.trans
grep -v sw4019 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4020 Tmp_train.trans >> Test_train.trans
grep -v sw4020 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4021 Tmp_train.trans >> Test_train.trans
grep -v sw4021 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4022 Tmp_train.trans >> Test_train.trans
grep -v sw4022 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4159 Tmp_train.trans >> Test_train.trans
grep -v sw4159 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4160 Tmp_train.trans >> Test_train.trans
grep -v sw4160 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4161 Tmp_train.trans >> Test_train.trans
grep -v sw4161 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4162 Tmp_train.trans >> Test_train.trans
grep -v sw4162 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4163 Tmp_train.trans >> Test_train.trans
grep -v sw4163 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4164 Tmp_train.trans >> Test_train.trans
grep -v sw4164 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4165 Tmp_train.trans >> Test_train.trans
grep -v sw4165 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4301 Tmp_train.trans >> Test_train.trans
grep -v sw4301 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4302 Tmp_train.trans >> Test_train.trans
grep -v sw4302 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4303 Tmp_train.trans >> Test_train.trans
grep -v sw4303 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4304 Tmp_train.trans >> Test_train.trans
grep -v sw4304 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4305 Tmp_train.trans >> Test_train.trans
grep -v sw4305 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4306 Tmp_train.trans >> Test_train.trans
grep -v sw4306 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4307 Tmp_train.trans >> Test_train.trans
grep -v sw4307 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4308 Tmp_train.trans >> Test_train.trans
grep -v sw4308 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4643 Tmp_train.trans >> Test_train.trans
grep -v sw4643 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4644 Tmp_train.trans >> Test_train.trans
grep -v sw4644 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4646 Tmp_train.trans >> Test_train.trans
grep -v sw4646 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4649 Tmp_train.trans >> Test_train.trans
grep -v sw4649 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4650 Tmp_train.trans >> Test_train.trans
grep -v sw4650 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4655 Tmp_train.trans >> Test_train.trans
grep -v sw4655 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4659 Tmp_train.trans >> Test_train.trans
grep -v sw4659 Tmp_train.trans >> Train_train.trans

mv Train_train.trans Tmp_train.trans

grep sw4660 Tmp_train.trans >> Test_train.trans
grep -v sw4660 Tmp_train.trans >> Train_train.trans