Speech:Spring 2018 Rosali Salemi Log


 * Home
 * Semesters
 * Spring 2018
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Week Ending February 5th, 2013
NOTE - A couple of useful commands for this Wiki page:

To put something in bold, either highlight and click the B in the toolbar above, or just type in three single quotes on either side of your text. To make a box around something (usually code snippets or commands) simply make sure your line of text is on its own line and hit the space bar right at the beginning of the line so that there is one space before the text begins. When you save it, you'll see a light gray box around your text.

If you want to use Notepad++ and you like to to copy your commands, you might get some wiki formatting messing up your commands unless you do:

TEXT EDITING Use regular notepad in Windows, which has no formatting in the background that might carry over when you copy/paste commands. Or if you use Notepad++, do mystuff to say 'pre-formatted' and wiki doesn't attempt to force its formatting onto your Notepadd++ file. Sublime Text may cause problems, though.


 * Task: 2/1/18. This week we established the members of the groups; mine is the Data Group, along with Isaac Marsh, Tri Nguyen and Arias Talari. We are communicating via email for now although we discussed using Google Hangouts and a weekly meeting in addition to our regular class.

I had an issue getting onto the VPN last week; I met with several classmates on Monday to fix this issue. I am familiarizing myself with the file system in Caesar, learning the locations of pertinent files and directories, particularly for the Data Group. I examined the contents of the dictionary files within the corpus directory, noting how both regular spoken words and descriptive words (such as -DASH D AE SH) are listed in two columns, with the word on the left and phonemes on the right.

I have also refreshed my memory for basic UNIX/Linux commands and created a directory within the 0303 directory under my number, 010, in which I will be conducting experiments.

2/2/2018 I played a bit with the CMU pronouncing dictionary on the wiki site to see how it displays a word by its phonemes, noting the 39 phonemes used and the 3 types of stresses that can be placed on words (0 - none, 1 - primary and 2 - secondary). I have read through most of this wiki site's information section.

2/4/2018 I've read through all the logs in our class to see what everyone is doing and get an idea on what I can do to make more progress. I like the idea of establishing a meter of everyone's strengths and weaknesses/areas of expertise such as Networking, Programming, Administration. I've also gone through the Spring 2017 semester log for Data Group member Maryjean Emerson, to see what kind of tasks she was working on (basically data cleanup through removal of duplicate lines in the text transcript files, and later, using regular expressions and modified perl scripts to eliminate unwanted items from the transcript). Finally, I signed up for a short course on perl at Lynda.com. (If you have a library card to a local library, you ought to be able to sign up for Lynda.com for free, if you want. Just Google the Bedford, NH library and Lynda.com and you should find a portal where you just use your library card number and pin to get access to Linda.com.)

2/5/2018 I met with Arias from my Data Group and Daniel Rubin from the Systems Group to get organized and try to run a successful train.


 * Results: I now have access to the VPN while at home. I have emailed everyone again with my suggestions for a weekly meeting time and am waiting to hear back. My experiment directory is established and I have general mental map of how caesar is laid out.

2/2/2018 Some of the information is out of date - for example, in Speech>UNIX>Active Directory it says we aren't using it and there's instructions for setting up Active Directory using OpenSUSE, which Prof Jonas said we aren't using anymore because all the machines run RedHat now. There's instructions for creating soft links to the batch machines to caesar (the only machine with the CMU software on it) so we can run the machines in parallel. If we need to create a new corpus (database of audio files and transcripts) there's instructions and scripts for that. Our group has decided to meet on Wednesdays at 1pm and also use Discord.

2/4/2018 I will keep the issues my classmates had with running a train in mind when I do mine tomorrow in case I run into similar issues. I am getting a better idea as to what kinds of tasks the Data Group will want to focus on.

2/5/2018 I have joined Discord and was also in contact with Daniel Beitel. I downloaded Filezilla as a means of GUI viewing of the various directories and files in caesar. BE CAREFUL not to double-click on anything in the bottom two windows of Fielzilla or you'll be moving files around, not just opening a folder. I also posted a note about Lynda.com on Discord.

Viewing the train results was more of a challenge than I thought - besides various text files, an .html file was generated that I wanted to open in a web browser and view it. I could not find (Google) a command to do so without installing additional software such as a linux-based browser called lynx, but I need to find out first if that will install it on my local laptop, or on the server (Prof Jonas said not to install software on caesar.) I did find a stackexchange answer from 2012 with a command that shows a text version of the information in the .html file. xdg-open  (type this without the <> brackets) The .html results didn't list many specifics, but said there were hundreds of error and warning messages. I was specifically looking for the WER but didn't see it.


 * Plan: 2/1/2018 My main goal this week is to read as much background information on the process of speech recognition as possible, through this wiki site and sites such as CMU Sphinx at https://cmusphinx.github.io/wiki/tutorialconcepts/ and to read through the logs of last years' students, particularly those from the Data Group.

2/2/2018 Get Discord. Read the rest of the wiki site and CMU Sphinx. Watch some youtube demos of running a train, decoding, etc. Find out how to open and view an .html file from the terminal.

2/4/2018 Tomorrow I plan to try to run my first train. If it fails, I want to try to figure out why, and ask others' advice. Several people such as Camden Marble and Daniel Rubin have already tried to run a train and run into problems that they might have resolved by now. I also want to take a course or tutorial on perl, since the scripts in the corpus directory are of type .pl and have unfamiliar syntax that I'm guessing is perl. I have signed up for a short course on perl at Linda.com.

2/5/2018 Continue the course on perl from Lynda.com. Read more on CMU Sphinx and speech recognition in general, and more logs from earlier students.
 * Concerns: 2/1/2018 For the moment I am still unsure as to exactly what the data group actually does, aside from improving the dictionary - and want to know how to accomplish that as well. Also we still need to set up a weekly meeting schedule.

2/2/2018 One of my previous concerns has been addressed. We are now meeting on Wednesdays.

2/4/2018 None yet - once I run a train and see what sort of output I get, I will have a better idea of the actual, desired output and hopefully have ideas to make it better.

2/5/2018 The results are not very easy to understand. I didn't see a WER, and there were a lot of error messages that were not explained; hopefully one of the other logs makes it clearer.

Week Ending February 12, 2013
2/6/2018 Today I spent some time reading the Language Modeling (or LM) section of the CMU Sphinx site. Before class our group met up and discussed what we had accomplished during the week. Arias and I helped Tri and Isaac run trains and get familiar with the caesar directory structure via the Terminal. We are using Filezilla to transfer the train data results to our laptops so we can view the .html page created as a result of the train. During class we got some guidance from Prof Jonas as to our next tasks, including Get Familiar With Linux and Learn to Understand the Data that we will be working with. To that end I have downloaded an open source application called VLC to my laptop and tested it to be sure I can listen to .wav audio files.
 * Task:

2/8/2018 Today I read more about acoustic models on the CMU site. At the group meeting we followed the tutorial for creating a language model and then attempted to run a decode on the trained data from each of our first trans. We also helped Isaac troubleshoot some problems with file permissions and needed to do the keygen steps again after he deleted his first train because he had run it in the wrong directory. Finally, we discussed what to do to come up with a proposal draft for the 11th.

2/9/2018 Today I read through the logs of Brendan Collins and Brian Anker and also the proposals for the data groups of 2017 and 2016 to get ideas for our proposal. I also began researching other ways to improve the accuracy of the data other than listening to the audio, reading the transcripts and manually making corrections, as 2016's group did, or by using regular expressions to filter the data to remove unwanted items such as brackets with filler words such as [laughter], as 2017's group did.

Prof Jonas mentioned looking into how the last Data Group dealt with annotations (words in brackets). 2017's data group tried tweaking the perl scripts using regular expressions to automate the deletion of words like "laughter" but keep any word associated with it ex. [laughter-really]. Maryjean's log says they were having issues with partial words and with lines that had words in brackets, as the line would duplicate itself the same number of times as the number of brackets. Unfortunately, they couldn't get the new regular expressions to work correctly without introducing other errors.

If we don't want to pick up where '16 or '17 left off, we could research other ways to improve score accuracy. The last item on the CMU tutorial is https://cmusphinx.github.io/wiki/tutorialtuning/ about getting better accuracy results, which may be a help, but is about pocketsphinx, not our sphinx3.

Over the weekend I had to work but kept in contact with the group through email to offer my suggestions as to what to put in the proposal. Arias submitted our document to Camden, who did an excellent job organizing the proposal into one "voice".

2/12/2018 After a major Windows 10 update, my Pulse Secure VPN was not able to get a connection. I called Tech Support at UNH in Durham (603) 862-4242 and after a lot of troubleshooting (checking the signal strength of my network connection, making sure my Norton antivirus had Pulse Secure on the Allow list, etc.) we determined that I had an older version of Pulse Secure that needed to be uninstalled and the newest version installed. Note that there are actually about 7 Pulse Secure program files that all have to be uninstalled one at a time. Then I tested Putty and Filezilla to make sure I had full access.

Josh Y used Vitali's logs from 2017 to figure out that the instructions for doing the decode are wrong in that the directory you must build the LM and run the decode in must be the same as where you ran the train. For me this was /mnt/main/Exp/010. I had put the LM directory alongside my 001 directory, (so that both were in 010), which meant the LM could not access the information it needed from the train inside the 001 directory.

While my train was running I read through the latest logs. I have also been studying older logs from 2015, such as Dakota Heyman's Media:https://foss.unh.edu/projects/index.php/Speech:Spring_2015_Dakota_Heyman_Log#Week_Ending_March_3.2C_2015 She describes how to create multiple symbolic (soft) links with the command ln -fs /mnt/main/corpus/switchboard/full/train/audio/utt/sw2062B-ms98-a-{0002..0104}.sph -t . You put a range of files in the {} - hers was for files 2-104. To delete multiple symbolic links find -L -maxdepth 1 -type l -delete

2/6/2018 I feel we have made good progress in getting everyone up to speed. We came up with a short list of tasks to do before our group meeting, which was pushed back to Thursday 2pm due to snow. I also posted the general location of the audio files to Discord, and here, in case anyone is looking for them.
 * Results:

/mnt/main/corpus/noaa/half/adapt/audio/wav

2/8/2018 The tutorial could not be followed exactly as written because we each have a directory just for the train we each initially ran, essentially adding one more directory to the file path. The LM scripts appeared to work properly, adding the expected files within my 001 experiment subfolder. For the LM: Original Step 1: From your Base Experiment folder make a folder called LM in /mnt/main/Exp/ where  is 0303. Our modified Step 1: /mnt/main/Exp//010/001 (my sub-experiment directory path). In this directory I finished the steps for creating the language model. For the Decode on Trained Test Data: Again, there is some ambiguity on file path names not matching the tutorial's exactly. Arias and Tri used the alternate command awk '{print $1}' /mnt/main/corpus/switchboard/30hr/test/trans/train.trans >> /mnt/main/Exp/0283/018/etc/018_decode.fileids (using their own filespath names for the destination) and I tried the simpler command head -1000 001_train.fileids > 001_decode.fileids due to an error message (see below). We each did this in our respective etc directory from within our sub-experiment folders (mine is 001). The decode appeared to work properly.

The hypothesis transcript command appeared to work, even generating an error message that according to the tutorial is expected (it tries to remove any existing hypothesis file first, so the first time the command is run, there is none to remove). The last step, running the SCLITE scorer to get the accuracy results, doesn't work. There is a list of potential problems, but our error message was the same despite the using different .fileids commands earlier. Segmentation fault (core dump) In Googling this, it's apparently a very common error message for many many different issues and may prove difficult to track down the cause.

2/9/2018 I see there are many ways to improve the quality of the data we are working with. https://foss.unh.edu/projects/index.php/Speech:Spring_2016_Data_Group where they talk about a different way of doing the SCLite scoring report (that gives us the WER), and give us advice on sorting through transcripts and audio files to remove bad data (unintelligible speaker and the line in the transcript that corresponds to it, duplicate transcripts, or transcripts with no matching audio file at all). They did not complete the project, as about halfway through the semester they switched to work on helping other groups create a script to create an entirely new corpus database.

2/12/2018 The proposal was submitted last night and looks good. I have been trying to get a train/decode running, but have encountered issues in part because I accidentally ran the parseTrans command twice due to UNIX pasting when you right-click. The resulting hyp.trans file was empty, though, although I am not sure whether that was the result of my mistake, since other people have been reporting the same issue. I have reached out to the class through Discord.

2/6/2018 Email and supplement the notes taken in class to the others in our group. Test VLC with the .sph audio files. Look at the logs from 2017's Data Group to figure out how to interpret the results from my first train. Thursday, we should brainstorm ideas on the rough draft of the proposal, which is due next week. Look over previous semesters' proposals to see how they handled it.
 * Plan:

2/8/2018 I will go through earlier logs and Google to learn more about this error, and try to get some ideas for the proposal from previous capstone logs and contact my teammates via Discord. Also, I plan to Google ways to improve a language model, dictionary, and acoustic model to see if anything looks feasible to do in roughly three months' time. We also want to reach out to the other teams to see if anyone has any ideas how to fix the error.

2/9/2018 Read more about other ways to improve WER scores and accuracy for speech recognition. Learn to interpret the results of the train we did. Do a comparison of an audio file and its corresponding transcript (how do we find which ones correspond to each other)? Document everything we change.

2/12/2018 Fix my VPN connection. Run a train, build a LM (Language Model) and run a decode. Determine what I should focus on in terms of research and tasks, such as learning how to add words/pronunciations to the dictionary. 2/6/2018 None at present.
 * Concerns:

2/8/2018 The error is troubling. Also I still am not sure what we should focus on for the proposal.

2/9/2018 I would like more communication from some of my teammates. Our proposal draft is due in two days and we still have not settled on a focus.

2/12/2018 I have not been able to get a successful train and decode done. I feel this has been taking too much time and want to dig into other tasks.

Week Ending February 19, 2013
I spent many hours reading students' logs, looking up Linux commands and came to class early to work with classmates to try to get a successful train/decode. I was not successful.
 * Task: 2/14/2018

Tri and I spent a few hours with Prof Jonas after class fixing a problem with Tri's hyp.trans file (there were twice as many lines of text, as everything got duplicated), and asking questions to get a better handle on what we need to do as a group. Tri has a long list of the commands Prof Jonas used to do his troubleshooting from his Terminal, which will help us, as I at least am not very experienced in Linux. He used sort hyp.trans | uniq | wc to eliminate non-unique lines.

Prof Jonas wants the Data Group to fix a problem with the following script (and in general be able to explain what it does)

parseLMTrans.pl it's on this page: https://foss.unh.edu/projects/index.php/Speech:Create_LM#Steps_for_Creating_the_Language_Model

but we must use the copy Prof Jonas created for us in a directory called cur to practice on so we don't mess up everyone's trains/decodes. Prof Jonas also wants us to follow this naming convention:

Whenever we make a change to a script, rename the directory it is in (which up to that point is called 'cur') to be the next number in the set of directories. Right now there are 16 older directories and one cur. So when we change the script, we rename the directory it is in to '17', and create a new 'cur' directory as well. We do this with mv -i cur 17 and then mkdir cur

Like so:

cd /mnt/main/scripts/user/History/parseLMTrans/

mv -i cur 1

mkdir cur

There is a History directory where you put older items. This is different than the also Linux command, 'history' that will give you a list of all the commands you've typed into your terminal.

Jonas also wants us to create a DELETE directory inside your sub-experiment directory. This is where you should send anything you want to delete but aren't sure you may need. DELETE is like a Windows trash can, because Linux doesn't back anything up in a trash can when you delete stuff. We will be creating more of these sub-experiment directories - we are expected to add more with the script addExp.

-Within LM is the trans_parsed file that is the result of this command with the script that we will be fixing:

parseLMTrans.pl trans_unedited trans_parsed

'''*Note* When you create a new experiment, the addExp.pl script must be run on Caesar. Don't forget to ssh to a drone server afterward, when you go to make a directory with the three-digit experiment number that the script gives you.'''

2/15/2018 Both Isaac and I have managed to run successful trains/decodes. Camden Marble and others have put together a guide that is in my Experiments page. https://foss.unh.edu/projects/index.php/Speech:Exps_0303_010

2/17/2018 I worked on the proposal for the data group for the last couple of days, emailing Prof Jonas for his critique and advice for more tasks we can do. I also kept in contact with my group to talk about the proposal.

2/18/2018 Today I met with one of my group members, Tri, to prepare for our discussion in class tomorrow about the data for the project. I also read everyone's logs and each group's channel on Discord.


 * Results: 2/14/2018

I improved my knowledge of Linux commands and how to troubleshoot problems using them.

Prof Jonas also identified a Data Group problem with the parseLMTrans.pl script relating to brackets that is causing the Language Model to substitute incorrect words because the words (or parts of the words) are in brackets [ ] and are in the dictionary with brackets, but not in the Language Model, so it's assigning them a probability of 0% and substituting other words instead. This involves the use of Regular Expressions to filter out the unwanted characters.

2/15/2018 I now have data that I can use for when we begin tweaking the scripts to figure out the problem with the brackets. Tri and I discussed ideas to put in the proposal.

2/17/2018 We now have a viable proposal.

2/18/2018 We have a guide of topics to speak about in class.

The Proposal - study the Proposals from 2014 and 2015. The Discord group also has a suggested template based on the 2014's proposal. Write our section of the Proposal and suggest edits to the rest of the class's. We need to get this written at our Thursday meeting. The class as a whole has a deadline of Saturday night.
 * Plan: 2/14/2018

For next week's class be ready to explain:

What the data is, exactly. Dictionaries, audio files, transcripts, what else? Explain the tags/annotations/filler words, give samples of the tagged words and explain them. We must know what scripts are removing/substituting what data. Study genTrans.pl and be able to explain what genTrans does when a word has a tag. (But we don't have to say how genTrans works, line by line, in the code). Does it keep the word, get rid of it, get rid of part of it, suggest a word and/or replace it? Explain the whole process: this particular input -> genTrans.pl -> this particular output. Draw conclusions - "Here is what is being done and why."

Examine the History directory to see exactly what it contains and what we should use it for.

Prof Jonas has tasked us with fixing the problem with the parseLMTrans.pl script. We will need to know more about Regular Expressions, perl, and Linux commands.

I am familiar with regular expressions but it looks like more advanced knowledge is needed, so here's a regular expressions tutorial:

https://www.digitalocean.com/community/tutorials/using-grep-regular-expressions-to-search-for-text-patterns-in-linux

And here is a regular expression tool. The CheatSheet I find especially useful.

https://regexr.com/

Run a successful train/decode.

2/15/2018 Begin work on the Data Group's section of the proposal. Look over proposal material that Tri has sent to Isaac and me. Watch a video Camden posted about speech recognition from Microsoft Research https://www.youtube.com/watch?v=q67z7PTGRi8&feature=youtu.be Read everyone's logs for the week so far. Study regular expressions and Linux commands.

2/17/2018 Monday I am doing a presentation of sorts to set what we need to know for Tuesday's class in our heads. Do some reading on creating the 30-step Speech Corpus Setup, perl, try out Linux commands and regular expressions. Also, read everyone's logs and follow up with linked information our class has posted on Discord.

2/18/2018 Create a new experiment and modify the regular expressions in the parseLMTrans.pl to see how the results differ from the output we get now.

2/14/2018 While I have a better understanding of what the Data Group does, I do not want to spend a lot of time studying the wrong things, which can easily happen when looking for information. Also, I still have not managed to run a successful train/decode yet.
 * Concerns:

2/15/2018 I need to figure out more to do with the data than just look at it. Identifying problems my team can address is the hard part.

2/17/2018 We are moving past concepts and into examining the inner workings of the scripts and possibly writing some ourselves. This will be a challenge for me. Also, one of our group members is mostly incommunicado, so my group has gone from 4 members to 2, effectively. I will reach out to him Monday to see if I can get him to contribute something, but if not I will be taking further action.

2/18/2018 It seems as if the other groups are making more progress. I was unable to meet with the other member of my group today, we are meeting tomorrow before class so I hope we can catch him up with everything that's gone on this week.

Week Ending February 26, 2013

 * Task:2/20/2018 At the group meeting before class we did two main things: prepared our talk about the data for the next class, and experimented with two pieces of software, Audacity and a site called convertio to try and clean up background noise from audio files. So far we have good results getting rid of extraneous pops and static, but it is very time-consuming to do it by hand. We discussed maybe using a script if we can figure out how to eliminate everything but human voices.

I spent much of the day revising the proposal.

2/21/2018 I spent several more hours on the revision after a series of emails from Prof Jonas last night. He recommends using the format that 2015's class did. We also used the table-formatting from class of 2017, since you can't use HTML tags except for colors. Also spent much of the day using Discord to coordinate with the other group leaders on the proposal, those of us who volunteered to organize our efforts and be the points of communication between groups.

I met up with my group and tried to run a train, but got an error due to the server being taken off-line without my knowing. Arias helped us and eventually I was able to create an experiment. Later I met up with Daniel Rubin and Tri in the server room, where they were running an experiment. Dan showed us what he was currently doing - cloning Asterix's image by booting into a persistent bootable Debian distro ( a version of Linux on a flash drive).

2/22/2018 Attempted to run a train/decode for a modified parseLMTrans.pl script and had problem that can be handled in a few different ways. When running a train/decode, if you get an error that says sclite: command not found, you may need to make sure there is a symbolic or soft link to the sclite program in /mnt/main/local. The drive /mnt/main/local was mounted, but /usr/local should be symbolic-linked (soft-linked) back to /mnt/main/local as well. The installation of SCLite is on Caesar at /mnt/main/local/usr/local/bin/sclite. Because /usr/local/ wasn't connected, it couldn't find it.

If you need to run the experiment without doing the symbolic link, you can: delete your experiment and redo it on a different server, or you can delete the three empty files (hyp.trans, scoring.log and decode.log) and delete the contents of the language model directory (LM), with the rm command, then ssh into a different server and redo just these commands:

cp -i /mnt/main/corpus/switchboard/5hr/train/trans/train.trans trans_unedited

/mnt/main/Exp/0303/026/parseLMTrans_nob.pl trans_unedited trans_parsed

./lm_create.pl trans_parsed

cd ..

cd etc

SKIP THIS COMMAND [awk '{print $1}' /mnt/main/corpus/switchboard/5hr/test/trans/train.trans >> /mnt/main/Exp/0303/037/etc/037_decode.fileids]

nohup run_decode.pl 0303/037 0303/037 1000 &

parseDecode.pl decode.log hyp.trans

sclite -r 037_train.trans -h hyp.trans -i swb >> scoring.log

2/26/2018 I spent the morning trying to run down a question for Steven. He had run an experiment and noticed that, depending on the SPRK (person who was speaking) that the Err rate in the scoring.log file was different and that some had very high error rates.

SPKR   | # Snt # Wrd | Corr    Sub    Del    Ins    Err  S.Err sw2013a |   1      4 | 50.0   50.0    0.0   50.0  100.0  100.0

I asked him to see if he could get the full file name of the audio clip that went with speaker sw2013a so I could see if the file was corrupt, but ran into a problem with trying to download the file using Filezilla. Steven was using the 300hr corpus, which has a huge amount of files ~ 250,000. Filezilla hangs and times out when trying to open the directory that contains the audio files. I can cd into the

/mnt/main/corpus/switchboard/300hr/train/audio/utt

directory, and used a find command: find sw2013A-ms98-a-0022.sph and it will find the file, but I have not been able to figure out how to download the file into my laptop. I tried the scp (secure copy) command but it says my local computer's name is not known. I have other work to finish up so I will work on that tomorrow, and if I can't get it, ask for help before class.

I read everyone's logs and the daily commentary on discord to see what experiments have been run and what the results were. I also spent an hour studying regular expressions, as most of the modifications we plan to do involve trying to improve the WER by changing what gets filtered out, in regard to brackets and what is inside them. This could be a filler word like [laughter] or [noise] but could also have words being said along with the laughter or noise. Then I went over logs from previous semesters to see what they have already tried and what issues occurred because of the modifications they made.


 * Results:2/20/2018 Audacity did not recognize the .sph audio file type, (stored in our /mnt/main/corpus/switchboard/(5hr, 30hr, 145hr or 300hr)/train/audio/utt) We used convertio.co to convert the files from .sph to .wav and then after using Audacity to clean up random background pops and hisses, converted it back from .wav to .sph using convertio.

The resulting .sph file has fewer kilobytes of data than the original, but has the same length, timewise. We will have to experiment to see whether this results in a problem as far as matching up with the corresponding transcript.

2/21/2018 The proposal is finally complete. I spent 22 hours this week on it due to all the revisions.

2/22/2018 I completed the train and decode for experiment number 037. See results on the wiki experiments page, the Avg was: Sum/Avg |  22    403 | 76.7   17.9    5.5    6.2   29.5   90.9 Interestingly enough, Isaac ran the exact same experiment on a different server and got different results. His is experiment number is 041 and his results for the Avg were: Sum/Avg |   8    123 | 79.7   13.0    7.3    5.7   26.0   87.5. We speculate that the experiment chooses random audio files for each experiment, as the server choice ought not to matter, since all the files are actually in Caesar.

2/26/2018 After listening to the audio files which I downloaded a few at a time in Filezilla from the 30hr directory, I noticed that a few words have a _1 after them, such as because_1 I believe it may be to note that the person used a slang or abbreviated word. For because_1, the person said "'cuz".


 * Plan: 2/20/2018 Clean up a number of audio files and then run a train/decode to see if and how the WER is affected. We may need to create a new corpus for the altered audio files and change the location in the instructions Camden posted for running a train/decode.

All-In-One.txt Look at 15 Feb 2018 Work on the proposal after class ends with those who are staying late. Tomorrow the Data group is meeting at 2pm.

2/21/2018 I will finish the experiment tomorrow. We have a group meeting planned for 1pm.

2/22/2018 Study the scripts Tri has modified and try to help fix the problems he is having with adding words to the dictionary. Listen to 60 audio files and compare their corresponding transcripts to the "truth transcripts".

2/26/2018 Figure out the best way to transfer .sph audio files to my laptop. Listen and compare to transcripts. Study the scripts Tri Nguyen modified, as he is reporting issues with errors relating to a lot of words that are not in the dictionary.


 * Concerns: 2/20/2018 I am concerned about using an outside website to do our converting - that may not be allowed due to intellectual property rights. We may need to find a different way to convert the files.

2/21/2018 I hope Prof Jonas likes the proposal after all the work we put into it.

Isaac was unable to run an experiment - he keeps getting:

[iam1002@caesar ~]$ addExp.pl -s

Can't open -s: No such file or directory at /mnt/main/scripts/user/addExp.pl line 81. (This is a known bug that should not stop him from creating an experiment.)

please enter your username -> AD/iam1002

Credentials: AD/iam1002

please enter your password->

Please enter the main experiment number (Ex: 0268)->0303

Please enter the sub experiment number (Ex: 001)->041

What is your sub-experiment's name?->isaac test

ERROR: ({"error":{"code":"badtoken","info":"Invalid token"}}) (We have asked the Experiment group about this)

2/22/2018 None at present. Isaac has found he can manually add experiments to the wiki experiments page.

2/26/2018 I want to accomplish more than I am. I feel like everything is taking much longer than it should just to get things up and running, leaving less time for actual work.

Week Ending March 5, 2013

 * Task: 2/28/2018Today I had a meeting with Tri Nguyen to go over the changes Tri has made to several scripts (parseLMTrans_2018, makeTrain_2018 (just changed it to point to the genTrans_2018) and genTrans_2018). These changes essentially consist of making changes to the regular expressions in the scripts. The regular expressions or regexes filter out various elements in parseLMTrans_2018 and genTrans_2018 in the transcript. We spent some time reviewing the different symbols for regexes and using this site to try them out. http://rextester.com/l/perl_online_compiler

For example, a longtime problem for the data group has been deciding what to do with words and/or partial words with brackets, some of which have dashes afterward, such as [laughter] or wh[at]- or sh[e]. Depending on whether we choose to filter out just the bracket characters, the brackets and the dashes, or the brackets, the dashes and the words inside, each causes problems. Getting rid of [at] in wh[at] means "wh" will throw and error stating that "wh" is not in the dictionary. Get rid of just the brackets and you have "what-" which is not in the dictionary either. There are dozens to hundreds of cases like this, so you can't add them all in by hand. Previous capstone students have tried using scripts that automatically add the words to the dictionary, but not the phonemes, which you also need. So far I know of one site with a tool that will generate phonemes if you manually type them in the text field box, here:

http://www.speech.cs.cmu.edu/cgi-bin/cmudict Since it is an official site by the CMU, we may be able to create some sort of script to do this task automatically, perhaps with help from the software group.

The dashes are another issue. You can't get rid of all of them because some actually do connect two words together. Yet a word like "what-" will not be in the dictionary.

3/2/2018 I went through this week's logs.

3/3/2018 I did a 2-hour conference call with Isaac to catch him up on what Tri and I went over while he was ill. I also watched the Microsoft Automatic Speech Recognition (ASR) video.

3/5/2018 I spent the day modifying regular expressions to try to get rid of the "-" character and anything in front of it without removing anything the might come after it.

I ran 5hr train/decodes on Idefix and Obelix using the old scripts.


 * Results: 2/28/2018 I understand regexes better and got some practice using the perl compiler site. I created an experiment but then got a "permission denied" error when trying to do a mkdir to create a corresponding directory. Tonight Arias helped me using Linux commands to go in as root, cd into the directory I wanted to create a new sub-experiment directory in (must use the matching number from when you use the addExp.pl -s command) and do a chmod g+w command to give write privileges to your entire group. Use ls -l to actually see the listing of permissions.

Note: addExp.pl needs one of two flags after it, according to Arias. addExp -r creates the root experiment directory (such as 0305, a four-digit number) and then to add a sub-experiment directory inside the root directory, type addExp -s.

3/2/2018 Arias' in particular was helpful in that he explained some Perl concepts and the fixes to problems I'd had regarding linux commands, creating directories and experiments.

3/3/2018 Isaac now has the tools, the links and a little experience editing the regular expressions in the perl scripts. Much of the video was the same information as on the CMU's tutorial on the CMU site.

3/5/2018 On Idefix, I got a 27.6 on Idefix. On Obelix, I got a WER of 25.0. So there is a difference in output depending on the server.


 * Plan: 2/28/2018 Do my own modifications to the scripts and run short (5hr) train/decodes. Meet with Isaac Friday night, as he wasn't feeling well today. Begin the free Udemy course in Perl that several others in the class have taken and approved of. https://www.udemy.com/learn-perl-in-just-7-days/

3/2/2018 Meet with Isaac on Saturday, as he still wasn't feeling well. Watch a 1 1/2-hr video Camden recommended on speech recognition to get a better idea of the big picture here and how the data group fits in with it. https://www.youtube.com/watch?v=q67z7PTGRi8&t=10s

3/3/2018 Create a table to record all the regular expressions I try so I don't repeat myself. Prof Jonas wants us to figure out why running identical scripts on different servers gave us three different WERs.

3/5/2018 Do a 5hr train on Idefix and Obelix using the new genTrans_2018.pl and parseLMTrans_2018.pl scripts to see if I also get differing results while using two different servers.


 * Concerns: 2/28/2018 Nothing too urgent. Just a lot of small steps that take a lot of time. I also want to use a chart or something to make it easier keep track of what I try so I can avoid repeating myself and so I can see everything at a glance.

3/2/2018 Beyond manipulating regular expressions, I'm trying to think of other ways to improve the WER, but I feel that will probably require a more in-depth knowledge of the actual process of speech recognition itself.

3/3/2018 How to identify the error Prof Jonas wants us to figure out.

3/5/2018 If the problem is server-related, I'm going to have to seek help from the Systems group and probably the Experiment group. We ought to be able to reproduce the same results no matter which server we are using.

Week Ending March 12, 2013

 * Task: 3/6/2018 It turns out that almost none of the decodes our class has done has run correctly. There are supposed to be 4172 lines/sentences from train.trans being decoded and output to hyp.trans. But the hyp.trans file only has a few lines (less than 20) in it even though the corpus (for 5hr) is 4172 lines. To see the results of all the scoring.log files in a directory, cd into 0305 (the parent experiment directory) one level up from where you keep your sub-experiment directories. Type

tail */etc/scoring.log

to see the last 10 lines of each scoring.log in that directory (the bottom of each table).

3/7/2018 I did some troubleshooting and email correspondence with Prof Jonas. First I tried running "parseDecode.pl decode.log hyp.trans" in my /mnt/main/Exp/0305/001/etc directory and immediately got just my prompt again [ras1002@idefix etc] I don't think it did anything.

I checked to see if there were any processes running by using the top, ps, fg and bg commands for linux. They all said 0 or No current job. Then I used tail decode.log to see the last 10 lines of the decode.log file to see if it completed successfully. (The last line should end with grep sphinx3_decode)

3/8/2018 Prof Jonas assigned the data group some more specific tasks. One is to study the corpora to see if the same speakers are in both the dev and eval data test sets and the training data set. What is the percentage, i. e. are only some speakers in both? We need to come up with a fair training and test (dev/eval) sets.

Also, he wants us to recreate Tri's experiment from /mnt/main/Exp/0303/026. And redo a couple of experiments, modify the genTrans.pl and parseLMTrans.pl scripts to keep all dashes "-" but leave out all brackets [ ] in 0305/008, and then redo 0305/009 to take out all markings on any words.

3/11/2018 Read everyone's logs and edited mine to include more thorough explanations and useful commands.

3/12/2018 I redid the documentation for my two experiments in /mnt/main/Exp/0305/001 and 0305/003. I also made contact with my new team, the Guardians.


 * Results: 3/6/2018

grep 'Sum/Avg' ????/???/etc/scoring.log

I see there are two different results for experiment 0305/002:

0305/002/etc/scoring.log:     | Sum/Avg |    3     45 | 86.7    8.9    4.4    8.9   22.2  100.0 | 0305/002/etc/scoring.log:     | Sum/Avg | 4172  60048 | 73.0   19.6    7.5    9.1   36.1   89.6 |

and three results for 0505/005, two of which are the same:

0305/005/etc/scoring.log:     | Sum/Avg |   22    402 | 76.6   18.7    4.7    7.2   30.6   95.5 | 0305/005/etc/scoring.log:     | Sum/Avg | 4172  60048 | 73.0   19.6    7.5    9.1   36.1   89.6 | 0305/005/etc/scoring.log:     | Sum/Avg | 4172  60048 | 73.0   19.6    7.5    9.1   36.1   89.6 |

I thought maybe there might be more than one scoring.log file in the sub-experiment folder, so I checked the contents of the etc file for 002, 005 and, for comparison, 001. I thought maybe Tri had run a second decode ( I will ask him about that) but even if he did, shouldn't there be three decode.log files? And 002 only has a single decode.log file. None of them have more than one scoring.log file, either.

There's one more decode from earlier in the semester that has the right number of lines. It has no duplicates when using the grep command above.

0303/042/etc/scoring.log:     | Sum/Avg | 4172  60215 | 72.5   19.6    7.9    7.3   34.8   87.5 |

3/7/2018 So it turns out that when you run parseDecode.pl decode.log hyp.trans, it DOES NOT show what is going on after that, unlike when you run the makeTrain command; it just gives you back your prompt. You must wait to run the next command (see the Experiments page, Spring 2018 0305/001 for a guide to this with commands) until you make sure the decode is done! Dan Beitel has done a successful train/decode/scoring on Obelix, which took approximately 2 hrs 40 minutes. To monitor his progress, he opened a second Terminal window, ssh'd into the same server and and ran "top" to see the CPU use, which is the only indication that the decode is still ongoing. His results are in  /mnt/main/Exp/0308/010. Score was 34.8 using the older scripts from the original wiki instructions, which matches the one in experiment 042 above.

**EDIT** You can view what processes are happening by using one of several commands:

tail -f fileName.extension (if you are in the directory that has the file you want to view).

This command with the -f flag for “forces” output to the Terminal as the file grows, starting with the last 10 lines.

For example: tail -f /mnt/main/Exp/0305/001/etc/decode.log or tail -f -n 20 fileName.extension to see the last 20 lines, or however many you want. It also removes the oldest lines from the top of the Terminal window.

tail -n 50 fileName.extension | more to see the last 50 lines, but not as they happen, and the “more” part just lets you go down a little bit at a time at a time by pressing the space bar. Ctrl + c to quit.

3/8/2018 I spent some time working with Steve from the Model group to try and identify whether 4172 is the number of lines for just the 5hr corpus, or not, but in using the command "grep FWDVIT decode.log | wc" in the /etc folder of the individual sub-experiment folders in both 0300, 0303, and 0305, 5hr is 4172, 30hr is 3992, and 300hr is 4034. Which makes no sense - shouldn't the larger corpora have more lines?

I spent half a day editing regular expressions for the three scripts the Prof wants us to modify, and more time looking for linux commands to figure out whether there are identical speakers in different corpora. mnt/main/corpus/switchboard/300hr/test has all three transcripts dev.trans, eval.trans and train.trans. Using: [ras1002@caesar trans]$ diff -sq train.trans dev.trans Files train.trans and dev.trans differ [ras1002@caesar trans]$ diff -srq train.trans eval.trans Files train.trans and eval.trans differ

And to see if they share the same files, I used: [ras1002@caesar trans]$ diff -s train.trans dev.trans and [ras1002@caesar trans]$ diff -s train.trans eval.trans and I got a huge list of lines that are different from each other.

> sw4914A-ms98-a-0018 117.226000 124.005875 and i just got into the w[e]- FidoNet's now has a thing where we can get in with the internet so i found out > sw4917A-ms98-a-0021 73.602375 77.124750 yeah so what do the Miatas run > sw4917B-ms98-a-0062 277.921750 280.592375 yeah Minivans d[o]- do pretty well

3/12/2018 /mnt/main/Exp/0305/001 is an example of the issue we had with doing a scoring command before the decode was finished, and /003 (with scoring re-done) has list of all commands used and how I verified my results using the logs and Linux commands. We set up a meeting time for the Guardians.


 * Plan: 3/6/2018 Tomorrow I will do a more thorough investigation of the contents of these three experiments which appear to have successfully completed the decode with all 4172 sentences intact.

3/7/2018 I still need to get familiar with what's actually in the logs that have been produced. This includes the decode.log, hyp.trans, and scoring.log files, the (sub-experiment number).html file, and the logs in the logdir directory. Use wc (counts the number of lines) and grep (search) for errors, warnings and the FWDVIT (decoded utterances) to look for anything that doesn't seem to be the right results.

I will also be trying a train/decode/scoring with the newer 2018 scripts the data group has been modifying. I have developed a regular expression to remove the first half of a compound word that is joined by a dash "-" such as [laughter-then] so it will remove [laughter-  ] but leave the "then". I just need to do the same for some of the other filler words such as [noise-then].

3/8/2018 Tomorrow I will try to track down individual speakers and see if they are represented in more than one data set (training vs. dev and eval). I need to reserch Linux commands like "diff".

3/12/2018 Isaac and I have been tasked with coming up with a new way to take samples of audio and transcripts to make a set of new corpora. The test sets of data (dev.trans and eval.trans) in mnt/main/corpus/switchboard/300hr/test/trans were both created from (transcript file) samples taken evenly across the original 311 hours of data (what's left is now just called the 300hr corpus; which is the data set known as "train"). Apparently, this means that we have sentences from the same speakers in all three data sets. This means we are not getting an unbiased train, since we are essentially peeking at some of the information in the two testing data sets, which ought to be unseen (completely unknown) to the speech recognition software if we want a true test of how accurate it is.
 * Concerns: 3/6/2018 Just trying to debug script(s) for the decode process and how new issues keep pushing back progress on our other work.

3/7/2018 I'm hoping for a better score than 34.8.

3/8/2018 Moderate concerns. A lot of work this week is just reading/research, which is not hard, but is very time-consuming and leaves little time to actually do the work needed after the research has proven fruitful.

3/12/2018 We will probably need to develop a script to create the new corpora, which I have not done before. And I need to do that Perl course.

Week Ending March 26, 2013

 * Task: 3/13/2018 Prof Jonas has asked the data group to do some research and come up with a fair way to choose samples of speakers to create a new set of corpora, because many speakers have more than one line, each of which is a separate .sph audio file (and corresponding transcript). When the previous capstone class created the corpora, they just took every 100th line, meaning some people probably have lines in both the 300, 145, 30, and 5hr data sets (the "train" sets) as well as the "tests" sets (dev and eval). Apparently this is cheating, as it means the acoustic model has already been exposed to those people's voices during training, so when we go to do the test, it's not really "unseen" data. And that's not a fair test. A real test in real-world use is when the speech rec system is translating somebody live, in real-time, so of course that would be genuine, unseen, spontaneous data/voice.


 * Results:3/13/2018 Right now I am trying to figure out how to identify all the speakers in the dev set so I can see if they are all also in the train set. I used the "more" command to verify that the same speaker does have more than one utterance in each of the three files, at least at the beginning of each of the three data sets. To check in all of them, I am thinking of something like a combination of the sort and uniq commands, but uniq would filter out different lines spoken by the same speaker.

So probably I need a script. I spent all morning adapting this one from Professor Jonas, after a few false starts, and created a regular expression using this IDE for Perl: http://rextester.com/l/perl_online_compiler It did filter out everything except the speaker's unique ID. Ex. "sw4917A-ms98-a-0021 73.602375 77.124750 yeah so what do the Miatas run" will leave only 4917A.

foreach word ('sed "s/sw//" mnt/main/corpus/switchboard/300hr/test/trans/dev.trans | sed "s/-ms98-a-[0-9]* [0-9]*.[0-9]* [0-9]*.[0-9]* .* //" | sort | uniq') # filter out everything except the speaker's id, sort the resulting set of words and eliminate the duplicates

echo -n "$word " >> devSpeakers.trans # create this file and append the the value in "word" to it

foreach word ('sed "s/sw//" mnt/main/corpus/switchboard/300hr/test/trans/train.trans | sed "s/-ms98-a-[0-9]* [0-9]*.[0-9]* [0-9]*.[0-9]* .* //" | sort | uniq') # filter out everything except the speaker's id, sort the resulting set of words and eliminate the duplicates

echo -n "$word " >> trainSpeakers.trans # create this file and append the the value in "word" to it

foreach word ('sed "s/sw//" mnt/main/corpus/switchboard/300hr/test/trans/eval.trans | sed "s/-ms98-a-[0-9]* [0-9]*.[0-9]* [0-9]*.[0-9]* .* //" | sort | uniq') # filter out everything except the speaker's id, sort the resulting set of words and eliminate the duplicates

echo -n "$word " >> evalSpeakers.trans # create this file and append the the value in "word" to it

end

Unfortunately, when I tried to run it in the terminal I get this message:

foreach? I feel like it's waiting for input, even though I specified the path name to the source file that I want it to take input from.

I have emailed the Professor to ask for help, and done some online searching and used the man pages to try to figure out how to get it to run. I may simply be using the wrong syntax, as most of what I've seen uses "for" not "foreach". A difference between bash and the Windows putty I am using, maybe.

So I spent the rest of the day doing research on corpora, trying to find best practices and guidance on how other people have chosen to represent the speakers in their data. All of a speaker's lines will be grouped sequentially together, so we could simply take a data set from the beginning or end and thus ensure that we don't have the same speaker represented in more than one data set, but as the utterances have many different U.S. dialects, we would be missing out on some of those so the corpus might not have an adequate sampling. Another way is to take all the utterances from each speaker, but spread out the choices over the entire corpus, such as every 500th line and all nearby utterances that also belong to that speaker.


 * Plan: 3/13/2018 Keep working on the script and the research. Study the older logs and guide to creating a corpus from previous semesters. Begin reading more about modeling, as we have now been assigned to the two Teams for the rest of the semester (mine is the Guardians).


 * Concerns: 3/13/2018 Tri has been put on a different project, so it is just Isaac and me for now. There are a lot of time-consuming things on the agenda.


 * Task:3/16/2018 Spent the evening after work researching linux commands to debug a script to get speaker ids from a transcript file so I can use that to compare who is in which data set. Also, I read logs.

foreach word ( echo `sed "s/ .*//" /mnt/main/corpus/switchboard/300hr/test/trans/train.trans` ) and then I gave the foreach? prompt that appeared this command: echo $word the $ is to get the value contained in the variable "word", and then foreach? end I got a list of just the first part of each utterance:
 * Results:3/16/2018 Prof Jonas contacted me. It turns out that I had several typos. It should be `sed "s/ .*//" /mnt/main/corpus/switchboard/300hr/test/trans/train.trans` I was missing the forward slash in front of /mnt and was using single quotes instead of back-quotes in front of sed and after trans, (which takes what command is inside, executes it, and pipes the result as input to whatever command is applicable). So to run this command, type:

sw4928A-ms98-a-0040

sw4936B-ms98-a-0055

sw4940B-ms98-a-0027

sw4940A-ms98-a-0074


 * Plan:3/16/2018 Tomorrow after work I need to modify this script to just get the speaker's id, ex. 4918A. Then I need to do more corpus research on best practices for choosing samples of speakers to make up the new corpora Prof Jonas wants.


 * Concerns:3/16/2018 Nothing pressing at this time.


 * Results:3/18/2018 I kept getting errors while trying to modify the script. I got it to filter out the "-ms98-a-" and the four numbers after it once, but then when I tried to also include the "sw" at the beginning of the line so that only the four-digit number and A or B would be in left, that didn't work. So I tried to just do the sw first, as a separate regular expression and run that with the sed and echo commands above, but I got a repeating line of the letter y y y y y... and after that everything I tried either did that or Command Not Found.


 * Plan: 3/18/2018 Keep trying to modify the regular expression, and do more research on the switchboard corpora. Also, meet with Isaac at 2:30pm. The next day will be our first Guardians team meeting.


 * Concerns:3/18/2018 Figure out how to get the regular expression to play nice with sed and echo.


 * Task:3/20/2018 Get our new Team Guardians organized. Document my research on the wiki and report on it during class. Ask about whether the Prof wants a new corpora based on the fact that my research has uncovered conflicting opinions as the whether having some of the same speakers in both the train data set (train.trans) and the two test data sets (dev.trans and eval.trans) is worth it. See the articles and the whitepapers 1 and 2, but basically, one paper says it doesn't matter because the results were within one standard deviation when compared to the same results when the transcripting was done by actual humans vs. IBM's speech recognition software. Also, the average machine WER of 7.5% for “unseen” speakers is well within one standard deviation of the “seen” speakers’ WER distribution (5.9% ± 2.7). The other paper says it does matter. Ask for permission to copy the 300hr corpus to a flash drive so that the ~250,000 .sph files more easily accessible/sortable than using Filezilla, which basically chokes on so much data. Ask to go over the parseLMTrans.pl script that the Prof wants us to modify so it stops removing the tags for [LAUGHTER] and [NOISE].


 * Results:3/20/2018 Guardians group meeting went well, we shared general information and set up a regular meeting time. The Prof said we can develop a plan to create a new corpora, but due to my research it's not as pressing as fixing the problem with the parseLMTrans.pl script. Isaac received permission to copy over the 300hr corpus. We are also going to copy over the "full" corpus as well, since that contains the complete set of all data that we would be using to create new corpora.

Tri and I spent two hours after class with the Prof debugging the regular expressions in the scripts the Prof wrote last week as an alternate to the parseLMTrans.pl, scripts which did not do everything the parseLMTrans did, just filtered out various combinations of words that had the brackets and any dashes. While debugging that, Tri and I learned more about using linux commands for debugging and also got to see how to use a code editor called emacs, right in the Terminal. Tri had a pair of scripts to get speakers' ids and sort them and remove duplicates, then counted them to see which speakers are in both the train and test sets. It turns out that 66% of speakers are in both train.trans and dev.trans.


 * Plan:3/20/2018 Top priority is to fix the parseLMTrans.pl script. We can't use the Prof's scripts, experiment 0305/013 because the original parseLMTrans.pl does more than just what his scripts do, but can study them to see what the regular expressions do. Research modeling information, including the various parameters that can be modified to improve the WER.


 * Concerns:3/20/2018 Nothing as of yet.


 * Task:3/23/2018 Began editing the online Excel sheet with possible modifiable data-group-related facets for Team Guardians. Reviewed the CMU tutorial and the ASR video.


 * Results:3/23/2018 Worked on the parseLMTrans.pl.


 * Plan:3/23/2018 Begin researching more about Modeling.


 * Concerns:3/23/2018 A LOT of reading...


 * Task:3/24/2018 We had a Team Avengers meeting to get organized and delegate research tasks.


 * Results:3/24/2018 Each of us presented what we had learned and our options moving forward.


 * Plan:3/24/2018 Attend Skype meeting with Team Avengers. Get organized, set a timeline.


 * Concerns:3/24/2018 Back to the beginning of the semester. Just trying to figure it all out.

Week Ending April 2, 2013

 * Task:3/26/2018 Doing a lot of reading about the specifics of Sphinx3 and speech recognition.


 * Results:3/26/2018 Mainly learning a lot more about the modeling process in general and the various options.


 * Plan:3/26/2018 Doing research on Sphinx3 and speech recognition/modeling.


 * Concerns:3/26/2018 Trying to isolate the most important tasks and options from the massive amount of information out there.


 * Task:3/28/2018 Yesterday I asked Steve to help me learn how to modify the sphinx_train.cfg files and where they are located. (They get created in /etc/ when you run a train.) See his log on February 19th 2018 for a guide:

https://foss.unh.edu/projects/index.php/Speech:Spring_2018_Stephen_Thibault_Log

I had earlier found a file that was in /mnt/main/root/docs/ that mentions sphinx_train.cfg, but being in "root" I figured that was not something I should touch. I did some editing of regular expressions to fix the bug in the 0105/011 script which should strip out all bracketed words and the [] characters, but should not strip out the dashes) and then Tri discovered that with a sentence like

sw2744B-ms98-a-0019 69.227875 76.106875 [noise] -[be]cau[se]- you know up to what the effort they put into it you know i've seen that for years yeah

that the second part of the bracketed word doesn't get removed. It should look like -cau-, so we spent some time troubleshooting that. Also did more research on modeling.


 * Results:3/28/2018 I can now modify a .cfg file right in the terminal with the vi command. Eventually Tri added a line that repeats a previous regular expression but with dashes added to the beginning of the part that should be matched, so as to strip out the second bracketed part. This probably won't work if a sentence has more than two bracket parts of a single word, but that should be a very rare occurrence.


 * Plan:3/28/2018 More research. Run a train/decode on 0305/011 and 0305/013 with the newest versions of the scripts. I am meeting with Hannah to begin creating the abstract and poster for the CCSC-NE.


 * Concerns:3/28/2018 Trying to learn enough to begin to come up with a strategy for the team competition.


 * Task:3/29/2018 Met with Hannah to work on the final version of the abstract for the CCSC-NE poster conference, but ran into an issue where there are conflicting instructions about whether references will count toward the 300 word limit. I read everyone's logs and have kept up with issues posted on Discord - right now no one is able to run decodes on most of the drones due to them missing some kind of library file. Asterix works, though. Worked with Tri to try to come up with a regular expression that would remove everything in front of the forward slash, including what is in the middle set of brackets, but leaves what is after the forward slash in "the [cima[tography]-/cinematography] "


 * Results:3/29/2018 I have emailed Karen Jin, who is on the committee for the CCSC-NE about the abstract's conflicting instructions. No success yet on the regular expression. I suggested Tri use a pair of capture parentheses to grab each half of the word and put them into variables, then just print out the one we want to keep. It worked nicely. There is a problem with our servers, people are getting a missing library file message so I can't run a train right now.


 * Plan:3/29/2018 I have to turn my attention toward modeling for a while, as Jonas wants us to spend the bulk of our time doing that now. So, more modeling research.


 * Concerns:3/29/2018 Just pushing on.


 * Task:3/30/2018 I spent the evening after work editing the abstract for the CCSC-NE conference. There are a couple of questions I need answered, as we have to use a template their committee requires. One is what to put in the area marked "Department".


 * Results:3/30/2018 I finished the abstract except for the questions I need answered.


 * Plan:3/30/2018 I emailed Karen Jin, who is in contact with their committee, and also Prof Jonas. Camden found the missing file, so now we can run train/decodes again.


 * Concerns:3/30/2018 Just the usual.Time.


 * Task: 4/1/2018 I ran a train to recreate Prof Jonas' 0305/011 in 0305/022 to try out the three new scripts that we are using in place of parseLMTrans.pl, and to try out the new guide for using them, as it is somewhat different from the one on the wiki.


 * Results: Successful train and decode. Results matched the 0305/017 that Tri did for Team Avengers:

[ras1002@majestix etc]$ tail -8 scoring.log | Sum/Avg | 4172 60569 | 73.8   18.4    7.8    6.5   32.7   88.3 | |=================================================================|    |  Mean   |  1.3   19.2 | 76.5   17.8    5.8   15.2   38.7   88.5 | | S.D.   |  0.5   16.5 | 17.8   15.0    7.8   29.2   32.4   29.5 | | Median |  1.0   15.0 | 76.9   15.9    2.1    3.1   33.3  100.0 | `-'
 * Plan: Continue with research, both for modeling and to seeif I can locate the NIST test set the Prof wants.
 * Concerns: Some of my teammates are having issues with trains/decodes.

Week Ending April 9, 2013

 * Task:4/3/2018 Met with the other group leaders to work on the CCSC-NE poster before class. Worked with Tri, who had finished fixing the parseLMTrans.pl script. Meet with the modeling group about running the 300hr train/decode with the new script.


 * Task: 4/5/2018I got a response to my questions about the CCSC-NE poster.


 * Ensure the text of the final poster abstract (INCLUDING REFERENCES AND ACKNOWLEDGEMENTS) is limited to a MAXIMUM OF 300 WORDS (no exceptions).


 * And, ONE of the students who is among the authors of the posters must register for the conference online by tomorrow (please check the website).


 * For you last question, our department name is “department of applied engineering and sciences”.

Today I am working with Lamia, Dan Beitel and Camden to finish the CCSC-NE poster, about the entire Capstone project as a whole.


 * Results:4/3/2018 It turns out the Prof told Tri to modify the original parseLMTrans.pl to do the same things as the scripts the Prof wrote for 0305/012. So it should not remove either the brackets or the dashes. The Prof also said the CCSC-NE poster should be less about our class and more about what we are trying to get for results, and how we plan to do that.

We explained to the modeling group how to run train/decodes that repeat experiments 0305/011 and 0305/013. 0305/011 and 0305/013 use different scripts than 0305/012. The fixed parseLMTrans.pl that is now in /mnt/main/scripts/user/ and is called by the standard wiki train guide incorporates the changes from 0305/012.

This is the guide to repeat 0305/011, in which we want to remove [bracketed words] but keep the dashes "-"

(0305/013 is similar, except that there are two versions of the script to prune the dictionary; you want the more recent one, #pruneDic_no_brackets_or_minus.pl#)

ssh asterix

We do use genTrans.pl, because when we call makeTrain.pl, it will call genTrans.pl We just don't want to use the result of genTrans.pl because it didn't change the transcript much, it kept [] and - so we edit _train.trans and .dic

(The original _train.trans and .dic were created by genTrans.pl)

REMEMBER TO CHANGE THE DESTINATION DIRECTORIES AND EXPERIMENT NUMBERS FROM THE EXAMPLES SHOWN HERE. REMEMBER WHICH SET OF SCRIPTS YOU ARE USING. 011, AND 013 HAVE SOME DIFFERENT SCRIPT NAMES.

cd /mnt/main/Exp/0305/022

makeTrain.pl switchboard 5hr/train

genFeats.pl -t

cd /mnt/main/Exp/0305/022/etc

cp -i /mnt/main/corpus/switchboard/5hr/train/trans/train.trans trans_unedited

/mnt/main/Exp/0305/011/etc/scripts/parseTrainTrans_no_brackets.pl trans_unedited 022_train.trans

DON'T change the 011 above- it's where the script you're calling is. DO change the 022 to your sub-experiment number.

/mnt/main/Exp/0305/011/etc/scripts/pruneDic_no_brackets.pl /mnt/main/corpus/switchboard /mnt/main/Exp/0305/022/etc/022_train.trans 022

DON'T change the 011 above- it's where the script you're calling is. DO change the 022 to your sub-experiment number.

Note: This command takes three arguments.

cd /mnt/main/Exp/0305/022

nohup scripts_pl/RunAll.pl &

mkdir LM

cd LM

/mnt/main/Exp/0305/011/etc/scripts/convertTrainToLM.pl /mnt/main/Exp/0305/022/etc/022_train.trans trans_parsed

lm_create.pl trans_parsed

cd ..

cd etc

cp 022_train.fileids 022_decode.fileids

nohup run_decode.pl 0305/022 0305/022 1000 &

parseDecode.pl decode.log hyp.trans

sclite -r 022_train.trans -h hyp.trans -i swb >> scoring.log

tail -8 scoring.log


 * Results:4/3/2018 The CCSC-NE abstract is submitted, and group leaders Lamia, Camden and Dan Beitel are registered. Modeling can finally do the two experiments the Prof wanted on 300hrs.


 * Plan:4/5/2018 Study the difference between the new parseLMTrans.pl and the scripts from 0305/012. Work on the CCSC-NE poster with the new instructions from the Prof. Stay in contact with the modeling group to see if they have any questions as they run the 300hr train/decode with the scripts from 0305/011 and 0305/013. Run experiments for the Guardians and look up more useful modeling information.
 * Concerns:4/3/2018 Data group has new tasks set by the Prof in addition to the time I need to spend on the Guardians work.


 * Concerns:4/5/2018 None at the moment.


 * Task: 4/7/2018 Last revision of the CCSC-NE poster after the Prof reviewed and edited what we sent him. Read logs and responded to two peoples' questions via Discord. Began reading through the Sphinx manual that both Danielle and Wesley posted links to. Reviewed latest results of Guardians' experiments.

Brian wanted to know why the line count of the 300hr is 4032 but the 5hr is 4172. My answer:

3/8/2018 I spent some time working with Steve from the Model group to try and identify whether 4172 is the number of lines for just the 5hr corpus, or not, but in using the command "grep FWDVIT decode.log | wc" in the /etc folder of the individual sub-experiment folders in both 0300, 0303, and 0305, 5hr is 4172, 30hr is 3992, and 300hr is 4034. Which makes no sense - shouldn't the larger corpora have more lines? I would definitely be interested in hearing the Prof's explanation. It really shouldn't be so few lines; it does take a lot longer than a 5hr....

and Arias: In your log you say: I tried to find a workaround for nohup (which writes to nohup.out) and added tail -f nohup.out to the script to also show the output to the screen but it didn't work. I understand doing this might defeat the purpose of nohup, but I still liked the use of nohup while retaining the screen output for the user.

My answer:

Have you tried using tail -f nohup.out in a second terminal window? It worked for Steve. Although his was decode.log, a log file, I don't know for sure if it'd work for a .out file.


 * Results: 4/7/2018 Poster is done. Experiment results are narrowing down where we want to focus. Modeling has run one of the 300hr experiments using the Prof's scripts and is currently running the other.


 * Plan: 4/7/2018 Tomorrow run an experiment following a line of inquiry begun by Danielle. Continue reading the manual and note promising options to try.


 * Concerns: 4/7/2018 Nothing as of yet.


 * Tasks: 4/9/2018 Worked with Tri to gather information for the Prof. Apparently the best scores of last year's class were achieved by using different configurations in the sphinx_train.cfg file for the seen vs. the unseen results.

Ran a train/decode testing T14 changed to 5.


 * Results: 4/9/2018 Last year's class' seen-data experiment is at /mnt/main/Exp/sp17/0301/011/ with a score of 28.4. Their unseen-data experiment is at /mnt/main/Exp/sp17/0301/020/ with a score of 41.3.

diff -w /mnt/main/Exp/sp17/0301/011/ /mnt/main/Exp/sp17/0301/020/ shows they had 6 differences in their .cfg files.


 * Plans: 4/9/2018


 * Concerns: 4/9/2018

Week Ending April 16, 2013
Did experiment /mnt/main/Exp/0309/039 for the Guardians.
 * Task:4/11/18

I spent 3 hours with the Prof yesterday helping him debug the three scripts he had created to deal with the brackets and dash questions from Data Group experiment 0305/011 (just get rid of the brackets with parseTrainTrans_no_brackets.pl and pruneDic_no_brackets.pl, these are in /mnt/main/Exp/0304/039/etc/scripts/) because when Brian tried to run them the it failed and gave a lot of errors about words not being in the dictionary. The 0305 scripts were originally made with 5hr data and the bigger corpora have more specialty cases with unusual notation, like bracketed words-within bracketed-words,so we had to have fancier regular expressions to deal with them, particularly words with single quotes in them. Since we want the actual word that was spoken, since it is what will be in the acoustic model and the truth transcript ( _train.trans) we wanted to keep words that are not real words, like

'cause instead of because

weren'n instead of weren't

Here is an example of a regular expression that filters out quotes, brackets and anything in them, even if the word has one or more bracketed words, and/or with quotes at beginning, middle or end, too, and replaces them with just a dash.

$line =~ s/\[\'*?[A-Z0-9_-]+?\'*?[A-Z0-9_-]*?\'*?\]-/-/g;  # non-greedy [???]-abc -> -abc

'''Note: The Prof thinks that using regular expressions is not the best solution to addressing the "words not in dictionary" problem, and suggests that we look into other ways. With only four weeks left we really don't have time to create a script to handle it, but I bet it would be a good thing to look into for next year's Capstone class.'''

The Prof has tasked the Modeling and Data groups (not something for the Teams) to do seven 300hr train/decodes on various criteria; see the Experiment pages for details.


 * Results:4/11/18 He believes we have fixed the scripts so they will run on 300hr experiments, so he began one on Caesar (which we are NOT supposed to ever do).

Completed experiment /mnt/main/Exp/0309/039 for the Guardians. Average results.


 * Plan:4/11/18 Decode his experiment when it's done training. Begin one of the Prof's 300hr experiments and one or more for the Guardians.


 * Concerns:4/11/18 300hr experiments take a couple of days each to complete, so there's not much time to get all this done.


 * Task:4/14/18 There was a flurry of activity as the Prof discovered that none of our servers were set up properly to do an LDA experiment, including the ones the Prof wants the Data/ Modeling group to handle. It took about two days to get everything working again (Thanks Steve, Camden, Dan B, Arias and Professor Jonas for all your hard work!) but so far we can do them on Caesar and Majestix, possibly Miraculix, maybe others. The 0309/039 has to be redone, as it was supposed to be an LDA and also my decode of the one from April 11th was aborted when most of the servers lost power on Saturday.


 * Results:4/14/18 We now have a working set of instructions to do LDA experiments:

Dan Beitel's instructions for 5hr LDA train and decode on Majestix

cd into your experiment ex. 0309/043 on Majestix

0309/044 30hr LDA on Majestix 1. makeTrain.pl switchboard 30hr/train 2. cd into etc to locate sphinx_train.cfg to change $CFG_LDA_MLLT='YES' and CFG_LDA_Dimention=32 (then cd back out again to sub-directory) 3. genFeats.pl -t 4. top to ensure no one else is using server 5. nohup scripts_pl/RunAll.pl & 6. mkdir LM   7. cd into LM    8. cp -i /mnt/main/corpus/switchboard/30hr/train/trans/train.trans trans_unedited 9. parseLMTrans.pl trans_unedited trans_parsed 10. lm_create.pl trans_parsed 11. cd .. and then run makeTest.pl -t switchboard/30hr 0309/044 0309/044 12. genFeats.pl -d 13. cd into etc 14. /usr/local/bin/sphinx3_decode \ -hmm /mnt/main/Exp/0309/044/model_parameters/044.mllt_cd_cont_1000 \

-lm /mnt/main/Exp/0309/044/LM/tmp.arpa \ -dict /mnt/main/Exp/0309/044/etc/044.dic \ -fdict /mnt/main/Exp/0309/044/etc/044.filler \ -ctl /mnt/main/Exp/0309/044/etc/044_decode.fileids \ -cepdir /mnt/main/Exp/0309/044/feat \ -cepext .mfc >& decode.log & 15. parseDecode.pl decode.log hyp.trans 16. sclite -r 044_train.trans -h hyp.trans -i swb >> scoring.log

________________________________________________________________________________________________________________________________________________________________________________________________________ '''Since one of the main complaints is lack of documentation and because there doesn't seem to be a PROBLEMS ENCOUNTERED section on this wiki, I'm going to infodump sections of the Prof's long emails to our Capstone 2018 class regarding the LDA fiasco. Sorry they are long, but very informative.''' _________________________________________________________________________________________________________________________________________________________________________________________________________

Friday 4/13/18 From Professor Michael Jonas:

So I discovered what had happened in 2017. Miraculix was indeed the machine used to train LDA but you had to be logged on as root. Last year's group installed python2.7 (which had libraries that LDA needed) under /root/usr/local and adjusted root's PATH variable to look in there before looking in /usr/local.

Kind of bad way of doing it as that's like installing Word in C:\Windows instead of C:\Program Files. So if you logged on as yourself you'd see python2.6 and as root you'd see python 2.7. Of course, root accounts are not all alike so if you tried to run Miraculix's root on a user directory on /mnt/main (belonging to Caesar since that's where our main file store lives) you'd have permission problems. So you had to log back into Caesar and change permissions on your training directory giving other (basically everyone) read & write permissions. Anything created by Miraculix's root would show up elsewhere as "nfsnobody" as the owner. If you wanted to change it you'd have to change its owner to yourself or give "group/other" read & write permission.

Yes, a huge mess and no wonder you guys struggled getting LDA to run. Of course it doesn't seem it was documented anywhere but again, I haven't gone through last years logs but I'm guessing you guys would have found it at one point if it existed. So lesson here: please document what you are doing!!!

In the course of all this I also discovered that run_decode_lda.pl does not do anything different other than appending "mllt_" to the -hmm argument. So, the -lda and -ldadim flags that the decoder has are still not being explored. If you type on the command line sphinx3_decode you will see it spit out 50+ flags that you can set. This is to ask everyone not to use run_decode.pl or run_decode_lda.pl anymore. That script was a convenience that I wrote 4 years ago so that students could quickly run a decode. However it is slowing research progress. Unlike training, which runs a ton of perl scripts and python code and java code and executable code and combines them in some iterative fashion (i..e RunAll.pl is a complex script that does lots and lots of things), decoding is a single executable that you can run on the command line (as I said, decoding is simple and training is hard and it's the latter we as researchers care more about. However, if you just run a plain old vanilla decode then you are not utilizing all of the things you may have created in a training job so by running the decode by hand you can now maybe add -lda and -ldadim and look to other flags as well.

Please do not edit or re-write the run_decode.pl/run_decode_lda.pl scripts. We are not here to create user friendly scripts with 4 weeks to go.

Run a decode on the command line and then document in your experiment log what arguments you gave it and why. Doing so will force you to confront the arguments and perhaps change them on the fly. For instance, I could run a decode in experiment 50 and grab language models from experiment 48 and acoustic models from 49 and maybe a dictionary from 38. Using the scripts forces you to use the same experiment throughout. And of course you can then integrate some of those other terrific arguments that those 50+ flags may give you. Makes sense?

Decoding on the command line is simple just type the command and add the flags, i.e.:

sphinx3_decode -lm /mnt/main/Exp/0305/012/LM/tmp.arpa -dict ...

This grabs what you want and puts them in the LM. where ... is the path for the dictionary and more arguments needed for filler dictionary (-fdict) acoustic model (-hmm) wave files to decode (-ctl, -cepdir and -cepext). Then you'll see you can add more flags.

Current state with regard to LDA training. First off, I also discovered that in scripts_pl we have not only RunAll.pl but RunAll_CDMLLT.pl. If you dug into LDA training you know that it does LDA first and then generates MLLT from it and in fact the (perhaps flawed) run_decode_lda.pl script uses those models by appending "mllt_" to the model path. I ran it successful as experiment 0304\042 and am presently decoding (it's a 5 hour train only). So we will see what I get. I did run two 5 hour LDA trains, one as root on Miraculix and one as myself on Caesar. They gave different results: Miraculix 27.2% (0304/040) and Caesar 26.8% (0304/042). So that was a bit concerning. Investigating further, it looks like when running on Miraculix as root, it is not using /mnt/main/local but a local copy under /usr/local, whereas on Caesar the /usr/local is linked to /mnt/main local. This bares some investigation so please, someone look at how the local copy of Sphinx training differs on Miraculix from our main install on Caesar. Yes we are doing better using Caesar based LDA but still, I would have expected the results to be identical since we haven't recompiled any software since 2012. So this points to a possible error and could mean that my hotfix installation of python2.7 on Caesar is flawed.

So yes, I did get a hotfix installation of python2.7 on Caesar and that means it should now work on all the drones. Further caveats here as that is not the case unfortunately (more trouble). Currently it runs on Caesar and Majestix out of the box. It will also run on Miraculix and Traubadix if you re-create the link to /mnt/main/local form /usr/local (right now each of those has it's own local /usr/local directory with presumably copies of Sphinx). So the other machines (Asterix, Obelix, Idefix, and Rome) are unable to run LDA because they seem to have a 32 bit installation of RedHat. This is concerning since all drones have 16GB of memory installed so this suggests those 32 bit machines are only seeing 4GB. This ought to be fixed. I ask that the System group look into cloning and fixing at least Aterix, Obelix and Idefix. One caveat, please be sure to understand if any of these machines are special (i.e. someone last year installed something unique) and if so, pull that disk, mark it with a note an put it aside (we pulled an extra 73GB disk out yesterday that can be used in its place for a clone--Camden, did you check its contents to make sure it was empty). You can perhaps clone it from Majestix since that seems to work (make sure that Majestix isn't one of those unique installations though).

Also, when cloning, make sure users and groups are set up properly. I had to fix groups as all users were given 500 or 501 for their group ids on Traubadix, Rome and I think Majestix but should be 1001 (i.e. that is what the group for cis790 is). If you aren't consistent then weird file attributes/group permissions start showing in /mnt/main (they did and I fixed some of them).

Finally, Rome is the bigger question. We are compiling Sphinx. We have backup running. We have CVS installed. We can use it as a wireless gateway. It also is running 32 bit of RedHat. Cloning would mean undoing all that work. Can we somehow upgrade it to 64 bit while keeping the software base installed? Perhaps this isn't something critical to fix but only to investigate so I can have someone do it over the summer when things have settled down. In any case, that means that the backup, CVS, and compiling tasks need to be well documented so it can be repeated if someone else over the summer re-installs RedHat (64 bit) on it. (BTW, how much memory does Rome have, it's not on our Hardware page).

Ok, I hope we can now at least continue last year's research, move it forward, explore some new features (i.e. decoder flags like -lda and -ldadim or the new RunAll_CDMMLT.pl method of training) along with your great team ideas and get a good results.

Best,

Mike

P.S. I still want my table filled out (see Rose) which means we still have to run five 300 hour trains (on Caesar, but sequentially) and then decode both seen and unseen. This is not part of your team research but the data/modeling groups finalized tasks. So with Caesar now working you can start training them. You can start with 0304/039 (just change the sphinx_train.cfg to use LDA).

FYI,

So I re-ran the 5-hour LDA train using RunAll_CDMLLT.pl (on Majestix vs Caesar but since they both use /mnt/main/local I'm going to assume they are the same). The results are better than the same corpus run using the RunAll.pl script:

5 hour train with 5 hour test-on-train (i.e. seen) decode 0304/041 uses RunAll.pl with LDA turned on (run on Caesar) WER: 26.8% 0304/042 uses RunAll_CDMLLT.pl with LDA turned on (run on Majestix) WER: 26.0%

Neither uses the -lda or -ldadim flag of the decoder (but does use the "mllt" models (i.e. what run_decode_lda.pl does).

Just a couple of updates. So RunAll_CDMLLT.pl, compared to RunAll.pl on the same data set, seems to do slightly better on a very small 5 hour train:

5 hour train with 5 hour test-on-train (i.e. seen) decode 0304/041 uses RunAll.pl with LDA turned on (run on Caesar) WER: 26.8% 0304/042 uses RunAll_CDMLLT.pl with LDA turned on (run on Majestix) WER: 26.0%

Neither decoding experiment used the -lda or -ldadim flags of the decoder (but they did use the "mllt" models (i.e. what run_decode_lda.pl does).

Also, there was a bit of confusion on running the decoder by hand. If you look at run_decode.pl at the bottom of the file (I hope people actually look at the scripts they run) you will see that the decode.log file is created by piping the output of the decoder (via >) into it. It's a tried and true Unix/Linux method to capture the output of a running program. So when you run the decoder by hand you'd do:

sphinx3_decode blah blah blah >& my_log_file.log & (This is the new version of this command as per the fix seen below)

where blah blah blah are all the flags with arguments you'd add (so that line will get pretty long, maybe spanning 3 or 4 lines on your screen depending on how wide your ssh terminal window is). You stick an ampersand at the end to make sure it runs in the background.

Also note that you cannot just stick -lda and -ldadim flags into the list of flags as that will not work (a few of you tried that already). Each of these flags may or may not have additional information they need (like a number or a file -- maybe the lda dimension flag wants an integer: -ldadim 4 -- but I don't know what it wants I'm only using it as an example). So you need to investigate by checking CMU on what they mean and how they are used and if they are needed for LDA models that you built. It seems that one way to use LDA models is by using the MLLT transformation model files created by an LDA train for the acoustic models during decode (i.e. via -hmm) and that is what the run_decode_lda.pl script was doing (but that was all).

Hope that makes a bit more sense to everyone. Private message me if you have any further questions.

Mike

Correction:

Camden noted that > doesn't capture sphinx3_decode's output for the csh/tcsh shell (maybe he propagated that on your discord channel already). You have to capture the stdout/stderr of the decoder. So just use an extra ampersand (i.e. >& for csh/tcsh):

sphinx3_decode blah blah blah >& my_log_file.log &

For the bash shell the syntax is different and you reverse it (i.e. &> in bash).BTW, you use bash when logged on as root and csh/tcsh when logged on as yourself. I use csh/tcsh mostly so any examples I give are csh (note that tcsh is just an extension on csh).

Mike

P.S. I keep seeing folks use /mnt/main/scripts/user appended to various commands (like lm_create.pl) and now I see /mnt/main/local/bin in front of sphinx3_decode. You don't need either as your PATH is set up to find both directories. So make it easier on your keyboard and just type the command. You only need to use full pathnames when the OS can't find what you type. Use the "which command" in Unix/Linux to see where it is coming from (i.e. type which sphinx3_decode and you'll see it's in /usr/local/bin)

Mike __________________________________________________________________________________________________________________________________________________________________________________________________________


 * Plan:4/14/18 Work for the Data/Modeling: redo five 300hr experiment. Work for Guardians: narrow down the best configurations for 300hrs.


 * Concerns:4/14/18 Hoping we can do a bit more with those flags for decode if there's time to hunt down how to use them.


 * Tasks:4/16/18 Spent the morning sorting through the three different sets of commands - the regular wiki seen train/decode, the ones for using the Prof's 3 alternate scripts to parseLMTrans.pl, and the new ones to use LDA and NOT use the nohup run_decode.pl 0309/030 0309/030 3000 & to come up with what I hope is the correct commands to combine everything into one. Also attempted to contact Tri to get his input via Discord. And I read everyone's logs to keep up with everything going on.

The latest commands: Steps to do a 5hr experiment with LDA and the Prof's 3 scripts instead of parseLMTrans.pl

First add an experiment. Must be in Caesar, no need to cd anywhere, to do addExp.pl

addExp.pl -s

Example: please enter your username -> AD/ras1002

Credentials: AD/ras1002

please enter your password->

Please enter the main experiment number (Ex: 0268)->0309

What is your sub-experiment's name?->T14 raised to 5

Please enter the author's name->Rose Salemi

Please enter a brief description of your sub-experiment->seen, 5hr, raise T14 to 5

Note: I get "bad token" error if the description is more than a few words long!

Your sub-experiment number is: 022

Please go to the Exp directory on Caesar and make a folder for this sub-experiment

cd .. cd .. cc .. cd Exp/0309 mkdir 022 (This is an example)

ssh into a drone, check "top" to see if anyone is using it and what the current CPU% is. LDA experiments need much more CPU processing power than the usual 25% for trains or 12.5% for decodes!

cd .. cd .. cc .. cd Exp/0309/022 (This is an example)

TRAIN TO BUILD ACOUSTIC MODEL

makeTrain.pl switchboard 5hr/train

cd into etc to locate sphinx_train.cfg to change $CFG_LDA_MLLT='YES' and CFG_LDA_Dimension=32 nano sphinx_train.cfg Use arrow keys to scroll down and change what you want, then hit CTRL+X and Y to save.

cd .. # go back out to the sub-experiment directory

genFeats.pl -t

cd /mnt/main/Exp/0309/022/etc REMEMBER to replace the experiment numbers with YOURS - with two exceptions (see below).

cp -i /mnt/main/corpus/switchboard/5hr/train/trans/train.trans trans_unedited #get the truth transcript

/mnt/main/Exp/0304/039/etc/scripts/parseTrainTrans_no_brackets.pl trans_unedited 022_train.trans DON'T change the 0304/039 above- it's where the script you're calling is. DO change the 022 to your sub-experiment number.

/mnt/main/Exp/0304/039/etc/scripts/pruneDic_no_brackets.pl /mnt/main/corpus/switchboard /mnt/main/Exp/0309/022/etc/022_train.trans 022 DON'T change the 0304/039 above- it's where the script you're calling is. DO change the 0309/022 to your experiment numbers. This script takes three arguments. The third is just your sub-experiment number.

cd.. back up one level to: /mnt/main/Exp/0309/022 (Change this experiment number to match yours.)

Now run "top" to check CPU usage. LDA experiments need much more CPU processing power than the usual 25% for trains or 12.5% for decodes!

nohup scripts_pl/RunAll.pl &

BUILD LANGUAGE MODEL

mkdir LM in your sub-experiment directory ex. /mnt/main/Exp/0309/022 (Change this experiment number to match yours.)

cd LM

/mnt/main/Exp/0305/011/etc/scripts/convertTrainToLM.pl /mnt/main/Exp/0305/022/etc/022_train.trans trans_parsed (Change this experiment number to match yours.)

lm_create.pl trans_parsed

DO DECODE

cd ..

makeTest.pl -t switchboard/5hr 0309/043 0309/043 (Change these experiment numbers to match yours.)

genFeats.pl -d

cd into etc

Now run "top" again to check CPU usage. LDA experiments need much more CPU processing power than the usual 25% for trains or 12.5% for decodes! (Please remember to change the experiment numbers to match yours!) /usr/local/bin/sphinx3_decode \ -hmm /mnt/main/Exp/0309/022/model_parameters/022.mllt_cd_cont_1000 \ -lm /mnt/main/Exp/0309/022/LM/tmp.arpa \ -dict /mnt/main/Exp/0309/022/etc/022.dic \ -fdict /mnt/main/Exp/0309/022/etc/022.filler \ -ctl /mnt/main/Exp/0309/022/etc/022_decode.fileids \ -cepdir /mnt/main/Exp/0309/022/feat \ -cepext .mfc >& decode.log &

parseDecode.pl decode.log hyp.trans

DO SCORING sclite -r 022_train.trans -h hyp.trans -i swb >> scoring.log (Change this experiment number to match yours.)

tail -8 scoring.log in etc to see the table results


 * Results:4/16/18


 * Plan:4/16/18 Do a test of the new commands on Caesar for 5hr to make sure they work before I begin the 300hr set of experiments the Prof wants.


 * Concerns:4/16/18 Here's hoping the 5hr will work.

Week Ending April 23, 2013

 * Task:4/17/18 Last night Tri ran a 5hr test with the instructions I put together (see above) for LDA and using the Prof's 3 scripts in place of parseLMTrans.pl. I checked the results this morning - decode.log finished without erroring out, and the line and word counts for the transcripts and fileids and tmp.arpa are accurate as far as I can tell, so I began a 300hr train this morning. Experiment /mnt/main/Exp/0305/034.

When I ran nohup scripts_pl/RunAll.pl &

I got this error: Something failed: (/mnt/main/Exp/0305/034/scripts_pl/00.verify/verify_all.pl)

Here's how I fixed it:

Phase 6: TRANSCRIPT - Checking that all the words in the transcript are in the dictionary Words in dictionary: 29748 Words in filler dictionary: 6 WARNING: This word: ARGU- was in the transcript file, but is not in the dictionary ( IT WAS REALLY FUNNY BECAUSE THE CROWDS WOULD LIKE GET INTO ARGU- YOU KNOW GET INTO FIGHTS WHO CAN SCREAM LOUDER AND YOU KNOW EVERYTHING SO IT'S REALLY PRETTY FUNNY TO SEE MY PARENTS GET INTO IT AND THEY'RE NOT REALLY EVEN PARTICIPA                                                                     TING ). Do cases match? WARNING: This word: LIV- was in the transcript file, but is not in the dictionary ( LIKE I'M IN JOURNALISM I WOULDN'T WALK AROUND OPENING MY BIG MOUTH SAYING THAT'S WHAT I DO FOR A LIV- ). Do cases match? WARNING: This word: HUMIDI- was in the transcript file, but is not in the dictionary ( YES I DO NOT LIKE THE HUMIDI- ). Do cases match? WARNING: This word: DEDUC- was in the transcript file, but is not in the dictionary ( WAIT A MINUTE NOW THEY RAISED THE DEDUC- THEY RAISED THE DEDUCTIBLE [VOCALIZED-NOISE] ). Do cases match? WARNING: This word: CONSER- was in the transcript file, but is not in the dictionary ( THEY KIND OF LIKED ME I LOOKED CONSER- AND ALL THAT STUFF AND THEY I DON'T KNOW WHAT THEY SAW IN ME BUT THEY SAW IT ). Do cases match?

Phase 7: TRANSCRIPT - Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once Something failed: (/mnt/main/Exp/0305/034/scripts_pl/00.verify/verify_all.pl)

[ras1002@caesar 034]$ Something failed: (/mnt/main/Exp/0305/034/scripts_pl/00.verify/verify_all.pl)

________________________________________________________________________________________________________________________

If you got this error, it means you need to add words to the dictionary. Or you could modify the regular expressions in your scripts to account for words with special notation such as brackets or dashes (we're already tried this and leaving the words in causes fewer problems and has a slightly higher WER, though). You can use:

nano fileName

to modify the file (CTRL+V will let you go down a page at a time, and when you are done, CTRL+X will ask if you want to save, hit Y for yes and Enter to agree to save that file)

I added the words (all but the three filler words that you are supposed to leave in, [NOISE], [VOCALIZED-NOISE] and [LAUGHTER]) to master.dic, added_words.dic and just to be sure, to my 034.dic, all with the associated phonemes. You can get the phonemes two ways: use the words already there as a guide, since there are usually similar words already there or you could guess or use your own judgement, but the CMU actually has a tool for that here: http://www.speech.cs.cmu.edu/cgi-bin/cmudict. It gives you the phonemes.

Note: Today in class the Prof said NOT to do experiments using LDA until we do a small subset of experiments to decide which random seed the MLLT parameter needs that will give us the best WER, and THEN do 300hr experiments. So I used top to find the process ID numbers of the processes associated with my train and type kill xxxxxxx and then top again to make sure it's not running any more.

He also said NOT to add words to the master.dic or the added_words.dic, just the .dic.

Per Prof Jonas, the Model /Data groups are to run three experiments, all 300 hours, using seen data (test-on-train) and using the new genTrans.pl script. One is a baseline, one is to recreate 0304/039 (which is actually a recreation of 0305/011, but with updated scripts to remove just bracketed words) and one is to recreate 0305/013 but uses Tri's newest version of those scripts to remove both bracketed words and dashes. We are to use Automatix.

FILE PERMISSIONS PROBLEMS: So I had a new version of the script genTrans.pl and needed to upload it to Caesar. First I couldn't transfer it in (after I moved the old one to /mnt/main/scripts/user/History/17 until I signed in as root and changed the directory permissions, per Steve, to give me write permission with

chmod o=rwx

So the new genTrans.pl is now in both /mnt/main/scripts/user/History/cur and /mnt/main/scripts/user. Then I changed the directory permissions back with

chmod o=rx

I noticed that the color of the text was not green (for executable), but white, which I was concerned about because Linux color-codes the text in the terminal for scripts and directories. Then Steve couldn't get a train to run - it didn't populate the transcript and subsequently the dictionary - so it turned out that was also a file permission issue, to make the script executable. So I Camden and Dan B helped me with that. As root I went into the directory that genTrans.pl is located in and did

chmod +x genTrans.pl

Although you can also specify the path just to be sure. The color turned green and Steve was able to get his train started.


 * Results:4/17/18 Got some practice with Linux file permissions and uploaded the newest genTrans.pl. Our Guardians group discussed our game plan for the next two weeks.


 * Plan:4/17/18 Work on Guardian research and experiments now that modeling/data work is underway.
 * Concerns:4/17/18 None yet.


 * Task:4/18/18 Today we had the URC poster presentation. Later I did some research and started an experiment, based off information I found at one of the linked sites Steve fixed (most of them said "dead link") but it turned out that someone had done the wiki links wrong when they first posted them.


 * Results: 4/18/18 Successful conference. We got to present to professor Karen Jin and impressed her - she said our Capstone class was able to explain what we were doing very clearly and with good detail, a sign that we had good understanding of what we had done. I firmly believe that documenting as you go really helps organize and set what you're doing in your mind, and many of my classmates agree.


 * Plan: 4/18/18 Finish the decode in the morning and do more experiments while I wait for Steve's 300hour to finish so I can do my own 300hour for the Prof.


 * Concerns: 4/18/18 None right now.


 * Task: 4/19/18 Finished my decode and got a slight improvement of 33.7 compared to 5hr baseline WER of 34.4. Seen data. Modified T14 to 5 and T15 to yes.


 * Results: 4/19/18

| Sum/Avg | 4172 60569 | 72.8   19.1    8.1    6.5   33.7   87.1 | |=================================================================|    |  Mean   |  1.3   19.2 | 75.5   18.2    6.2   14.2   38.7   87.5 | | S.D.   |  0.5   16.5 | 18.3   15.2    8.2   28.0   32.0   30.6 | | Median |  1.0   15.0 | 75.6   16.7    3.0    3.0   33.3  100.0 | `-'


 * Plan: 4/19/18 Keep testing other facets. Be ready to use Automatix to do the 300hr for the Prof when it becomes available.
 * Concerns: 4/19/18 None right now. Just help out wherever I can.


 * Task: 4/21/18 Did experiment 0309/060. Began my 300hr seen data experiment on Automatix that recreates 0304/039, which is a recreation of 0305/011 that uses the Prof's 3 scripts that remove only words in brackets and the bracket characters. This will be one of just a few 300hr experiments done since I uploaded the modified genTrans.pl script.

| Sum/Avg | 4172 60569 | 87.6    6.5    5.9    4.3   16.7   70.0 | |=================================================================|    |  Mean   |  1.3   19.2 | 89.7    6.0    4.3   10.0   20.3   70.5 | | S.D.   |  0.5   16.5 | 12.5    9.0    7.4   24.5   27.2   42.5 | | Median |  1.0   15.0 | 93.3    1.9    0.0    0.0   13.0  100.0 | `-'
 * Results:4/21/18 For 0309/060


 * Plan:4/21/18 Test more facets, finish my 300hr experiment.


 * Concerns:4/21/18 Steve ran into some problems doing his 300hr baseline with the new genTrans.pl. He also changed the senone count to 8000 and used the manual decode rather than nohup run_decode.pl 0309/054 0309/054 8000 &, so I will be watching to see if I run into the same problems he detailed in 0310/019.


 * Task:4/23/18 Performing two experiments" the 300hr for the Prof as detailed above, and a 30 hr unseen to test some configuration settings from a previous experiment.
 * Results:4/23/18

This is a recreation of experiment 0305/011, on seen data using the three scripts the Prof and data group modified to remove brackets and those words inside that have been deemed undesirable, while leaving all dashes. This experiment also uses the new version of genTrans.pl, which now does not remove the three filler words, [NOISE], [VOCALIZED-NOISE], and [LAUGHTER].

Results:

40.2 is a decent result, especially considering this experiment does not use LDA/MLLT.

| Sum/Avg | 4034 57411 | 68.9   23.9    7.2    9.1   40.2   90.5 | |=================================================================|    |  Mean   |  1.3   18.5 | 72.3   22.4    5.3   17.7   45.4   90.2 | | S.D.   |  0.5   16.1 | 19.4   16.8    7.6   31.5   35.2   28.0 | | Median |  1.0   13.0 | 72.1   21.6    0.0    5.9   37.5  100.0 | `-'


 * Plan:4/23/18 Update documentation of all results on wiki and on Guardians'. Ask what else the Prof might need done. Begin work on the final summary and final report.


 * Concerns:4/23/18 None right now.

Week Ending April 30, 2013
Purpose: 30hr, unseen data, with changes to T11 T13 T23
 * Task:4/24/2018 Submit Capstone evaluation. Documentation of results from my last experiment.

Details: Uses the training data from 0309/023

Results: Not a great score, 52.5 is a bit more than the only other unseen 30hr experiment the Guardians have done at 53.2.

| Sum/Avg | 3912 55592 | 57.6   34.8    7.6   10.1    52.5    89.3 | |===================================================================|   |  Mean   |  1.3   18.2 | 65.0   29.4    5.6   17.5    52.5    89.4 | | S.D.   |  0.5   16.3 | 21.0   18.7    7.9   37.8    40.3    28.8 | | Median |  1.0   13.0 | 64.9   32.4    0.0    7.1    50.0   100.0 | `---'


 * Results:4/24/2018 Evaluation and wiki documentation done.


 * Plan:4/24/2018 Use Prof's template to begin final summary. Begin final report.


 * Concerns:4/24/2018 None.


 * Task:4/27/2018 Document the results of the three 300hr experiments done based on the Prof's 3 scripts - Steve's baseline, my recreation of 0305/011 and Tri's recreation of 0305/013. Not quite identical recreations, though, as we all used 8000 senones instead of the 1000 baseline senone count. I also went through the Guardians' documentation and marked all experiments that have been done since we uploaded the newest parseLMTrans.pl and genTrans.pl scripts.


 * Results:4/27/2018 Guardians' documentation done.


 * Plan:4/27/2018 After work tomorrow, begin report.


 * Concerns:4/27/2018 None at the moment.


 * Task:4/29/2018 Read logs. Spent several hours on a draft of the Data Group's part of the Final Report and added that plus other information to the Data Group's wiki page. Kept up with Discord communication and the Guardians' newest results.


 * Results:4/29/2018 Draft is done.


 * Plan:4/29/2018 Will ask for the other data group members to contribute suggestions and content.


 * Concerns:4/29/2018 Not much time left to run experiments.


 * Task:4/29/2018 Monitored progress on current experiments in specialized areas such as LDA on Discord. Checked logs for updates. Edited a bit of the final report.


 * Results:4/29/2018 Not much is going on, just waiting for experiment results.


 * Plan:4/29/2018 Study the results and see what else can be done to tweak them.


 * Concerns:4/29/2018 Nothing else right now.

Week Ending May 7, 2013

 * Task: 05/04/2018 I have done some editing to the Final Class Report after more communication with the Data Group memebers.


 * Results:05/04/2018 I hope the changes I made will make my explanation of the work the data group has done more clear.


 * Plan:05/04/2018 We are running a last experiment right now so when we get the results, we will be looking to to the eval testing.


 * Concerns:05/04/2018 None right now.