Speech:Spring 2015 Dakota Heyman Log


 * Home
 * Semesters
 * Spring 2015
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Helpful Links
After going through logs of previous students, I have decided to create this 'Helpful Links' section to quickly show future students useful information. These links will be of most help to future Data Group members but there will be details that help everyone.
 * CMUSphinx(Speech Recognition Information)
 * Corpus (General Corpus Information)
 * Run a Train (Scripts to run trains, use Steps 1, 2, and 3)
 * Accessing Drones (Should be done in class but this is useful in case you were absent or the process didn't work for you)
 * Count # of Files (The command 'ls -1 | wc -l' is used)
 * Moving Directories (The command 'mv [source] [destination]' is used)
 * Creating Soft Links(The command 'ln -s ../disk1/swb1/*.sph -t .' is used)

Week Ending February 3, 2015

 * Task:
 * Learn about previous data team's semester efforts

1/29
 * Results:
 * Looked over data group logs from Spring 2014. Jared's is the most complete for the first week so he will most likely be of most help this week.
 * Edited 2015 Spring Data Group page to include links to all of our personal logs.

2/1
 * Installed Putty and attempted to log in to Caesar using my wildcats account (as specified by Spring 2014 logs). It didn't work so I'm assuming that the accounts aren't set up yet.
 * Caesar Login Procedure:
 * Download Putty (or use other terminal software)
 * Connect to the host name caesar.unh.edu
 * Use port 22/ensure that you are establishing an ssh connection
 * Continued to read logs from previous semesters.

2/2
 * Met together with group via Google Hangouts to discuss proposal. We plan to meet again tomorrow to work on finalizing it so that we can send it to the Proposal Group.
 * Emailed Mohamed to see if he can let me know how he obtained access to Caeser. As of now I am not sure if it is user error on my part or if his group has early access. I'm assuming the latter since no one else from my group has gotten access either.
 * Familiarized self with the Corpus and  Data located on the wiki. This seems to be the best approach to understanding the data structure until I have access to Caesar.
 * After reviewing the report from Spring 2014, organization of data appeared to be the focus for future groups. This lines up with our designated goal of using soft links to better structure the data.
 * Reviewed various linux commands. I have basic linux knowledge from working with networking switches written in linux. The command line itself for the switches were similar to Cisco's IOS but the switches contained a bash that allowed for linux commands.
 * Read about linux command to create soft links (How to Create Soft Links in Linux). I have not been able to test out the command for myself but it should be: ln -s {target-filename} {symbolic-filename}.

2/3 Current Estimated Timeline:
 * After talking with Russ today, I realized that my description for connecting to Caeser was not clear. This prompted me to go back and explain in further detail the steps I took to connect. It also reminded me to emphasize complete documentation so that our efforts are clear to current and future classmates.
 * Met with group mates to discuss our plans for finalizing the proposal.
 * I wanted to review my linux skills further and remembered a website that I had used in the past to learn some basic commands. The website (Learn Python the Hard Way) is very effective at teaching the user how to program in python but it also includes a small section on terminal commands.
 * The Mac/Linux portion teaches users many different commands to manipulate files and directories.
 * I applied these skills to a soft link tutorial (Symbolic Links) to gain experience working with soft links on my own PC without needing to experiment inside Caesar.
 * Wrote a tentative personal timeline to include in the proposal:

Week Ending Feb 3rd
 * Review personal logs from Data team members of Spring 2014
 * Learn and review linux commands
 * Create a prototype file structure and practice soft links
 * Collaborate with team members to develop project proposal

Week Ending Feb 10th
 * Gain access to Caesar and create personal account if necessary
 * Examine the file structure
 * Ensure that existing documentation is up to date

Week Ending Feb 17th
 * Start reorganizing file structure to make it simpler by creating soft links
 * Document any changes and process used to make changes

Week Ending Feb 24th
 * Further organize data based on current progress and outside feedback
 * Learn how to run trains

Week Ending March 3rd
 * Finish final steps in organization
 * Document results as well as unfinished efforts that need to be completed by future Data teams
 * Attempt to run a train


 * Plan:
 * This week, the plan is to review logs and collaborate with team members to create a project proposal.
 * Concerns:
 * I have not logged into Caesar yet so the current state of data organization is unknown.

Week Ending February 10, 2015

 * Task:
 * Examine files and file structure

2/4
 * Results:
 * Had personal account created to access Caesar (username is identical to wildcats. For my examples, my username is djn96).
 * After receiving access to account, password had to be changed using the 'passwd' command.
 * Gained access to Asterix
 * Asterix Access Procedure
 * At lines 3, 5, and 6 simply press enter.

caesar sp15/djn96> ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/mnt/main/home/sp15/djn96/.ssh/id_rsa): Created directory '/mnt/main/home/sp15/djn96/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /mnt/main/home/sp15/djn96/.ssh/id_rsa. Your public key has been saved in /mnt/main/home/sp15/djn96/.ssh/id_rsa.pub.


 * After this step, the key fingerprint and randomart image should be displayed.
 * The last step is to create a soft link called 'authorized_keys' that points to id_rsa.pub. The 'ls' commands used are not mandatory and are only used to ensure that the creation of the soft link is successful

caesar sp15/djn96> cd .ssh Directory: /mnt/main/home/sp15/djn96/.ssh caesar djn96/.ssh> ls id_rsa id_rsa.pub caesar djn96/.ssh> ln -s id_rsa.pub authorized_keys caesar djn96/.ssh> ls authorized_keys id_rsa  id_rsa.pub


 * Access to Asterix should now be available by using the 'ssh asterix' command.

caesar sp15/djn96> ssh asterix The authenticity of host 'asterix (192.168.10.2)' can't be established. RSA key fingerprint is 57:d0:b6:de:e0:6e:f2:ae:fc:52:48:3a:a2:2d:43:fe. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'asterix,192.168.10.2' (RSA) to the list of known hosts. ___     _            _  / _ \    | |          (_) / /_\ \___| |_ ___ _ __ ___  __ |  _  / __| __/ _ \ '__| \ \/ / | | | \__ \ ||  __/ |  | |>  < \_| |_/___/\__\___|_|  |_/_/\_\                  Weclomes you! asterix sp15/djn96>

2/6
 * Switched to Mac and accessed Caesar by using the 'ssh (username)@caesar.unh.edu' command from the Mac terminal (or any other terminal application like iTerm).

ssh djn96@caesar.unh.edu pam_mount password: _____ / __ \ | /  \/ __ _  ___  ___  __ _ _ __ | |    / _` |/ _ \/ __|/ _` | '__| | \__/\ (_| |  __/\__ \ (_| | |  \____/\__,_|\___||___/\__,_|_|                   Welcomes you! caesar sp15/djn96>
 * If the above process doesn't work on the first try, 2 attempts may be necessary.

ssh djn96@caesar.unh.edu ssh: Could not resolve hostname caesar.unh.edu: nodename nor servname provided, or not known ssh djn96@caesar.unh.edu pam_mount password: _____ / __ \ | /  \/ __ _  ___  ___  __ _ _ __ | |    / _` |/ _ \/ __|/ _` | '__| | \__/\ (_| |  __/\__ \ (_| | |  \____/\__,_|\___||___/\__,_|_|                   Welcomes you! caesar sp15/djn96>


 * Read logs from last year.

2/9
 * Today, I decided to look at the file structure of the corpus and document it. This way, future groups can get an understanding of the structure without needing access. I won't go into every little detail of each directory because changes are likely between now and the next group. Instead an overall view should be helpful.
 * If you already have a user account then go to the corpus by using the commands below. The 'cd' command is used to change directories. 'cd ..' sends you backwards one directory. Adding slashes allows you to go back multiple directories in one command.

Directory: /mnt/main/home/sp15/djn96 caesar sp15/djn96> cd ../../.. Directory: /mnt/main caesar /mnt/main> cd corpus Directory: /mnt/main/corpus


 * From here, you should see four directories:
 * dict
 * This directory contains dictionaries that contain words along with their acoustic models.
 * noaa
 * Both this directory and switchboard contain transcripts and audio files that make up the data being converted.
 * scripts
 * Currently there are two scripts: genTrans.pl and the much larger ms98_icsi_word.text file.
 * switchboard


 * After this brief examination of the file structure, I realize that I need to do more research in order to fully understand the contents of each directory.

2/10
 * Group met over Google Hangouts. We all have had a chance to look at the file structure and have a better understanding of the data organization. We all expressed concerns if our tasks will take enough effort for four people. We plan to meet again tomorrow before class to further discuss the project.
 * Read logs from each classmate. It was encouraging to see that Mohamed successfully ran a train. I will be curious to hear about his experience with it in class tomorrow.


 * Plan:
 * Discuss our efforts examining the file structure.
 * Assign tasks to each other.
 * Concerns:
 * Are our tasks enough work for four contributors?

Week Ending February 17, 2015

 * Task:
 * Fix known broken soft links


 * Results:

2/11
 * Today, I examined the the noaa and switchboard directories to consider their soft link requirements.
 * It appears that noaa is already correctly linked. Both the '40min_split' and 'half' directories have soft links that point to the 'full' directory. The 'ls -la' command was used to determine what files were being linked to.
 * The linking for switchboard is more complicated. The .sph files for each of the hour directories point to /mnt/main/corpus/switchboard/dist/Switchboard/flat which in turn points to each respective disk that the file is located on. It appears to be working correctly but more analysis may be required to verify this.
 * I was also curious if the highest hour count ('256hr') was identical to full. Me and Stephen both went into each of the respective directories ('256hr' and 'full') and determined that they were not the same. '256hr' contains 2285 .sph files whereas 'full' contains 2435 .sph files. The number of files in a directory can be determined using the 'ls -1 | wc -l' command. The 'ls -1' command tells the terminal to list each item one line at a time. The 'wc -l' command then counts the number of lines in the output. I learned these commands from the following tutorial (Counting Files in the Current Directory).

2/12
 * While perusing the switchboard directory yesterday I noticed that '256hr' used a broken soft link to its .sph files. To examine the counts of each files I compared the output of 'ls -1 | wc -l' from the correctly linked 'full' and the actual path that '256hr' was trying to link to. I tried to fix the link yesterday but I kept getting kicked off as root since it was during class time so I decided to wait until today to fix it when it would be less busy.
 * In its incorrect state, the 'conv' directory of '256hr' pointed to /mnt/main/corpus/dist/Switchboard/consolidated instead of /mnt/main/corpus/switchboard/dist/Switchboard/consolidated.

caesar:/mnt/main/corpus/switchboard/256hr/clean/audio # ls -la lrwxrwxrwx 1 root root      46 2014-09-16 09:47 conv -> /mnt/main/corpus/dist/Switchboard/consolidated


 * To fix the soft link, the 'ln -nfs' command was used. After the command, input the correct path that you wish to link to followed by the path of the original directory.
 * If you attempt to do this with your personal account, caesar will refuse and state that permission is denied. Root access is required.

caesar:/mnt/main/corpus/switchboard/256hr/clean/audio # ln -nfs /mnt/main/corpus/switchboard/dist/Switchboard/consolidated /mnt/main/corpus/switchboard/256hr/clean/audio/conv


 * I confirmed that the correction was successful by once again using the 'ls -la' command
 * I also asked Russ if he would verify that the change was successful by having him access the newly linked directory with his personal account. He was able to successfully access it.

lrwxrwxrwx 1 root root      58 2015-02-12 10:37 conv -> /mnt/main/corpus/switchboard/dist/Switchboard/consolidated


 * After further examining switchboard, I discovered a discrepancy in how the soft links are setup. In the directory that I fixed with Russ (/mnt/main/corpus/switchboard/256hr/clean/audio/conv) the directory itself is linked to another directory (/mnt/main/corpus/switchboard/dist/Switchboard/consolidated). In all the other hour directories, the conv directory contains .sph files that are linked to individual .sph files in /mnt/main/corpus/dist/Switchboard/flat. This link is broken for all of these files and should instead point to /mnt/main/corpus/switchboard/dist/Switchboard/flat. I am not sure the advantages and disadvantages of linking directories vs individual files. I will discuss it with the group as well as research further.
 * I was looking through logs of the previous year to see their explanation for linking the way they did. The last modification date was between march and april so at this point we were no longer in our originally assigned groups. I will try to look through everyone's blogs for that time period to see if I can find some reasoning.
 * While reading Mitch's Blog, I discovered him talking about removing unnecessary wav files from the experiments directory(one of our tasks) during the week of april 15. We want to remove everything prior to spring 2014 so this should be very helpful when we start this task.
 * I was also reviewing Jared's Blog and saw him talking about measuring the size of Switchboard. This should also be very beneficial when we attempt the task.
 * I brought this information to the attention of the group and we will decide together how we want to proceed. Now that we have a better understanding of our goals, we can assign tasks amongst ourselves.

2/16
 * Read logs from teammates to see how everyone was progressing this week.
 * Our group plans to meet via Google Hangouts either today or tomorrow to discuss our progress.

2/17
 * Tonight, our group met via Google Hangouts to discuss our progress.
 * We also assigned tasks amongst ourselves:
 * Russ is going to focus on measuring the size of Switchboard. Various hour counts have been calculated in the past and we have been tasked with finding a definitive number.
 * Krista is going to eliminate redundant .wav files including any one created prior to Spring 2014.
 * Stephen will work on learning how to run experiments. He has already created the necessary directories and will document his process so that each of the other data group members can follow his lead and make child experiments.
 * I plan on fixing the soft link structure. I spent a lot of time this week learning the current status of soft links and found many to be broken. They can be fixed multiple ways (as described in detail on 2/12). I will ask Professor Jonas his opinion in class tomorrow.
 * I read logs from each classmate to see how the class as a whole was progressing. Sam seems to be making good progress on running trains. A lot of people seemed to face issues when Caeser was down on 2/16. I was fortunate to not need access to Caesar that day as it would have halted my progress as well.
 * I found a link that explains many wiki markup features Wiki Markup Tips. This will be useful when documenting.
 * Plan:
 * Decide upon a plan to fix the rest of the broken soft links
 * Concerns:
 * Unsure if plan to fix soft links is correct

Week Ending February 24, 2015

 * Task:
 * Fix soft links

2/19 mv /mnt/main/corpus/switchboard/dist/Switchboard/disk1 /mnt/main/corpus/switchboard/dist
 * Results:
 * Today I removed the redundant 'Switchboard' directory that was located inside /mnt/main/corpus/switchboard/dist.
 * I created some test directories to ensure that the process would work correctly before actually moving the directories. Root privilege is needed whenever modifying Caesar outside your own personal directory.
 * For each directory I ran the following command (disk 1 is used as an example):
 * After doing this for each directory, I decided to fix the soft links for the 'first_5hr' directory. As of now the overall plan for soft linking isn't finalized (waiting on response from Professor Jonas) but Sam is running into problems training and I want to have at least one working directory so we can deduce if the issues are occurring based on broken soft links.
 * For each file I ran the following command (sw02001.sph is used as an example):

caesar:/mnt/main/corpus/switchboard/first_5hr/clean/wav # ln -fs /mnt/main/corpus/switchboard/dist/flat/sw02001.sph /mnt/main/corpus/switchboard/first_5hr/clean/wav/sw02001.sph


 * The file before:

lrwxrwxrwx 1 root root 50 2014-03-01 01:54 sw02001.sph -> /mnt/main/corpus/dist/Switchboard/flat/sw02001.sph


 * Here is the file after:

lrwxrwxrwx 1 root root 50 2015-02-19 16:42 sw02001.sph -> /mnt/main/corpus/switchboard/dist/flat/sw02001.sph


 * After this step, I then had to fix the links from the /mnt/main/corpus/switchboard/dist/flat directory that pointed to disk1. The process is the same (but with different paths) as above.
 * Now, I'm waiting for Professor Jonas to see how he would like the soft linking to work. Also, I am hoping that the fixes I made will help Sam train.
 * Sam got back to me and said that the fixes worked. He asked for the 125hr directory to be fixed as well so I fixed the links in that directory too.
 * This process is simple yet tedious. If the process we use going forward is to fix individual file soft links than creating a script should be investigated. The 125hr directory contains 92 files. Since the links need to be fixed in 2 places, this amounts to 184 soft links. The process to fix these took me about an hour. There are 2435 total files in the 'flat' directory that point to all the original files found on disks. I have fixed 132 of them leaving 2313 broken soft links. Double that number to 4626 to include the links to 'flat' within each of the hour directories. If I fix 200 links/hour that means that it will take me roughly (4600/200 = 23) 23 hours to fix ALL the links. I am not very good at scripting and the numbers aren't consistent enough to simply auto-increment so I don't know how long it would take me to write a script. For now, I will wait for Jonas to see how he prefers to go about a solution.

2/22
 * Still waiting for a response from Professor Jonas.
 * Read logs from team members. Stephen seemed to be making good progress on running trains but ran into an issue while decoding. Hopefully next week each team member will be able to do one successfully.
 * I plan to email the group today to hear if they have any updated that hasn't been put on their logs yet. Also, we will setup a time where we can all meet via google hangouts again.

2/23
 * Emailed Professor Jonas again to see if he can back to me before class on Wednesday.
 * Read logs from class.
 * Group met via Google Hangouts to discuss progress. We have all started to make headway on our tasks. Stephen has run into an issue running trains so me and Russ are going to attempt one tomorrow to see if we run into the same issues.

2/24
 * There was some confusion between me and Professor Jonas regarding emails. I hope to get the soft link structure figured out in class tomorrow.
 * I read through Sam's log that he is having some issues because the path to the audio files are structured differently in each directory. I know how to fix this but there would need to be a standard set so that all the directories share the same structure. As a group, and as a class, we would need to agree on a proposed structure.
 * I intended to attempt running a train today but was unable to due to time constraints. Instead, I will seek Stephen's help in class tomorrow.
 * Instead of running a train, I researched into potentially writing a script. This might be necessary to fix all the soft links. The script needs to be able to run a linux command and increment for each file. The command needed is system. An example below (where softLink is the path of the intended soft link and originalFile is the path where the file is originally located):

system (ln -fs softLink originalFile)


 * The issues with a script are figuring out how to account for the filenames. There are large patches of numbers in a row and then some that skip numbers. I don't have a ton of programming or scripting experience, so I would be worried that writing a script would take away needed time to manually fix the soft links. As of now, I am still undecided on the subject.


 * Plan:
 * More soft links need to be fixed
 * Run a train
 * Concerns:
 * Fixing the soft links takes a lot of time. I am not sure if I will have enough time to fix them all. The task may need to be split up across the group or I may need help creating a script

Week Ending March 3, 2015

 * Task:
 * Finish fixing soft links

2/25
 * Results:
 * I fixed the 'first_5hr' directory so that it accurately represented the structure of the other hour directories. This was necessary for the script to run trains.
 * In order to transfer the soft links from 'wav' to 'conv', I had to use the following command (The original path is listed first followed by the destination path):

cp -r /mnt/main/corpus/switchboard/first_5hr/clean/wav/* /mnt/main/corpus/switchboard/first_5hr/clean/audio


 * Then I created the 'conv' directory and used the same command to copy the files once again.


 * The old path of 'first_5hr':

/mnt/main/corpus/switchboard/first_5hr/clean/wav


 * The new path of 'first_5hr':

/mnt/main/corpus/switchboard/first_5hr/clean/audio/conv


 * This new path accurately matches the path of the other hour directories located in switchboard:

/mnt/main/corpus/switchboard/125hr_3170/clean/audio/conv /mnt/main/corpus/switchboard/256hr/clean/audio/conv


 * The path for the 'full' directory is not fixed yet because trains aren't being run for that data currently. It needs to get done but I am prioritizing fixing the soft links first.
 * I taught Russ and Krista how to fix soft links. They are currently working on fixing the soft links for the 'flat' directory. There are over 2000 files so this task needed to be delegated.
 * With Russ and Krista fixing the 'flat' directory, I am going to work on fixing the soft links for each of the hour directories within switchboard. I am starting with 256hr. After fixing its 'conv' directory (which was a soft link for some reason), I am currently creating soft links for each of the files listed in 'consolidated' (which was what 'conv' originally linked to).


 * Current Progress: 406/2285 (18%)


 * After finishing for the day, I decided to look to see if there are any scripts that previous students have created to make soft links. The issue with doing it manually isn't necessarily the time required but it also has a higher chance for user error than through automation.
 * I looked through the logs from Summer 2014 but didn't find anything conclusive. In the scripts section of the wiki, I found Convert.pl . This script is focus on creating soft links through text found in transcripts but I'm curious if it could be adapted to fit our needs. My main worry about creating a script is that I won't be able to complete one by our deadline. I would rather do it manually and ensure its success.

2/26
 * Krista found a much faster way to create all the soft links. She may include her process in her log but if she doesn't then I will post a brief explanation in a later post of mine.
 * I was without internet access for much of the day and was unable to make progress on my own.
 * I plan to implement Krista's method to fix the rest of the soft links in the '256hr' directory tomorrow as well as fix the 'full' directory so that it matches the structure of the other directories. Hopefully all of these tasks can be done before next week so that I can spend next week focusing on running a train.

2/27
 * I tried to use Krista's method but it didn't work for me so I asked her to fix what I had messed up. I think I know how to fix it, but it may take me longer than it would for her so I am seeing if she is able to first.
 * I fixed the 'full' directory so that it now uses the /clean/audio/conv structure that exists in the other directories.
 * The 'full' directory has the 'utt' directory in the wrong place but when I tried to change its location it didn't work. I remember Sam saying that the 'utt' directory might not matter so i'm leaving it for now but will probably move it later.

3/3
 * Today I learned that I am a massive idiot and wasted a lot of time. I went to the Corpus info page and found that there is a command that will create soft links for all files that end in .sph. I had been to the corpus page before but not since the first or second week and had forgotten about it. The method that Krista used (copying sections of the soft link command into excel and then using excel to duplicate the command) sped up the process a lot but the new command is a more reliable way to create soft links going forward:

ln -s ../disk1/swb1/*.sph -t.


 * More info on the command can be found on the info page itself (linked in the first bullet). This discovery made me decide to create a "Helpful Info" section at the top of my log so that future students will be able to see the most important information right away when they view my log. I will create this section during the first week of bootcamp and update it whenever I find more important information.
 * I created the NOAA wiki page. Right now it is sparse on information because as of 5:52PM, Caesar is down and I no longer have access to the NOAA directory.
 * Group met up via Google Hangouts. We all discussed our status and it seems like we are all making good progress. After some trouble decoding, Stephen has run a train now and will help the Data team in class tomorrow as we all attempt to run a train.
 * While creating the NOAA wiki page, I found some errors in the Switchboard wiki page and corrected them.
 * I read all of my classmate's logs. The systems group seems to be very busy working on the migration from Caesar to Brutus. I hadn't had any trouble connecting to Caesar (until right now) so at the very least Caesar is working correctly. The current outage isn't on the migration schedule located on the System team's wiki page so I'm not sure if the current outage is planned or not.


 * Plan:
 * Hear from class if more soft links need to be fixed
 * Run trains
 * Concerns:
 * I am concerned about running trains. So far, it seems like few people can run them consistently so I don't think we'll be able to skip bootcamp. This will give the class more time so that everyone is comfortable running trains.

Week Ending March 10, 2015
Fix directory structure and Fix more soft links 3/4 ln -fs /mnt/main/corpus/switchboard/first_5hr/train/audio/conv/sw{02001..02062}.sph -t.
 * Task:
 * Results:
 * Today, Professor Jonas gave us a new task to standardize Switchboard. It currently is very inconsistent in its file structure so we need to fix that.
 * Every 'clean' directory will now be removed and instead be a 'train' directory. It is not known why 'clean' was used in the first place so I will be attempting to research why on the wiki. The directories were created last Spring so at least I have some information on where to start looking.
 * I tested the soft link command (mentioned in previous post) to see if it could be used exclusively. It works but the way that it is presented makes soft links for every file in switchboard. This is fine for the 'full' directory but problematic for any of the other ones. I was able to modify the command to make it work better (but still not perfect).
 * The command was modified to (first_5hr used as an example):


 * In the new command, -f was added to force the command to rewrite soft links if they already existed. If this command is being run in an empty directory then it is not needed. Instead of using a '*' to include every file, I included a range instead. I knew what the range needed for first_5hr was before so I simply went from the first number (02001) to the last (02062). This created soft links for all the files in between. Broken ones that do show up are easily identifiable by their red background. The '-t .' tells the terminal to input the current directory as the target.

3/5
 * The directories are now all standardized.
 * Each directory contains two main subdirectories: 'test' and 'train'.
 * Within these directories, 'audio', 'info', and 'trans' are located.
 * 'audio' contains .sph files, 'info' contains training information (more research needed to understand purpose of this directory), and 'trans' contains the text transcripts of the conversations.
 * Any directories that do not follow the above format likely did not contain the files necessary for the directory. We didn't want to create empty directories just for the sake of making them. I plan to do more research to see if I can find the appropriate files and place them in their respective directories.
 * More work on the NOAA wiki page will be done later in the week/weekend.

3/7
 * Caesar is currently down so I don't have access to the NOAA directory. When it is back up, I will add more information to the NOAA wiki page.
 * In the meantime, I emailed Marcel to see if there is anything in particular that he wanted me to talk about in the documentation. I also asked him if he knew the reasoning behind the 'clean' directories or if he knew someone who would know anything about them.
 * I still hope to run a train before bootcamp. This would likely occur either Monday or Tuesday next week.
 * I also read classmates logs.

3/9 ln -fs /mnt/main/corpus/switchboard/full/train/audio/utt/sw2062B-ms98-a-{0002..0104}.sph -t.
 * Marcel hasn't gotten back to me yet. I will wait another day before updating the NOAA wiki page.
 * Sam is having trouble running trains and asked me to fix the broken links for the 'utt' directory in 'first_5hr' (/mnt/main/corpus/switchboard/first_5hr/train/audio/utt).
 * This directory had over 4500 broken links so I decided to try the new soft linking method (mentioned on 3/4). For example:


 * The difference for the 'utt' directory is that the 'sw2062B' part needs to be changed in addition to the range specified (2-104 in this case). After running this command for every file group (changing the 'sw2062B' each time as necessary), I was left with over 7000 links since the command creates links for every number listed in the range - not just for the ones where a soft link already exists. To quickly eliminate the remaining broken links I used this command:

find -L -maxdepth 1 -type l -delete


 * I found the above command here: Removing Multiple Dead Soft Links. The command searches for all symbolic links (indicated by -L), tells it to only search within the current directory (maxdepth 1), and then to only manipulate broken links (-type l). The 'delete' command then deletes all the subjects that fit the previous criteria.
 * This new method only took me an hour and would have likely taken over 24 hours with my original method and would have been very difficult using Krista's method because there are a lot more parameters that need to be frequently changed.

3/10
 * I added information about transcripts to the NOAA wiki page. I will add more when Marcel gets back to me.
 * I wanted to verify that the soft links I fixed yesterday were properly corrected. I forgot to count how many files were in the 'utt' directory for 'first_5hr' before I started fixing the directory so I couldn't simply compare the before and after result of 'ls -1 | wc -l'.
 * First, I used 'ls -1 | wc -l' to see how many total files there were: 4659.
 * These soft links all point to the 'full' directory so I used the following command 'ls | head -4660' within the 'full' directory to look at the first 4660 files. The first file in the directory was 'nohup.out' not an utterance file so I had to increment the command one higher than the actual amount of files I wanted to search for.
 * The command listed 'sw2062B-ms98-a-0104.sph' as the last file in the search - the same as the last file in the 'first_5hr' directory. This shows that there are the same number of files in each directory which means that the soft links were fixed correctly.

Find out any remaining tasks for Data Group
 * Plan:
 * Concerns:

Week Ending March 24, 2015

 * Task:
 * Divide up Data Group tasks.
 * Coordinate with Patriots to exchange contact information and discuss ideas.

3/14
 * Results:
 * I outlined the rest of the tasks for the Data Group to complete:
 * Understand the scripts so that we know what needs to be implemented in NOAA. I believe we need a dist/dict section for NOAA
 * Make sure all the 'info' directories contain correct dictionaries (make sure the transcripts and dictionaries match up). We may need to create some dictionaries from scratch as well.
 * Clean up the dist/dict/custom/old area. After looking in Caesar, /mnt/main/corpus/switchboard/dist/dict/custom is the directory that needs to be organized.
 * Also, we need to clean up the scripts directory (Sam can supervise as needed).
 * Russ started cleaning up the experiments directory to eliminate the wav and feats directories. I emailed Morgan to let him know that we took responsibility for this task.
 * We plan to meet via Hangouts tomorrow or Monday to distribute our tasks.

3/20
 * Data Group met via Hangouts to divide up remaining group tasks.
 * Read logs (Pretty empty because its vacation this week :D ).
 * I want to run a train ASAP but Caesar is down so that is on hold. I am awaiting a response from Professor Jonas to see if he knows the cause. For what its worth, other students are having the same connection issues.

3/21
 * Professor Jonas hasn't gotten back to me but it appears that Mohamed is looking into the issue today. I will run a train once it is up.
 * Read through other logs.

3/24
 * Since Caesar is still down, I decided to create a helpful links section at the top of my log page. This will help future students see the most relevant information immediately.
 * Plan:


 * Concerns:

Week Ending March 31, 2015

 * Task:

3/25 3/27 nohup run_decode5.pl 007 0266/007 1000 3/29
 * Results:
 * Access to Caesar has been restored by first ssh'ing to cisunix and then to Caesar.
 * I attempted to run a 5hr train and it seemed to freeze after a while. I will attempt to run another one this week.
 * Tasks for the group has been divided amongst the team and I sent out an email detailing what everyone needs to do.
 * Yesterday, Me and Professor Jonas tried to figure out why the disk 4 and 8 directories were created in all uppercase. To get around this in the past, soft links were specifically created to account for the case mismatch. I did some digging in past logs and found references to the issue but I couldn't find a specific reason for it. Whatever the case is, the plan is to fix the issue incase it causes further issues.
 * The train that I ran in class got interrupted so I started a new one. I ran into an issue while running the following command:
 * This produced an error of "run_decode5.pl: Command not found". Garrett suggesting using the nohup DECODE/run_decode5.pl 007 0266/007 1000 command instead. The error no longer shoed up but now it froze so I'm currently waiting to see if it produced any results.
 * I also did some group work. The results will likely be posted over the weekend.
 * My 5 hour train ran successfully and since this was just a test, and not part of the group project, I've placed the results below:

|=================================================================|     | Sum/Avg | 3506  42940 | 75.0   18.4    6.6   16.9   41.9   93.6 | |=================================================================|     |  Mean   | 43.8  536.8 | 75.4   18.5    6.2   19.3   43.9   94.6 | | S.D.   | 20.2  247.5 |  7.2    5.7    2.7    9.4   12.1    6.9 | | Median | 40.0  486.0 | 76.2   17.0    5.9   17.6   43.3   96.6 | `-' 3/31
 * I also contributed to the group project.
 * I read logs from classmates and group mates. Most people have now run a train, which is exciting! It isn't a difficult process, so those who haven't aren't that far behind, but the more people who run them means more ideas to lower the baseline.
 * I can't seem to log in to Caesar. I'm using the same password I always have but its not working. It seems like most people are having this issue so I'm not alone. Trevor said earlier that he was having the same problem but he found a solution so I'll see what he says.
 * Plan:


 * Concerns:

Week Ending April 7, 2015

 * Task:

4/2 rm -rf /mnt/main/Exp/{0081..0140}/wav rm -rf /mnt/main/Exp/{0081..0140}/feat rm -rf /mnt/main/Exp/{0001..0080}/DELETE 4/4 4/5 4/6
 * Results:
 * Yesterday I fixed the soft links for the 125hr directory. I thought everything was fixed but I guess we missed some. I have outline the process for soft linking before so please refer to those logs if you'd like a more detailed process).
 * Today, I remembered an earlier task to delete the feat/wav directories for experiments 0001-0140. Russ had started this process (up to 0080). To delete these directories, I used the following commands:
 * We had agreed to initially place the contents of these directories into a 'DELETE' directory since that's what we thought Professor Jonas wanted. He would rather these files gone completely so I went back and removed the old 'DELETE' directories:
 * I'm currently looking into updating the info directories (they're all empty except for 5hr). I have a plan that I outlined to the data group and I will update my log when we all agree on a process.
 * I am reading up on Speech Recognition from CMUSphinx.
 * Started a 256hr train in Exp 0275.
 * Train is still running, so I am checking on its process throughout the weekend.
 * Patriots have been assigned clients Automatix, Methusalix, and Verleihnix. I quickly ensured that I could access them by sshing to them from Caesar.
 * So it turns out that running a 256hr train without changing the configuration will take... 256 hours. Per Professor Jonas's request, I killed the process since it had already reached 30 hours (more than enough results). I ended up scoring the results on Automatix just to give Caesar a break.
 * I won't post the results on our team page since its insecure. Instead, I will wait until there is a safe page on the wiki to discuss results.
 * Up to this point I have focused on trains but there are some important data tasks that are unresolved. Tomorrow I will look into both the empty info directories (important) as well as the seemingly small size of the 125hr directory (very important).
 * Plan:


 * Concerns:

Week Ending April 14, 2015

 * Task:

4/8 14235 (MB) / 259 (hours) = 54 MB / hour 14235 (MB) / 2435 (files) = 5.85 MB / file 54 / 5.85 = approximately 9 files / hour 4/9 ln -fs /mnt/main/corpus/switchboard/125hr_3170/train/audio/conv/sw{02001..03188}.sph -t. find -L -maxdepth 1 -type l -delete ln -fs /mnt/main/corpus/switchboard/full/train/audio/utt/sw{2001..3188}A-ms98-a-0001.sph -t. ln -fs /mnt/main/corpus/switchboard/full/train/audio/utt/sw{2001..3188}B-ms98-a-0001.sph -t. 4/13 %cd /mnt/main/corpus/switchboard/full/train/audio/utt go to full corpus %ls > ~/full.txt dump ls into a file %cd /mnt/main/corpus/switchboard/125hr_3170/train/audio/ now go to audio in 125 %mv -i utt OLD move utt into OLD %mkdir utt make new utt dir %cd utt go into utt dir %foreach id (`head -139985 ~/full.txt`) loop for first 139985 %ln -s /mnt/main/corpus/switchboard/full/train/audio/utt/$id $id create soft link %end % grep -n 3170 ~/full.txt 139985:sw3170B-ms98-a-0045.sph % cd /mnt/main/corpus/switchboard/full/train/trans/ % foreach id (`head -139985 ~/full.txt`) loop for first 139985 % grep $id full_train.trans >> train.trans grab relevant trans % end % mv -i train.trans /mnt/main/corpus/switchboard/125hr_3170/train/trans/. 4/14 Segmentation fault (core dumped)
 * Results:
 * I've been looking at Caesar to try to figure out the issue with the 125hr directory. I don't know of a way to see the length (in time) of the sphere files so I instead looked at the file size. I counted the size of each disk from switchboard and all together, it adds up to 14.235GB. On the wiki, switchboard is totaled at 259 hours. This means:
 * Using this math more or less checks out with the current directories. first5hr has 40 files (not 45 or 9x5) and full has 2435 (not 2331 or 9x259) but as you can see the numbers are very close and we will only be using subsets anyway. So all this being said, to recreate the new 125hr directory I need to create links to (9x125) or 1125 files. This is much more than what it currently has (90) and should be close to 125 hours. I plan to ask Professor Jonas in class to day to get his opinion before I update the directory.
 * I presented an updated URC poster to the data group based on contributions from each team member. I plan to flesh it out a little bit more before sending it to Professor Jonas.
 * Krista noticed that the files for 125hr begin at sw03189.sph and end at sw03285.sph. Since these files are roughly halfway through 256hr, she hypothesizes that the missing 125hr .sph files are all the ones leading up to sw03189.sph. I am going to recreate these soft links and if the # of files are roughly half of 256hr, then her hypothesis will be correct. The same will need to be done for the utt files.
 * After creating the missing soft links, I am happy to report that the numbers check out: 1147 .sph files (I estimated 1125) and 148271 utt files (there are 250330 in 256hr, again roughly half).
 * To do this, I used the same soft-linking methods as before:
 * Then I deleted broken links (which are created because not every number between 2001 and 3188 actually exist as a .sph file)
 * For the utterances, I used:
 * I had to manually increment the 0001 because using 2 ranges in the same command would't work for me. I incremented up to 230 for now and I am seeing incremental increases in files each time (around 10 new files per 10 commands). Ideally, I would increment until I reached the limit but I am not sure how high the number goes. For now, I believe the directory should work fine. If it doesn't, I will add more utterances.
 * A lot has happened regarding 125 hour trains since my last update. I'll outline the highlights (tldr: 125hr is fixed, some more data group work left).
 * Professor Jonas theorized that since the directory is titled '125hr_3170', the .sph files should end at 3170. He also outlined the process he would use to fix the directory:
 * 139985 files are created because that is the amount of utterances for 3170 conv files. This was found by using:
 * The necessary transcripts needed to be made as well:
 * I used this process and it didn't work for me. That is because there was a discrepancy in full_train.trans (which at the time was full_transcript.txt). Doing a word count (wc) command showed that there were 3 extra files in the full_train.trans than there were in the list of full's utt files. These files were nohup.out, temp.wav, and utt. Removing these files corrected the word count error. Professor Jonas went through a lot of steps to make this work so I'll just outline some overall tips that will help with data manipulation.
 * Use wc and diff to see if two files are identical or not.
 * Use head and tail to view the first/last couple files of a directory to see if any inconsistencies can be seen quickly.
 * Diff can be used to transform a file (sort/awk).
 * With all this done, I then renamed redundant trans files to train.trans.
 * I'm now going to run a 125hr train since everything should be correct now. I saw that Kayla attempted one and it failed due to something like 'train.trans not found'. This error shouldn't occur now that I have renamed the file.
 * I ran through the 125hr train and everything went successfully until the final step. I received the following error:
 * I read through Stephen's log and saw that he had the same issue a couple weeks ago. I hope to figure out the cause of this error tomorrow.
 * I haven't spent much time on my group assignment for the week. I plan to do this before class tomorrow (last minute, but better than nothing).
 * Plan:


 * Concerns:

Week Ending April 21, 2015

 * Task:

4/16 4/17 4/19 head -5000 008_train.fileids > 008_decode.fileids 4/20
 * Results:
 * Patriots have new parameter that we're testing that seems promising. Running a new 125hr train with the new settings.
 * 125hr train is currently decoding on Methusalix, awaiting results...
 * My 125hr train should be finished by now but its still running. I definitely remembered to create a smaller subset by only looking at the first 5000 files:
 * My original 125hr train ran very quickly but I now know that the decode was incorrect on that one. However, Krista finished her 125hr train decode faster than mine and she started around the same time. I'll email her and see how long it took for her train to decode. A 125hr train without subsets would take 5 days so its on track to finish Tuesday regardless, but I did create a subset so it should already be done.
 * My previous train is still running (I have checked the decode.log and it is still running). Krista told me that her train only took her 12 hours so in the meantime I have created another 125_hr train and I will only use a subset of 1000 with the new parameter to see if my results match hers. This one should definitely finish before Wednesday so at the very least I will have results for one of the currently running trains.


 * Plan:


 * Concerns:

Week Ending April 28, 2015

 * Task:

4/24 4/26 4/27
 * Results:
 * My previous trains completed running but I forgot some important parameters so the results weren't as I expected but luckily our team (leaving out specific team member so that its more difficult for opposing team to figure out results) reran the trains with the appropriate parameters and got very promising results.
 * I am currently working on a new task for my team this week.
 * I was studying some scripts for my task this week. I understand the scripts better but couldn't quite get it to function the way I wanted to. I'll have to look into it more tomorrow.
 * Read other logs.
 * Made a lot of progress on task. Now just to test it...


 * Plan:


 * Concerns:

Week Ending May 5, 2015

 * Task:

4/29 5/3 5/4 5/5
 * Results:
 * The final game plan has been sent out to the team. Garrett helped me work out a problem I was having and is now testing it to see how it improves the WER.
 * We have results with better WER but a higher RTF. We will need to decide which is the better overall result.
 * The results report is coming together.
 * Read logs to see progress from both teams.
 * The report started to be compiled today. The finishing touches will be saved for tomorrow so that the most recent results can be implemented.
 * I compiled report sections from Kyle, Refik, Trevor, and Nathaniel to create the Patriots report. I added in some more information and emailed the rough draft to the team. I will make revisions tonight so that we can have a finalized version tomorrow.
 * Plan:


 * Concerns: