Speech:Spring 2011 James Log


 * Home
 * Spring 2011
 * Proposal
 * Report

Week Ending March 8th, 2011
This past week I was tasked with revising the project proposal subsection for the “Building Models” group. This involved going through the current proposal and fleshing out the details and altering the document by adding the details that were lacking and rewording the text in some areas to increase the accuracy of some descriptions. In addition, I was tasked with performing some task or training on Troubadix but, after asking fellow capstone project members about it, was unable to procure the nature of the task assigned and/or how to complete that task. Despite this, the task of updating the entire blade rack was performed during and after the past meeting. The task of revising the project proposal document was relatively straightforward. I reviewed the current text and then added to or edited the text where it was needed. I then sent the document to my fellow team member KC who further reviewed and edited the document before returning it. After looking over the document and noting the changes She had made, I sent the document to Matt to finalize the group section on the proposal. In contrast to the success of the above task, the tasks associated with working on my assigned server were utter failure. The majority of Brian and I's time was spent in the server room updating the rack with the required packages during the last meeting. We were not really briefed on the nature of the tasks we were expected to perform for next week and were not informed on the status of the configuration of our individual machines. Despite this I attempted to configure the system I was assigned to build the SphinxTrain package that I found on it after futzing around in SSH for a few hours but failed in that regard. Overall the tasks associated with my server were complete and utter failures because of the lack of clarity of what they were. This next week's plan is to play catch up as soon as possible on the last week's failures and move forward with the tasks outlined on the proposal document. A suitable dictionary must be found, which is a relatively simple task. After that, I need to start building the mini training set and have it run by next week. Communication must be made as to how to operate the Sphinx System as I am still cloudy on the details. I plan to do my work Wednesday day, late Friday night, and late Saturday night. I think that some sort of meeting notes are in order because the whole fogginess on what the assigned task was resulted in their incomplete status for this week. Meeting notes should be checked by all members before leaving at the end of the capstone meeting and then emailed out to everyone so that we're all on the same pages for the week.
 * Task:
 * Results:
 * Plan:
 * Concerns:

Week Ending March 22nd, 2011
Over the spring break I was tasked with some duties relating to my group leadership. I was responsible for updating the proposal with task assignments and estimated completion dates. I completed the proposal update later than expected but then received an email with the updated proposal and decided to use the tasks and schedule within that document. During the break I sent out an email to my team asking for their input on which tasks they would like to perform, however I never received any response. I then assigned the above tasks to those I thought most suited to the tasks but took into consideration that both Scott and Nick might need to have their workload shared with myself or KC. We will provide support in completing those tasks if they cannot complete them by breaking the tasks down into smaller ones and then dividing them amongst other team members. In addition, I was responsible for settting up the svn on caesar so that it could be accessed through user accounts. This will be done using ssh tunneling, a feature that will result in the best overall security for caesar. Rather than an apache server being set up which has another whole host of configuration hassles the ssh tunnel will allow us full functionality with a service already familiar to the less experienced unix users. Current tasks for the group are as follows: a Mini Switchboard train set will be created by James Bartoldus on March 29th. a full train set will be created by James Bartoldus on April 3rd. a mini development test set will be created by James Bartoldus on March 29th a full test set will be created by James Bartoldus on April 3rd a full evaluation test set will also be created by James Bartoldus on April 3rd. a tool that will parse transcriptions from Switchboard to Sphinx will be created by Nick Sandberg on April 3rd a tool that will call on an application to down sample audio files will be created by KC Ibey on March 29th a tool that will generate new experiment directories according to the experiment directory structure will be created by Scott Innes on April 3rd The scheduling is complete for the current know task list of the Building Models group. The SVN has a few kinks to work out before it will be accessible however the underlying structure is there and all the functionality is available.
 * Task:
 * Results:

This next week I plan on getting the dataset definitions for the mini sets squared away and getting the SVN server totally configured so that the perl scripts being written will be under version control. The SVN configuration will be relatively straightforward and just needs some trial and error tweaking of the configuration file. I also plan on getting a page up on the wiki that will list the commands needed to successfully operate SVN from the command line as I am not sure whether ssh tunneling is available in graphical clients such as tortoiseSVN. The datasets worry me at this point as I am not entirely clear on what defining the dataset entails. Additionally my group leadership worries me as I didn't receive a single email response over the spring break from any of my team members. Please respond to my emails with at least an “acknowledged” from now on. Is there any way that the Building Models group could communicate more effectively? Could the group perhaps swap AIM screennames so we can actually have a conversation and get in touch when we need to?
 * Plan:
 * Concerns:

Week Ending March 29th, 2011
This past week I was tasked with the creation of the mini dataset consisting of both a mini training set and a mini development set. The mini training sets consist of an hour of training data and a half hour of development testing data. The sets pull from the same data that the full dataset will draw from: the //media/data/Switchboard directory on Caesar. My intention is to create a directory within the Switchboard directory with a series of symbolic links to individual files spanning the disks so as to provide samples from the entirety of the set for the full data set but for the mini dataset I intend to create a similar setup just using the data from one disk as the mini dataset is meant to function more as a proof of concept than anything else. In addition I have decided that the SVN can best be accessed through the use of ssh tunneling to put and pull revisions from Caesar. Subversion has built-in secure shell functionality that will allow it to be accessed much in the same way that users currently access Caesar and will allow users to easily transition into its usage. The other option is to set up an apache server however I believe that will introduce a host of security issues best avoided. The dataset is still currently not fully set up as I need to consult with the group to get some more up-to-date information on the nature of the switchboard data. I plan on using time after the Capstone meeting today to consult with the Building Models group and the rest of the Capstone to get the datasets squared away. It seems at this point that perhaps things have gotten away from me a bit with my senior project commitments taking up much of my attention however because this is also an important class I intend to devote much more effort to it in the coming weeks as my senior project shifts from the development phase into the report writing phase. The SVN should be ready to use before next class after I consult with Brian on the user configuration that we will be using and it will be a simple matter of emailing everyone who will be using the system a brief set of commands for using the repository. For next week I intend be working on the full dataset and to help my team members get accustomed to the subversion repository so that our work for the project is all under version control. All documents can go under version control should that be deemed necessary. Primarily my concern is with the development of the full dataset which will utilize the full 150 hours of data from the Switchboard files. This will undoubtedly take some time to organize and I will keep the group updated if I run into any major issues that could prevent it from being finished for the next meeting. I also intend to offer as much assistance as I can to any team member that requires it. My AIM screen name is dryraininbetween and I am almost constantly signed in and available to assist team members in any small tasks they would like to delegate to me in order to complete their tasks. I realize that I haven't exactly been meeting deadlines myself but I can offer my services to the best of my ability.
 * Task:
 * Results:
 * Plan:

Currently I don't have any concerns other than time constraints. This can't really be helped though as this is my senior year and obviously I'm going to be swamped. If you're waiting on me for anything don't hesitate to let me know I try to check my email multiple times daily and always respond to any direct questions as promptly as I can.
 * Concerns:

Week Ending April 5th, 2011
N/A N/A N/A N/A N/A Worked on getting dangerous with perl scripting. Wrote a small experimental script that pulls the byte length out of the headers of .sph files and then attempted to automate the activity within a single disk. Still working out some issues with that.
 * WEDNESDAY:
 * THURSDAY:
 * FRIDAY:
 * SATURDAY:
 * SUNDAY:
 * MONDAY:

the script is as follows it currently is generating a few errors and will definitely need to be modified and expanded before it can be used to calculate the entire dataset size, but it is a start.

N/A
 * TUESDAY:

Week Ending April 12th, 2011
N/A N/A N/A N/A N/A
 * WEDNESDAY:
 * THURSDAY:
 * FRIDAY:
 * SATURDAY:
 * SUNDAY:
 * MONDAY:


 * TUESDAY:

Week Ending April 19th, 2011
Saturday

Worked on altering my Perl script to parse the full transcript file to have it determine the cumulative length of the dataset for a more accurate measurement than the Vague >240 hours given by LDC.

Doesn't work yet still needs to be tweaked before it can compile. Will work on it more Sunday.

Monday

finished working on the script to find the full length of the data with a high degree of accuracy

here is output from the script caesar:/home/linux/Documents # perl timeFiles.pl Dataset Cumulative Length seconds : 969796Dataset Cumulative Length minutes : 16163.2666666667caesar:/home/linux/Documents # perl timeFiles.pl Dataset Cumulative Length seconds : 969796 Dataset Cumulative Length minutes : 16163.2666666667 Dataset Cumulative Length hours : 269.387777777778

as you can see the dataset is approximately 270 hours, a bit off from the estimates of 240 by the ldc

the code used to procure the length of the dataset can be seen below

accurate subsets of the data can now be constructed for training using this updated length measurement

Week Ending April 26th, 2011
modified the script to find the lines of transcript for the mini dataset need to write a script for the transcript to be paired with actual audio files now unless that's already been done

One hour of data is contained from the start of file to line 554 one half hour of data is contained from line 554 to 820

Week Ending May 10th, 2011
Wednesday worked on getting the transcript parsing script in order had some trouble understanding regular expressions and how to use them effectively

Thursday Worked on the script again and added some functionality. the script pull out relevant information and then removes all text between brackets using regular expressions. still need to do some work in order to get all of the functionality working. found a regex reference card for use in generating the script.

Friday Didn't get to work on the script: hopefully I can get the regex working from the shell script tomorrow night.