Speech:Spring 2014 Proposal Group


 * Home
 * Semesters
 * Spring 2014
 * Proposal
 * Report
 * Information - General Project Information
 * Experiments - List of speech experiments

Groups

 * Systems Group
 * Experiment Group
 * Tools Group
 * Data Group
 * Modeling Group


 * [Proposal Group]

Summary
I created this group so we can discuss ways we can create the Proposal in a way it reflects a narrative. Obviously there are separate sections for this project (5) total.

Professor has told us time and time again that he wants this narrative rather than what has been done in the past. From what I can see by looking at past proposals is everything is split up into the group sections. I think a huge reason for that is just the way Media Wiki structures their pages. That's all fine and nice, but I think we should totally axe the idea of creating different sections on the Proposal page, but not totally axe the principal of them.

// I like the structure etc, but I know there will need to be more detail in a few areas. I am specifically thinking about the tools section. I would want to describe the tools we use, what version is installed and what is available (much like I did in the proposal already). This is not only so we know what we are using, but so the next semester will know the current stage of what's installed. Are there any other groups that would need something similar (maybe the systems group?. If there is supposed to be an overall narrative, I think we should get rid of using words like "our" and "we" for example, you start the intro of the Experiments group with "The experiments group...their..." but then move to "we...our" I think it would be better with they and their, etc. If someone needs to go over it for this sort of structure and cohesion I am happy to. I'll add the tools section to fit this outline tomorrow pm - Justin Alix //

I'm talking about not using multiple sections that force separation and take away the narrative feel he wants. I'm thinking about having ONE section on Media Wiki (one Edit button) that we write the entire proposal in. And we can then follow a guideline for each group like below (fake data):

Introduction
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Goals
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Implementation Plan
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Timeline
Joshua Pauline Brian Ramon
 * 2/12 - 2/25 - Work on creating a Master Experiment script to create a variety of types of new experiments.
 * 2/12 - 2/25 - Work on creating a Master Experiment script to create a variety of types of new experiments.
 * 2/12 - 2/25 - Work on creating a Master Experiment script to create a variety of types of new experiments.
 * 2/12 - 2/25 - Work on creating a Master Experiment script to create a variety of types of new experiments.

So by doing this for EVERY one of our groups, we can kind of keep it very uniform and the reader can easily move from one group section to the next and by the time they reach the timeline for each group, they will have full knowledge of what is going on and what the actual plan is for EVERY group.

This is how I've seen project proposals done in the past at work, which is why I am recommending this. It looks formal and organized which in my opinion is what needs to happen. The client NEEDS to know that YOU know what you're talking about and that you are confident you can offer them what they want.

I would love to discuss this when we meet tomorrow and get a feel for what others think. We also should talk about all getting our groups proposal on this page before the beginning of next week so one of us can format the actual proposal page before next weeks class.

Thanks! - JA

2/15/2014 - Josh

Hey all - So below is the proposal the Experiment group plans on using. I decided to nix the Goals part and just combine that with the Implementation Plan section. This isn't finished just yet as we are still aligning the timeline information, but this is to show you guys what I did.

Also, say you're in the Imp. Plan section and want to add sub-headers for different sections, you can simply do that adding 4 equal signs (====) before and after the heading title. I did this on the Timeline section with all four of our names. The parser makes smaller, bolded headers that stand out more.

Introduction
The Experiment group focuses a lot of their energy providing a better experience for other members while producing a variety of different types of experiments. We spend time perfecting our knowledge of the detailed directory structure each new experiment implements, making sure we know what each part is incase any questions arise. Over the past semesters, the number of experiments has grown tremendously proving just how important a smooth, and efficient process running experiments must be. With the Spring 2014 semester already running a significant number of experiments, it gives our group great motivation to continue our efforts of learning this process, and applying it by writing knowledgeable guides, detailed information about all parts of the experiment directory, and modifying some scripts to become up-to-date.

Implementation Plan
The Spring 2014 Experiment groups plan for this semester is ambitious. Like our introduction states, we want to focus on the theory behind what goes into the Sphinx process because when we get that, we'll have a much better understanding on the experiment directory structure we are using. Once we have an understanding, we can apply it by creating a more instructive and initiative guide on the Experiment Information Wiki page. And depending on time, we may be also be able to modify existing scripts to help the entire experiment making process a bit smoother and more efficient for those that have to run a number of them.

Below is a detailed list of what we would like to accomplish this semester:


 * Better understand the theory and application behind the structure when creating a new Exp. directory.
 * We have 8 directories that get used: etc, feat, models, python, scripts, trans, wav, and logs.
 * I think it's best that we all look into these directories and note how these get created, why they get created, and what contents are in them.
 * You can look at the Run a Train (http://foss.unh.edu/projects/index.php/Speech:Training) tutorial on the Wiki again to see what most of those directories are used for.
 * Both of the other parts of the Model Building section are: Creating a Language Model (http://foss.unh.edu/projects/index.php/Speech:Create_LM) and Running a Decode (http://foss.unh.edu/projects/index.php/Speech:Run_Decode) - both of these add TWO new folders to your experiment directory (LM and DECODE respectively).


 * Totally re-format the Main Experiment wiki page: http://foss.unh.edu/projects/index.php/Speech:Exp
 * The current state is confusing and very out of date. Our goal should be to create a more informative and intuitive page to detail the Experiment process.
 * Once we have an understanding, we should organize this page with detailed information about each of the 8 folders that get created:
 * Why they are added
 * What they get used for
 * Detail any scripts that are currently being used and/or scripts that we create during the semester that have direct relationship to the experiment directory and the process as a whole.
 * Note all the different types of experiments that can be run based on previous entries. For each one, give a little summary of each and link to a past experiment that someone did.


 * Re-visit the script that actually creates a new (empty) Experiment directory: create_expdir.pl
 * I think we can use this script and expand on it and possibly use this as our base Master script when we get to a point where we can start combing some of the existing stuff when doing a new Run a Train experiment (as that process requires you to create a brand new experiment directory) including Eric's train_01.pl and train_02.pl (which already need to be combined), and all the other scripts Colby talked about in his email he sent Wednesday shortly after class.


 * Re-visit the move_to_expdir.pl script - not sure what the point of this script is and if it's being used anywhere.

Josh

 * Week ending Feb 18th
 * Finish and publish final proposal.
 * Gain understanding and theory behind the experiment directory structure.


 * Week ending Feb 25th
 * Revisit create_expdir.pl script
 * Understand what it's currently doing, and add any modifications to make it more widely used in the process of creating a new experiment - i.e. Run a Train process.


 * Week ending Mar 4th
 * Should have working knowledge of the create_expdir.pl script. Can write up a simple guide that can be added to the Experiment Wiki page.
 * Can now start on incorporating this into the Running a Train process by implementing the train_01 and train_02 scripts (if they choose to do so when running the create_expdir.pl script).


 * Week ending Mar 18th - spring break Mar 10th - 14th
 * Complete the create_expdir.pl script that is functional in creating a new experiment directory and can simply start the process of Running a Train
 * Create a sub-page on the Experiment Wiki page that describes this script in detail.
 * Work with team to modify the Run a Train tutorial page to include the steps for this new script.

Pauline

 * Week ending Feb 18th
 * Gain understanding and theory behind the experiment directory structure.


 * Week ending Feb 25th


 * Week ending Mar 4th


 * Week ending Mar 18th - spring break Mar 10th - 14th

Brian

 * Week ending Feb 18th
 * Gain understanding and theory behind the experiment directory structure.


 * Week ending Feb 25th
 * Log details regarding experiment, for example, the who, what, when, where scenarios


 * Week ending Mar 4th
 * Generate an experiment from start to finish


 * Week ending Mar 18th - spring break Mar 10th - 14th
 * Detail all aspects of an experiment, for example, what it is, what it does, and how to do it

Ramon

 * Week ending Feb 18th
 * Gain understanding and theory behind the experiment directory structure.


 * Week ending Feb 25th


 * Week ending Mar 4th


 * Week ending Mar 18th - spring break Mar 10th - 14th

Introduction
The Data group is primarily concerned with maintaining, updating and future proofing the following elements:  Transcripts Word Alignment Audio 

We have been tasked with several important aspects of our research and development with Speech technology. First and foremost, the transcripts were left a cluttered mess. We have find and organize all 250 hours worth of transcripts. Once we find them, we need to make sure they're organized correctly and actually usable. We also need to make sure the audio files are in the correct format, and are able to be used. The Data group is fully committed to gaining a complete understanding of our current and future positions regarding the data gathered. We have already begun learning the structure of our several corpora. With these and other tasks, such as checking the genTrans6.pl script along with verifying dictionary completeness, and learning about .sph files, the Data group has a lot to keep track of.

Implementation Plan
This semester, the Data Group fully intends on accomplishing all of its tasks. Staying on top of our assignments is what will cause us to succeed. We need to fully understand the various scripts that have been previously written to maintain and count transcripts for future use. Once we understand and get all of the kinks out of our transcripts and audio files, we can continue to make progress on our research. We also want to take this semester to learn Perl, and how we can utilize it to better perform our tasks.

Below is a detailed list of what we would like to accomplish this semester:


 * Collect and organize all of our transcripts so they are usable
 * Last year, several Perl scripts were written in order to count and organize the transcripts. The success of these scripts it not yet known. The scripts can be found here (http://foss.unh.edu/projects/index.php/Speech:Spring_2013_Matthew_Henninger_Log)
 * It seems as though these various scripts have collectively combined all transcripts into one transcript file. We still need to verify the accuracy of these statements.


 * Make sure eval and dev files are separate from train corpus
 * Professor Jonas spoke about the possibility of eval and dev files not being separate in all corpora


 * Add a lot more useful information than previous semesters regarding the data group
 * The data group has only been around once before, and unfortunately its members didn't keep very detailed logs, if logs were present at all.
 * We want to organize all of the perl scripts used, and the locations of our transcripts and audio into one central location so future Data groups aren't as lost as we were
 * A new page on the media wiki site dedicated to this would be very helpful


 * Read and understand the genTrans6.pl script
 * Taken from the media wiki page, "This is the Perl script that was created to do nearly everything you want. It cleans the transcripts and creates the wav files. It locates the .sph files from the specified directory and it converts each one to a .wav file. It then goes through the transcript and cleans is up. This means that it takes out the header and it leaves the for the start. It also changes all characters to uppercase and deletes any [, ], {, }, and -, that it finds. This is done through the use of the "sed" command. It does this all the way through the script and it leaves the < /s > to show that it is the end of the line."


 * Verify completeness of our dictionary
 * We also need to locate the dictionary


 * Understand the experiment and train processes, and successfully run trains and test on trains
 * Media wiki pages are very helpful for this

John

 * Week ending Feb 18th
 * Understand trains and experiments


 * Week ending Feb 25th
 * Use Perl scripts to better understand how they work
 * Specifically genTrans6.pl


 * Week ending Mar 4th
 * Collaborate with group members to create a useful page on Mediawiki for other future Data Group members


 * Week ending Mar 18th - spring break Mar 10th - 14th
 * Finalize any last tasks

Jared

 * Week ending Feb 18th
 * Verify that the corpus is up to date.


 * Week ending Feb 25th
 * Create a way to determine the length of training transcripts (tiny, 5hr, etc).


 * Week ending Mar 4th
 * Find a way to convert .sph files to wav and determine total length.


 * Week ending Mar 18th - spring break Mar 10th - 14th
 * Verify transcripts match audio for each train.

Mitchell

 * Week ending Feb 18th


 * Week ending Feb 25th


 * Week ending Mar 4th


 * Week ending Mar 18th - spring break Mar 10th - 14th