Speech:Spring 2013 Matthew Henninger Log


 * Home
 * Semesters
 * Spring 2013
 * Proposal
 * Report

Week Ending February 5th, 2013

 * Task: Determine what data is missing from the current transcript files.

 Feb 1, 2013 Trying to figure out how many hours of transcripts that I have. I am using the transcription files that are in the file switchboard_word_alignments.tar.gz. I extracted the documents and then copied all of the transcript files into a single folder using the find command. If you use this command remember to create a directory where the files can be copied first. find -iname *trans.text -exec cp {}  \; Once I had copied the files to the directory I created a perl script that would read all the files in a directory and then give me the total time for the transcripts. use strict; use warnings; use Cwd;
 * Results:
 * 1) !/usr/bin/perl

use File::Find::Rule;
 * 1) needed to intsall libfindbin-libs-perl for this to work

my $allfiles = ''; my $seconds = 0; my $minutes = 0; my $hours = 0; my $days = 0;

my $cwd = getcwd;
 * 1) get current directory

my $includeFiles = File::Find::Rule->file ->name('*.text');
 * 1) only read files ending in text

my @filenames = File::Find::Rule->or($includeFiles) ->in($cwd);
 * 1) get all filenames ending in .text and add them to array filenames

@filenames = sort @filenames;
 * 1) sort filenames so they are in order

foreach(@filenames) {  #open file open(my $filebuffer, "<:encoding(UTF-8)", "$_") or die "cannot open < $_: $!"; #read file line while(my $row = <$filebuffer>) {     #get value from each line in a file chomp $row; #split line by spaces my @splitrow = split(/ /, $row); #subtract lines stop timestamp by its start timestamp $seconds = $seconds + ($splitrow[2] - $splitrow[1]); }  } $minutes = $seconds / 60; $hours = $minutes / 60; $days = $hours / 24;
 * 1) open all files
 * 1) conversions

printf "Total time in Seconds: %.2f\n",$seconds; printf "Total time in Minutes: %.2f\n",$minutes; printf "Total time in Hours: %.2f\n",$hours; printf "Total time in Days: %.2f\n",$days;
 * 1) print rounded numbers

This script outputs the total time for all the transcripts as. Total time in Seconds: 1865856.03825001 Total time in Minutes: 31097.6006375002 Total time in Hours: 518.293343958337 Total time in Days: 21.595555998264 I noticed that the files were seperated A and B segments. A being the caller and B being the receiver of the call I modified my script get totals for A and B. Total A: Total time in Seconds: 932928.019125013 Total time in Minutes: 15548.8003187502 Total time in Hours: 259.14667197917 Total time in Days: 10.7977779991321

Total B: Total time in Seconds: 932928.019124983 Total time in Minutes: 15548.8003187497 Total time in Hours: 259.146671979162 Total time in Days: 10.7977779991317

Once rounded the times for A and B are even. This is to be expected because transcripts for A and B are a records of the same conversations and should be equal.

Feb 4, 2013 Used ssh to log into caesar and to find the transcript files. I use the find command to locate all *.text files in order ot locate the transcription files. I found that all of the files from switchboard_word_alignments.tar.gz were on the server intact. All of the data seems to be on cesar. I found a script and a text file called ms98_icsi_word.text under /mnt/main/corpus/scripts. I used filezilla to download the files to my computer. After Using my script I received this result. Total time in Seconds: 36116.702774 Total time in Minutes: 601.945046 Total time in Hours: 10.032417 Total time in Days: 0.418017 This must be the transcript that is missing data. Since the origional transcripts are complete on cesar there must be an error with the script genTrans.pl. 


 * Plan: Download transcript files then sort the files into one folder. Write a perl script to count the ammount of time accounted for in the transcription files. Compaire the times with what is stored on cesar. If there are missing files on cesar figure out why and fix them.
 * Concerns: Do not know perl but it seems pretty easy to use. Data could be difficult to track down. This website is slow and hard to navigate it is taking me more time than it should to retreve data from it.

Week Ending February 12, 2013

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending February 19, 2013

 * Task: Create a single transcript file that contains all the spoken word contents of the transcripts.

 <li>Feb 16, 2013</ul> <dd>I have modified my read time script and have found once you remove the non verbal lines of code you get about 170 hours of transcripts. The transcript now outputs total time for transcrips and total spoken time for transcripts.
 * Results:

new transcript use strict; use warnings; use Cwd; use Getopt::Long;
 * 1) !/usr/bin/perl
 * 2) Matthew Henninger
 * 3) totaltime.pl

my $allfiles = ''; my $seconds = 0; my $spokenseconds = 0; my $filecount = 0; my $totallinecount = 0; my $spokenlinecount = 0; my $round=.6; my $man=0; my $help=0; my @filenames = ;

sub calculateSeconds {  my($start, $stop) = @_; return $stop - $start; }

sub printTimeValues {  my($timeseconds,$roundto) = @_; my $roundstr = "%". $roundto. "f"; my $minutes = 0; my $hours = 0; my $days = 0; #conversions $minutes = $timeseconds / 60; $hours = $minutes / 60; $days = $hours / 24; #print rounded numbers printf "Total time in Seconds: $roundstr\n",$timeseconds; printf "Total time in Minutes: $roundstr\n",$minutes; printf "Total time in Hours: $roundstr\n",$hours; printf "Total time in Days: $roundstr\n",$days; }

my $cwd = getcwd;
 * 1) get current directory

my $result = GetOptions ("round=f"=>\$round,  'help|?' =>\$help); if(@ARGV) {  #get commandline arguments or all text files @filenames = @ARGV; } else {  #open script with defalt read all *.text files exec('./totaltime.pl *.text') or print STDERR "could not run ./totaltime.pl"; }
 * 1) commandline options

@filenames = sort @filenames;
 * 1) sort filenames so they are in order

foreach(@filenames) {  #open file open(my $filebuffer, "<:encoding(UTF-8)", "$_") or die "cannot open < $_: $!"; #read file line while(my $row = <$filebuffer>) {     #get value from each line in a file chomp $row; $row =~ s/\s+$//; #trim off trailing white space #split line by spaces my @splitrow = split(/ /, $row); #subtract lines stop timestamp by its start timestamp $seconds = $seconds + calculateSeconds($splitrow[1], $splitrow[2]); if($row !~ /^.{38,42}\[\w+\]$/) {        $spokenseconds = $spokenseconds + calculateSeconds($splitrow[1], $splitrow[2]); $spokenlinecount++; }     $totallinecount++; }   $filecount++; }
 * 1) open all files

print "file count is $filecount\n"; print "\nTotal line count is $totallinecount\n"; print "Total time:\n"; printTimeValues($seconds,$round);

print "\nSpoken line count is $spokenlinecount\n"; print "Total spoken time:\n"; printTimeValues($spokenseconds,$round);

__END__

I have also written a script that creates a single transcript file out of all of the transcripts. It does this by reading all of the files in the array removing any new line or white space at the end of the line and excluding any line that does not contain spoken text. The array is then sorted by transcript section number and then by the start time of the entry in that section. Once this is completed the array is saved into the filename specified in the commandline argument.

use strict; use warnings; use Cwd;
 * 1) !/usr/bin/perl
 * 2) Matthew Henninger
 * 3) combine_text.pl

my $dir = getcwd; #get current directory my $output_dir; my @outputfile;

sub file_sort #sorts by filename then start time {   #get file number and start time $a =~ /^sw(\d+).-ms98-a-\d+\s(\d+.\d+)/; my($filenumber_a, $starttime_a) = ($1, $2); $b =~ /^sw(\d+).-ms98-a-\d+\s(\d+.\d+)/; my($filenumber_b, $starttime_b) = ($1, $2); #compaire start times if filenumber is equal return $starttime_a <=> $starttime_b if ($filenumber_a == $filenumber_b); #else compair filenumbers return $filenumber_a <=> $filenumber_b }

if(@ARGV == 0) {   print("Error output filename needed\n"); exit; } else {   $output_dir = $ARGV[0]; }

for(glob("$dir/*.text")) {   #open file open(my $filebuffer, "<:encoding(UTF-8)", "$_") or die "cannot open < $_: $!"; while(my $row = <$filebuffer>) {       chomp $row; #remove new line $row =~ s/\s+$//; #trim off trailing white space #exclude rows that only contain a value between braces if($row !~ /^.{38,42}\[\w+\]$/) {           push(@outputfile,$row); }   }    close($filebuffer) #close file; }
 * 1) for each text file in current directory

@outputfile = sort file_sort @outputfile;
 * 1) sort file using custom sort

open(my $fb, ">", $output_dir) or die "cannot save > $output_dir: $!"; for(@outputfile) {   print $fb "$_\n"; } close($fb); #close file
 * 1) open output file
 * 1) output to custom sort

Uploaded full transcript named full_transcript.text to /mnt/main/corpus/switchboard/full/train/trans on caesar.

<dt><li>Feb 17, 2013</ul> <dd>Looking at caesar I found that the time was wrong on the server. The ntp time server was not configured and running. I added north-america.pool.ntp.org to /etc/ntp.config. Started the ntp time server using "rcntp start". The time on caesar is now correct. It had previously been Sun Feb 17 04:23 EST 2013 instead of the actual time of Sun Feb 17 14:53 EST 2013. Will need to add ntp server to startup. All other servers should be synced ceaser in the future.

<dt><li>Feb 18, 2013</ul> <dd>Talked with rest of group they are working on getting the audio files organized. I informed them that I had talked to Prof Jonas and he indicated that he did not want to have the files seperated. I do not have space on my hard drive to store the audio files to look at them on my home computer. I am still concerned that the transcripts have more hours than the 100 hours that prof jonas quoted. <dd>I also looked at the genTrans.pl script I downloaded from caesar. It goes through a .text file and creates a transcript file as well as an utterance id file. I am not sure that this script is entirely correct from reading the regular expressions contained in the script. Will have to read the sphix website to see what data is needed in a transcript file. This script is also unneccisaly long could reduce it's size quite a bit. </dl>


 * Plan: Use script I had already written to generate a file that contains all of the text into one file. Will read transcript files on caesar to see what files were generated in previous classes.


 * Concerns:

Week Ending February 26, 2013

 * Task: To confirm that we have at least 100 hours of transcrpts to match the audio.


 * Results: I have written a script that will compares the audio transcript data with written transcripts and creates a CSV file that can be opened Libreoffice. Using Libreoffice i converted the csv file into a spreadsheet. And using the spread sheet I was able to determine that we actually have around 250 hours of audio and transcripts. The transcripts are separated into A and B files and each contain 295 hours 8 minutes and 48 seconds of data. This is due to the fact that they represent a single channel from the same audio file. Our total audio file length is 255 hours 37 minuets and 3 seconds. I edited the spreadsheet to show which transcript files did not have a matching transcript and found that there were 24 missing audio files. I manually checked and found that the files were indeed missing from our audio transcripts. I will have to make a new combined transcript file that excludes transcript files with no corresponding audio data.


 * Plan: To create a perl script that will compares each transcript file with each audio file using sox.


 * Concerns: Getting feedback from sox in perl is not the easiest thing in the world. Sox output is in STDERR instead of STDOUT so you have to convert it in you command line argument before you can retrieve the output.

Week Ending March 5, 2013
Injured myself too loopy on drugs to work on code.
 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending March 12, 2013

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending March 26, 2013

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 2, 2013

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 9, 2013

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 16, 2013

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 23, 2013

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending April 30, 2013

 * Task:


 * Results:


 * Plan:


 * Concerns:

Week Ending May 7, 2013

 * Task:


 * Results:


 * Plan:


 * Concerns: