Speech:GenTrans0

Summary
Title: genTrans.pl

Author: unknown

Location: mnt/main/scripts/user/genTrans.pl also in mnt/main/corpus/scripts/genTRans.pl

Usage:

Description

 * GenTrans Perl Script - original version

This version of the GenTrans script takes out the header and it leaves the for the start. It also changes all characters to uppercase and deletes any [, ], {, }, and -, that it finds. This is done through the use of the "sed" command. It does this all the way through the script and it leaves the < /s > to show that it is the end of the line.

The line: is commented out of the script because this command does not work correctly. What this command does is it take out anything it find that is inside of the [ ]. This is what we want the script to do, but what this command does is if it finds more than one bracketed word, ex: [LAUGHTER] words words words words [NOISE], it will delete everything from the first [ in LAUGHTER all the way to the last ] in NOISE. This is not what we want the script to do since we lose data that we cannot lose. This is commented out so that future classes can see that this expression was tried and failed and can be worked on to perfection.
 * 1) $message =~ s/\[.*\]//g;

Code
if ($#ARGV != 1) {  print "usage: genTrans.pl  \n"; exit -1; } $trans_unedited = $ARGV[0]; $trans_prefix = $ARGV[1]; $train_trans = "$trans_prefix"."_train.trans"; $train_fileids = "$trans_prefix"."_train.fileids"; system ("rm $train_trans"); system ("rm $train_fileids"); print "processing."; open(MYINPUTFILE, "<$trans_unedited") || die("can't open file: $!"); open(MYOUTPUTFILE, ">>$train_trans"); open(MYIDFILE, ">>$train_fileids"); while()		# read in file line by line {      print "."; $line = $_; chomp $line; $utteranceID = $line;	    # copy line to new variable ### $utteranceID =~ s/sw[0-9]*//; # remove all characters prior to the speaker identification $utteranceID =~ s/ .*//;	    # remove all characters after the speaker and utteranceID, this pulls out the utterance ID       #get sph name $sphName = $line; $sphName =~ m/sw[0-9]*/; #match to substring sw0...? $sphName = $&;          #grab match $sphName =~ s/^sw/sw0/; #replace instance of sw with sw0 $sphName = $sphName. ".sph"; $start = $line;			      # copy line to new variable $start =~ s/sw[0-9]*[A-B]-ms98-a-[0-9]* //; # remove all characters up to and including the first whitespace $start =~ s/ .*//; 			      # remove everything after the whitespace, this pulls out start time $stop = $line;				# copy line to new variable $stop =~ s/sw[0-9]*[A-B]-ms98-a-[0-9]* \d+\.(\d+) //; #remove all characters up to & including the 1st whitespace $stop =~ s/ .*//;			# substitute a blank for everything after the whitespace, this pulls out stop time $duration = $stop - $start; $message = $line;				 # copy line to new variable $message =~ s/sw[0-9]*[A-B]-ms98-a-[0-9]* [0-9]*.[0-9]* [0-9]*.[0-9]* //; # remove everything before the message $message =~ s/\"//g;      #$message =~ s/\[.*\]//g;       $message =~ s/\[//g;       $message =~ s/\]//g;       $message =~ s/\-/ /g;       $message =~ s/\// /g;       $message =~ s/\{//g;       $message =~ s/\}//g;       $message =~ s/\_1//g;       $message =  uc $message;         #make me some sph files       $sysCmd = "sox -U ../wavTemp/" . $sphName . " -a ../wavTemp/" . $sphName . ".wav trim " . $start . " " . $duration;       system($sysCmd);             $sysCmd = "sox ../wavTemp/" . $sphName . ".wav -r 8k -c 1 -s ../wavTemp/" . $utteranceID . ".wav";       system($sysCmd);       $sysCmd = "sox ../wavTemp/" . $sphName . ".wav ../wav/" . $utteranceID . ".sph";       system($sysCmd);       $newTranscript = " $message ($utteranceID)";		       print MYOUTPUTFILE "$newTranscript\n";			# send transcript to new file       print MYIDFILE "$utteranceID\n";     } close(MYINPUTFILE); close(MYOUTPUTFILE); close(MYIDFILE); print "done\n";
 * 1) !/usr/bin/perl