Speech:Summer 2012 genTrans


 * Home
 * Information

GenTrans Perl Script
This is a modified version of the genTrans.pl script. Documentation for the original can be found here.

This script still performs the same functions in that it takes the unedited transcript and formats it for use in training. It takes the relevant sph files and creates new sph files that match the utterance entries in the transcript.

The changes made are as follows:
 * Changed the input it needs. It now instructs the user to provide the path to the corpus directory that contains the transcript.  Each corpus directory contains a trans and wav directory.
 * The script will pull the transcript from the trans directory
 * The script will use the sph files in the wav directory
 * It no longer needs to copy the files to a wavTemp directory. All the files are read directly from the source directory and the output files are saved directly to the wav directory in the current experiment folder.
 * Removed one sox statement that was not used to speed up processing.
 * Rather than creating temp copies of each sph file being converted, one temp file is used. It is overwritten with each conversion so this does not use up a lot of disk space.

Source Code

 * 1) !/usr/bin/perl

if ($#ARGV != 1) { print "usage: genTrans.pl  \n". " Example: /mnt/main/corpus/switchboard/tiny/train 0011\n"; print "should be executed from the top level experiment directory ex: /mnt/main/Exp/0011\n"; exit -1; }

$corpus_dir = $ARGV[0];
 * 1) set corpus directory

$trans_prefix = $ARGV[1];
 * 1) prefix is the exp_id

$trans_unedited = $corpus_dir. "/trans/train.trans";
 * 1) append the path the trans file based on the corpus dir provided

$train_trans = "etc/". $trans_prefix. "_train.trans"; $train_fileids = "etc/". $trans_prefix. "_train.fileids";
 * 1) set the output file names


 * 1) system ("rm $train_trans");
 * 2) system ("rm $train_fileids");

print "processing.";

open(MYINPUTFILE, "<$trans_unedited") || die("can't open file: $!"); open(MYOUTPUTFILE, ">>$train_trans"); open(MYIDFILE, ">>$train_fileids");

while()               # read in file line by line {     print ".";

$line = $_; chomp $line; $utteranceID = $line;         # copy line to new variable $utteranceID =~ s/ .*//;      # remove all characters after the speaker and utteranceID, this pulls out the utterance ID

#get sph name $sphName = $line; $sphName =~ m/sw[0-9]*/; #match to substring sw0...? $sphName = $&;          #grab match $sphName =~ s/^sw/sw0/; #replace instance of sw with sw0 $sphName = $sphName. ".sph";

$start = $line;                         # copy line to new variable $start =~ s/sw[0-9]*[A-B]-ms98-a-[0-9]* //; # remove all characters up to and including the first whitespace $start =~ s/ .*//;                              # remove everything after the whitespace, this pulls out start time

$stop = $line;                           # copy line to new variable $stop =~ s/sw[0-9]*[A-B]-ms98-a-[0-9]* \d+\.(\d+) //; #remove all characters up to & including the 1st whitespace $stop =~ s/ .*//;                # substitute a blank for everything after the whitespace, this pulls out stop time

$duration = $stop - $start;

$message = $line;                          # copy line to new variable $message =~ s/sw[0-9]*[A-B]-ms98-a-[0-9]* [0-9]*.[0-9]* [0-9]*.[0-9]* //; # remove everything before the message $message =~ s/\"//g;     $message =~ s/\[noise] //g;      $message =~ s/\[//g;      $message =~ s/\]//g;      $message =~ s/\-/ /g;      $message =~ s/\// /g;      $message =~ s/\{//g;      $message =~ s/\}//g;      $message =~ s/\_1//g;      $message =  uc $message;

#make me some sph files $sysCmd = "sox -U ". $corpus_dir. "/wav/". $sphName. " -a wav/temp.wav trim ". $start. " " . $duration; system($sysCmd);

$sysCmd = "sox wav/temp.wav wav/". $utteranceID. ".sph"; system($sysCmd);

$newTranscript = " $message ($utteranceID)"; print MYOUTPUTFILE "$newTranscript\n";                   # send transcript to new file

print MYIDFILE "$utteranceID\n"; }

close(MYINPUTFILE); close(MYOUTPUTFILE); close(MYIDFILE); print "done\n";