Speech:GenUttAudio.pl

=Summary= Title: genUttAudio.pl -- formally createUtts.pl Authors: James (primary), Jon (helping) -- Modeling Group SP16 Location: /mnt/main/scripts/user/ Usage: genUttAudio.pl  
 * Example: genUttAudio.pl /mnt/main/corpus/switchboard/256hr/train/trans/train.trans /mnt/main/corpus/switchboard/256hr_new/train/audio/utt/ /mnt/main/corpus/switchboard/256hr_new/info/logs/

IMPORTANT: Don't forget the final forward slashes on the second and third arguments. Possible Improvement: Look into using soxi -D audioFile to determine length of audio file in seconds instead of manually calculating it. This could possible make the script more flexible because I'm assuming here that the soxi -D command looks into the all of the information found in the audio data (header data). Link: soxi -- Found the info here: Jared_Rohrdanz_Log UPDATE (3/30/16): The above improvement has been implemented. Reference: Header Info/Converting Info, found in Forrest_Surprenant_Log

=Description= This script takes in a transcript file, and from that, generates audio utterance files in the specified /utt/ directory and also generates four logs in the specified /logs/ directory.

DEBUGGING: To determine if either of the log files (trans.log, utt.log, conv.log) have "bad" data on a given line, you can simply grep for WARNING and the lines that are bad will be output to the terminal.

The following logs are generated: Content & Format  Content & Format  Content & Format   Content & Format Date: Generated utt sph files for: Used transcript: Created following logs: ---trans.log ---utt.log ---conv.log Finished processing, file count: Difference between file count and utterances in transcript:  Time of total audio in hours: NOTE: For second to last line:  if negative: lower number of sphinx audio utterances than utterances in transcript file;   if positive: higher number of sphinx audio utterances than utterances in transcript file
 * train.log
 * utt.log
 * conv.log
 * corpus.log

=Code=
 * 1) !/usr/bin/perl


 * 1) Authors: James S. (primary), Jon S. (helping) -- Modeling Group SP16


 * 1) -Description
 * 2) Takes in a transcript file (i.e. train.trans) and generates audio utterances from
 * 3) the conversation audio files in /mnt/main/corpus/switchboard/dist/flat
 * 4) Usage: genUttAudio.pl /absolute/path/to/train.trans /absolute/path
 * 5) /to/directory you want the utts in/ /absolute/path/to/log directory/


 * 1) -Pseudocode-
 * 2) Get arguments
 * 3) Open file
 * 4) Start loop
 * 5) Successively read each line
 * 6) Throw full file name into variable
 * 7) Throw a formatted file name with a 0 after the w and the letter taken off the end
 * 8) (i.e. sw02345 instead of sw2345A) into a variable
 * 9) Throw the start time into a vaiable
 * 10) Throw the end time into a variable
 * 11) Throw the diff between end time and start time into a variable
 * 12) Get the channel
 * 13) Use sox command like so: sox filein fileout trim start duration remix (1 or 2,
 * 14) depending on the channel)
 * 15) Log train, utt, and conv data
 * 16) End loop
 * 17) Close file
 * 18) Log corpus data


 * 1) -Start of code--

use POSIX; # Get timezone

if (@ARGV != 3) {   die("Three arguments are necessary to run this script!"); }
 * 1) Check for valid number of arguments

$trainFile = $ARGV[0]; $targetDirectory = $ARGV[1]; $logDirectory = $ARGV[2];
 * 1) Get arguments

open FIN, "<", $trainFile;
 * 1) Open transcript file for reading

while (my $entry = ) {   # Fill array with items in entry (i. e. file name, start time, etc.) my @entryItems = split ' ', $entry;
 * 1) Process each entry in the transcript file and create a corresponding utterance audio file

# Copy full file name (i.e. sw3041A-ms98-a-0002) my $fullFileName = $entryItems[0];

# Creating a formatted file name to find in the flat directory my $part1FileName = substr $fullFileName, 0, 2; # sw -- Using the example full file name above my $part2FileName = substr $fullFileName, 2, 4; # 3041 my $formattedFileName = $part1FileName. "0" . $part2FileName; # sw03041

# Get start and end times and get the duration (the diff) my $startTime = $entryItems[1]; my $endTime = $entryItems[2]; my $duration = $endTime - $startTime;

# Get channel my $channel = substr $fullFileName, 6, 1; # A or B

# Use the sox command to create an utterance audio file given the current entry in the transcript if ($channel eq "A" || $channel eq "a") # Use channel 1 a.k.a. speaker A   { $soxCmd = "sox /mnt/main/corpus/switchboard/dist/flat/". $formattedFileName. ".sph --bits 16 --encoding signed-integer -4 ". $targetDirectory. $fullFileName. ".sph trim ". $startTime. " " . $duration. " remix 1"; system($soxCmd); }   else # Use channel 2 a.k.a. speaker B    { $soxCmd = "sox /mnt/main/corpus/switchboard/dist/flat/". $formattedFileName. ".sph --bits 16 --encoding signed-integer -4 ". $targetDirectory. $fullFileName. ".sph trim ". $startTime. " " . $duration. " remix 2"; system($soxCmd); }

# Log trans data if ($duration > 0) # Good {       $transLogData = $fullFileName. "\t". $startTime. "\t". $endTime. "\t". $duration; # If duration is negative, bad }   else # Bad {       $transLogData = $fullFileName. "\t". $startTime. "\t". $endTime. "\t". $duration. "\tWARNING"; # If duration is negative, bad }   $transLogCmd = "echo ". $transLogData. " >> " . $logDirectory. "trans.log"; system($transLogCmd);

# Log utt data $expUttDuration = $duration; $actUttDuration = `soxi -D $targetDirectory$fullFileName.sph`; # In seconds $actUttDuration = substr $actUttDuration, 0, length($actUttDuration) - 1; $uttDurationDiff = $actUttDuration - $expUttDuration; # If not close in value, bad if (abs($uttDurationDiff) < 0.01) # Good {       $uttLogData = $fullFileName. "\t". $expUttDuration. "\t". $actUttDuration. "\t". $uttDurationDiff; }   else # Bad {       $uttLogData = $fullFileName. "\t". $expUttDuration. "\t". $actUttDuration. "\t". $uttDurationDiff. "\tWARNING"; }   $uttLogCmd = "echo ". $uttLogData. " >> " . $logDirectory. "utt.log"; system($uttLogCmd);

# Log conv data $uttEndTime = $endTime; $convDuration = `soxi -D /mnt/main/corpus/switchboard/dist/flat/$formattedFileName.sph`; $convDuration = substr $convDuration, 0, length($convDuration) - 1; $convUttDiff = $convDuration - $uttEndTime; # If not positive or 0, bad if ($convUttDiff >= 0) # Good {       $convLogData = $fullFileName. "\t". $uttEndTime. "\t". $formattedFileName. "\t". $convDuration. "\t". $convUttDiff; }   else # Bad {       $convLogData = $fullFileName. "\t". $uttEndTime. "\t". $formattedFileName. "\t". $convDuration. "\t". $convUttDiff. "\tWARNING"; }   $convLogCmd = "echo ". $convLogData. " >> " . $logDirectory. "conv.log"; system($convLogCmd);

# Give something for the user to see print $soxCmd. "\n"; }

close FIN;

$date = localtime; $date = $date. " " . strftime("%Z", localtime); # Appending the timezone $totalUttCount = `wc -l $trainFile`; $totalUttCount = substr $totalUttCount, 0, length($totalUttCount) - 1; $fileCount = `ls $targetDirectory | wc -l`; $fileCount = substr $fileCount, 0, length($fileCount) - 1; $diffUttCount = $fileCount - $totalUttCount; # Should be 0, the utterances in the transcript should be the same as the number of generated audio utterance files $uttLogPath = $logDirectory. "utt.log"; $totalAudioTime = `awk '{total += \$3} END {print total / 3600}' $uttLogPath`; $totalAudioTime = substr $totalAudioTime, 0, length($totalAudioTime) - 1; @path = split "/", $trainFile; $last = @path - 3; #(i.e. train from /corpusName/train/trans/train.trans) $corpus = "/"; for (my $i = 1; $i < $last; $i++) # Creates a path that ends with the corpus name {   if ($i eq $last - 1) # Don't include a "/" at the very end {       $corpus = $corpus. $path[$i]; }   else {       $corpus = $corpus. $path[$i]. "/";   } } $usedTranscript = ".../"; for (my $i = $last - 1; $i < @path; $i++) # Creates a path that starts with the corpus name {   if ($i eq @path - 1) {       $usedTranscript = $usedTranscript. $path[$i]; }   else {       $usedTranscript = $usedTranscript. $path[$i]. "/";   } } @corpusLogData = ; $corpusLogData[0] = "Date: $date"; $corpusLogData[1] = "Generated utt sph files for: ". $corpus; $corpusLogData[2] = "Used transcript: ". $usedTranscript; $corpusLogData[3] = "Created following logs:"; $corpusLogData[4] = "---trans.log"; $corpusLogData[5] = "---utt.log"; $corpusLogData[6] = "---conv.log"; $corpusLogData[7] = "Finished processing, file count: $fileCount"; $corpusLogData[8] = "Difference between file count and utterances in transcript: $diffUttCount"; $corpusLogData[9] = "Time of total audio in hours: $totalAudioTime"; for (my $i = 0; $i < @corpusLogData; $i++) {   $corpusLogCmd = "echo ". $corpusLogData[$i]. " >> " . $logDirectory. "corpus.log"; system($corpusLogCmd); }
 * 1) Log corpus data


 * 1) -End of code