Speech:Spring 2013 updateDict.pl

Summary
updateDict.pl is a Perl script designed to:
 * Merge a large dictionary file with a large amount of dictionary additions located in a separate file.
 * Perform error-checking on both the dictionary and addition file. It looks for:
 * Words without an associated pronunciation.
 * Redundant word entries.
 * Update an existing word entry in the dictionary with an entry in the addition file that has an updated pronunciation.
 * Creates a backup of the original dictionary file before integrating both files.
 * Sorts the newly created dictionary file by alphabetic order.

It is designed to not only be a time-saver when updating dictionaries, but also ensures that the resulting dictionary is of the format which Sphinx will accept. This is more than what utilizing the existing similar built-in Unix utilities can offer.

Usage: updateDict.pl -m - Description: This utility will take an experiment dictionary file, and a textfile containing list of additions, automatically merging the two by taking the entries in the additions inserting them into the dictionary based on the proper alphabetic location. The additions textfile must be written in the same  format as the dictionary, with a single word on each line. Alphabetical order does NOT matter. If there is a word with the same spelling but different pronunciations in both the addition file and the dictionary, the application will prompt you if you wish to over-write the experiment dictionary; duplicate words with the same pronunciation will be ignored.

Options: The following options are in this version: -m : Merge. This parameter is REQUIRED. Meaning that trying to use this tool without it will bring you to this screen. This is intentional, the order in which you pass the filenames is important! -v : Verbose. Print out everything that is being done. Useful for debugging and to generally see what the script is thinking. -f : Force merge. This option specifies that all pronunciations in the additions file take precedence over the dictionary's pronunciation and thus won't ask you for confirmation to do so. CAUTION: This option may have unintended side effects if you aren't careful! -h : Help. Print this page. Will ignore any other options.

Source code
use Getopt::Std; #Simplifies CMD line options.
 * 1) !/usr/bin/perl
 * 2)   Title: updateDict.pl
 * 3)   Author: Eric Beikman
 * 4) 	Date : Feb 22, 2013
 * 5) 	Usage: updateDict.pl -m -
 * 6) 	Description: This utility will take an experiment dictionary file, and a textfile containing list of additions, automatically merging the two by taking the entries in the additions inserting them into the dictionary based on the proper alphabetic location. The additions textfile must be written in the same  format as the dictionary, with a single word on each line. Alphabetical order does NOT matter. If there is a word with the same spelling but different pronunciations in both the addition file and the dictionary, the application will prompt you if you wish to over-write the experiment dictionary; duplicate words with the same pronunciation will be ignored.

use File::Copy;#Better file tools.

$help = "updateDict.pl\nUsage: updateDict.pl -m -  Description: This utility will take an experiment dictionary file, and a textfile containing list of additions, automatically merging the two by taking the entries in the additions inserting them into the dictionary based on the proper alphabetic location. The additions textfile must be written in the same   format as the dictionary, with a single word on each line. Alphabetical order does NOT matter. If there is a word with the same spelling but different pronunciations in both the addition file and the dictionary, the application will prompt you if you wish to over-write the experiment dictionary; duplicate words with the same pronunciation will be ignored.\n Options: The following options are in this version: -m : Merge. This parameter is REQUIRED. Meaning that trying to use this tool without it will bring you to this screen. This is intentional, the order in which you pass the filenames is important! -v : Verbose. Print out everything that is being done. Useful for debugging and to generally see what the script is thinking. -f : Force merge. This option specifies that all pronunciations in the additions file take precedence over the dictionary's pronunciation and thus won't ask you for confirmation to do so. CAUTION: This option may have unintended side effects if you aren't careful! -h : Help. Print this page. Will ignore any other options. "; @addition = ; #Used for storing a additions file. @dict = ; #Used for storing dictionary.

getopts('vfhm', \%flags); #Get option Flags

if($ARGV[0] eq 'h' || $ARGV[0] eq '?' || $flags{h}){ die "$help"; } if($ARGV[0] eq  || $ARGV[1] eq ){ die "Error: Insufficient and/or unknown arguments:\n $help"; } if(!$flags{m}){ die "Missing -m, read the manual! The order which you pass the filenames is important!:\n $help"; }

open(DICT, "<" ,$ARGV[0]) #Opens file for Read. or die "Error: Cannot open dictionary: \"$ARGV[0]\" for reading and writing!\nCheck the path/filename and try again.\n"; open(ADDITION, "<",$ARGV[1]) #Opens file for read. or die "Error: Cannot open additions file: \"$ARGV[1]\" for reading!\nCheck the path/filename and try again.\n";

print "Loading Additions...\n" if $flags{v}; while($line = ){ chomp($line);#Removes stuff we don't want. push(@addition, $line) if $line ne '';#Adds an addition. print "$line\n" if $flags{v}; } print "Done Loading. Now Sorting Additions...\n" if $flags{v};; @addition = sort(@addition);#Sort the addition file.
 * 1) Load and sort the list of additions:

print "Loading Dictionary...\n" if $flags{v}; while($line = ){ chomp($line); #Removes stuff we don't want. push(@dict, $line) if $line ne '';#Adds a dictionary entry. print "$line\n" if $flags{v}; } print "Done Loading Dict.\n Now checking for redundant word entries...\n" if $flags{v};
 * 1) Load and sort the dictionary:

for($i = 0; $i < scalar(@dict); $i++){ print "Checking to see if: $dict[$i] exists in the additions list...\n" if $flags{v}; @dictEntry = split(/ +/, $dict[$i]); #$testme = scalar(@dictEntry); if(scalar(@dictEntry) <= 1){ die "Error! Dictionary entry $dictEntry[0] on line $i does not have a pronounciation!\n 		No changes to any file have been made, delete the entry manually and run this script again.\n"; } 	for($ii = 0; $ii < scalar(@addition); $ii++){ $entry = $addition[$ii]; @name = split(/ +/, $entry);
 * 1) Checking for duplicates

$scalarTest = scalar(@name); if($scalarTest <= 1){#IF the entry in the additions list does not have a pronounciation. print "WARNING: Entry $name[0] in the additions list does not have a pronounciation! Ignoring it\n"; splice(@addition, $ii, 1); #$ii--; }		elsif($dictEntry[0] eq $name[0]){ if(!&samePronounce && !$flags{f} ){#If there is a duplicate entry with differing pronounciations and the -f flag is not set. print "Duplicate entry found!\n Dictionary entry: $dict[$i]\nAddition entry: $entry\nReplace pronunciation in dictionary with the one found in dictionary? y/n? [y]: "; $continue = ''; while($continue ne "\n" && $continue ne 'y' && $continue ne 'n'){ $continue = ; if($continue ne '\n'){ chomp($continue); }					if($continue ne "\n" && $continue ne 'y' && $continue ne 'n'){ print "Incorrect value! Press 'y' or 'n' for 'yes' or 'no' respectively, or press 'enter' for default [y]:"; }				}				if($continue eq 'n'){#user wants to keep the dict's pronounciation splice(@addition, $ii, 1); #$ii--; } else {#user wants to use the updated pronounciation. delete $dict[$i]; #$i--; }			}			elsif(!&samePronounce && $flags{f}){#Same differing pronounciations print "Found duplicate dictionary entry with different pronounciations. Assuming version in update is correct due to -f flag.\n"; delete $dict[$i]; #$i--; } else{ print "Duplicate entry found! $dictEntry[0]. Both Entries have the same pronounciations, so its being ignored.\n" if $flags{v}; splice(@addition, $ii, 1); #$ii--; }		}	} } print "Done checking for redundant word entries. Now appending new entries to dictionary, followed by sorting the dictionary alphabetically...\n" if $flags{v};

foreach $addLine (@addition){ if(defined($addLine)){ #If the line isn't undefined. print "Adding $addLine to dictionary\n"; push(@dict, $addLine); } } @dict = sort(@dict);#sorting the Dictionary.

print "Done appending new entries to dictionary and sorting the dictionary alphabetically. Now renaming file to $ARGV[0].old ...\n" if $flags{v}; close(DICT);#Closing old dictionary handle so we can re-use it. $append; while(1){ if(-e $ARGV[0] . ".old" && -e $ARGV[0] . "$append.old"){ $append++; } else { move("$ARGV[0]", "./$ARGV[0]" . "$append.old") or die "FILE RENAME FAILED\n"; last; } }

print "Now making a new $ARGV[0] file...\n" if $flags{v}; open(DICT, '>', "$ARGV[0]");

print "Now recreating $ARGV[0]\n with new entries...\n" if $flags{v}; foreach $line (@dict) { if(defined($line)){ print DICT "$line\n" or die "Error writing in new dict file!\n"; } }

print "Done recreating $ARGV[0] with new entries. Closing file..." if $flags{v}; close(DICT); close(ADDITION);

sub samePronounce { local(@first, @second, $i); @first = @dictEntry; @second =@name; shift(@first); shift(@second); #First we need to determine if the two arrays are of the same length if(scalar(@first) != scalar(@second)){ return "";#False. }	for($i = 0; $i < scalar(@first); $i++){ if($first[$i] ne $second[$i]){ return ""; }	}	return "1"; }