Speech:Create LM

From Openitware
Jump to: navigation, search

Project Notes

Create the Language Model

Read following before starting:

  1. Replace all instances of: <experiment #> with your experiment number!
    • Experiment numbers are 4 digits long (includes any preceding zeros), starting from 0001 to 9999.
    • Do not include the '<' or '>'.
  2. Similarly, replace all items encapsulated in < and > with the appropriate text.
    • Usually its a filename/path.
    • Do not include the '<' or '>'.
  3. Pay attention as to what directory you execute scripts in!
    • Certain scripts need to be executed in specific directories.
  4. DO copy and paste commands from this page. Do NOT copy and paste multiple commands from this page at once.
    • Most commands/scripts on this page need specific information added specific to your experiment. If you paste multiple commands at once into the terminal without adding in this information, bad things may result.
  5. Percent signs (%) indicate a command to be executed on the shell.
    • Leave them out when copying a command from this page.
  6. Do NOT execute any of the following commands as root.
    • While it won't result in any of the following consequences, it does mess up the permissions for any directory and files created during the process.
      • This effectively blocks others from accessing the data derived from the experiment. Which isn't a very nice thing to do.
Please note
  • The Base Experiment directory is specific to each experiment, and refers to /mnt/main/Exp/<experiment #>
  • The Root Experiment directory is generic to all experiments, and refers to /mnt/main/Exp
Failure to pay heed to the above may result in:
  1. At best: Script failure.
  2. At worst: Data deletion.
  3. Very annoyingly: Will create a mess.
  4. But most annoyingly: Will create a mess in a publicly used directory such as /mnt/main/Exp.

Steps for Creating the Language Model

September 6th (Cedric Woodbury) - Major changes have been made to the entire process during the Summer 2012 Semester. To see the new revised process click here.

March 22, 2013 (Eric Beikman). The following instructions are current:

Setup the Language Model folder and copy over the unedited transcript.
  1. From your Base Experiment folder make a folder called LM.
    • % mkdir LM
  2. Go into this new directory.
    • % cd LM
  3. Copy over the transcript used from the corpus directory: Put the corpus path you used when creating your transcript (using genTrans.pl) in <corpus path>!
    • % cp -i <corpus path>/train/trans/train.trans trans_unedited 
    • FOR EXAMPLE: If we are using the 30hr/train corpus:
      • % cp -i /mnt/main/corpus/switchboard/30hr/train/trans/train.trans trans_unedited 

Prepare the transcript and execute the script that will build the language model.
  1. Prepare the transcript:
    • % parseLMTrans.pl trans_unedited trans_parsed 
  2. Execute the script:
    • % lm_create.pl trans_parsed

The Language Model has been created. Move onto Speech:Run_Decode.