Speech:Spring 2019 Peter Baronas Log

From Openitware
Jump to: navigation, search


Week Ending February 4th, 2019

Task

One of the first goals for this week was to login to Caesar using the VPN from my house so I would be able to constantly have access. The next thing I wish to do is read in more detail about the experiments completed by last year’s data group, as well as see what files are stored in Caesar. Overall, I need to get more familiar with how to navigate a UNIX system through the command line. I have limited experience with the command line interface. I'm used to working in systems with graphical user interfaces. Finally I need to get more familiar with Sphinx.

Results

I only had partial success trying to login to Caesar. I was able to set up the VPN correctly. However for some reason, every time I try to login using my user account, it times out. Yet, when I try to log into root, I can gain access to root but still have no access to any of the files. I cannot even use the LS command to see what files are available. Reading on the Sphinx web site gave me a better understating of how it takes and uses audio for speech recognition and how sometimes it finds more information in the gaps between phones or diphones than in the phones themselves. While I knew that most of speech recognition is hampered by trying to get an all-analog communications method to be understood by an entirely digital system. I did not realize how much of a difference could be caused by using different analog to digital conversions. The thing I think will be the biggest challenge to getting good decoding of the audio sample will be finding a good way to get the system to differentiate sounds like ‘um’ and other such unimportant sounds from the phones and diphones. While working with my group, we found that when you use the addExp.pl if you use –r you create a new main folder but if you use –s you create a sub experiment. So, there is a main directory with my name because of finding this out the hard way. Also, through trial and error, I found that the Windows 10 ssh does not let you use the backspace key or any other not ASCII key, but if you use putty it works fine. I created a new train in experiment 0313 sub 007. I plan on running the language model and decode next week once I have had a chance to read in more detail about them. I set up a group on Slack for the data group to use as a point of contact and communication and the entire data group has made good use of it. We used it to set up a meeting for Monday to go over what we are doing on the project.

Plan

The plan for fixing the problems with logging in is to contact both the systems group and Prof. Jonas to see if either has any suggestions. I also plan on becoming more familiar with UNIX systems by simply searchinng for and reviewing some video tutorials on how to navigate a UNIX system and some basic UNIX commands. To become more familiar with Sphinx, I plan on reading its website and informational materials. I am going to read more on how to set up the language model by using the wiki and also learn more about how to decode and score before I test the train I created this week.

Concerns

Hopefully this is just an error that can be corrected with admin permissions. I am very hesitant to touch and manipulate files in a UNIX system given my unfamiliarity with that operating system in general. Plus, there is the added challenge of only being able to use command line to navigate, which I have never had to do before. I still don’t really know what I am to do on this project. I am getting the feeling that I am not alone in this. My group is meeting to discuss and compare our initial impressions of the project and hopefully develop a better understanding together. I accidently created a main experiment folder that I will need to remove because of an error in using the addExp.pl command and I am not confident in deleting the folder on Caesar that this created.

Week Ending February 11, 2019

Task

The task my group was given was to create a new corpra in the 30 min to an hour range, to find out if there are any significant differences when using the different types of transcripts, and if so, which ones to use. Thankfully the 2018 Data Group got some of the ground work done on this so we can copy their experiments to confirm results and then move on to longer time test. Final we need to find a way to set up the trains and decode and not use any of the same speakers in order to keep the unseen decode truly unseen. I also need to get a 5-hour experiment to work, so I can better understand how to run an experiment. Then, I need to become more familiar with the switchboard. I am also going to work on creating the draft for the proposal for February 12.

Results

The first draft of the Data Group’s proposal is complete with the exception of setting up a task timeline due to differences in opinion among the group on when certain jobs need to be completed and which ones will take the most time to complete. After talking with my group over the slack channel they said that they liked the proposal and that we should meet to go over the time table which I wanted to do any ways because it would not be fair for one person to set up a time table without at least talking to the others first and getting their input on how long they think different task will take. Worked one the time line with the group we have set a preliminary dates for project to be done. Everyone felt the time line was good and we could get things done with out being rushed.

Plan

During the Data Group meeting on Tuesday, I will go over the task timeline with the rest of the group so we can finalize who's doing what and what projects will take what priority. We’ll also estimate when we think each project can be completed. I also will work with my group to take in any changes they wish to make to the proposal, as well as any that are suggested as part of the first draft being graded. I wish to read more about creating and managing the corpora since it looks like it is going to be a large part of the Data Group's projects this semester, if not its entirety. I'm still confused about how exactly they are generated by the system, specifically what types of data is being used, and how the audio files are being paired with transcripts. Fortunately there are a lot of helpful entries about this since, in the past years, the Data Groups have had to do a lot of work with corpora, as well. Once I am able to reliably run experiments, I plan to start performing tests to confirm the results of last year's Data Group experiments, as well as expanding experiments out onto the different trains to make sure that the improvements we see from the experiments are carried through to the more complicated tests. We are meeting to finish the time table on the projects so we can use it in the draft of the proposal this will also be helpful for determining what to focus on rather than just meandering through projects.

Concerns

I still am having trouble getting trains to run reliably I've had to start from scratch multiple times now over I'm hopeful that this next attempt will finally get a full decode and work properly. I still seem to have a problem connecting to the server from home even when in using the VPN as well as putty I'm thinking this might be because my ISP is blocking some of the ports used for ssh unfortunately I do not have the password to modify router settings second change these port settings without a lot of hassle. After testing my router settings I unblock the port 22 and that didn’t fix the problem so I will need to meet with IT to fix the VPN. Talked with UNH IT and they had no suggestion with why the VPN is not letting me to caesar and have done some testing on campus and found the the VPN is the reason I cant get in to the sever.

Week Ending February 18, 2019

Task

I worked with some members of the other groups to set up a single standardized style for the proposals so each group was somewhat consistent. The representatives of the other groups also helped me with trying to get my computer to connect with Caesar when using the VPN. We were able to find that the server did not see any of the attempts made while using the VPN. This led us to the conclusion that the problem is with the VPN, so I am going to have to call the main UNH IT Helpdesk to see if they can help me. I am also going to try using the VPN on my home desktop (as opposed to my laptop) to see if the problem is only with my laptop or if it has something to do with my user account. Over the past two weeks, I have been focused on preparing my groups proposal. Because of my problems connecting to Caesar, I am unable to work on any tasks requiring Caesar from home. To maximize the group’s efficiency, I volunteered to write the proposal so that the others can focus on work with Caesar. Also by being written by a single principal writer, the proposal can be written with a consistent voice. I have been seeking and receiving input and feedback from my fellow group members on the writing to ensure it is reflective of everyone’s work and opinion.

Results

After meeting with the other groups, I changed the format of the task timeline in the proposal to have it be more consistent with the rest of the groups. I also expanded the tasks a bit to give more details, provide more specific time frames for when things will be completed, and indicate who would be responsible. I also tried to make the tasks discussion more simple and easy to follow. As part of the testing of my VPN connection with the other group members, we found that the VPN application on my laptop works the same with other valid users as it does with me. When a valid user logs in to the VPN, the VPN application confirms a connection is made. However, any attempt to login to Caesar through the VPN fails to occur, with no error reported and the Caesar log shows no record of an attempt to connect. This leads me to conclude that the VPN is having problems or conflicts with either the hardware or the software on my laptop. I hope I don’t run into the same problems when I try using my personal desktop. Installed the VPN on my personal desktop and found that I could connect to the server with no problems. The only minor difficulty was needing to install an ssh program because, unlike my laptop, the version of Windows 10 on my desktop does not have an ssh program preinstalled.

Plan

Over the coming week, I plan on installing the VPN application on my personal desktop and testing it VPN to see if I am able to make connection with Caesar. I will contact the UNH IT Helpdesk to hopefully solve this problem with connecting to the server. In addition, I plan to complete the changes to the Data group proposal that were recommend to me by Prof. Jonas before the due date so that all the writers of the groups can go over the proposals to make sure that there is a consistent voice to the writing. I also plan on learning more about the best approaches to training and testing Acoustic Models and Language Models so that I can make informed decisions on how best to filter the data for the training and testing improvements. I also am going to work with the rest of my group to better understand how to navigate the server and use the train and other testing tools because I have fallen a bit behind due to the challenges with getting a connection with the server.

Concerns

Even if the desktop works, it is a bit of a pain because I generally don’t have any of my school work on that system so I will have to be constantly switching between my two personal computers to get work done. Also, because my desktop is not typically used for school work, it lacks some of the software that I need to get my work done. I would need to switch between both computers to complete tasks. I still feel lost about what we are doing and the process to accomplish the tasks. I am struggling to understand how the data is even sampled and used in the building of the different models. I also don’t understand yet how the scoring is done, what it represents, and how the system determines that information.

Week Ending February 25, 2019

Task

My group asked me to do in-depth research into understanding how speech is modeled or, in other words, what Sphinx is looking at and how it interprets the audio. I will approach this by looking into all the information we have on speech modeling in the wiki, as well as doing some extra research in optimal ways of modeling speech. I will also explore information about Google and Amazon speech recognition systems because both of those seem to be the most accurate from what I've used. Now that I have an ability to access Caesar from home, I plan on finally completing my first train and experiment and, at least learning how to debug and score it correctly. Unfortunately, I felt pretty far behind on that due to having difficulty connecting, as well as taking the lead on writing my group’s proposal.

Results

Due to some confusion about where I last left off on my experiment before running into server connection errors and focusing on creating the proposal, I decided to start from scratch and, because I ran into an error last time, I decided I would try the 30-hour train for my first experiment. However I did not realize how much longer it would take. It has taken over four hours to complete the first step of the experiment. I now see why it was such a high priority for us to generate a one-hour test experiment. I did not realize it would be this much longer. Last time I did the five hour trains, it only took at most an hour or two. However I'm wondering if the results of this test will be better because it has had more data to train from. I've not fully figured out scoring yet. I also read up on some of the grammar models and find it interesting that overall, Sphinx-4, which is not the version we are using, is considered the better tool. However it looks like it is considered the better tool purely because it is written in Java, which is a language that can be easily ported to any other device due to its unique way of dealing with compilers. This makes me wonder if we should be looking to switch to Sphinx-4 in the future or if the improvements we are making to the word error rate and language models are going to remain isolated to this specific server environment, due to it being a research focused platform.

Plan

My plan for the coming weeks is to complete my 30-hour experiment as well as score it. Then I will also continue to do research into how Sphinx works and how speech recognition in general is laid out so I have a better understanding of it for when we approach other tasks for the project.

Concerns

I am concerned that because we are using a speech recognition software that is both no longer in active development as well as being considered less effective than others, we may be falling behind in some aspects of this technology. And this may be keeping us from drastically reducing our word error rates. I have noticed that the groups with the best speech recognition are organizations like Google and Amazon, both of which have a large pool of users as well as a large set of highly varied speech patterns that can be processed through a large network of processors and server farms. This makes you wonder if it is even possible for a small server with very limited amounts of data and processing to even get to the point where it is usable for practical purposes. I personally use a commercially available speech recognition software known as Dragon for many years, and I have noticed that it has a lot of trouble recognizing my voice even after multiple training sessions. Its performance still does not compare to that of companies like Google and Amazon who have had relatively little time in the space compared to other companies and yet have much more accurate language models. I think part of that reasoning is human speech is still far beyond the capabilities of the average computer or even high-end private computer to process without huge quantities of data or huge quantities of processing power which both Amazon and Google have easy access to in comparison to individuals.

Week Ending March 4, 2019

Task

This past week my group has asked me to continue to research into how Sphinx does its speech recognition and I am continuing to read the articles that were posted to the wiki. Once I have completed reading all those articles, I will start researching into finding new sources. I am also going to update the Data Group wiki with at least our current tasks. When meeting with the group on Monday, I will ask what other content they would like added to the Data Group wiki. Some of that may be rolled into next week's work.

Results

According to the reading I did this week, it would appear that Sphinx is most successful when the data that it uses to train its language and acoustic models are not from the same environment. In other words, while yes we are designing a set of training and test protocols that have distinctly different sets of speakers and conversations, they are still from the same environment and have the same benefits and flaws as both trained and tested data are from the same environment. According to this paper, the researchers found the best results when data was used from two different environments for the training and testing. In other words they take two different types of audio recording in separate environments and that gave them the best results. This also seems to fit with the fact that companies like Google and Amazon have really effective speech recognition because they have a much larger user base and are able to get a lot of different types of samples. I have modified the Data Group’s log with some pretty basic information about what tasks we are performing. I plan to expand upon them in the coming weeks after I have had a chance to get some more input from my group on what type of information they feel would be best put in that section of the wiki.

Plan

I plan on continuing to read up on speech recognition to learn how Sphinx operates and develops its various models. Also, I will continue to debug my first test experiment. I'm running into an error and just have not figured out how to get around it yet. I will also be working on expanding the Data Group’s log with all the work my group has been doing along with other information that we have found useful. Some of the data we have generated to keep track of information is not well documented on the wiki and can’t be stored on Caesar such as a SQL database with information about the data for better lookups and better organization. This database has been helpful with finding what conversations would be best for training.

Concerns

One thing that I'm continuing to notice about speech recognition in regards to how we are testing is that audio recorded through traditional telephone lines has a much higher chance of causing interference and other types of problems. This makes speech recognition, already a difficult task for a computer, even harder because of an increase in background noise and other audio distortions that are due to both the environment of a phone call and the effects that audio is put under in order to be transmitted over traditional phone lines, which in many parts of the country are still based around some pretty old standards due to how our telephone network infrastructure was built. In other words, a lot of phones have to compress the audio and filter it so that the data can be sent over the telephone lines. This can make speech recognition harder. This is concerning to me because we are using audio recordings taken over the phone as both training and testing data. In part my concern is that we are not adequately filtering out the background noise or any distortion caused by the phone line. Using phone audio to create our acoustic model might be the wrong approach because this may be leading to the computer making some inadequate assumptions or associations. It may be best to look into finding a way to get some good clean audio samples for the acoustic model to be generated from. This may lead to a marked decrease in word error rate.

Week Ending March 11, 2019

Task

During a recent class meeting, Prof. Jonas suggested that I take over the work on determining what audio files we should be using for the five-hour unseen test and the five-hour seen test for a total of 10 hours of data that will be reserved for testing only. This task was previously being headed up by Brendan who made a lot of early progress on finding the sources of this type of information and was very helpful in pointing me in the right direction on where to look on both Caesar and online resources to find the needed files. The goal is to have a list of files and their locations on Caesar that total up to 10 hours of audio that we can safely earmark to be used for testing only by March 26.

Results

I have found all the files that contain information about both conversations and conversation length and have collated them into an easier-to-use format of an Excel file. I have found the lengths of all the conversations and have found that they are consistently large numbers of five-minute conversations, ten-minute conversations and 45-minute conversations with a smattering of conversations that are longer than 45 minutes, with a couple reaching the 50- or 55-minute mark. After scanning through the data, it appears that the most common amount of time the conversation would last for was 10 minutes. I have a feeling that this was the goal with a lot of this research and there were a few other data points for longer conversations and shorter ones. So I've determined that it would be best to pick 60 unique 10-minute conversations that have been evenly distributed across the CDs. This is going to take a significant amount of time due to the fact that the CDs all hold a variable number of conversations and there are a total of 2438 conversations . This is a mix of both good and bad news: the good news is that it will be easy to take 60 conversations that are 10 minutes in length from across the 23 CDs. However this is going to take a lot of time to determine which conversations to choose from each CD. But, what will take even longer is determining what conversations IDs are links to 10-minute conversations. I will be working with some Excel files and hoping to make this task a little easier but this has to be done by hand. It will take several hours, possibly even a day or two. After doing some more analysis of the data, I found that there are a total of 379 conversations that are 10 minutes in length. This is a wonderful number to see because it means I will have no trouble finding enough conversations to get 10 hours. However I am a little bit concerned given that the highest conversation ID number linked to a 10-minute conversation is in the 3000s and some the CDs don't have conversations in the 3000s. In particular, some of the higher numbered CDs, such as 23, start off with their conversation IDs in the 4000s. So I may need to do some more analysis to determine a combination between 10 minute conversation length and maybe 5- or 20-minute conversations that show up on other CDs. However upon further analysis, it appears that there are 1279 conversations that are 5 minutes in length and they have a much wider range of ID numbers, including some in the high 4000s. I think using five-minute conversations might actually be better. This means it is less likely that we will be able to eliminate specific speakers, but we will be able to get a better distribution across all of the 23 CDs. This may also have some unforeseen benefits in the fact that it will have a wider variety of speech patterns to be analyzed.

Plan

Over spring break and the week after, I am planning on working through the spreadsheets that hold the conversation information and determining which conversation IDs linked to 5-minute conversations and then, once that is established, which will be a much shorter list than the 2000+ conversations in total, I will then see if there is any chance that we can make sure that we take out the same speakers every time while still evenly distributing the number of audio files taken from each CD. I also plan on continuing my reading into how Sphinx speech recognition works because I feel it is been helpful in informing me about best practices for speech recognition and what type of environments and what steps might be worth taking to improve the system in the future.

Concerns

I did have some concerns with finding enough conversations of 10 minutes in length on each CD given that each CD seems to have a variable number of conversations on it. For example the CD labeled “disk 9” had exactly 80 conversations on it whereas disc 23 had well above 80 conversations. Also, one of the CDs must have an extra conversation selected from it because 23 does not evenly divide with 60. However I do think this is a minor problem and, unless we get really unlucky and find that a CD has no 10-minute conversation, which is highly unlikely given the total number of conversations and how frequently the 10-minute conversations appear. Overall I feel that the hardest part of creating the clean tests will be setting up the scripts in such a way that they never access specific files.

Week Ending March 25, 2019

Task

This week’s task is to work on is to find all the filenames that we have in the discs for the audio files and then cross-reference that list with my list of five-minute conversations. The goal is to establish a set of test data for the both unseen and seen tests.

Results

I managed to create an Excel file that has all of the filenames listed in columns relating to which disk contains which audio files. Then I went through and highlighted all of the file IDs of files that were 5 minutes in length. I plan on talking with the rest of my group to determine which file IDs we want to use in the testing data and then we'll change these to be highlighted in green. I can then work with people who are more familiar with how the test script works to blacklist these files from being used in the trains and set up the test parameters. Overall, most of the time-consuming work has been completed and data has been organized for the improved testing parameters. All that is left to do is create the scripts or add the old scripts. I do not know which would be easier for I am unfamiliar with the scripting language and have not reviewed the scripts. I will probably talk with the scripting team about this and I will also be posting all the spreadsheets I have made since I feel the information will be useful to those looking into the audio files in the future. It will save someone from having to repeat the work that I have done. I just need to consult with Prof. Jonas about which file format would be best to post the lists.

Plan

I plan on at least having picked out the 120 files that will be used for the test parameters by the end of next week, but my target is having all of this work done. Then I work with people who are more familiar with the scripting language we are using to set up the training parameters, so that they do not grab the preselected files. That way, we know that we are always testing on data that has not been used in the process of training the language model.

Concerns

I do have some concerns. While looking at the files to write down all the IDs, I came across a situation several times where my normal user was not allowed access to the disc folders, particularly discs 6 and 8. I could access them when I switched to superuser; however, I could access other discs without being superuser. I wonder if this has any impact on how the script works and whether or not we should change the file permissions. Furthermore, some of the discs had a different directory that held the file IDs. For example, one had a directory that looked like disk1/swlb whereas disc 6 had a directory that looked like disk6/SWLB. Furthermore, when I listed the files in disc 1 versus disc 6, in disc 1, the files were in the gray text like used for input commands. However, when I opened the directory on disc 6 which had the uppercase directory name, all the files had extensions of .SSH rather than the .ssh that everything in disc 1 had. This also led to the command prompt window highlighting these in green text that was different than in disc 1. I wonder if this has an effect on how the files are interpreted. If they are treated differently, we should look to normalize the directory and determine whether we want uppercase file extensions or the lowercase file extensions. Another concern I have is that, unfortunately, some of the discs do not have any files that are 5 minutes long and some only have one 5-minute conversation, so unfortunately we are not even be able to easily sample across all the discs. I am a bit concerned that setting up files to be excluded will be a little more difficult than setting up files to be picked specifically. We are dealing with a system that is supposed to have some randomization in the files that are selected. I am wondering if it might be more efficient to move the files that are going to be used for the two types of tests to a protected directory, so that way they cannot be randomly selected. This approach may be better than having to build that into a script. That way, we do not need to change any of the scripts and can continue to use them as is. Or, if we are changing them, we are changing for different reasons rather than just because we want to protect 120 files.

Week Ending April 1, 2019

Task

This week, I have had two tasks to work on. The first is for the data group. That task is to create a new directory under my home folder in Caesar and then to modify the transcripts to be two separate transcripts. One transcript will be the transcript of the 10 hours that I am separating out for testing, and the other transcript will be the 290 hours of training audio. This way we can have the transcripts ready for implementation if we decide to switch over to the new method of testing. However, since this is unlikely to happen until sometime over the summer, I am going to be working at the task of creating the two transcripts slowly over the next few weeks so that way I can spend more time focusing on the work for the alliance group. The task for the alliance group was to research different ways to improve the word error rate in our testing and to do research on this so we can have a better idea of what we want to do as a group project. I think I am going to continue to focus on what we could do to improve the audio since this is where most of my research for the data group has gone. Since this is something I am interested in already, plus having a background in electrical engineering and knowing how A/D converters work, I have a bit of a leg up.

Results

Efforts at creating the new transcript have been started. I have created a new directory under my home folder labeled “new 300 hour” and copied in the old full transcript. I also have been doing research on how the grep command works to better understand what it does and how it does it. Hopefully I will find a way to not have to do 120 slight variations on the grep command, Instead, I would rather be able to run it all at once. So far I have not found a way to do the grep command to search for more than one type of regular expression at a time, and am still a bit fuzzy on how we can transfer that to a new file.

Plan

I plan on naming the new transcripts “test transcript” and “train transcript”, respectively. The test transcript will have the 10 hours of audio that was separated out for testing purposes whereas the train transcript will have the 290 hours of training data. This should hopefully be easy for future students to follow and use. I plan on volunteering to do a lot of the writing for the alliance group project because I feel like I'm good at taking a lot of people’s thoughts about a project and condensing down into an easy-to-follow form, as well as combining the thoughts and methodologies of multiple people into a cohesive whole that is easy-to-follow and understand. I also feel that I do not have the best understanding about the goals of this project, so it is a good way to help without feeling like I am far behind the curve.

Concerns

One concern I have that has still been bugging me for a while is I am not sure what the point of this research is other than to lower the word error rate. But what is the end goal of this speech recognition project? Is this project meant to be used to recognize phone audio or is meant to be used in some type of speech to text conversion software. Because I am not sure of the purpose of this project, I am not sure how best to improve it. One way we can improve this project would be to use better audio than phone recordings but if the whole point of this project is to work with phone audio then it makes sense to train on phone audio. However if this project is to make some type of speech to text software, then we should not be using phone audio since in most cases people using speech to text software use higher-quality microphones and higher bandwidth A/D converters than you can find in your typical landline phone, not to mention the distortion that occurs in a landline phone due to having a much more limited bandwidth than more traditional audio sources.

Week Ending April 8, 2019

Task

This week I am continuing to work on creating new transcripts, as well as doing research into topics to pursue for the large group project. I am still focusing on problems in the audio realm of the system because my background is in electrical engineering technology and I did a lot of work with signal-to-noise ratios, as well as analog-to-digital converters. These topics are both highly relevant when working with audio quality and computer interpretation of all of the audio.

Results

This week I read one of the papers linked on the wiki that focused on how different signal-to-noise ratios affected the word error rate, as well as the algorithms that were used to try to minimize the impact of the signal-to-noise ratio. Something I found interesting about the signal-to-noise ratios was that the lower the decibel level, the harder the computer had of recognizing the speech. In contrast, most humans can easily recognize speech even in noisy environments at this type of level. I speculate that part of this problem is that humans are much better at discerning key sounds or ignoring extraneous background noise while a computer, due to its limitations and binary processes, has to process all sounds to the same degree before it can determine whether the sound is important or noise. However, the computer excelled at recognizing high decibel value signal-to-noise ratio audio where humans would generally struggle, and I think this has to do more with the computer being able to more easily recognize the distinct differences and not have it drowned out like a human. This information furthers my belief that we should be looking into getting higher-quality audio that is not recorded over a phone for training the baseline language models and acoustic models. By reserving the use of filters to interpreting newly entered audio after establishment of models, we would end up with more accurate results because the system would not have to work through filters twice.

Plan

I plan on continuing to generate the transcript; however, I am running into problems with the grep command. I am not really sure how it is meant to create new transcripts if it is searching for information and then displays the line that the information is on. Furthermore, I am not sure how to combine multiple transcripts into a single file and am not sure whether the grep command will take the found line out of the original file or not. I plan on talking with some my group members who know more about UNIX to learn more information about how the grep command works. This may help me understand how best to approach the merging of multiple transcripts into one and how to copy over information from one transcript to another. I plan on meeting with the alliance group to determine if we should look into using different algorithms for filtering out background noise as a possible topic for our group project. The reason I think this would be a good topic is because a lot of what the data modeling group has been doing has been treating transcripts so that the computer can read them better. However, the problems we are seeing in the transcript are brought about by audio that has a large amount of background noise and asides from in the conversation. This makes me feel like the audio that we are using is not as clean as we would like it and that maybe applying filtering algorithms would help improve the word error rate.


Concerns

After completing this reading, I was concerned about how we are treating the audio. The big problem the data modeling group has been working with is fixing the transcript so that the references to extraneous noise or breaks in conversation for things like laughter or a guess at what the word was leads me to believe that the audio we are using does not have a high signal-to-noise ratio. However it does not look like we are doing any of the methods of filtering out low signal-to-noise ratios. I am thinking this might be a place to start for a large group project because if using the various built-in algorithms for filtering out background noise improve word error rate significantly, this will lead us to look to using different audio that is of a higher quality or we can simply use the algorithms and start fine tuning and tweaking the algorithms, as needed.

Week Ending April 15, 2019

Task

One of my tasks for this week is to continue to work on using the grep command to create the new transcripts to reflect the changes in what audio is being used where. My second task is to do additional research into how the statistical model is made to determine if we should look into changing the statistical model for the large group project.

Results

I have been reading up on the various versions of the Cambridge Statistical Language Modeling Toolkit and made some interesting discoveries. My first discovery is that it looks like the toolkit has a built-in method for ignoring particular parts of a transcript. This has been a large part of the work the data group has been doing the entire time we have been working on the project. Because this was listed as part of the statistical model, we did not think to look into the toolkit we were using to create a model. However if we had come across this earlier, we might have been able to save a lot of time and testing by simply turning on the different types of backups to determine which one gives us the best results rather than having to physically edit the transcripts and creating scripts to do so. Also as a result of this reading, I think it would be interesting for us to try doing 4-grams, or 5-grams language models because these will likely lead to better results and since we have machines that are likely to be able to handle this type of processing due to the recent upgrades that the systems group has been able to implement. Also, in toolkit version 2, all the data has been compressed in a slightly better manner to take up less space. I think this should be an interesting way to see if we could get some improved results without having to change too much over. I am not sure how difficult it would be to implement this change but it would be a worthy project goal.

Plan

My plan for my first task is to continue to work with the grep command to create the new transcripts files and then merge all of them. I am still somewhat uncomfortable with this due to my lack of experience with UNIX, but work is progressing. I plan on discussing ideas for switching to a 4-grams, or even a 5-grams language model with the rest of the alliance group during our in-class meeting time and will talk with people about how we should best approach that going forward. I will also see what other classmates have come up with and continue to discuss pros and cons of each idea so we can decide as a group what idea we would pursue.

Concerns

One of my biggest concerns is I am unable to verify that we are using version 2 of the toolkit. If we are not running version 2, I do not think most of my ideas will pan out well. However, if next year's class could upgrade to using version 2, it would incorporate a lot more functionality as well as it does a better job of compressing the data so that it is easier for the computer to process especially given that we are using somewhat lower end machines. Any type of improved data management and processing we can get our hands on is a great boon. Furthermore the v2 toolkit provides built-in support for ignoring and backing off from specific parts of the language model. I think with some more experimentation, it could easily be used to ignore parts of the transcript that we do not want the computer to be processing such as the [laughter] or other such instances which the data group has been spending a lot of time and effort editing out of the current transcripts. Relying on the toolkit’s built in back office to ignore that type of information may be better than the rather inconsistent results that we have seen with the hand edited transcripts. The other big concern is that I am still heavily struggling with getting the grep command to create new documents. I have talked with people who know UNIX systems better than me and they tend to agree with me that the grep command by itself is just a search tool and I am unsure how to combine the grep command with other commands to create new files.

Week Ending April 22, 2019

Task

This week I decided to focus on finishing up the transcript modifications in order to have it done before the end of the semester. That way can talk with Prof. Jonas about any further changes or tweaks needed. I also continue to do some research into various topics for research and improvement of the word error rate.

Results

I managed to write a .SH file that has all the commands for creating a new transcript. Unfortunately it is a rather long file because I cannot find a good way to create loops because the natures of the IDs being searched for are atypically nonlinear. This leads to difficulty in creating an iteration scheme that would hit all the needed values without missing values or grabbing unwanted values. I have not been able to run the file yet because I am unable to transfer it due to connection issues with the server from home, as well as being unfamiliar with how to transfer files from a Windows machine to a Linux machine through an SSH platform.

Plan

The plan for creating a new transcript is rather straightforward and involves using the grep and the grep -v commands respectively to search the train.trans file for the IDs of the audio files that are being pulled out for clean test, moving them to a file called Test_train.trans, and then removing them from the trans.trans file and putting them in a new file named Train_train.trans. This is repeated this for the 100+ file IDs identified as part of the previous work into the audio files. I have the file complete, I plan on using the secure shell program to transfer the file to the server and then run the file, hopefully without any errors occurring. I did not use any of the interactive versions of the commands, so that way, I do not have to constantly enter commands to continue the program considering is over 600 lines. To avoid potentially catastrophic problems, if this program is to have any problems, I have created a backup of the original transcript file saved in the same folder as a file called “safe” as well as this program is not touching the raw transcript file which is saved in a completely different directory which I have not pointed to in any way with the program, so it will only use the transcript file that is in the folder that I plan to run the program in as well as store the program in. I will also run some individual commands to test that I will get the results I want and the testing was successfully.

Concerns

A minor concern I have is that the file IDs listed for the audio files and listed in the transcript files are slightly different. I don't know if this is just a miscommunication between crating the transcripts and those crating the audio files or if this is something which created in the process of editing the transcripts that previous classes worked on. However, overall it is a rather small and minimal problem because all it is that the file ID of sw02010 for the file is linked to sw2010 in the transcript. I am also concerned that there may be some syntax errors in my .SH file due to my lack of familiarity with UNIX and Linux-based systems, as well as some of the weird quirks of these types of files. Another minor concern I have is that I am unsure what be the best topic for improving the word error rate for the main research for the alliance team. It seems like everyone is working on a few different ideas rather than consolidating on one and focusing in on one a way to improve the word error rate. While none of the ideas anyone has proposed are bad, and in fact most of them are pretty good, I feel that we are wasting time researching multiple ideas we could all focus on one idea and get a better understanding of it universally. This would allow us to start working with it sooner rather than having a few people knowledgeable about the one topic we end up working on while everyone else has to catch up to follow the experts. And as a result be less useful for the main part of the project.

Week Ending April 29, 2019

Task

My tasks for this week were to continue to look into how the dictionary works for the language model and find ways to improve it, finish up work on the script for creating new 300 hour transcripts, and finally update the wiki project labeled “new 300 hour transcript” so next year's class has some more information to work from when trying to figure out what the new transcripts are for and why they were created.

Results

I spent the majority of my time working on creating a new and detailed set of documentation regarding the new 300 hour transcripts which in fact is a poor name for the project since it is technically two transcripts now with one that is 290 hours in duration and one that is 10 hours. However, I did manage to write up what I think is a good summary describing the goal of the project was and how I went about it, as well as mentioning some of the dangers of using script I created to create a new transcripts. I am also looking into how to upload documents to the Wiki page so that people can have access to the Excel document and the .sh script I created in the process of working on this project. I am particularly focusing on the Excel document because it has information that might prove useful to future data groups because it is not only a list of the audio file IDs but also has much of the detailed information about length, conversation, and conversation participants, as well, which may be useful in future data group tasks. Thanks to some help from Prof. Jonas, I was able to complete the new script but have not yet run it due to some networking issues I have from home in trying to access Caesar. The changes help me better understand both how Linux works and what I was doing wrong. I think that this new script will work better and be safer by not directly affecting some of the more vital files.

Plan

The alliance group is currently working on running three different experiments using three different dictionaries. One will be the original dictionary from last year's experiments regarding dictionaries from file location SP18/0310/083. Another will be a modified version of that dictionary merged with the current dictionary we are using., and finally we will run an experiment using the current dictionary but edited to remove all instances of laughter modifications to see if any of these leads to improvement in the word error rate especially in regards to changing how the system handles background noise. We will be running most of these tests as 145 hours since these will give the most useful results without taking an inordinate amount of time. We also have a group working on adjusting various variables in the decoder settings to determine if any of those lead to better results for longer testing, especially since we are still continuing to run into an issue where we get worse word error rate for longer trains. This should not be happening given the way the system is supposed to work. We are hoping that the reason we keep getting these higher word error rates on longer trains is because we are not changing a certain variable that we are just unaware of.

Concerns

One of my biggest concerns is that we only have two weeks left of class and I still feel like we are just starting our base experimentation and that we may not have time to do much in-depth analysis other than to say that which tests give the best results. I worry that this will lead to a difficult time creating a comprehensive report of what we did. Furthermore I feel a bit lost in the conversation since most of my time on the data group tasks was devoted to taking charge of writing our portion of the proposal and organizing and analyzing information in regards to the audio files, while not having much time to focus on how the decoder and how the data are handled by the various systems involved in the Sphinx program. I am also somewhat wary of the new transcripts actually doing anything to improve the word error rate. The original problem that these were supposed to solve was that the audio files for the original testing were randomly selected and led to some of the same speakers being involved in both training and testing. Unfortunately, due to how the conversations are formatted, I was unable to guarantee that all the speakers in the testing transcripts were not in any of the training transcripts.

Week Ending May 6, 2019

Task

This week, the tasks that I worked on were finishing up the documentation for scripting for the new 300-hour transcripts, doing research into ways to improve our dictionary, and looking at the dictionary itself to find various problems and solutions to those problems. Another task was reviewing different pronunciations of common words to make sure we are not overlooking a common pronunciation.

Results

This week I had to work with Prof. Jonas to figure out why the script for creating the new 300-hour transcripts was not working despite that the written code seemed to have no errors that we could identify, plus the error message we kept getting was that a Terminator or hidden character kept interfering. After a lot of head scratching and struggling, we found out that it had something to do with my installation of Putty on my personal laptop, or at least that is what we think because when Prof. Jonas ran the same script through his computer with his installation of the SSH shell, it worked perfectly with none of the hidden characters causing errors. The end result is the new transcripts have finally been built and will be ready for testing for next year's class. I am still working on finalizing some of the documentation about this and will definitely note this problem in the log write up for this specific project.

Plan

I plan on pulling the dictionary from my standard testing experiment because that will be the base dictionary we have been using throughout the entire semester. I will review it to see if I can find any obvious problems with either phonetic or actual spelling, as well as any weird connections for laughter that might be causing interference. Also, I have an Excel file that has the information about all of the audio files including audio duration and file ID in order to link the audio file with parts of the transcripts. I plan to place this Excel file on the main Caesar server under my directory in its own folder because I feel this file contains critical information. I feel if I had this information at the beginning of the semester, it would have made a lot of the data group tasks significantly easier. At least half of the work I did on creating a new transcript was just figuring out which IDs linked which transcripts and audio files, and which files were of a certain duration. Part of the struggle was that the documentation for the discs that we were given had extra files that we are not using, plus a clunky ID system that is not consistent between the transcript files and the listed documentation or audio files. However , once I was able to get everything into an Excel file, it was a lot easier to sort through since I can use the built-in sort features and other functions to make sure I had files of the correct duration and other such information.

Concerns

One of my biggest concerns is that my Putty seemed to be causing a lot of problems that could have been avoided this whole time because it added unnecessary hidden characters to every command and these characters caused errors. However I did not realize it was causing this problem until I ran the script for creating the new transcripts. This led me to wonder if this problem is why I had difficulty connecting to the VPN through my laptop this entire time and that, if maybe I just reinstalled putty or even used a different SSH shell, it might have worked. This is an incredibly frustrating problem to come across this late in the semester because, even knowing this, I am not sure if there are any actual solutions and I have reinstalled putty multiple times on the same machine and it still seems to have a lot of VPN connection issues. I eventually got around this by using a different computer, only it is not my school laptop, so it was a hassle to transfer files back and forth between the two computers. I also wonder if the same Putty issues caused me to have problems with running experiments and many of the other tasks involved with this project because I seemed to have an inordinate amount of struggles to get some the basic tasks done. I am wondering if it had to do with this hidden character error the whole time.