Speech:Spring 2014 Sinisa Vidic Log

From Openitware
Jump to: navigation, search


Week Ending February 4th, 2014

Plan
Task
  • Read Wiki documentation
  • Read all the logs from the System Group members of last years Capstone class.
  • Research into Fedora 20 for possible OS change for Caesar and all the droids.
  • Familiarize myself with UNIX/Linux OS and commands.
Results

1/30

  • Read Logs
  • Tried to login into Caesar earlier today but was unsuccessful. I'm assuming the accounts haven't been created yet will try to login tomorrow.

2/2

  • Downloaded and installed Virtual Box on to my local computer.
  • Downloaded and installed Fedora 20 (64-bit) on to the Virtual Box.
  • Learned some UNIX/Linux commands with the help of Google and the Speech:Unix page.
  • Used the terminal in the Fedora 20 to SSH into Caesar with my account. Changed my password with command passwd and created the keygen for my account with command ssh-keygen -t rsa.
  • SSH into Automatix from Caesar without the need to use a password after keygen.

2/3

2/4

  • Downloaded and installed openSuse 13.1 (64-bit) in order to get the feel of it compared to Fedora 20
  • Download was 4.3 GB compared to only 953 MB for Fedora 20
  • At first glance Fedora seems to have a lot less software installed on it compared to bunch of games and financial software that openSuse comes with. This could be good or bad, good in the sense that it probably comes with a lot of software that Sphinx needs to run properly. Bad is that a lot of this software will be useless for our needs and a lot of time will be spent uninstalling it in order to save Disk space and computer performance.
  • More research will need to be done in order to come to a conclusion if in fact it is worth to upgrade to a newer OS version or to a complete new OS in Fedora. I plan on researching Sphinx next week which will give me a better picture into what requirements are needed to run it.
Concerns
  • Having so little knowledge of UNIX/Linux OS it is hard to compare the two on bases of which will be more suitable for our needs without actually installing Sphinx(need more research) and running it on both OS

Week Ending February 11, 2014

Task
  • Read Logs from last years class concerning Fedora upgrade.
  • Read info on running trains and creating experiments
  • Work on the Systems Group portion of the Capstone proposal
Results

2/8

  • Read logs

2/9

  • Read logs
  • Done some research on the KeyGen issue on Fedora for Rome without success. Still waiting for access to Rome as of right now my account on it hasn't been created.

2/10

  • Worked on the proposal
  • Contacted Valerie Therrien and Arwa Hamdi in regards to tasks assigned.
  • Updated Systems Group log to include the time line and task responsibilities for the group. (All dates are approximated)
  • Googled some more for solutions to our Fedora keygen issue with no luck.
  • Began to set up local Virtual Box to simulate our Caesar and Rome machines.

2/11

  • Spent a lot of time trying to find a solution to the Fedora keygen issue.
  • Setting SSH capabilities on the openSuse and Fedora Virtual Boxes was simply and easy. Installed OpenSSH with command yum install openssh-server technically it only updated the already installed version that Fedora came pre-installed with.
  • Started the sshd service with /sbin/service sshd start and that allowed the openSuse box to connect to Fedora using SSH.
  • Created the rsa keys on openSuse with ssh-keygen -t rsa made a copy of the id_rsa.pub and named it authorized_keys. Command used cp -i id_rsa.pub authorized_keys
  • Tried to upload authorized_keys file to the same user (viper) on the Fedora using cat .ssh/id_rsa.pub | ssh viper@192.168.1.5 'cat >> .ssh/authorized_keys' but it didn't work. So I transfered the file by old method using a "flash drive".
  • Tried login in to Fedora but it still asked for the password.
  • Disabled "PasswordAuthentication" in the /etc/ssh/sshd_config file on Fedora. Restarted the SSH service with ssh restart and tried to login from openSuse. This time an error showed up Permission denied (publickey,gssapi-with-mic). Googled a lot but was unable to find a working solution.
  • Went a step further and attempted to recreate a full Caesar/Rome network simulation. Read a lot of tutorials on how to setup NFS Server/Client. With a lot of trial and error I was able to mount a openSuse /home directory on to Fedora to act as a shared user directory.
  • Unfortunately I made a critical error by accidentally removing the Fedora's root files thus making Fedora usable on limited bases. At this point the only way to get it to work properly is to re-install Fedora for which I did not have time for.
  • Even though it was an unsuccessful day, I had a privilege of learning a lot in the process.
Plan
  • Will try to recreate our KeyGen issue using VirtualBox with openSuse and Fedora running. (by 2/11/14)
Concerns

Week Ending February 18, 2014

Task
  • Finish Brainstorming for the proposal
  • Write the proposal by Sunday(2/16) evening
  • Read information on doing experiments/running a train
  • Attempt to run an experiment on Automatix to get experience and share the information with my teammates
  • Do a little research on the best hard drive format for backing Linux files on to it
Results

2/14

  • Spent an hour on finishing the brainstorming for the Systems Group portion of the proposal that I started few days ago. I expect to finish and upload the proposal on to wiki by Sunday evening. By doing so it will give the rest of the proposal group members to review it and make any necessary changes before the due day.

2/16

Unfortunately, I was unable to work on the proposal this weekend. I apologies to all my team mates in the systems group and the proposal group. I will have the proposal done sometime tomorrow(Monday, 17th).

2/17

Due to the set back from yesterday my plan for today had changed. I was planing on starting to read guides how to create an experiment and run a train using Sphinx speech recognition software. Since I needed to finish the Systems Group portion of the proposal that plan is pushed for tomorrow. Today I spent time writing the proposal and uploaded it to the Speech:Spring 2014 Proposal#Systems_Group for review by the other Proposal Group members. I have also asked my Systems Group teammates to read it and recommend any changes and/or additions to the proposal if needed.

2/18

Today I spent time reading all about creating experiment directories and how to setup/run a train, a decoder and create a language model at the following pages:Speech:Exp, Speech:Training, Speech:Run Decode, Speech:Create LM. Creating an experiment directory and a language model is straightforward unlike running a train and decoder. Train and decoder seem to be a complex beast and will need to be done delicately in order to not cause major file and system issues. Will ask for a confirmation from the Modeling Group if the steps laid out in the above four guides are correct so that the Systems Group could start its first experiment in line of many to come, to test Fedora and openSuse performance of running Sphinx Speech Recognition.

Another task I did was search for the experiment done on Fedora that produced ""Error Percentage in 20's". First I looked at the Systems Group proposal from last year to see whose responsibility it was to do such an experiment and unfortunately there was no such information posted. Then I decided to closely search through the logs of the four members of last years Systems Group in hope that one of them would have posted about it if such an experiment was done. That search turned out to be unsuccessful as well. I was left with only one more option and that is to manually look through a hundred or so experiment logs inside Speech:Exps database. In the end no such experiment was found, this could be to a fact that it never got recorded as there are a bunch experiments missing or it could have been deleted or it was never ran. The only four recorded experiments that mention Fedora are from Eric Beikman that were done last summer. One of them was setting up the Fedora environment to be able to run a train and the other was adding words to the dictionary to be used for one of the experiments. So technically only two experiments were ran that have produced a SCLite score, Speech:Exps 0115 & Speech:Exps 0117. Experiment 0115 produced an average error percentage of 40.5 while experiment 0117 did a bit better, producing a score of 33.8. Since no experiment done on Fedora of Error-% in 20's will need to ask Prof. Jonas for permission to use Eric's experiments to run our tests with on Automatix and Rome machines.

Concerns

Week Ending February 25, 2014

Task
  • Research the experiment logs in hope of finding an experiment with error rate in 20's
  • Use the 20's % error experiment for a new experiment that will be used as a test method for OS war between openSuse and Fedora
Results

2/22

There was a miscommunication between Manager Jonas and I, on which experiment to look for in order to base my Fedora and openSuse testing. I was under the impression to find an experiment that was run on Fedora with the error rate in 20's which I was unable to find. In last Capstone meeting this became clear to me that Manager Jonas was looking for an experiment that had error rate in 20's independent on which OS it was run on. So, I went back and researched experiment logs again. Not only did I find one experiment that had word error rate in 20's but four of them and on top of that there was an experiment done that produced word error rate in 10's.

Following Experiments fit the requirement:

2/23

I have been busy all day so I haven't had the time to attempt to run my first experiment but I did manage to create a new experiment log Speech:Exps 0189 for my practice experiment for tomorrow (2/24). I decided to use Speech:Exps 0110 as my base for testing openSuse and Fedora performance of running Sphinx Speech Recognition Software in order to compare the two Operating Systems.

2/24

I noticed that my Speech:Exps 0189 log created last night was taken over by Forrset. I'm not sure why but I assume that he already started an experiment 0189 and didn't create the log before hand. I only created the log so no major damage was done and he was nice enough to move my log to Speech:Exps 0190.

Here is also a screenshot of some of the issues, including the one where Automatix seemed to restart itself don't know why.

Went to log into automatix with my user name and was prompted to enter a password which it shouldn't have since we have passwordless logins on this machine and it worked fine the other day. When I entered my password it wouldn't log in, replied with rsa key is wrong, permission denied. Before I could continue with my first experiment attempt I needed to troubleshoot these two bugs. First I reran keygen command on Caesar to rebuild my rsa keys as they might have been corrupted. Tried to log in to Automatix again and the same problem, denied. Then I decided to log in as root on Automatix and was able to do so. I checked if my user name is in the /etc/password file and it was there. I followed with changing my users password with command passwrd sfm32 and was successful in doing so. Logged off as root and attempted to log in as my user and this time I was successful and yet another error was displayed on the terminal: Could not chdir /mnt/main/home/sp14/sfm32: No such file or directory. This time I was sure the problem lies in the shared folder from Caesar being disconnected on Automatix. I ran command df -h to look at mounted file systems on the machine and sure enough there was no shared file system from Caesar. Needed to log in back as root with command su and then ran the command mount caesar:/mnt/main /mnt/main which mounted the /mnt/main directory from Caesar and solved all the issues encountered.

Week Ending March 4, 2014

Task
  • Update my Exp 0190 log with results of the experiment
  • Run a new experiment and this time truly based on the Exp 0110
Results

3/3

Last week I setup an experiment 0190 to run as my first test experiment. I planned on running a duplicate experiment of 0110 which was run by Eric last year and produced a best score of 25.2% error, but in the last second I decided to run a brand new experiment. I followed the guidelines on the Speech:Training page to setup the experiment. It took some time to follow all of the steps in the guide as I was trying to carefully do all the steps. In the end I was able to create an experiment 0190 directory in /mnt/main/Exp/0190, I successfully updated the Sphinx Training Configuration file /mnt/main/Exp/0190/etc/sphinx_train.cfg to my experiment needs. Next I ran a genTrans6.pl script to generate transcriptions and audio files for my experiment. I went with the /mnt/main/corpus/switchboard/mini/train corpus. Then time came to generate a dictionary with the pruneDictionary2.pl script, this process took some time to complete. The last two steps were to generate a phone list and feats data. To generate phone list, i copied /mnt/main/scripts/user/genPhones.csh script into my experiment folder and then ran it. This produced a 0190.phone file in my /mnt/main/Exp/0190/etc directory. To generate feats was even simpler, ran make_feats.pl script as follows /mnt/main/scripts/train/scripts_pl/make_feats.pl -ctl /mnt/main/Exp/0190/etc/0190_train.fileids. This created a 0190_train.fileids file to be used in running a train.

After all that, it finally came the time to run the 0190 experiment and hope that there are no errors made during the setup phase. I ran the train script /mnt/main/scripts/train/scripts_pl/RunAll.pl. At first it seemed as it will be a clean run but then WARNING after WARNING started popping on the terminal. All of the warnings were coming from the dictionary file not having all the words in it thus the script was rejecting to run unless all the words are in a dictionary file. Having looked at the failed experiment log I found that the dictionary was missing 500 or so words. I did not have the time to implement all the missing words into the dictionary so I decided to copy a dictionary file from the 0028 experiment that the Run a Train guide is based on but that also turned to be a failure and thus making my first experiment a failure as well.

3/4

After some experience with setting up an experiment it was time to finally attempt to run a duplicate of Speech:Exps 0110 on Rome machine in order to test Fedora OS and Sphinx Speech Recognition. Right off the bat there are no instructions/guides on the wiki that deal with duplicating an experiment and running it. So, my train of thought was to copy the entire Exp/0110 directory into a new experiment directory which I named 0206. The copying took awhile but all the files successfully copied over into a new experiment directory. Next, I edited the sphinx_train.cfg file to include the new experiment #. Next, I renamed all the files starting with 0110 into 0206. I was a bit surprised that there were only two such files because when I was doing my test experiment from scratch there were five such files. I proceeded to run the train /mnt/main/scripts/train/scripts_pl/RunAll.pl for this new-old experiment. As expected I received an error which stated that it couldn't find a dictionary file, of course it couldn't find it when the experiment 0110 never had one. At this time I went back and looked at the experiment 0110 log which stated that it was based on an experiment 0107. I went into the experiment 0107 directory and inside the /etc folder I found the dictionary, phone, and fillers files that are missing from experiment 0110. I copied all three files and placed them into my experiment 0206/etc folder. I ran the train script again from my base experiment folder and this time it worked. There were number of WARNINGS that were showing but no major ERROR that would prevent the train script to fail. Well, it seemed I started my celebration a bit prematurely as the train script after five or so minutes stopped with an ERROR claiming that a file is missing to be specific it was /mnt/main/Exp/0206/trees/0206.unpruned/ER2-0.dtree file. Looking inside the /trees folder there was a file with the name 0206.unpruned but not with the name suggested in the terminal 0206.unpruned/ER2-0.dtree. This is where I was forced to stop with any further testing as I didn't know how to fix such an issue.

I feel like there is a simpler way of duplicating and running an experiment without copying files from multiple previous experiments. Hopefully someone from Modeling Group will have the time to explain and show me how to do it in our next meeting.

Week Ending March 18, 2014

Task
  • Contact Modeling Group about an experiment that System Group could use to do our experiments with
  • Use the suggested experiment from Modeling Group to base my new experiment on Rome machine
  • Update the group on my findings and help them run their own experiments
Results

3/13

I emailed my group colleagues to find out where they are at the moment with their tasks. I suggested a group plan for this week that includes finishing updating the necessary wiki guides, finding a solution to our Fedora key-gen issue, fixing the previously installed backup drive on Rome, attempting to do a full Caesar backup (if time permits after Wednesdays meeting), and successfully running an experiment on Rome.

I have also emailed Colby Johnson in regards to an experiment that System Group could use to test Fedora OS on Rome. Unlike my previous attempt with an experiment 0110 which was missing a lot of files and train script throwing an error while running the experiment, this new experiment needs to be complete with minimal modifications required in order to run successfully.

3/16

Colby J. suggested that we at the Systems Group use Speech:Exps 0168 to do our Fedora OS experiments on Rome. Today, I created a new Exp entry in the Experiment Logs section Speech:Exps 0210 and I have setup the Exp 0210 inside /mnt/main/Exp directory. Colby wasn't sure about the script that supposedly helps in duplicating an experiment, so I decided to manually copy the Exp 0168 directory into a new Exp 0210 directory. Using cp -r command, this process as usual took a few minutes to complete due to vast number of files. Once completed I edited the Exp parameters inside sphinx_train.cfg to reflect Exp 0210. Then, changed dictionary, phone, trans, filler, and fileids files prefix from 0168 to 0210. Tomorrow I will run the train on this experiment and hope it completes successfully unlike my previous to attempts.

3/17

I ran a train on the Exp 0210 that I setup yesterday. The train finished successfully or at least I think it did (there was no confirmation message). During the train process there were a lot of ERROR and WARNING messages but it didn't seem to bother the train process as it ran for about an hour and ten minuets, same time span as the Exp 0168. I'm glad that I final was able to run a train from start to the end. On to running a decoder which I have no experience with, hopefully there wont be any errors.

One concern, Exp 0168 already had a language model and it copied over to my Exp 0210, now I don't know if I need to create a new language model or run with the one already there. I'll email Colby for clarification before I continue with the Exp 0210.

3/18

First attempt at creating a new language model by running lm_create.pl script generated only three files compared to Exp 0168 six files. Rerunning the script generated all the files without changing anything, don't know why the first attempt failed.

Trying to run a decoder was futile as it turned out Rome was missing files in /usr/local directory. Running the run_decode2.pl script produced an error sphinx3_decode doesn't exist inside /usr/local/bin, in fact the folder was completely empty. Colby was also baffled by this since Caesar had the file at that location. At this point I decided to manually copy the Caesar's /usr/local directory over to Rome. Now when running run_decode2.pl script a new error showed in the decode.log /usr/local/bin/sphinx3_decode: error while loading shared libraries: libs3decoder.so.0: cannot open shared object file: No such file or directory. Now this was a confusing error considering that libs3decoder.so.0 was inside its normal /usr/local/lib directory. While looking through /mnt/main/ directory a local folder caught my eye. Files inside it looked to be the same as in Caesar's /usr/local, now it hit me that /usr/local is most likely a soft link to /mnt/main/local/ don't somehow got disconnected. I logged back in as root and removed the local folder I previously copied into Rome's /usr/ directory and ran ls -l /mnt/main/local /usr/local command, which created the soft link for /mnt/main/local.

Ofcourse didn't solve the issue of not finding libs3decoder.so.0. I knew that Eric was the one who installed Fedora on Rome by reading his logs earlier in the semester, so I went back to his log in hope of finding a solution to this issue. In less than five minutes I hit a JACKPOT. He indeed had the same issue and provided a solution to it.

Eric Beikman Solution:

  • When starting the decode script, I encountered another issue:
    • /usr/local/bin/sphinx3_decode: error while loading shared libraries: libs3decoder.so.0: cannot open shared object file: No such file or directory
  • Previously I was able to resolve this issue on rome by setting the "$LD_LIBRARY_PATH" to /usr/local/lib
    • I have a better solution now:
      1. Make a file located at and called: /etc/ld.so.conf.d/sphinx.conf
      2. Add /usr/local/lib to the file.
      3. Execute sudo ldconfig to reload the shared libraries.

I didn't need to do first two steps since Rome already had the /etc/ld.so.conf.d/sphinx.conf file with /usr/local/lib parameter inside it. I only ran the sudo ldconfig which solved the issue. Now I was able to run_decode2.pl script.

Special Thanks To

Colby Johnson

Week Ending March 25, 2014

Task
  • Look into key-gen issue on Rome
  • Look into "Broken Pipe" issue while running the Decoder on Rome from my VirtualBox System
  • Solve the backup drive issue on Rome after class on Wednesday
Results

3/24

  • Summary for the last few days as I was unable to log them in during that time period.
    • 3/19
      • After having my experiment 0210 crash twice while decoding on my virtual-box system, I was able to successfully complete the decoder during the afternoon. The issue that I encountered at home wile running the decoder is the Write failed: Broke Pipe. The decoder would run fine for couple of hours but then it would crash, specifically I get booted of Rome and Caesar systems with only the "Broken Pipe" message in my terminal. As I was discussing the issue with the Modeling Group who were baffled by it as they never experienced such an issue, I decided to rerun the decoder from one of the P132 room computers. After five or so hours the decoder successfully finished its job and I was able to score it with SClite. The results were similar to the original experiment 0168.
    • 3/22
      • I looked into the Broken Pipe issue and seems that it occurs when a host or client do not send any data over the ssh pipe for a specific period of time then the connection is broken. The suggested solution is add ServerAliveInterval 120(on the client side) or ClientAliveInterval 120(server side) to the /etc/ssh/ssh_config file. I did not try out the solution yet, so I'm not sure if it will solve my Broken Pipe issue.

As for today 3/24, I spent a few hours of trying to solve the pass-wordless log in on Rome. I have tried changing bunch of parameters inside the /etc/ssh/sshd_config file, changed permissions to the ~/.ssh folder and files inside and on top of that, I disabled the firewall (I think I did) but same thing, keeps asking for a password. Reading forums that deal with setting keygen didn't provide me with any working solution. I don't know what possibly could be the problem. Comparing sshd_config file from Caesar, Automatix with Rome's seem to have identical settings.

3/25

Spent some time researching how to format a new hard drive in Linux using terminal. While using some suggested commands to check the current hard drives I found that the hard drive we installed several weeks back is in fact being recognized by the system which is weird because that day it didn't recognize the hard drive. It reads at 300 GB and that is the one we installed, I'm hoping that after are meeting tomorrow we will be able to format it as it seems to be straight forward process, nothing too fancy.

Concerns

System Group had an agreement to work on the backups after the meeting on Wednesday but due to rest of the colleagues leaving I was unable to do any work on the Rome backup drive.

Week Ending April 1, 2014

Task
  • Work on getting passwordless login on Rome fixed (keygen)
Results

3/29

Systems Group successfully formatted the 300 GB hard-drive that was installed several weeks ago on Rome. This hard-drive will be used to backup our Sphinx Speech Recognition Experiments located in the /mnt/main/Exp directory.

I have taken Prof. Jonas's suggestion of changing my users home directory on Rome in order to test if the keygen issue might be related to the shared user directory on the system. I had edited /etc/passwd file to change my home directory from /mnt/main/home/sp14/sfm32 to /tmp/sfm32, then copied over the authorized_keys file to my new home/.ssh directory. Tried logging in from Caesar and was automatically logged in, the keygen worked. I finally have made progress on the keygen issue, it seems that there is a problem with our shared home directory. While openSuse seems to work fine with our current setup, I'm not exactly sure why Fedora is having an issue with it. Maybe it doesn't like some folder permissions in which our home directories are located in. I'll have a look at them tomorrow.

3/30

I started my day where I left off last night checking directory and file permissions. Changing directory permissions inside /mnt/ did not solve the issue as I had hoped. After researching more and looking around Rome logs, I think I have narrowed down what is causing our passwordless issue. Unlike OpenSuse OS, Fedora comes with a SELinux access control manager, which controls security policies on files and folders of the system. One particle error message was popping in the /var/log/message log that involved SELinux and authorized_keys file.

 SELinux is preventing /usr/sbin/sshd from read access on the file authorized_keys.

Unfortunately, SELinux is new too me, it will take some time to research it and hopefully understand enough to solve the issue. Little research that I have done concerning the above error message has prompted with a lot of solutions but none of them seemed to work when I implemented them on Rome. I'm afraid that it might not be possible to change SELinux policies on /mnt/main directory due to it being a shared directory and belonging to a different Linux OS.

3/31

Well, finally I have tackled down the culprit in regards to passwordless issue on Rome. It was in fact SELinux, the access control policy mechanism on Fedora that was preventing a network shared user directory from being used as such. After many hours spent on researching, changing permissions, changing ssh_config and sshd_config file parameters in hope of solving the issue it turns out only one parameter needed to be changed in the SELinux policy. The parameter was use_nfs_home_dirs. By default this parameter is turned off in Fedora, a simple one line command changes parameter to true/on.

setsebool -P use_nfs_home_dirs 1 <--- the 1 at the end singles the parameter to be switched on.

Week Ending April 8, 2014

Task
  • Solve Gnome issue on Obelix - login screen won't load.
  • Help my Justice League group with running a 100 hour train
Results

4/6

  • Obelix machine is experiencing an issue when booting. It will only boot in "fail" mode but hangs when attempting to boot in regular GUI mode. This is particularly an inconvenience if we need to reboot the machine for whatever reason, we lose the ability to ssh into the machine as it never loads the system, which prevents ssh port from opening. The only solution is for someone to be in the server room to manually start ssh and mount our nsf in fail safe mode in order for remote access to the machine.

I have looked into all the possible logs on Obelix in search of any clues as to why it suddenly stopped loading in regular mode. With the logs and the google searches I have conducted, I feel like the issue at hand has to do with either gnome, gdm service, xorg, or a bad video card, bad video driver. Here are some interesting log outputs I found.

  • /var/log/Xorg.0.log - command ran egrep "EE|WW" /var/log/Xorg.0.log
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[245398.665] (WW) The directory "/usr/share/fonts/TTF/" does not exist.
[245398.665] (WW) The directory "/usr/share/fonts/OTF/" does not exist.
[245398.665] (WW) The directory "/usr/share/fonts/Type1/" does not exist.
[245398.665] (WW) The directory "/usr/share/fonts/100dpi" does not exist.
[245398.665] (WW) The directory "/usr/share/fonts/cyrillic" does not exist.
[245398.665] (WW) The directory "/usr/share/fonts/misc/sgi" does not exist.
[245398.675] (II) Loading extension MIT-SCREEN-SAVER
[245398.680] (WW) Warning, couldn't open module fglrx
[245398.680] (EE) Failed to load module "fglrx" (module does not exist, 0)
[245398.725] (WW) Falling back to old probe method for fbdev
[245398.726] (WW) Falling back to old probe method for vesa
[245398.773] (WW) MACH64(0): Cannot shadow an accelerated frame buffer.
[245398.788] (WW) MACH64(0): DRI static buffer allocation failed -- need at least 12800 kB video memory
[245398.901] (EE) No input driver/identifier specified (ignoring)
[245398.901] (EE) No input driver/identifier specified (ignoring)

fglrx is the Linux driver for ATI video cards which Obelix server runs on. Full name of the graphics card is: ATI Technologies Inc Rage XL (rev 27)

  • /var/log/boot.msg
<notice -- Apr  3 21:15:30.708364000> service cron donedone
<notice -- Apr  3 21:15:30.708697000> service smartd startStarting smartd 
<notice -- Apr  3 21:15:30.915243000> service smartd donedone
<notice -- Apr  3 21:15:30.915740000> service stoppreload start<notice -- Apr  3 21:15:30.951351000> service stoppreload donedone
Master Resource Control: runlevel 5 has been reached
Failed services in runlevel 5: vboxadd vmtoolsd
Skipped services in runlevel 5: cifs xdm

vboxadd, vmtoolsd and cifs services shouldn't be causing the Obelix issue. xdm on the other hand is involved in the graphical process of the system. Reading information on xdm suggests that if there is an error with it that it will be logged in /var/log/xdm.log file but no such file exist on Obelix.

  • /var/log/gdm/:0-slave.log
Driver not XRANDR 1.2 capable, ignoring DISPLAYMANAGER_RANDR_MODE_* settings /etc/X11/xdm/Xsetup: line 147: /usr/bin/hal-find-by-property: No such file or directory


The only way to test any of the solutions I found online is to reboot the machine and see if I'm able to login to it via ssh. This is a huge risk as if the solution is unsuccessful, I lose the connection to Obelix. So, I decided to find a copy of openSuse 11.3 and run it on my virtual machine and see if I can possibly recreate the Obelix problem and at the same time see what happens when I implement some of the online solutions to my copy of the OS. So, far I'm unable to recreate the problem but with installing a new fresh copy of openSuse 11.3 on my virtual box provided me with the ability to compare two systems config files. The result of all that was identical gnome and xorg files. This makes me believe that the issue might be with a bad graphics card or bad graphics driver.

4/7

Changing the display manager from gdm to xdm in /etc/sysconfig/displaymanager allowed a GUI log in on Obelix. This tells us that most likely issue is Gnome related. This will be looked at in more detail tomorrow. The bad news is that same issue is occurring on other machines which suggest maybe Torque could be the reason behind it.

The second issue Obelix faces is sshd isn't running on boot. Looking at the service scripts inside /etc/init.d/ directory, sshd was there but it was empty. Not sure why or how the contents of the sshd file got removed. Copied Caesars sshd script over to Obelix and tested it with command service sshd status, it returned that sshd is running. The only way to test if it's working on boot is to reboot Obelix, at this time no such action occurred.

4/8

It turns out that the missing sshd service script was indeed the reason why ssh wasn't running at boot on Obelix. With fixed sshd service, my focus shifts to figuring out how this happened and what is causing gnome issue on 8 out of 10 machines. Unfortunately, I wasn't able to work on these issues since a new and more dangers issue has appeared. Caesar for the last 3-4 days has been acting slow, ssh authentication was taking longer than usual, and once inside the system it was running sluggish. I decided to look in the Caesar /var/log/message log to see if there is anything that could have contributed to such sluggish performance. What I saw was a huge red flag, thousands of failed root and fake user logins. The attack was coming from multiple IP addresses and using one of the IP look-up websites, it showed that the attacks were coming from China. Next step is to block those IP addresses and change root password just in case Caesar is compromised.

Week Ending April 15, 2014

Task
  • Investigate the cause of Gnome failing to load on 8 machines
  • Look into a best method of blocking attacks on Caesar
  • Research Sphinx parameters in hope of finding a necessary balance for our groups experiments
  • Try to run my own experiment to help Justice League group find the perfect balance between parameters
Results

4/12

Caesar is still being bombarded with fake SSH login attempts. While looking through some of the older logs on Caesar, these attacks seem to be going for months. Even though it doesn't look like any of the attempts were successful, an action needs to be taken. I have already made a list of 10 or so IP addresses from which attacks were coming but I feel that this matter isn't for a manually IP banning as that requires constant log checking and adding IP's to ban list. An application is needed to automatically ban an IP that has multiple failed login attempts in very short time span. Linux has such an application and it's called fail2ban. This little helper constantly parses a log(s) in search of failed logins and if there are multiple failed logins in short period of time it adds that source IP address to the firewalls ban list. First, I'll test the application on my virtual system before suggesting it to Prof. Jonas for implementation on Caesar.

Fail2Ban website www.fail2ban.org

4/14

Spent the day researching on Sphinx in hope of finding any additional information that will give us lower error rate numbers. Found an interesting Sphinx guide which goes in detail of how Sphinx works (train and decode) and its file system. Due to the current competition with the Avengers, I will not be posting the link to the guide on this log for now. I have shared the information with my group and will update this log after the competition to include the link to the guide.

4/15

  • Downloaded and installed fail2ban application on to my virtual openSuse 11.3 machine. The process was easy, it took about 15 minutes from start to finish. Configuring it was also simple as it already comes with a lot of preset configurations for different protocols. Testing the product by trying fake logins from my other virtual box was a success. It automatically banned the IP address after 5 failed attempts. It will be a great application to have on Caesar in order to protect it from all of these SSH attacks.
  • For some odd reasons all of our machines have wrong time and some even wrong dates. So, I went to each machine and updated their dates and time using two commands date +%D -s 2014-04-15 && date +%T -s 01:17:00.
  • Researched some more on possible causes to our gnome issue and haven't found anything concrete yet. There are multiple errors and warnings inside the log files on each machine but without having the ability to be in the server room to test some of the online suggestions, it's difficult to pin point what is causing gdm display manager from not loading. One constant error that I see across all machines is the GTK library issue.
gnome-about --gnome-version - this command should return the gnome version but it also returns all of these warnings.
                                Googling some of them hasn't provided any solid solution that I believe in.
/usr/lib/python2.6/site-packages/gtk-2.0/gtk/__init__.py:57: GtkWarning: could not open display
  warnings.warn(str(e), _gtk.Warning)
/usr/bin/gnome-about:828: Warning: invalid (NULL) pointer instance
  gtk.RESPONSE_CLOSE))
/usr/bin/gnome-about:828: Warning: g_signal_connect_data: assertion `G_TYPE_CHECK_INSTANCE (instance)' failed
  gtk.RESPONSE_CLOSE))
/usr/bin/gnome-about:828: GtkWarning: gtk_settings_get_for_screen: assertion `GDK_IS_SCREEN (screen)' failed
  gtk.RESPONSE_CLOSE))
/usr/bin/gnome-about:828: Warning: g_object_get: assertion `G_IS_OBJECT (object)' failed
  gtk.RESPONSE_CLOSE))
/usr/bin/gnome-about:828: Warning: value "TRUE" of type `gboolean' is invalid or out of range for property `visible' of type `gboolean'
  gtk.RESPONSE_CLOSE))
Version: 2.30.0
Distributor: SUSE
  • Looked at 2 dozen log files from our groups 100hr experiment to find out why we are getting high number of errors and warnings while decoding. Unfortunately it seems that some of the logs aren't being recorded as they should, maybe there is a parameter responsible for this. Nonetheless, I did manage to find some errors and warning that might be of help. I have shared that information with my group members .

Week Ending April 22, 2014

Task
  • Report to Jonas a list of IP's attackers are using for SSH login attempts
  • Run mini train experiments in order to figure out best parameters for train and decode
Results

4/18

On Wednesday I had a chat with Jonas about fail2ban app and how it could help control our Caesar problem. He said there was no need to use it on the local level because UNH system already has such application on system wide level. Jonas then asked me to compile a list of all the suspected IP addresses and email it to him. He would then forward it to the UNH System Administrators in order to ban the IP addresses.

So, I parsed the /var/log/messages log for the latest attacks on Caesar. Recorded the IP addresses which produced multiple failed logins. Sent the report to Jonas for the IP's to be banned on UNH level.

4/19

  • Researched Gnome in hope of some light being shed on our issues with it. So far still nothing concert.
  • Read about Sphinx parameters

4/21

Thanks to David, new way of creating experiments is so simple with only little configurations. Created an experiment with mini/train corpus data and ran a train, it completed successfully. Looking at the 010.html log file, I see that there are still errors and warnings that occur while training. I researched some of them and wasn't able to find any useful information as to how to solve does problems. Sphinx 3 logs aren't the best as they lack so many details about errors, making it hard to figure out what is the cause.

Couple of files (baum_welch.c & accum.c) are being mentioned in the /logdir/30.cd_hmm_untied/010.1-1.bw.log when there is an error. Errors such as these.

ERROR: "baum_welch.c", line 331: sw2001A-ms98-a-0049 ignored
WARNING: "accum.c", line 626: The following senones never occur in the input data
       120 121 122 123 124 125 126 127 128 132
       133 134 135 136 137 138 139 140 141 142
       143 150 151 152 153 154 155 174 175 176
       177 178 179 180 181 182 183 184 185 189
       190 191 195 196 197 201 202 203 204 205
       206 216 217 218 219 220 221 222 223 224

So, I went through Sphinx3 source code on sourceforge.net in order to find those files and look at the code at the reported lines. I was able to find the files but they are written in C which will take me sometime to understand. I have shared my findings with my group so maybe Forrest, David or somebody else is more familiar with C language and is able to understand the errors that are occurring during training.

Here is the link to the source files http://sourceforge.net/p/cmusphinx/code/HEAD/tree/trunk/sphinxtrain/src/programs/bw/

4/22

  • Testing new training method suggested by Forrest. The process is similar to the original experiment setup but this time we have a special script which inserts the mono channel data into our experiment to be trained on. I think the process could be even simpler if it's incorporated with David's script which worked great when I tested it yesterday.

Week Ending April 29, 2014

Task
  • This week all of the focus is on running experiments.
Results

4/23

Ran 2 experiments, one with the regular mini/train data set and the other was mini/mono data created by eliminating stereo channel. The results were not satisfying as the error rate from mini/mono was almost double the error rate of mini/train.

Update with results:

Exp: 0252/010
Input
Data: mini/train
Density: 16
Senones: 570
Results
SENTENCE ERROR: 87.6% (481/549)   WORD ERROR RATE: 20.4% (1975/9699)
RTx: 0.34 
Exp: 0252/014
Input
Data: mini/mono
Density: 16
Senones: 570
Results
TOTAL Words: 9762 Correct: 6086 Errors: 3703
TOTAL Percent correct = 62.34% Error = 37.93% Accuracy = 62.07%
TOTAL Insertions: 27 Deletions: 2283 Substitutions: 1393
RTx: 0.38

4/25

Since the mini data set experiments were not what we expected and David, Forrest and Pauline are at a competition for the next couple of days, i decided to run a few more experiments comparing the mono audio to original stereo audio. This time I went with first_5hr data set, and the results of the two experiments are almost identical with first_5hr/mono data having a slightly better results than first_5hr/train data.

Exp: 0252/020
Input
Data: first_5hr/mono
Density: 25
Senones: 1777
Results
TOTAL Words: 60084 Correct: 49485 Errors: 18287
TOTAL Percent correct = 82.36% Error = 30.44% Accuracy = 69.56%
TOTAL Insertions: 7688 Deletions: 3763 Substitutions: 6836
RTx: 1.22
Exp: 0252/021
Input
Data: first_5hr/train
Density: 25
Senones: 1777
Results
TOTAL Words: 60084 Correct: 49348 Errors: 18484
TOTAL Percent correct = 82.13% Error = 30.76% Accuracy = 69.24%
TOTAL Insertions: 7748 Deletions: 3801 Substitutions: 6935
RTx: 1.21
Not sure about real-time (RTx) numbers, I might be looking at the wrong log data will have to check with the group to make sure.

Due to the results above I went ahead and began another first_5hr/mono experiment but this time I have set density and senone values based on the experiment [| 0199/d32/s3000]. This experiment produced an Error Rate of 17.2 with 32 density and 3000 senones. If the results of my experiment /0252/022 are better than that of 0199/d32/s3000, then we are on the right track with splitting stereo audio into mono and training on it.

4/28

Exp: 0252/022/001
Input
Data: first_5hr/mono
Density: 32
Senone: 3000
NPART: 1
Results
TOTAL Words: 60084 Correct: 53123 Errors: 13626
TOTAL Percent correct = 88.41% Error = 22.68% Accuracy = 77.32%
TOTAL Insertions: 6665 Deletions: 2830 Substitutions: 4131
RTx: 1.82
Exp: 0252/022/002
Input
Data: first_5hr/mono
Density: 32
Senone: 3000
NPART: 2
Results
TOTAL Words: 60084 Correct: 53097 Errors: 13702
TOTAL Percent correct = 88.37% Error = 22.80% Accuracy = 77.20%
TOTAL Insertions: 6715 Deletions: 2889 Substitutions: 4098
RTx: 1.77
Exp: 0252/022/003
Input
Data: first_5hr/mono
Density: 64
Senone: 3000
Results
TOTAL Words: 60084 Correct: 55083 Errors: 10044
TOTAL Percent correct = 91.68% Error = 16.72% Accuracy = 83.28%
TOTAL Insertions: 5043 Deletions: 2177 Substitutions: 2824
RTx: 4.94
Exp: 0252/022/004
Input
Data: first_5hr/mono
Density: 32
Senone: 4000
Results
TOTAL Words: 60084 Correct: 53597 Errors: 12943
TOTAL Percent correct = 89.20% Error = 21.54% Accuracy = 78.46%
TOTAL Insertions: 6456 Deletions: 2790 Substitutions: 3697
RTx: 3.89
Exp: 0252/022/005
Input
Data: first_5hr/mono
Density: 64
Senone: 4000
Results
TOTAL Words: 60084 Correct: 53543 Errors: 11209
TOTAL Percent correct = 89.11% Error = 18.66% Accuracy = 81.34%
TOTAL Insertions: 4668 Deletions: 3041 Substitutions: 3500
RTx: 4.90

Week Ending May 6, 2014

Task


Results


Plan


Concerns