Well, finally I have tackled down the culprit in regards to passwordless issue on Rome. It was in fact SELinux, the access control policy mechanism on Fedora that was preventing a network shared user directory from being used as such. After many hours spent on researching, changing permissions, changing ssh_config and sshd_config file parameters in hope of solving the issue it turns out only one parameter needed to be changed in the SELinux policy. The parameter was
use_nfs_home_dirs. By default this parameter is turned off in Fedora, a simple one line command changes parameter to true/on.
- Obelix machine is experiencing an issue when booting. It will only boot in "fail" mode but hangs when attempting to boot in regular GUI mode. This is particularly an inconvenience if we need to reboot the machine for whatever reason, we lose the ability to ssh into the machine as it never loads the system, which prevents ssh port from opening. The only solution is for someone to be in the server room to manually start ssh and mount our nsf in fail safe mode in order for remote access to the machine.
I have looked into all the possible logs on Obelix in search of any clues as to why it suddenly stopped loading in regular mode. With the logs and the google searches I have conducted, I feel like the issue at hand has to do with either gnome, gdm service, xorg, or a bad video card, bad video driver. Here are some interesting log outputs I found.
- /var/log/Xorg.0.log - command ran
egrep "EE|WW" /var/log/Xorg.0.log
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[245398.665] (WW) The directory "/usr/share/fonts/TTF/" does not exist.
[245398.665] (WW) The directory "/usr/share/fonts/OTF/" does not exist.
[245398.665] (WW) The directory "/usr/share/fonts/Type1/" does not exist.
[245398.665] (WW) The directory "/usr/share/fonts/100dpi" does not exist.
[245398.665] (WW) The directory "/usr/share/fonts/cyrillic" does not exist.
[245398.665] (WW) The directory "/usr/share/fonts/misc/sgi" does not exist.
[245398.675] (II) Loading extension MIT-SCREEN-SAVER
[245398.680] (WW) Warning, couldn't open module fglrx
[245398.680] (EE) Failed to load module "fglrx" (module does not exist, 0)
[245398.725] (WW) Falling back to old probe method for fbdev
[245398.726] (WW) Falling back to old probe method for vesa
[245398.773] (WW) MACH64(0): Cannot shadow an accelerated frame buffer.
[245398.788] (WW) MACH64(0): DRI static buffer allocation failed -- need at least 12800 kB video memory
[245398.901] (EE) No input driver/identifier specified (ignoring)
[245398.901] (EE) No input driver/identifier specified (ignoring)
fglrx is the Linux driver for ATI video cards which Obelix server runs on. Full name of the graphics card is:
ATI Technologies Inc Rage XL (rev 27)
<notice -- Apr 3 21:15:30.708364000> service cron donedone
<notice -- Apr 3 21:15:30.708697000> service smartd startStarting smartd
<notice -- Apr 3 21:15:30.915243000> service smartd donedone
<notice -- Apr 3 21:15:30.915740000> service stoppreload start<notice -- Apr 3 21:15:30.951351000> service stoppreload donedone
Master Resource Control: runlevel 5 has been reached
Failed services in runlevel 5: vboxadd vmtoolsd
Skipped services in runlevel 5: cifs xdm
vboxadd, vmtoolsd and cifs services shouldn't be causing the Obelix issue. xdm on the other hand is involved in the graphical process of the system. Reading information on xdm suggests that if there is an error with it that it will be logged in /var/log/xdm.log file but no such file exist on Obelix.
Driver not XRANDR 1.2 capable, ignoring DISPLAYMANAGER_RANDR_MODE_* settings /etc/X11/xdm/Xsetup: line 147: /usr/bin/hal-find-by-property: No such file or directory
The only way to test any of the solutions I found online is to reboot the machine and see if I'm able to login to it via ssh. This is a huge risk as if the solution is unsuccessful, I lose the connection to Obelix. So, I decided to find a copy of openSuse 11.3 and run it on my virtual machine and see if I can possibly recreate the Obelix problem and at the same time see what happens when I implement some of the online solutions to my copy of the OS. So, far I'm unable to recreate the problem but with installing a new fresh copy of openSuse 11.3 on my virtual box provided me with the ability to compare two systems config files. The result of all that was identical gnome and xorg files. This makes me believe that the issue might be with a bad graphics card or bad graphics driver.
Changing the display manager from
/etc/sysconfig/displaymanager allowed a GUI log in on Obelix. This tells us that most likely issue is Gnome related. This will be looked at in more detail tomorrow. The bad news is that same issue is occurring on other machines which suggest maybe Torque could be the reason behind it.
The second issue Obelix faces is sshd isn't running on boot. Looking at the service scripts inside
/etc/init.d/ directory, sshd was there but it was empty. Not sure why or how the contents of the sshd file got removed. Copied Caesars sshd script over to Obelix and tested it with command
service sshd status, it returned that sshd is running. The only way to test if it's working on boot is to reboot Obelix, at this time no such action occurred.
It turns out that the missing sshd service script was indeed the reason why ssh wasn't running at boot on Obelix. With fixed sshd service, my focus shifts to figuring out how this happened and what is causing gnome issue on 8 out of 10 machines. Unfortunately, I wasn't able to work on these issues since a new and more dangers issue has appeared. Caesar for the last 3-4 days has been acting slow, ssh authentication was taking longer than usual, and once inside the system it was running sluggish. I decided to look in the Caesar /var/log/message log to see if there is anything that could have contributed to such sluggish performance. What I saw was a huge red flag, thousands of failed root and fake user logins. The attack was coming from multiple IP addresses and using one of the IP look-up websites, it showed that the attacks were coming from China. Next step is to block those IP addresses and change root password just in case Caesar is compromised.
Week Ending April 15, 2014
- Investigate the cause of Gnome failing to load on 8 machines
- Look into a best method of blocking attacks on Caesar
- Research Sphinx parameters in hope of finding a necessary balance for our groups experiments
- Try to run my own experiment to help Justice League group find the perfect balance between parameters
Caesar is still being bombarded with fake SSH login attempts. While looking through some of the older logs on Caesar, these attacks seem to be going for months. Even though it doesn't look like any of the attempts were successful, an action needs to be taken. I have already made a list of 10 or so IP addresses from which attacks were coming but I feel that this matter isn't for a manually IP banning as that requires constant log checking and adding IP's to ban list. An application is needed to automatically ban an IP that has multiple failed login attempts in very short time span. Linux has such an application and it's called fail2ban. This little helper constantly parses a log(s) in search of failed logins and if there are multiple failed logins in short period of time it adds that source IP address to the firewalls ban list. First, I'll test the application on my virtual system before suggesting it to Prof. Jonas for implementation on Caesar.
Fail2Ban website www.fail2ban.org
Spent the day researching on Sphinx in hope of finding any additional information that will give us lower error rate numbers. Found an interesting Sphinx guide which goes in detail of how Sphinx works (train and decode) and its file system. Due to the current competition with the Avengers, I will not be posting the link to the guide on this log for now. I have shared the information with my group and will update this log after the competition to include the link to the guide.
- Downloaded and installed fail2ban application on to my virtual openSuse 11.3 machine. The process was easy, it took about 15 minutes from start to finish. Configuring it was also simple as it already comes with a lot of preset configurations for different protocols. Testing the product by trying fake logins from my other virtual box was a success. It automatically banned the IP address after 5 failed attempts. It will be a great application to have on Caesar in order to protect it from all of these SSH attacks.
- For some odd reasons all of our machines have wrong time and some even wrong dates. So, I went to each machine and updated their dates and time using two commands
date +%D -s 2014-04-15 && date +%T -s 01:17:00.
- Researched some more on possible causes to our gnome issue and haven't found anything concrete yet. There are multiple errors and warnings inside the log files on each machine but without having the ability to be in the server room to test some of the online suggestions, it's difficult to pin point what is causing gdm display manager from not loading. One constant error that I see across all machines is the GTK library issue.
gnome-about --gnome-version - this command should return the gnome version but it also returns all of these warnings.
Googling some of them hasn't provided any solid solution that I believe in.
/usr/lib/python2.6/site-packages/gtk-2.0/gtk/__init__.py:57: GtkWarning: could not open display
/usr/bin/gnome-about:828: Warning: invalid (NULL) pointer instance
/usr/bin/gnome-about:828: Warning: g_signal_connect_data: assertion `G_TYPE_CHECK_INSTANCE (instance)' failed
/usr/bin/gnome-about:828: GtkWarning: gtk_settings_get_for_screen: assertion `GDK_IS_SCREEN (screen)' failed
/usr/bin/gnome-about:828: Warning: g_object_get: assertion `G_IS_OBJECT (object)' failed
/usr/bin/gnome-about:828: Warning: value "TRUE" of type `gboolean' is invalid or out of range for property `visible' of type `gboolean'
- Looked at 2 dozen log files from our groups 100hr experiment to find out why we are getting high number of errors and warnings while decoding. Unfortunately it seems that some of the logs aren't being recorded as they should, maybe there is a parameter responsible for this. Nonetheless, I did manage to find some errors and warning that might be of help. I have shared that information with my group members .
Week Ending April 22, 2014
- Report to Jonas a list of IP's attackers are using for SSH login attempts
- Run mini train experiments in order to figure out best parameters for train and decode
On Wednesday I had a chat with Jonas about fail2ban app and how it could help control our Caesar problem. He said there was no need to use it on the local level because UNH system already has such application on system wide level. Jonas then asked me to compile a list of all the suspected IP addresses and email it to him. He would then forward it to the UNH System Administrators in order to ban the IP addresses.
So, I parsed the /var/log/messages log for the latest attacks on Caesar. Recorded the IP addresses which produced multiple failed logins. Sent the report to Jonas for the IP's to be banned on UNH level.
- Researched Gnome in hope of some light being shed on our issues with it. So far still nothing concert.
- Read about Sphinx parameters
Thanks to David, new way of creating experiments is so simple with only little configurations. Created an experiment with mini/train corpus data and ran a train, it completed successfully. Looking at the 010.html log file, I see that there are still errors and warnings that occur while training. I researched some of them and wasn't able to find any useful information as to how to solve does problems. Sphinx 3 logs aren't the best as they lack so many details about errors, making it hard to figure out what is the cause.
Couple of files (baum_welch.c & accum.c) are being mentioned in the /logdir/30.cd_hmm_untied/010.1-1.bw.log when there is an error. Errors such as these.
ERROR: "baum_welch.c", line 331: sw2001A-ms98-a-0049 ignored
WARNING: "accum.c", line 626: The following senones never occur in the input data
120 121 122 123 124 125 126 127 128 132
133 134 135 136 137 138 139 140 141 142
143 150 151 152 153 154 155 174 175 176
177 178 179 180 181 182 183 184 185 189
190 191 195 196 197 201 202 203 204 205
206 216 217 218 219 220 221 222 223 224
So, I went through Sphinx3 source code on sourceforge.net in order to find those files and look at the code at the reported lines. I was able to find the files but they are written in C which will take me sometime to understand. I have shared my findings with my group so maybe Forrest, David or somebody else is more familiar with C language and is able to understand the errors that are occurring during training.
Here is the link to the source files http://sourceforge.net/p/cmusphinx/code/HEAD/tree/trunk/sphinxtrain/src/programs/bw/
- Testing new training method suggested by Forrest. The process is similar to the original experiment setup but this time we have a special script which inserts the mono channel data into our experiment to be trained on. I think the process could be even simpler if it's incorporated with David's script which worked great when I tested it yesterday.
Week Ending April 29, 2014
- This week all of the focus is on running experiments.
Ran 2 experiments, one with the regular mini/train data set and the other was mini/mono data created by eliminating stereo channel. The results were not satisfying as the error rate from mini/mono was almost double the error rate of mini/train.
Update with results:
SENTENCE ERROR: 87.6% (481/549) WORD ERROR RATE: 20.4% (1975/9699)
TOTAL Words: 9762 Correct: 6086 Errors: 3703
TOTAL Percent correct = 62.34% Error = 37.93% Accuracy = 62.07%
TOTAL Insertions: 27 Deletions: 2283 Substitutions: 1393
Since the mini data set experiments were not what we expected and David, Forrest and Pauline are at a competition for the next couple of days, i decided to run a few more experiments comparing the mono audio to original stereo audio. This time I went with first_5hr data set, and the results of the two experiments are almost identical with first_5hr/mono data having a slightly better results than first_5hr/train data.
TOTAL Words: 60084 Correct: 49485 Errors: 18287
TOTAL Percent correct = 82.36% Error = 30.44% Accuracy = 69.56%
TOTAL Insertions: 7688 Deletions: 3763 Substitutions: 6836
TOTAL Words: 60084 Correct: 49348 Errors: 18484
TOTAL Percent correct = 82.13% Error = 30.76% Accuracy = 69.24%
TOTAL Insertions: 7748 Deletions: 3801 Substitutions: 6935
Not sure about real-time (RTx) numbers, I might be looking at the wrong log data will have to check with the group to make sure.
Due to the results above I went ahead and began another first_5hr/mono experiment but this time I have set density and senone values based on the experiment [| 0199/d32/s3000]. This experiment produced an
Error Rate of 17.2 with 32 density and 3000 senones. If the results of my experiment /0252/022 are better than that of 0199/d32/s3000, then we are on the right track with splitting stereo audio into mono and training on it.
TOTAL Words: 60084 Correct: 53123 Errors: 13626
TOTAL Percent correct = 88.41% Error = 22.68% Accuracy = 77.32%
TOTAL Insertions: 6665 Deletions: 2830 Substitutions: 4131
TOTAL Words: 60084 Correct: 53097 Errors: 13702
TOTAL Percent correct = 88.37% Error = 22.80% Accuracy = 77.20%
TOTAL Insertions: 6715 Deletions: 2889 Substitutions: 4098
TOTAL Words: 60084 Correct: 55083 Errors: 10044
TOTAL Percent correct = 91.68% Error = 16.72% Accuracy = 83.28%
TOTAL Insertions: 5043 Deletions: 2177 Substitutions: 2824
TOTAL Words: 60084 Correct: 53597 Errors: 12943
TOTAL Percent correct = 89.20% Error = 21.54% Accuracy = 78.46%
TOTAL Insertions: 6456 Deletions: 2790 Substitutions: 3697
TOTAL Words: 60084 Correct: 53543 Errors: 11209
TOTAL Percent correct = 89.11% Error = 18.66% Accuracy = 81.34%
TOTAL Insertions: 4668 Deletions: 3041 Substitutions: 3500
Week Ending May 6, 2014