Speech:Spring 2017 Andrew George Log

From Openitware
Jump to: navigation, search


Week Ending February 7th, 2017

Plan

2/2 - Have impromptu meeting over Google Hangouts.

2/3 - Checking in

2/4 - Go live with the collaboration tool for the Systems team on local ESXi server. Run first experiment.

2/5 - Checking in

2/7 - Have meeting over Google Hangouts to run decode on recently ran train and discuss proposal.

Task

2/2 - Discussing what needs to be done and how we are going to do it. Mainly focusing on proposal and a method to collaborate in a way that doesn't use a third-party cloud service.

2/3 - Checking in

2/4 - Setup file sharing server (OwnCloud) on ESXi host. Run first experiment.

2/5 - Checking in

2/7 - Run decode on recently ran train. Research Ansible and Niagos or configuration management and system monitoring.

Concerns

2/2 - Figuring out and deciding on what needs to be done.

2/3 - Checking in

2/4 - Making it work for everyone in the team to be able to successfully share files and text documents without the use of a third party cloud service. Not being able to get the first experiment to work.

2/5 - Checking in

2/7 - Requesting funding for the purchase of iDRAC cards for all functioning servers.

Results

2/2 - Julian volunteered to organize the proposal and we decided on building a MediaWiki server on my home server. Built MediaWiki server on a Ubuntu VM which resides on my VMware ESXi server. SSH'd into Ubuntu VM and followed the linked guide to install MediaWiki (https://www.hiroom2.com/2016/08/05/ubuntu-16-04-install-mediawiki/).

2/3 - Checking in

2/4 - Decided on scraping the MediaWiki server and went with OwnCloud (https://owncloud.com/). Mark Tollick and I Installed OwnCloud on my ESXi Ubuntu virtual machine and enabled access for all team members via HTTPS. We then built a VPN tunnel to my home network using OpenVPN to give remote access for all team members to access the OwnCloud server. Tested VPN and OwnCloud server access successfully. Added first experiment and train first train.

2/5 - Checking in

2/7 - Came across same error Vitali did when trying to decode the train that we had run on 2/4.

Week Ending February 14, 2017

Plan

2/9/2017

  • Spontaneous Google Hangout meeting to discuss methods of monitoring potential.
  • Check service status of Caesar.
  • Read more Wiki.
  • Research system commands for Red Hat.

2/10/2017

  • Checking in and reading documentation

2/13/2017

  • Checking in and reading documentation

2/14/2017

  • Reply to Professor Jonas's email about how to parse out the drone machines.
  • Reach out to Jon Shallow for more details on what the Modeling group did at the end of last year and the summer.
  • Read more documentation
Task

2/9/2017

  • Discuss monitoring alternatives.
  • Obtain system service status of Caesar.
  • Read more on Speech Wiki.
  • Check for nmap on Caesar.
  • Log into all drone machines and check service status

2/10/2017

  • Checking in and reading documentation

2/13/2017

  • Checking in and reading documentation

2/14/2017

  • Research LCD display statuses on the servers
  • Verify connection statuses of the servers via Caesar
  • Reply to Professor Jonas's email about how to parse out the drone machines.
  • Reach out to Jon Shallow for more details on what the Modeling group did at the end of last year and the summer.
  • Read more documentation
Concerns

2/9/2017

  • No concerns so far. Just need to figure out the current status of the servers.

2/10/2017

  • Checking in and reading documentation

2/13/2017

  • Checking in and reading documentation

2/14/2017

  • My concerns are not being able to provide the support that the other teams need from us as the Systems team.
  • I would like to have visibility on the switch that supplies the connection to all the servers.
  • Not being able to setup a monitoring solution for the servers in case the school is closed or none of our team members are able to make it to school that day.
Results

2/9/2017

  • Discovered that all drone servers have no access to Red Hat Subscription Service therefore cannot perform any yum install commands.
  • I was unable to log into obelix.unh.edu with root password via ssh from Caesar.
  • Same result as above when trying to access miraculix@unh.edu via ssh from Caesar.
  • obelix.unh.edu was not responding to pings or ssh via Caesar (Connection timed out).
  • Found that nmap was not installed on Caesar or any of the other machines on the 192.168.10.0/24 subnet.
[root@caesar ~]# nmap 192.168.10.0/24
Starting Nmap 5.51 ( http://nmap.org ) at 2017-02-09 22:40 EST
Nmap scan report for caesar (192.168.10.1)
Host is up (0.000012s latency).
Not shown: 997 closed ports
PORT     STATE SERVICE
22/tcp   open  ssh
111/tcp  open  rpcbind
2049/tcp open  nfs
Nmap scan report for asterix (192.168.10.2)
Host is up (0.00011s latency).
Not shown: 999 filtered ports
PORT   STATE SERVICE
22/tcp open  ssh
MAC Address: 00:19:B9:E7:2A:4E (Dell)
Nmap scan report for idefix (192.168.10.7)
Host is up (0.00011s latency).
Not shown: 999 filtered ports
PORT   STATE SERVICE
22/tcp open  ssh
MAC Address: 00:19:B9:E7:3E:13 (Dell)
Nmap scan report for rome (192.168.10.11)
Host is up (0.00016s latency).
Not shown: 999 filtered ports
PORT   STATE SERVICE
22/tcp open  ssh
MAC Address: 00:22:19:25:8F:C8 (Dell)
Nmap scan report for brutus (192.168.10.12)
Host is up (0.00010s latency).
Not shown: 997 closed ports
PORT     STATE SERVICE
22/tcp   open  ssh
111/tcp  open  rpcbind
2049/tcp open  nfs
MAC Address: 00:0F:1F:6D:25:3F (WW Pcba Test)
Nmap done: 256 IP addresses (5 hosts up) scanned in 10.77 seconds
  • Found a service status command for Red Hat (service --status-all) and ran on Caesar.
[root@caesar ~]# service --status-all
abrt-ccpp hook is installed
abrtd (pid  2183) is running...
abrt-dump-oops is stopped
acpid (pid  1899) is running...
atd (pid  2202) is running...
auditd (pid  1692) is running...
automount (pid  1971) is running...
Usage: /etc/init.d/bluetooth {start|stop}
certmonger (pid  2233) is running...
Frequency scaling enabled using ondemand governor
crond (pid  2191) is running...
cupsd (pid  1873) is running...
dnsmasq is stopped
firstboot is not scheduled to run
hald (pid  1908) is running...
htcacheclean is stopped
httpd is stopped
irqbalance (pid  1768) is running...
Kdump is not operational
lvmetad is stopped
mdmonitor is stopped
messagebus (pid  1797) is running...
netconsole module not loaded
Configured devices:
lo br0.old eth0 eth1 eth1.old
Currently active devices:
lo eth0 eth1 br0
NetworkManager (pid  1808) is running...
rpc.svcgssd is stopped
rpc.mountd (pid 2017) is running...
nfsd (pid 2032 2031 2030 2029 2028 2027 2026 2025) is running...
rpc.rquotad (pid 2013) is running...
rpc.statd (pid  1841) is running...
ntpd is stopped
oddjobd is stopped
portreserve is stopped
master (pid  2159) is running...
Process accounting is disabled.
quota_nld is stopped
rdisc is stopped
restorecond is stopped
rhnsd (pid  2212) is running...
rhsmcertd (pid 2220) is running...
rngd is stopped
rpcbind (pid  1782) is running...
rpc.gssd is stopped
rpc.idmapd (pid 2053) is running...
rpc.svcgssd is stopped
rsyslogd (pid  1717) is running...
sandbox is stopped
saslauthd is stopped
smartd is stopped
snmpd is stopped
snmptrapd is stopped
spice-vdagentd is stopped
openssh-daemon (pid  2071) is running...
sssd is stopped
wdaemon is stopped
winbindd is stopped
wpa_supplicant (pid  1874) is running...
xinetd (pid  2079) is running...
ypbind is stopped

2/10/2017

  • Checking in and reading documentation

2/13/2017

  • Checking in and reading documentation

2/14/2017

  • Replied to Professor Jonas' email with as much information as I could provide on the statuses of the servers.
  • Read through Jon Shallows log to determine what machine they used most. Found that Miraculix was one that they worked with because it was having issues.
  • Emailed Jon Shallow to ask about any further information he could possibly provide.
  • Logged into Caesar to try to access all six machines and recorded the data for the email sent to professor Jonas and other teammates and team leaders.

Week Ending February 21, 2017

Plan

2/15/2017

  • Meet in the server room to troubleshoot machines.
  • Figure out SSH-keygen.

2/16/2017

  • Meet in the server room to recover root password on Majestix.
  • Create language model from out previously ran train.
  • Run decode on the train.

2/18/2017

  • Have Google Hangout meeting with group.

2/20/2017

  • Meet with group in server room.

2/21/2017

  • Meet with Mark in server room at 12PM
  • Meet with Julian in server room at 4PM
Task

2/15/2017

  • Our task was to troubleshoot the servers that were not responding to ping or SSH access.
  • Run the SSH-keygen commands on our accounts to gain access to servers without the need for our password every time.

2/16/2017

  • Recover password on Majestix.
  • Finish language model and decode.

2/18/2017

  • Discuss and review final draft of our group proposal.

2/20/2017

  • Give Internet access to Rome for the installation of GCC & G++.

2/21/2017

  • Troubleshoot Miraculix (was unable to ssh from Caesar).
  • Look into Rome's backup configuration to see if any maintenance is needed.
Concerns

2/15/2017

  • My only concern was not being able to figure out how to get the down servers back online by way of some unfixable error.

2/16/2017

  • My only concern was that we would not be able to figure out how to proceed with the language model and decode due to directory permissions.

2/18/2017

  • Not sure about expectations as far as specific tasks or direction of the Systems group.

2/20/2017

  • Not being able to give Internet access to any of the drone machines.

2/21/2017

  • Not being able to retrieve backup files or recover deleted files due to backup configuration not being correct.
Results

2/15/2017

  • A couple of the servers were not connected to the network properly, so we reseated the network cables to the machines and after they were able to respond to pings and SSH.
  • Setup SSH-keygen on our accounts.
  • Found that Majestix needs its root password recovered since there were no other user accounts added and the root password was unknown.
  • Gave Idefix to Modeling group.
  • Gave Majestix to Tools group
  • Asterix is not going to be used
  • Obelix and Miraculix will be used by the System to install Torque on.
  • Watched Professor Jonas remove the softlink from asterix.
  • Watched Professor Jonas run the scripts to add all the student accounts to each server except for Majestix.

2/16/2017

  • Used chmod 777 fixed the permissions issue we were having in the LM directory.
    • Created language model.
    • Ran decode on the trained data.
    • Successfully completed out first train.
  • Recovered password on the Majestix machine by booting from the Red Hat disk and static'd its IP address (192.168.10.6) to the eth0 interface.
    • Majestix is now accessible via Caesar.
    • Informed the Tools group that they can now access Majestix and need to check the system for the files that they need.

2/18/2017

  • Refined implementation tasks on the final proposal.

2/20/2017

  • Unfortunately, I had to go to class, but the group was unable to gain Internet access on Rome when using the WAN cable from Caesar.
  • Updated our group's final proposal on the Wiki to match up with the rest of the class.

2/21/2017

  • Rebooted Miraculix which resulted in a fail to reboot due to the server not being able to unmount mnt/main/.
  • Hard rebooted Miraculix and it came back online.

Week Ending February 28, 2017

Plan

2/22/2017

  • Have meeting in server room to discuss tasks for the upcoming week.

2/23/2017

  • Have meeting over Hangouts to troubleshoot Rome's DNS issues.

2/24/2017

  • Research methods of potentially configuring Red Hat to route network traffic.

2/27/2017

  • Meet in server room to configure drone servers.

2/28/2017

  • Meet in server room to attempt to give Internet access to 3 drones at once.
Task

2/22/2017

  • Discuss tasks that need to be completed this week.
  • Came up with the following tasks
    • Figure out how to gain Internet access on the drone servers.
    • Configure Rome to pull a public IP address from the network cable that Professor Jonas let us use.
    • Figure out how to get a hold of the Red Hat license keys.
    • Register each drone server to the Red Hat Satellite Network so they can do yum installs of the packages we need.
    • Find alternatives to obtaining the Red Hat license keys in case we are unable to get them.

2/23/2017

  • Troubleshoot Rome's DNS issues to gain access to the Internet.
  • Stare and compare all network configurations between Caesar and Rome.
  • Research network troubleshooting issues on Red Hat.
  • Seek help outside of school.

2/24/2017

  • Research methods of potentially configuring Red Hat to route network traffic.

2/27/2017

  • Register MAC addresses of drone servers to the UNH IT network webpage (https://networking.unh.edu/nonbrowser).
  • Check all network configurations on drone servers.
  • Configure all network configurations on drone servers.
  • Restart network services.
  • Reboot server.

2/28/2017

  • Configure Linksys wireless router into a switch to give access to 3 drones at a time.
Concerns

2/22/2017

  • Feeling a little concerned about obtaining the Red Hat activation keys for the drone servers.

2/23/2017

  • Unsure of how to proceed when running into issues because of being worried about not doing the right thing or doing something that would be unapproved by our professor.

2/24/2017

  • Feeling a little discouraged due to the restrictions on the network and hardware configurations.

2/27/2017

  • Concerned that we may not be able to give Internet access to the drone servers that need it.

2/28/2017

  • Concerned that configuring a switch may not be a viable option.
Results

2/22/2017

  • We were able to configure Rome's eth0 interface to pull a valid IP via DHCP.
  • Once Rome pulled an IP, we were excited by the success. We were able to ping out to the default gateway of the public subnet as well as google's DNS (8.8.8.8).
  • Our excitement did not last long because we soon found that DNS was not working.

2/23/2017

  • nslookup was not resolving DNS.
  • We checked the network config files and made sure everything was configured the same exact way as Caesar was. Still no resolution.

2/24/2017

  • Found that we are not allowed to configure Caesar to NAT the traffic through its WAN interface to give access to all the drone servers.

2/27/2017

  • Successfully registered all drone servers to the UNH IT department website and confirmed that we were able to get online with the drones.
  • Once registered, we were able to browse out to the Internet.
  • Now we just need to figure out how to register the servers to the Red Hat network so we can start doing yum installs.

2/28/2017

  • Successfully gave Internet access to 3 drone machines by configuring Linksys router.
  • Found that we would be charged more money to have more than one connected to the network at a time.
  • Unplugged all drones except for one.
  • Will have to float the one orange Ethernet cable to the drones as needed.

Week Ending March 7, 2017

Task

3/1/2017

  • Attempt to register the drones to the Red Hat network using the subscription manager.
  • Professor Jonas had the idea of using a wifi dongle to connect to ROME and configure it to bridge the connection to the other drones.
  • Tools team was having difficulties decoding on Majestix which was giving them errors that appeared to be caused from a missing file or path from Sphinx.

3/2/2017

  • Run a 5 hour train on Miraculix under my AD username

3/3/2017

  • Checking in

3/4/2017

  • Checking in

3/5/2017

  • Attempt to run another 5 hour train.

3/6/2017

  • Another attempt to run a 5 hour train all the way through to completion on Miraculix.
  • Run yum update on Miraculix.

3/7/2017

  • Run 5hr train with Mark to see if he gets the same error I did.
  • Install SCLite on Miraculix.
  • Attempt to install WiFi dongle.
  • Look into disabling the RAID configuration on the drone machines.
Results

3/1/2017

See link for more detail Systems Group

  • While we were troubleshooting the activation key issue, we found that after making sure the server had complete access to the Internet without DNS issues, we were able to activate the license using the information Bruce had emailed us.
    • First we had to register the drone to the Red Hat Satellite server cinnabar.unh.edu using a wget command.
    • Then we had to remove yum repo metadata left over from the other sources.
    • Lastly, we used the subscription manager command from the Systems group log to activate the key.
  • Successfully registering the license on the rest of the drone servers except for Majestix.
  • We installed GCC-G++ on Rome using a yum install

3/2/2017

  • Ran train but encountered issues with permissions and had to delete 003 to start all over.

3/3/2017

  • Checking in

3/4/2017

  • Checking in

3/5/2017

  • Logged in as AD username and recreated directory 003 inside the Systems team's Experiment directory (0297).
  • Ran train successfully
  • Created language model successfully
  • Decode error occurred when running "sclite -r 003_train.trans -h hyp.trans -i swb >> scoring.log"
  • Command said it did not exist.
  • Ran from Caesar but returned "Segmentation fault (core dumped)"

3/6/2017

3/7/2017

  • Ran 5hr train with Mark and he got the same segmentation fault error.
  • Tried to install SCLite, but ran into issues (file format not recognized when running 'makefile' after extracting the SCLite executable).
  • Extracted USB dongle tar.gz file and stopped there.
  • Haven't had a chance to look at the RAID config of Miraculix.
  • Successfully ran and decoded a 5 hour train on sub-experiment 008 via Caesar.
Plan

3/1/2017

  • Need to attempt to install WiFi dongle on Rome to see if it would be possible to use that as a bridge instead.
  • Wait for Tools group to find the files they need on Majestix and transfer them over to Obelix which is now what we will giving them because we also were not able to register Majestix.
  • Will need to rebuild Majestix once Tolls group is done with it.
  • Install IRC on Rome.
  • Install Torque.

3/2/2017

  • Will need to attempt to run another 5 hour train tomorrow.

3/3/2017

  • Checking in

3/4/2017

  • Checking in

3/5/2017

  • Going to attempted to run another train tomorrow.

3/6/2017

  • Attempt to install the WiFi dongle on Rome to see if it would be possible to get the server to connect to the Internet through the dongle.
  • Look into disabling the RAID configuration on the drone machines.
  • Run 5hr train from Caesar from start to finish to see if that works.

3/7/2017

  • Need to figure out the task of transferring/creating the symbolic link from usr/local-OFF to usr/local
  • Continue to try installing WiFi dongle on Rome.
  • Continue to look into RAID config of Miraculix.
  • Figure out what the problem is with decoding a 5hr train and why it won't work for us.
  • Figure out how to get SCLite to work on Miraculix.
Concerns

3/1/2017

  • No concerns. Feeling better now that we were able to register all the drones to the Red Hat network.

3/2/2017

  • Need to figure out permissions issue.

3/3/2017

  • Checking in

3/4/2017

  • Checking in

3/5/2017

  • Concerned that Miraculix needs to be rebuilt.

3/6/2017

  • Concerned that the WiFi dongle will take too much time to configure and then too much time to configure Rome to NAT the traffic out its wireless interface.
  • More concerned that Miraculix needs to be fixed.

3/7/2017

  • Concerned that I keep missing something key during the process of running a 5hr train and decoding it.

Week Ending March 21, 2017

Task

3/16/2017

  • Today my task is to go into school and do a fresh install of Red Hat Linux Server 6.6 to rebuild Majestix. Once the fresh install has been completed, we will need to reconfigure the networking profile, register it to the RHN, and reconnect the link to /mnt/main/.
  • Configure host file.
  • Configure network file.
  • Configure NICs.
  • Configure DNS.
  • Activate Red Hat.
  • Run yum update.

3/17/2017

  • Checking in.

3/18/2017

  • Mount Caesar NFS on Majestix.

3/19/2017

3/20/2017

  • Finish installing Torque on Majestix.

3/21/2017

  • Start rebuilding Miraculix
  • Research more about Torque installation
Results

3/16/2017

  • Followed Forrest Suprenant's guide to installing Red Hat (Fall 2014 Redhat notes by Forrest). Successfully reinstalled Red Hat and was able to get it online with a public IP as well as static the local IP address (192.168.10.6) to it. Now with outside remote access, we can do the rest of the configurations and whatnot from home.
  • Updated the DNS nameservers in the resolv.conf file in Forrest's guide.
  • Followed the guide and was able to successfully register and update the system.

3/17/2017

  • Checking in.

3/18/2017

  • Followed Forrest Suprenant's guide to installing Red Hat and started at step 4.
  • Everything went according to the guide and I was able to successfully mount Caesar's NFS to Majestix.
  • I did not proceed to step 5 because I wanted to be in the server room for when I tested out the possible bug.

3/19/2017

  • Completed the following steps:
    • Open Necessary Ports
On the Torque Server Host:
Red Hat 6-based systems using iptables
[root]# iptables-save > /tmp/iptables.mod
[root]# vi /tmp/iptables.mod
# Add the following line immediately *before* the line matching
# "-A INPUT -j REJECT --reject-with icmp-host-prohibited"
-A INPUT -p tcp --dport 15001 -j ACCEPT
[root]# iptables-restore < /tmp/iptables.mod				
[root]# service iptables save
  • Verified the hostname.
    • On the Torque Server Host, confirm your host (with the correct IP address) is in your /etc/hosts file. To verify that the hostname resolves correctly, make sure that hostname and hostname -f report the correct name for the host.
  • Installed Packages
    • On the Torque Server Host, use the following commands to install the libxml2-devel, openssl-devel, and boost-devel packages.
Red Hat 6-based or Red Hat 7-based systems
[root]# yum install libtool openssl-devel libxml2-devel boost-devel gcc gcc-c++
  • Install Torque Server
    • Installed git
# Red Hat 6-based or Red Hat 7-based systems
[root]# yum install git

[root]# git clone https://github.com/adaptivecomputing/torque.git -b 6.0.2 6.0.2 
[root]# cd 6.0.2
[root]# ./autogen.sh
  • Get the tarball source distribution.
[root]# yum install wget
[root]# wget http://www.adaptivecomputing.com/download/torque/torque-6.0.2-<filename>.tar.gz -O torque-6.0.2.tar.gz
[root]# tar -xzvf torque-6.0.2.tar.gz
[root]# cd torque-6.0.2/

This is where I ran into issues

[root@majestix ~]# wget https://www.adaptivecomputing.com/download/torque/torque-6.0.2.tar.gz -O torque-6.0.2.tar.gz
--2017-03-19 23:02:34--  https://www.adaptivecomputing.com/download/torque/torque-6.0.2.tar.gz
Resolving www.adaptivecomputing.com... 104.27.168.86, 104.27.169.86, 2400:cb00:2048:1::681b:a856, ...
Connecting to www.adaptivecomputing.com|104.27.168.86|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.adaptivecomputing.com/downloading/?file=/torque/torque-6.0.2.tar.gz [following]
--2017-03-19 23:02:34--  https://www.adaptivecomputing.com/downloading/?file=/torque/torque-6.0.2.tar.gz
Connecting to www.adaptivecomputing.com|104.27.168.86|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: “torque-6.0.2.tar.gz”
   [ <=>                                                                        ] 56,065      --.-K/s   in 0.03s
2017-03-19 23:02:35 (1.83 MB/s) - “torque-6.0.2.tar.gz” saved [56065]
[root@majestix ~]# tar -xzvf torque-6.0.2.tar.gz
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now


3/20/2017

  • Worked with Julian and Mark to resolve yesterday's issue. The problem was that the URL used in the link for downloading torque was off. Julian found that the source of the download was coming from Amazon's S3 storage. The command we used was:
wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.0.2-1469811694_d9a3483.tar.gz -O torque-6.0.2.tar.gz
  • After the download, we extracted the tar file and proceeded on to the next step of the guide (step 2 of 2.2.4 Install Torque Server).
  • We made it all the way through the install guide and configured the server.
  • Next we tried turning miraculix into a node using the 2.2.5 Install Torque MOMs step of the guide.
    • Unfortunately, we ran into issues when trying to start the services. We ended up determining that miraculix is going to need to be rebuilt from the ground up and to try again with Torque.

3/21/2017

  • Started a fresh install of Red Hat 6.6 server on Miraculix
  • Created five virtual machines on my desktop using VMware workstation (https://gyazo.com/fea2b6bd68fed51a1b38e0e38d6f65f7).
    • Hoping it might be quicker to learn the ins and outs of Torque by trying to setup a test environment at home.
    • Once I setup the node VM, cloning it into four separate unlinked full clones took almost no time (Hardware specs on each node https://gyazo.com/3aff453a39409a9dd50933e45c6b90b1).
    • Ready to configure cluster now.
Plan

3/16/2017

  • Mount the symbolic link to /mnt/main.
  • Make sure that Majestix can successfully run a train.

3/17/2017

  • Checking in.

3/18/2017

  • My plan is to get into the server room and try rebooting Majestix to see if the "NO APIC bug" still prevents the machine from booting past the BIOS. If the bug still exists, I will implement the bug fix as step 5 shows.
  • Next, I would like to test a 5 hour train on Majestix to see if the drone machine is fully functional.

3/19/2017

  • Need to figure out how to resolve the tarball source distribution issue when installing Torque on Majestix.
  • May need to rebuild Miraculix.
  • Still need to look into installing WiFi dongle on Rome and bridge it.

3/20/2017

  • Definitely need to rebuild Miraculix.
  • Still need to look into installing WiFi dongle on Rome and bridge it.

3/21/2017

  • Configure Torque cluster on home environment.
  • Finish rebuilding Miraculix.
  • Still need to look into installing WiFi dongle on Rome and bridge it.
Concerns

3/16/2017

  • No concerns at this time.

3/17/2017

  • Checking in.

3/18/2017

  • Only concern I have is that there is some unforeseen issue that occurred when reinstalling Majestix and I'd have to start over again.

3/19/2017

  • Only concern is that Torque will not be operational before class.

3/20/2017

  • Slight concerns with WiFi dongle...not sure if it will work...

3/21/2017

  • Still concerned with the WiFi dongle.

Week Ending March 28, 2017

Task

3/22

  • After we reported to the professor about what was done over break, we were given tasks to complete. Mark was given the rsync issue with Rome, Julian was given the WiFi dongle issue on Rome, and Bonnie and I were given the task of installing Torque on Rome and Majestix. Professor Jonas then tried showing me how to add the student AD accounts to the machine so he didn't have to keep doing it. I had a hard time trying to follow what he was doing, but I think I can figure it out. Professor Jonas also wanted me to research why it throws an error when trying to SSH into a newly built system and fix it. He also wanted me to figure out why "Unbound DNS resolver" was the name of an account within Majestix and Miraculix and how it somehow made its way into Caesar's account directory. Another task will be to add a Banner to both Majestix and Miraculix since they were just completely rebuilt.

3/23

  • I came into school today to work in the server room. I was unable to get into Majestix remotely so I had to troubleshoot it's connectivity. I just had to restart the network services to get it back up online. Another task I wanted to get started on was uninstalling Torque server from Majestix and install it on Rome (I was mistakenly under the impression that Majestix was to be the server host and Miraculix was the MOM compute node. Instead, Rome is going to be the pbs_server host and Majestix is going to be the MOM compute node.).

3/24

  • Checking in.

3/25

  • Checking in.

3/26

  • Checking in.

3/27

  • Majestix was not responding to ping or SSH from Caesar again so I had to troubleshoot the interface connecting the machine to the 192.168.10.x network. I am not sure why it keeps losing its connection on that interface. I also wanted to configure the welcome banners on Majestix and Miraculix as well as fix the font on Idefix and Obelix. Dylan L. informed the team (Empire) that Miraculix was running into issues with the decode on a 5 hour train and that he will give it another try with a 30 hour corpus.

3/28

  • Finished installing Torque server on Rome and everything seemed have gone smoothly and Torque services are now running. Next task is to install Torque MOM node on Majestix. Another task was to help Julian figure out why the wlan0 interface disappeared on Rome after rebooting it. I got on a Google Hangouts call with Julian and Mark to troubleshoot the issue. I also needed to figure out the RSA key error when logging into Majestix from one of the other nodes as well as create the ssh-keygen for logging into the drone via our AD accounts.
Results

3/22

  • First, we had our first official team meeting as Team Empire. We discussed what needs to be done before we really start playing with the Sphinx configuration file to get better results on the decode. We decided that we would split up the machines we were given and make sure each one was able to successfully run a train and decode. The rest of the class time was spent working with the professor trying to figure out why there was an unbound DNS account showing on Miraculix, Majestix, and somehow on Caesar as well. Some more time was spent trying to figure out why an error would be thrown when SSHing into the newly rebuilt drone machines I setup (Miraculix & Majestix). Found that you had to clear the known-hosts configuration file inside /etc.

3/23

  • Restarting the network services seemed to have fixed the issue temporarily. I may need to look into creating a crontab for restarting the network services on a timed schedule. Next, I got started on uninstalling Torque server on Majestix. I followed a guide from the following website (https://www.webmo.net/support/torque.html). After uninstalling Torque on Majestix, I started installing Torque server on Rome. I made it to the third step on installing the Torque server. I will have to pick back up where I left off, on another day.

3/24

  • Checking in.

3/25

  • Checking in.

3/26

  • Checking in.

3/27

  • Restarted the network services again and the link came back up. I tried using the GUI to create a static connection to eth0, but I am not sure if it has resolved the issue (side note - it's later in the day and the interface has gone back down - issue not resolved). I compared the ifcfg-eth0 config file with caesar and everything appeared to be the same. I am not sure why it keeps losing its connection. I will have to reboot the machine and try troubleshooting it another day. I was able to add banners to Majestix and Miraculix as well as fix the font issues that Professor Jonas was having with Idefix and Obelix.

3/28

  • The issue I ran into was when I was trying to install Torque MOM on Majestix. I was able to successfully scp the install files over to Majestix and install them, but when I tried initializing the pbs_mom service, it failed to start because it came back with an error that said: "Starting TORQUE Mom: /usr/local/sbin/pbs_mom: error while loading shared libraries: libxml2.so.2: cannot open shared object file: No such file or directory [FAILED]". I have a feeling that I didn't uninstall the Torque server right on Majestix or I just never updated the shared libraries via yum update/install libxml2.so.2. I also noticed that when looking at all the running services on Majestix, it showed that the pbs_mom was showing "pbs_mom dead but subsys locked [FAILED]". After some Googling, I found this: "This means the service was running at one time, but has crashed. When you start a service, it creates a "lock" file to indicate that the service is running. This helps avoid multiple instances of the service. When you stop a service, this lock file is removed. When a running service crashes, the lock file exists but the process no longer exists. Thus, the message." I will have to go into the server room to see if updating the shared libraries fixes the issue. If that doesn't work, I think I may have to rebuild Majestix and start all over again.
  • I found that in the /.ssh/known_hosts file, the RSA key was old and no longer valid. So I commented out the old key to have a new one created when trying to access the machine. It appeared to have worked because I was then able to get into it. Another thing I looked into was the Unbound DNS resolver account that made it into Caesar's passwd file. I believe that it ended it there because of the yum updates that I did when I rebuilt Majestix and Miraculix, which I surmised from going through the yum update logs.
Plan

3/22

  • Professor Jonas also wanted me to research why it throws an error when trying to SSH into a newly built system and fix it. He also wanted me to figure out why "Unbound DNS resolver" was the name of an account within Majestix and Miraculix and how it somehow made its way into Caesar's account directory. Another task will be to add a Banner to both Majestix and Miraculix since they were just completely rebuilt.

3/23

  • The plan is to finish installing Torque Server on Rome and install Torque MOM node on Majestix. Also, still need to figure out the unbound issue with the account file on Majestix and Miraculix as well as add the AD accounts to both rebuilt drones.

3/24

  • Checking in.

3/25

  • Checking in.

3/26

  • Checking in.

3/27

  • The plan is to finish installing Torque Server on Rome and install Torque MOM node on Majestix. Also, still need to figure out the unbound issue with the account file on Majestix and Miraculix as well as add the AD accounts to both rebuilt drones.

3/28

  • The plan now is to troubleshoot the pbs_mom install on Majestix and figure out
Concerns

3/22

  • My concerns are that Torque is going to take too much time setting up and working out all the bugs.

3/23

  • Still concerned about Torque.

3/24

  • Checking in.

3/25

  • Checking in.

3/26

  • Checking in.

3/27

  • Concerned about Torque.

3/28

  • Slowly but surely making progress on Torque, but still concerned that it will not work for some reason unknown.

Week Ending April 4, 2017

Task

3/29

  • Continue to look into why the Unbound DNS Resolver was created on Caesar. Troubleshoot install errors on Torque MOM node: Majestix.

3/30

  • Configure Cisco switch and reorganize the rack and cabling. Continue to Troubleshoot install errors on Torque MOM node: Majestix.

4/3

  • Today, one of my tasks were to troubleshoot the connectivity to Asterix and Miraculix. I am not sure why certain machines keep losing their connection to the 192.168.10.0/24 subnet. I may need to enable a cron job to restart the network services on a daily schedule. Asterix was responding to pings but not SSH. Vitali said the session would hang when trying to initialize. I logged onto Caesar and confirmed that it was giving me the same issue. Then the Tools team told me that miraculix was not responding to anything at all. I also confirmed on my end that it was not responding.

4/4

  • Task was to try one more attempt at gaining management access to the Enterasys switch. I had a suspicion that the RJ45 to DB9 adapter I was first using may not have been fully functioning. I luckily had one more adapter at home to try out.
Results

3/29

  • Defaulted the Cisco switch and was able to log into it. We should be able to create multiple VLANs to use for connecting all the drones together.

3/30

  • So going off of the error that was thrown when I tried running the Torque MOM service (pbs_mom), I gave internet to Majestix and did a yum install libxml2.so.2 When I tried running the service after that, I got the same error, but the shared library that was missing was libxml2.so.3. So I ran another yum install but for that package instead. Got the same error a third time but it was a different library (libxml2.so.6). After that I didn't get an error message the next time and the Torque MOM service came right up. After that, installing the Torque Client Host went smoothly.
  • I was able to get VLAN 1 to work on the Cisco switch, but the purpose of the Cisco switch was to be able to use multiple VLANs so it looks like we're going to have to stick with the Enterasys. There was an email from the professor saying that he'd rather stick with the 48 port switch instead just in case they want to add more drones in the future. So now we're back to troubleshooting how to manage the Enterasys switch.
  • I noticed that there was a label on the top of the Enterasys that had a name and IP address (192.168.188.2). So the first thing I tried was plugging my laptop into it and giving it a static IP address on the same subnet. I gave my laptop an IP of 192.168.188.100/24. The only thing I was able to do was successfully ping the .2 address. I was unable to SSH, telnet, or browse via http/https. Unfortunately, I had to stop for the day at that point.

4/3

  • I stopped by the server room before class to troubleshoot the issues with the two drones. To resolve the issues with Miraculix, I restarted the network services and it started responding to ping and management. As for Asterix, I was not able to get the desktop to respond to the monitor, keyboard, or mouse. Nothing was coming up on the screen. So I manually rebooted Asterix. It booted up fine and I logged into root on the desktop. Once Asterix was fully operational, the network services were working and responding to ping and management.
  • First I tried plugging in the USB to DB9 adapter into my laptop and a female to female DB9 adapter between the console port and the USB adapter (just like it looks in the picture). I opened PuTTY and set it to serial COM3 at 9600 baud and tried opening a terminal. Nothing happened. The switch did not respond to any of the inputs I was sending through the wire. Next I tried the follow configuration: COM port => PC adapter => RJ45 Cisco rollover cable => RJ45 to DB9 adapter=> Console port. No dice. No response from the switch.

4/4

  • I plugged in the new null modem adapter and opened the console session in PuTTY. The terminal came to life and asked for a username and password. Unfortunately, I was not able to guess the password so I did a quick Google search and found that there was a password reset button on the back of the switch. I used a pen to press the button and saw the terminal light up again. The button had reset the username as admin and removed the password so all you have to do is press enter for the password. Once I was in the switch, I set the IP address of the switch to 192.168.10.100 and pointed it to Caesar. Next, I enabled SSH on the switch because telnet was not installed on any of the servers. Then, I was able to SSH into the switch from Caesar. Once that was all set, I reinstalled the Enterasys switch and removed the Cisco. I also cleaned up the cable management and made it look somewhat decent. Later at night, I was messing around with the configuration of the switch from home when I made a mistake on trying to reconfigure some of the VLAN tags on the switch ports. This mistake caused loss of management to the switch and access to all of the drones from Caesar (Caesar was still accessible though). Easy enough fix though, just need to reboot the switch since I did not save the configuration change I made.
Plan

3/29

  • Finish configuring Cisco switch and move over all the cables.

3/30

  • With Torque now installed on two machines, it's now time for initializing the system and testing it out. I started with making sure all node(s) were reporting by entering the command # pbsnodes -a which returned:
 state = free
power_state = Running
np = 4
properties = cluster01
ntype = cluster
status = rectime=1491191315,macaddr=00:19:b9:e7:51:7a,cpuclock=UserSpace:2333MHz,varattr=,jobs=,state=free,netload=36801607,gres=,loadave=0.80,ncpus=8,physmem=16333236kb,availmem=22856460kb,totmem=23435696kb,idletime=288972,nusers=1,nsessions=1,sessions=9932,uname=Linux majestix 2.6.32-642.15.1.el6.x86_64 #1 SMP Mon Feb 20 02:26:38 EST 2017 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
  • There is a console port on the side of the Enterasys switch. It appears that the console port is our last hope at gaining access to the switch and hopefully gain access to multiple VLANs. I have a couple different adapters to try out on the switch. Will have to try them next time I go into school.

4/3

  • Tomorrow, I am going to try to go to school early to get some work done in the server room. I would like to power cycle the Enterasys switch and try to gain management access to it. I received a message from: Huong Ha @8:20PM: Hey Andrew, Jonas wanted to coordinate with the systems group before we go forward with our C++ install on Caesar. He just wanted us to know if Brutus could be rebuilt to be Caesar if the g++ install messes up anything... even though that's a very small chance... Just need the okay from the systems group. So now we need to make a backup of Caesar and rebuild Brutus to be a Caesar backup.

4/4

  • Wake up early to go to school so I can reboot the switch and regain remote access to the switch from Caesar and the drones. Then, I would like to split up the ports onto different VLANs so we can attempt to give internet access to all of the drones on a different subnet that point out of the WiFi dongle that Julian installed on Rome.
  • Bonnie and I also need to look into initializing Torque and getting something to actually work on it and prove that the concept works on the setup we have.
Concerns

3/29

  • Concerned that I will have to rebuild Majestix because I did not uninstall Torque correctly.

3/30

  • Problems with the drone interfaces keep coming up. Asterix is responding to pings but stalls on any SSH access attempt. Miraculix is not responding to pings or SSH. Will need to look at both machines on Monday.

4/3

  • No concerns at this time.

4/4

  • A little concerned that Torque will end up taking more work than it is worth or that we will not be able to finish it by the end of this semester.

Week Ending April 11, 2017

Task

4/5

  • After we gave the professor our updates from the previous week, Jonas released us to work with our groups. Some of our tasks included testing Torque, configuring Rome to be a router, set up separate VLANs on the Enterasys switch, and make sure Brutus can replace Caesar if Caesar gets messed up from some installs we have to do on it. Also need to add our user profiles to Miraculix, make sure all the ssh key-gen's work and look into why unbound DNS resolver made it into the /etc/passwd config file. Lastly, we need to troubleshoot why there is so much packet loss between Rome and the backup server.

4/10

  • One of my tasks today was to take pictures of the hardware rack and upload them to the URC poster that Bonnie made for our group. I also needed to add the class's active directory accounts to Rome and Miraculix because Rome was never assigned to any of the groups except Systems and did not need all the active directory accounts added. Miraculix had just been rebuilt and did not have any of our student accounts added either. Professor Jonas showed me where the script was located in /mnt/main and how to execute it.

4/11

  • I ended up helping out Julian with the WiFi dongle issue in the server room for some time today. I also needed to continue my work on Torque. We need to properly test the Torque system and try to run a test batch job. First we need to verify that all queues are properly configured. Next, we have to look over the additional server configuration. Then, we need to verify that all nodes are correctly reporting. Next, we need to submit a basic job as a non-root user and verify jobs display.
Results

4/5

  • I was able to clean up the power cable mess in the back of the server cabinet. I used velcro ties I had bought for work a long time ago. I also organized the Ethernet cables so it looks somewhat decent for the URC poster pictures. I came into school today to fix the issue I had caused with the Enterasys switch. I first tried rebooting it while running a constant ping from Caesar to Rome to see if power cycling the switch would revert back to the old config before I wrote it to memory. No luck. Next, I took out my console cable and plugged it into the switch to gain direct access. I reset the port config changes I had made last night and everything came back online and was responding to pings and management from Caesar. I then labeled some of the ports in the switch. One thing I tried was to see if I could plug the orange Internet cable into one of the open ports on the switch and put it on its own separate VLAN (400) which I named as "DMZ". My idea was that if you changed a port config to only be on VLAN 400, then you could give Internet access to any one of the drone machines remotely one at a time. This would of course be only a temporary solution until we can get Rome's WiFi dongle to work and get it to route all the traffic through the wireless interface. Also, the professor told us that he will not always have access to the orange Internet cable. Unfortunately, I could not get my idea to work as I intended it to and eventually had to give up.

4/10

  • I was able to take a picture of both the front and the back of the equipment rack and uploaded them to the Google drive where the URC poster was being shared. The URC poster we are presenting is focusing on our Torque install (https://gyazo.com/5e8475bd1b312ba14d1a92bc3c98e334). Next, I logged into Caesar as root and ssh'd into Rome. I found the admin script to add all of our student active directory accounts. The path to the script is /mnt/main/scripts/admin/redhat/sp17/cis790_users.csh. At first I was a little nervous to execute the script because when professor Jonas showed me, he went through the process very quickly and I was not sure if there was something I may have missed. However, I knew where the script was located and I knew that you had to be logged into the machine that you wanted to add the accounts to, but I wasn't sure if you had to be in a specific directory or what. So when I asked the professor, he just said that you have to be logged into the machine as root and it doesn't matter what directory you are in as long as you are logged in as root. So that is what I did. I logged into Rome and ran the following command as root: tcsh /mnt/main/scripts/admin/redhat/sp17/cis790_users.csh. I received an error saying that the group "cis790" did not exist. So I ran the following command: groupadd cis790. Then I ran the script again and it gave a different error for every account it added: useradd: warning: the home directory already exists. Not copying any file from skel directory into it. However, when I checked /etc/passwd I saw that all of our accounts had been added. So I went through the same process on Miraculix and got the same results. To test it all out, I logged into Caesar as acg12 (my AD account) and ssh'd into Rome and boom, got right in without a password. I backed out of Rome and ssh'd into Miraculix with the same result!

4/11

  • Julian was able to configure Rome to route the traffic from the 172.16.10.0/24 subnet, through the WiFi dongle. However, when trying to ping out from Asterix, we received an error that said destination host prohibited. Julian is going to mess with the firewall rules and iptables configurations within Rome to see if he can get it to work. The real problem we have with this WiFi dongle is that it only appears to have the ability to connect to UNH-Public which needs to be authenticated via a web browser and then the session has a timeout of about 30 mins. Not only that, but the connection is very intermittent and extremely unreliable. It will disconnect randomly and sometimes loses its IP. The dongle is plugged into the back of Rome where it is completely surrounded by metal and noise (things that are not conducive to receiving a clean signal from an access point that is not even in the same room as the server rack).
  • For Torque, I verified that all queues are properly configured by running the following command:
[acg12@rome ~]$ qstat -q

which then outputted:

server: rome

Queue   Memory   CPU Time   Walltime   Node   Run  Que  Lm  State
-----   ------   --------   --------   ----   ---  ---  --  -----
batch   --        --        --         --     0    0    --  ER
                                              ---  ---
                                              0    0

Next, I ran the following to view the additional server configuration:

[acg12@rome ~]$ qmgr -c 'p s'

That returned:

# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = rome
set server managers = acg12@rome
set server operators = acg12@rome
set server default_queue = batch
set server log_events = 2047
set server mail_from = adm
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 300
set server poll_jobs = True
set server down_on_error = True
set server mom_job_sync = True
set server keep_completed = 300
set server next_job_number = 1
set server moab_array_compatible = True
set server nppcu = 1
set server timeout_for_job_delete = 120
set server timeout_for_job_requeue = 120

Then, I verified that the node was correctly reporting:

[acg12@rome ~]$ pbsnodes -a

Which returned:

majestix
    state = free
    power_state = Running
    np = 1
    ntype = cluster
    status = 
rectime=1491971979,macaddr=00:19:b9:e7:51:7a,cpuclock=UserSpace:2333MHz,varattr=,jobs=,state=free,netload=9391433
1,gres=,loadave=0.96,ncpus=8,physmem=16333236kb,availmem=22849124kb,totmem=23435696kb,idletime=514680,nusers=1,ns
essions=1,sessions=9932,uname=Linux majestix 2.6.32-642.15.1.el6.x86_64 #1 SMP Mon Feb 20 02:26:38 EST 2017 
x86_64,opsys=linux
    mom_service_port = 15002
    mom_manager_port = 15003

Now it was time to submit a basic job as a non-root user.

[acg12@rome ~]$ echo "sleep 30" | qsub

Returning:

2.rome

The reason why it was at 2, is because I had ran that command 3 times. Lastly, I verified the jobs display:

[acg12@rome ~]$ qstat

Returning:

 Job ID                    Name             User            Time Use S Queue
 ------------------------- ---------------- --------------- -------- - -----
0.rome                     STDIN            acg12                  0 C batch
1.rome                     STDIN            acg12                  0 C batch
2.rome                     STDIN            acg12                  0 Q batch

The configuration guide finished with the following:

At this point, the job should be in the Q state and will not run because a scheduler is not running yet. 
Torque can use its native scheduler by running pbs_sched or an advanced scheduler (such as Moab Workload 
Manager). See Integrating schedulers for details on setting up an advanced scheduler.
Plan

4/5

  • The plan now is to continue to look into testing Torque and reach out to Zach from last year who wrote a report on Torque and had it successfully set up on Brutus. Bonnie has volunteered to do the URC poster so I will assist with pictures and anything else she might need help on. Next, I need to make sure I know what I am doing when executing the csh script for adding the class usernames to a drone's /etc/passwd config file.

4/10

  • Now, the plan is to figure out how to get my AD account ssh key-gen to work between Rome and Majestix because apparently it is needed to communicate between the Torque server and node. The next thing to focus on is Torque itself. Bonnie and I will need to dig into the configuration of both the server host and the node to see if we can get it to actually run a test batch job successfully. Considering that it has been installed and I have verified that the node is reporting, it shouldn't be too bad to get it to test out alright, but who knows. There are always issues that pop up when not expected.

4/11

  • The plan now is to figure out how to configure the scheduler and get it up and running. Either we try to use Torque's native scheduler by running pbs_sched or an advanced scheduler. We will have to read through the Integrating schedulers section to learn more about it or just look into how to use pbs_sched.
Concerns

4/5

  • No concerns at this time.

4/10

  • No concerns at this time.

4/11

  • Concerned that the WiFi dongle was a waste of time. Still a learning experience, but in the end I believe it is a failed solution to our problem with trying to give Internet to all of the drones through that WiFi dongle.

Week Ending April 18, 2017

4/13
Task
  • We still need to figure out how to configure, enable, initialize, and manage the scheduler for Torque. Then, we need to troubleshoot the connection from the server room to the tech consultant office, where the backup computer is kept. A ticket had been opened with UNH IT and we wanted to make sure that there really was an issue with the cable run before we have have a tech show up and charge the school money to get something replaced. Another thing we have to do is figure out how to ssh into Majestix from Rome with the SSH keygen. Lastly, we still needed to figure out why Asterix was getting a destination host prohibited error when trying to ping out to the Internet.
Results
  • Bonnie and I did some research online to see if we could figure out how to continue with Torque. We were unable to get pbs_sched to enable and start as a service. I even tried creating a new directory path (/usr/lib/systemd/system/) within Rome to see if the config guide (https://www.webmo.net/support/torque.html) configure the scheduler section would work. Apparently, Rome did not recognize the command systemctl so I tried doing a yum install of it, but to no avail. There was next to no documentation on how to access or do anything with pbs_sched. After many failed attempts to get the pbs_sched to initialize, we decided to see if Maui (a different scheduler) might be an easier method. To test the connection between the server room and the tech consultant office, Julian connected his laptop on one end and I connected my laptop to the other end. We static'd IP addresses to our laptop's Ethernet interface on the same subnet and tried pinging each other. We did not see any latency or major packet-loss. Next we plugged into the little dummy switch on the patch panel and tested. No packet-loss or latency either. Then we plugged in Rome and saw a clean test. Once we plugged in the backup computer, we saw the heavy packet-loss and high latency. The network interface on the computer was either failing or faulty. So Mark emailed IT and cancelled the ticket. I found that I was unable to SSH into Rome as my account. To solve that problem, I ran the cis790_passwd_batch.csh script on Rome and then I was able to access Rome. Next, I found that I was able to SSH from Majestix to Rome but not the other way around. Even after redoing the SSH keygen process. Julian solved the problem with not being able to ping out to the Internet on Asterix. He adjusted a rule within the iptables config (see Julian Consoli's logs).
Plan
  • Now we need to look into installing Maui. We also need to make a backup of Caesar using rsync which will most likely be done on Rome for now until we can rebuild the backup computer or fix the interface. Brutus needs to be combed through to make sure it can take over as Caesar. May need to install Red Hat on Brutus since openSUSE is currently installed on it. I would also like to just configure Rome to route everything out the orange Internet cable for now until we can figure out a more reliable way of connecting to the WiFi as well as the UNH-Secure SSID. I don't think UNH IT will see any of the drone MAC addresses behind Rome since it is routing the traffic instead of bridging it. They should only be seeing the MAC address of Rome's Internet interface.
Concerns
  • Only concern is that we might not be able to get Torque fully operational before the end of this semester.
4/15
Task
  • Need to read through the installation guide (http://docs.adaptivecomputing.com/maui/mauistart.php) to Maui to see if we can get it running on Rome. I am not sure if Rome is supposed to be the machine that the scheduler is installed on, but I didn't think it would be a good idea to install it on the node in case there was a conflict between the scheduler and the pbs_mom software. Nothing in the documentation says which machine to install Maui on so I made the executive decision to install it on Rome.
Results
  • I followed the quick start guide to installing Maui on Rome. As a note, I ran the installation under the root user. I am not 100% sure if I was supposed to do that because it never really specifically mentioned that it had to be a non root user. All it said was that the user needed to have admin access to the resource manager that will be used. So I just assumed root user would be the best option since it pretty much has global access to everything in the system. The first thing I did was download the Maui package from Adaptive Computing (http://www.adaptivecomputing.com/support/download-center/maui-cluster-scheduler/) which you have to register yourself on the website to access the download. I downloaded it directly from Rome in the server room (I'm sure there is another way to do it with a wget command, but I was already in the server room at the time and figured it would be easier that way) after I gave it Internet access. I used the browser to get to the website, sign in and copied the file into the Torque directory. The rest of the installation process I did remotely from home. I followed step one on the quick start guide. Here are some links to the logs of my command prompt:
https://drive.google.com/file/d/0B64eqpTcga_dMWxib09SYmpBVVU/view?usp=sharing - configure
https://drive.google.com/file/d/0B64eqpTcga_dMFltUTBxOGt1VGM/view?usp=sharing - make
https://drive.google.com/file/d/0B64eqpTcga_dMmlBSTQwS1Zjbnc/view?usp=sharing - make install

Step two was to configure the resource manager. I looked through some of the integration guides, but was not familiar enough with the configuration of Torque to determine if I should be messing with any of those configurations. It also said that when you initially install Maui, it throws a default config on it anyways to at least start it off (I may have to go back and comb through it more thoroughly). Step three was to configure maui. The guide says that there is a Maui configuration file named maui.cfg, however, I did not see a file with that name. The only file that looked like their example had the name of maui.cfg.dist. The only thing I changed in the config file was the SERVERMODE parameter. I changed the parameter to TEST. The next step after changing it to test mode was to export the PATH and run the command maui. The issue I ran into was that when I ran the maui command, nothing happened. The guide said to check the maui.log file, but I could not find such a file. I also found that the commands the guide was recommending were not working. I had to change directory into bin and execute the commands in their using "./" in front of them. I was able to get two commands to respond with something, so I may have gotten Maui to at least be enabled. Here are my results:

1st command:

[root@rome bin]# ./showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME


     0 Active Jobs       0 of    1 Processors Active (0.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 0   Active Jobs: 0   Idle Jobs: 0   Blocked Jobs: 0

2nd command:

[root@rome bin]# ./showstats
maui active for      00:02:04  stats initialized on Wed Dec 31 19:00:00

Eligible/Idle Jobs:                    0/0         (0.000%)
Active Jobs:                           0
Successful/Completed Jobs:             0/0         (0.000%)
Avg/Max QTime (Hours):              0.00/0.00
Avg/Max XFactor:                    0.00/0.00

Dedicated/Total ProcHours:          0.00/0.03      (0.000%)

Current Active/Total Procs:            0/1         (0.000%)

Avg WallClock Accuracy:            <N/A>
Avg Job Proc Efficiency:           <N/A>
Est/Avg Backlog (Hours):           <N/A>/<N/A>

After seeing this, I set the config back to NORMAL mode from TEST mode.

Plan
  • There is going to need to be some more testing and configuration changes for me to feel completely comfortable with using this scheduler. I am still having some difficulties wrapping my head around this entire concept. I understand that Torque distributes the job over an array of combined resources, but how it all works together and how everything runs in the background is still a mystery to me. I feel like I need to see a fully configured system up and running on multiple nodes in an environment much like ours and really get a hands on tutorial of it all by someone who has built it all from the ground up. The next phase here is to keep digging in and playing with the system and configuration. We also need to make a backup of Caesar using rsync which will most likely be done on Rome for now until we can rebuild the backup computer or fix the interface. Brutus needs to be combed through to make sure it can take over as Caesar. May need to install Red Hat on Brutus since openSUSE is currently installed on it. I would also like to just configure Rome to route everything out the orange Internet cable for now until we can figure out a more reliable way of connecting to the WiFi as well as the UNH-Secure SSID.
Concerns
  • I am concerned I will not get a complete grasp on Torque and Maui before the end of the semester. The only thing I can do is make sure my documentation is good enough for the next semester to take over from where we left off.
4/17
Task
  • The task for today was to work with Mark on backing up Caesar. The reason we need to backup Caesar is because the Tools group needs to do some yum installs on it. Professor Jonas said that he would like a backup of Caesar just in case something goes wrong with it and it becomes faulty (Jonas says it should be fine). He would like Brutus to be able to replace Caesar if that were to happen. Currently, Brutus has openSUSE installed so I would like to rebuild it with Red Hat 6.6 to match the rest of the machines.
Results
  • First, I found that I was having problems logging into Rome from Caesar using my personal account. I logged into Rome via root and noticed that /mnt/main/ was not mounted. After re-mounting /mnt/main/ to Rome, I was then able to ssh into Rome from Caesar with the ssh keygen under my account name. Over a hangout call, Mark informed me that rsync should be able to backup the files we need from Caesar sufficiently to rebuild Caesar if we needed to on Brutus. We decided that since the backup computer in the tech consultant office is having issues, we would just store the backup on Rome for now. I checked the disk size and usage with the following command:
[root@caesar ~]# df -h

Found that roughly 5GB of 50GB was being used on both Caesar and Rome. I created a directory in root called BACKUP. I then created two more directories inside the backup directory called caesar and fullcaesar. Mark said that we should backup the root directory of Caesar in one folder and a backup of all the directories under /. Caesar's root directory was backed up to root@rome /root/BACKUP/caesar using the following command:

[root@caesar ~]# rsync -a -e ssh /root/ root@rome:/root/BACKUP/caesar/

Next, we ran the following commands to attempt to fully backup Caesar:

rsync -a -e ssh /bin/ root@rome:/root/BACKUP/fullcaesar/bin/
rsync -a -e ssh /boot/ root@rome:/root/BACKUP/fullcaesar/boot/
rsync -a -e ssh /dev/ root@rome:/root/BACKUP/fullcaesar/dev/
rsync -a -e ssh /Downloads/ root@rome:/root/BACKUP/fullcaesar/Downloads/
rsync -a -e ssh /etc/ root@rome:/root/BACKUP/fullcaesar/etc/
rsync -a -e ssh /home/ root@rome:/root/BACKUP/fullcaesar/home/
rsync -a -e ssh /lib/ root@rome:/root/BACKUP/fullcaesar/lib/
rsync -a -e ssh /lib64/ root@rome:/root/BACKUP/fullcaesar/lib64/
rsync -a -e ssh /lost+found/ root@rome:/root/BACKUP/fullcaesar/lost+found/
rsync -a -e ssh /media/ root@rome:/root/BACKUP/fullcaesar/media/
rsync -a -e ssh /misc/ root@rome:/root/BACKUP/fullcaesar/misc/
rsync -a -e ssh /net/ root@rome:/root/BACKUP/fullcaesar/net/
rsync -a -e ssh /nfsshare/ root@rome:/root/BACKUP/fullcaesar/nfsshare/
rsync -a -e ssh /opt/ root@rome:/root/BACKUP/fullcaesar/opt/
rsync -a -e ssh /root/ root@rome:/root/BACKUP/fullcaesar/root/
rsync -a -e ssh /sbin/ root@rome:/root/BACKUP/fullcaesar/sbin/
rsync -a -e ssh /selinux/ root@rome:/root/BACKUP/fullcaesar/selinux/
rsync -a -e ssh /srv/ root@rome:/root/BACKUP/fullcaesar/srv/
rsync -a -e ssh /testtest/ root@rome:/root/BACKUP/fullcaesar/testtest/
rsync -a -e ssh /tmp/ root@rome:/root/BACKUP/fullcaesar/tmp/
rsync -a -e ssh /user/ root@rome:/root/BACKUP/fullcaesar/user/
rsync -a -e ssh /usr/ root@rome:/root/BACKUP/fullcaesar/usr/
rsync -a -e ssh /var/ root@rome:/root/BACKUP/fullcaesar/var/

I should have spent the time to write a bash script, but with how I created the directories as root, I had to manually enter in the root password for every command. Once the backup was complete, I posted a message in Slack saying that Caesar has been backed up and is ready for installs.

Plan
  • The plan is to continue testing and playing with the scheduler, Maui. I would like to get to a point where we could actually run some legitimate batch jobs and potentially get to a point where we could run a train with it (although I doubt that will happen before the end of the semester). I would also like to rebuild Brutus and install Red Hat 6.6 on it. I would like to build Brutus to a point where it would be easy to cut over to in the case of the disaster with Caesar.
Concerns
  • My concerns only lay with Torque. I feel like there is way too much to research and tinker with before fully understanding it and actually getting it to do what you want it to. I honestly think that at this point, the Tools group or something should take over since it has been fully installed and somewhat configured.
4/18
Task
  • Read over Torque/Maui documentation
Results
  • Read over Torque/Maui documentation
Plan
  • Read over Torque/Maui documentation
Concerns
  • Read over Torque/Maui documentation

Week Ending April 25, 2017

4/20

Task
  • Read over Torque/Maui documentation
Results
  • Read over Torque/Maui documentation
Plan
  • Read over Torque/Maui documentation
Concerns
  • Read over Torque/Maui documentation

4/23

Task
  • Checking in
Results
  • Checking in
Plan
  • Checking in
Concerns
  • Checking in

4/24

Task
  • Remove backup we made on Rome of Caesar's root directory.
  • Fix SSH-keygen on Obelix from my account. Could not SSH from acg12 via Caesar.
Results
  • Deleted the directory BACKUP on Rome.
  • Updated SSH-keygen on my account in Caesar.
    • Unsuccessful
  • Ran the following commands:
[root@obelix ~]# tcsh /mnt/main/scripts/admin/redhat/sp17/cis790_passwd_batch.csh
[root@obelix ~]# tcsh /mnt/main/scripts/admin/redhat/sp17/cis790_passwd_caesar.csh
  • Unsuccessful
Plan
  • Work on backup machine in tech consultant office.
  • Test Torque and Maui more.
Concerns
  • Not sure how to get to a point where we can run trains and decodes via Torque/Maui.

4/25

Task
  • My task for today is to dig through the configuration of Torque and Maui to see how they work together and figure out how to integrate them into something the class can use to process their trains and decodes faster.
Results
  • I tried going through the guide on testing the server configuration again (http://bit.ly/2oHm3Mo). I submitted a job through my userid (acg12) on Rome and it returned a job id of: 5.rome (I had ran several other jobs previously). I then logged into Rome via root and changed directory into maui's bin folder to try some commands. The path to the maui command directory is as follows:
[root@rome ~]# cd /usr/local/src/maui-3.3.1/bin/

The commands in this directory are:

-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 canceljob
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 changeparam
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 checkjob
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 checknode
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 diagnose
-rwx--x--x.  1 root root 2758442 Apr 15 13:12 maui
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 mbal
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 mclient
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 mdiag
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 mjobctl
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 mnodectl
-rwxr-x--x.  1 root root 2476055 Apr 15 13:12 mprof
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 mschedctl
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 mstat
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 releasehold
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 releaseres
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 resetstats
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 runjob
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 schedctl
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 sethold
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 setqos
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 setres
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 setspri
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 showbf
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 showconfig
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 showgrid
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 showhold
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 showq
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 showres
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 showstart
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 showstate
-rwxr-x--x. 30 root root 2554020 Apr 15 13:12 showstats

I couldn't figure out a way to run the commands from anywhere else but this directory. I had to use "./" in front of the command to execute it. Here are some descriptions on what the commands do (https://gyazo.com/513609ded6786f02ccffa41c18de352e). I tried running several of the commands, but I was unable to find anything useful. The "showq" command returned the following:

[root@rome bin]# ./showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME


     0 Active Jobs       0 of    1 Processors Active (0.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


Total Jobs: 0   Active Jobs: 0   Idle Jobs: 0   Blocked Jobs: 0

However, I was able to get some interesting information out of the "showstats" command:

[root@rome bin]# ./showstats

maui active for   10:02:26:05  stats initialized on Wed Dec 31 19:00:00

Eligible/Idle Jobs:                    0/0         (0.000%)
Active Jobs:                           0
Successful/Completed Jobs:             1/1         (100.000%)
Avg/Max QTime (Hours):              0.00/0.00
Avg/Max XFactor:                    0.01/0.01

Dedicated/Total ProcHours:          0.01/242.43    (0.004%)

Current Active/Total Procs:            0/1         (0.000%)

Avg WallClock Accuracy:           0.833%
Avg Job Proc Efficiency:          0.000%
Est/Avg Backlog (Hours):            0.00/0.00

As you can see from the "Successful/Completed Jobs" section, it shows that 1/1 was 100% complete. This tells me that Maui is communicating with Torque on some level, but I am still having some difficulties figuring out where to go from here.

Plan
  • The plan now is to do some more research and digging into the documentation of Maui and Torque. We also need to take care of the backup issue between Caesar and the backup machine in the tech consultant room. Mark said that it is running a version of hyper-V which neither of us has any experience with. We may need to consult Professor Jonas on alternatives to what to do with the backup system.
Concerns
  • My only concern still is not being able to fully integrate Torque and Maui into something that the sphinx decoder can use.

Week Ending May 2, 2017

4/26

Task
  • Work with Mark and Julian on troubleshooting the backup machine located in the tech consultant's office.
Results
  • Found that none of the NICs on the backup machine were functioning properly. We changed the subnet on the link to the backup machine to 172.16.0.0/24. We tested with a laptop plugged into the port in the tech consultant's office and was able to successfully ping the default gateway, Rome (172.16.0.11). The test told us that the cable, link and subnet was working properly, so now we just need to troubleshoot the backup machine itself.
  • It eventually came down to having to completely rebuild the backup machine and we decided that ESXi would be the best OS for the machine because then we could install Red Hat 6.6 on it to match Caesar. Having ESXi on the backup machine would also allow for the creation of future virtual machines for other purposes. Mark wiped the backup machine and installed ESXi on it. However, that was when Mark discovered that all of the NICs on the machine did not work. Mark is now working on obtaining a new NIC for the backup machine to see if that will work.
Plan
  • Work with Mark to get the backup server configured and initialized.
  • Work on Torque documentation.
  • Work on group log documentation.
Concerns
  • No concerns at this time.

4/27

Task
  • Work with Bonnie over a Google Hangouts call to write up and go over the Torque documentation.
Results
  • We worked on a shared Google document and wrote out a step-by-step guide on what we did to install and configure Torque as well as Maui.
Plan
  • Keep working on Torque and Maui documentation.
Concerns
  • No concerns at this time.

4/28

Task
  • Torque documentation.
Results
  • Worked on Torque documentation.
Plan
  • Group log documentation.
Concerns
  • No concerns at this time.

4/29

Task
  • Checking in.
Results
  • Checking in.
Plan
  • Checking in.
Concerns
  • Checking in.

5/2

Task
  • Group log documentation.
Results
  • Worked on group log documentation.
Plan
  • Review all documentation for final report.
Concerns
  • No concerns at this time.

Week Ending May 9, 2017

5/3

Task
  • Continue to work Torque and Maui Documentation.
  • Continue to troubleshoot backup server situation.
Results
  • Worked on Torque and Maui Documentation.
  • Found that both the T100 and T105 servers that were given to us to use as the backup server had faulty hardware. Both Poweredge servers experienced heavy packet loss when being pinged by Rome. Worked with Mark, Julian, and Professor Jonas to troubleshoot the servers before finding out about the faulty hardware.
Plan
  • Will need to continue troubleshooting backup server and/or come up with an alternative solution.
Concerns
  • Running out of time before getting the backup server running with an active backup of /mnt/main.

5/4

Task
  • Work with Mark remotely to setup and install a new server besides the T100 or T105. I believe Mark was able to get his hands on a server that the tech consultants us have.
Results
  • Mark installed ESXi and configured the network interface for ESXi to be 172.16.0.22. Then he built a VM using the Red Hat CD that he turned into an ISO using InfraRecorder (http://infrarecorder.org/). Once the VM was built, he configured the interface on the Red Hat VM to be 172.16.0.25. Rome was able to ping both the ESXi host as well as the Red Hat VM with no packet loss whatsoever. Once the networking aspect of this task was complete, the next step was to partition the hard drive into two separate sections so we could keep the Red Hat OS on one smaller partition and the /mnt/main backup on the other bigger partition.
Plan
  • Plan will most likely be to use vSphere to partition the hard drive and then reinstall Red Hat on a rebuilt VM.
Concerns
  • Running out of time before getting the backup server running with an active backup of /mnt/main.

5/5

Task
  • Get on a Hangouts call with Mark and be support for him while he partitions the hard drive and sets up rsync snapshot.
Results
  • Once the virtual machine's hard drive was successfully partitioned, Mark installed rsync snapshot and configured it to point to /mnt/main. Ran into issues when trying to setup SSH-keygen between the machines.
Plan
  • Will need to troubleshoot the virtual machine on site tomorrow.
Concerns
  • No concerns at this time.

5/6

Task
  • Checking in.
Results
  • Checking in.
Plan
  • Checking in.
Concerns
  • Checking in.

5/7

Task
  • Continue to work Torque and Maui Documentation.
Results
  • Worked on Torque and Maui Documentation.
Plan
  • Work on Final report.
Concerns
  • No concerns at this time.

5/8

Task
  • Work on Final report.
Results
  • Worked on Final report.
Plan
  • Continue to work on Final report.
Concerns
  • No concerns at this time.

5/9

Task
  • Work on Final report.
Results
  • Worked on Final report.
Plan
  • Continue to work on Final report.
Concerns
  • No concerns at this time.