Cluster Administration - tum-t38/firefly GitHub Wiki


The information below is currently out of date and may be inaccurate. See MAAS Notes for relevant details.


General Info

Daily reports

Important Servers

pig 10.152.219.21 /zfs/s01/z0[1-4]
cow 10.152.219.24 /zfs/s02/z0[5-7]
dragon 10.152.219.26 /work
hen 10.152.219.28 login
sheep 10.152.219.30 DNS, DHCP, WikiJS, MagicINFO

Troubleshooting

By reboot of sheep, DNS may not start properly:

sudo service nis restart
sudo service rpcbind restart

Firewall

https://cust-fw199-a.fw.lrz.de

Use port forwarding to access it outside the T38 network:

ssh sheep -L 4433:cust-fw199-a.fw.lrz.de:443

Then go to https://localhost:4433

Useful Sections
  • System / User Manager / Authentication Servers
  • VPN / OpenVPN / Servers

10.152.219.30
129.187.254.32
10.156.33.123

Backups

Management

Backups are performed with the IBM Tivoli software. Data is stored off-site at the LRZ on tape. Backup nodes are repositories of files. It is recommended that each node store less that 10 million files. Therefore each ZFS raid has its own node. Nodes can be managed through the LRZ website http://datweb.lrz.de/

Backups and restores are initiated on the command line with the dsmc command.

sheep

WARNING: No backup of complete system!

  • TSM Nodes:
    • T38-SHEEP-1

Crontab:

00 02 * * *     test -x /etc/tsm-backup && /etc/tsm-backup T38-SHEEP-1

/opt/tivoli/tsm/client/ba/bin/dsm.sys:

/var/yp/ypfiles/ 
/etc/ 
/home/ 
/opt/t38www/ 
/opt/wiki/ 
/opt/wikijs-home/

pig (s01)

WARNING: No backup of complete system!

  • TSM Nodes:
    • T38-s01-z01
    • T38-s01-z02
    • T38-s01-z03
    • T38-s01-z04

Crontab:

00 02 * * *     test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z01-tsm
00 00 * * *     test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z02-tsm
00 09 * * *     test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z03-tsm
00 20 * * *     test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z04-tsm

/opt/tivoli/tsm/client/ba/bin/dsm.sys:

domain                  "/zfs/s01/z01/home"
domain                  "/zfs/s01/z02/home"
domain                  "/zfs/s01/z03/home"
domain                  "/zfs/s01/z04/home"

cow (s02)

WARNING: No backup of complete system!

  • TSM Nodes (node names are NOT consistent with the host machine anymore):
    • T38-s01-z05
    • T38-s01-z06
    • T38-s01-z07

Crontab:

00 18 * * *     test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z05-tsm
00 09 * * *     test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z06-tsm
00 22 * * *     test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z07-tsm

/opt/tivoli/tsm/client/ba/bin/dsm.sys:

domain                  "/zfs/s02/z05/home"
domain                  "/zfs/s02/z06/home"
domain                  "/zfs/s02/z07/home"

DNS Entries

Initial settings in /etc/bind/

Adding entries to DNS records

sudo rndc freeze
vi /var/cache/bind/...
# Increment serial numbers!
sudo rndc reload
sudo rndc thaw

Clear DNS cache on laptop

MacOSX:

sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder

Switch Configuration

Main switch is located at 10.152.219.1

Accessing the main T38 switch

The switch can only be accessed via the Management port. Either plug directly in that port and set an IP address manually on the 10.152.219.x network, or plug the management port into another port of the router.

Activating SFP28 ports

To use 10G Transceivers in the 25G SFP28 ports, the port speed must be reduced to 10G. This can only be done through the terminal interface by ssh'ing into the switch. For example:

$ ssh [email protected]
[email protected]'s password:

FS#config
Enter configuration commands, one per line.  End with CNTL/Z.
FS(config)#interface TFGigabitEthernet 0/29
FS(config-if-TFGigabitEthernet 0/29)#port speed-mode 10G
Warning: Ports TF0/29, TF0/30, TF0/31, TF0/32 will be set speed mode 10G. Continue? [Y/N]:Y
FS(config-if-TFGigabitEthernet 0/29)#end
FS#show interfaces TFGigabitEthernet 0/29

Adding a new node

General steps:

  1. Enable ssh root access on new nodes. Necessary to ensure configuration files are copied with correct permissions.
  2. Create patch directories in the cluster configuration setup
  3. Run the installation script over the list of new nodes

In ~admin/CLUSTER_CONFIG the directory t38cn-patch contains a subset of configuration files that should be broadcast to cluster nodes. There is also a directory for each node that contains a snapshot of files on each node. The idea is to first modify the files in t38cn-patch, copy them to the snapshot directories, and then upload those files to each node. 

gather-all <NODE-list-file> copies files from each node to its snapshot directory
scatter-all <NODE-list-file>copies files from each snapshot directory to its node and runs /etc/T38-UPDATE-SCRIPTS/update

When a node has just been configured with the OS, the default is to not allow root to login via ssh. On T38 machines, there is a user account jc with sudo rights. We use that account to enable root access via ssh. The script enable-root-sshdoes this by changing /etc/ssh/sshd_config to set PermitRootLogin yes and restart the ssh server.

run-new-install connects to each new node, copies over the files in the snapshot directory and runs /etc/T38-UPDATE-SCRIPTS/new-install.

Installing OS on a new node

Ensure tftp on sheep is configured to boot the correct image:

vi /etc/default/tftpd-hpa
service tftpd-hpa restart
service tftpd-hpa status

Add an MAC entry for the node and its IPMI service to the TUM machine list Google Sheet, copy the generated host lines to t38-network.dhcp.

vi /etc/dhcp/dhcp.conf
service isc-dhcp-server restart
service isc-dhcp-server status

Boot the new machine with the network boot option (default on new machines). The BIOS should get the expected BMC address and find the expected IP address for the network boot. The installation menu will appear.

During the configuration the network hardware will be detected. At this point the suggested name of the machine should be the same as listed in the dhcp configuration. If not, the dhcp settings were not correctly detected.

Partion with 96 GB for swap and the rest for OS. Swap at beginning. Don't forget the bootable flag for the OS partition.

Install OpenSSH server and Basic Ubuntu Server.

Install grub to master boot record. Complete installation and reboot. 

Change BIOS boot settings to boot hard drive before network.

Link speed too low

The network cards all support at least 1Gb/s. If the link fails to auto detect this try forcing with the command:

sudo ethtool -s eno1 speed 1000 duplex full autoneg off

Replace eno1 with the appropriate device.