Cluster Administration - tum-t38/firefly GitHub Wiki
The information below is currently out of date and may be inaccurate. See MAAS Notes for relevant details.
General Info
Important Servers
| pig | 10.152.219.21 | /zfs/s01/z0[1-4] |
| cow | 10.152.219.24 | /zfs/s02/z0[5-7] |
| dragon | 10.152.219.26 | /work |
| hen | 10.152.219.28 | login |
| sheep | 10.152.219.30 | DNS, DHCP, WikiJS, MagicINFO |
Troubleshooting
By reboot of sheep, DNS may not start properly:
sudo service nis restart
sudo service rpcbind restart
Firewall
https://cust-fw199-a.fw.lrz.de
Use port forwarding to access it outside the T38 network:
ssh sheep -L 4433:cust-fw199-a.fw.lrz.de:443
Then go to https://localhost:4433
Useful Sections
- System / User Manager / Authentication Servers
- VPN / OpenVPN / Servers
10.152.219.30
129.187.254.32
10.156.33.123
Backups
Management
Backups are performed with the IBM Tivoli software. Data is stored off-site at the LRZ on tape. Backup nodes are repositories of files. It is recommended that each node store less that 10 million files. Therefore each ZFS raid has its own node. Nodes can be managed through the LRZ website http://datweb.lrz.de/
Backups and restores are initiated on the command line with the dsmc command.
sheep
WARNING: No backup of complete system!
- TSM Nodes:
- T38-SHEEP-1
Crontab:
00 02 * * * test -x /etc/tsm-backup && /etc/tsm-backup T38-SHEEP-1
/opt/tivoli/tsm/client/ba/bin/dsm.sys:
/var/yp/ypfiles/
/etc/
/home/
/opt/t38www/
/opt/wiki/
/opt/wikijs-home/
pig (s01)
WARNING: No backup of complete system!
- TSM Nodes:
- T38-s01-z01
- T38-s01-z02
- T38-s01-z03
- T38-s01-z04
Crontab:
00 02 * * * test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z01-tsm
00 00 * * * test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z02-tsm
00 09 * * * test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z03-tsm
00 20 * * * test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z04-tsm
/opt/tivoli/tsm/client/ba/bin/dsm.sys:
domain "/zfs/s01/z01/home"
domain "/zfs/s01/z02/home"
domain "/zfs/s01/z03/home"
domain "/zfs/s01/z04/home"
cow (s02)
WARNING: No backup of complete system!
- TSM Nodes (node names are NOT consistent with the host machine anymore):
- T38-s01-z05
- T38-s01-z06
- T38-s01-z07
Crontab:
00 18 * * * test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z05-tsm
00 09 * * * test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z06-tsm
00 22 * * * test -x /etc/tsm-backup-zfs && /etc/tsm-backup-zfs t38-s01-z07-tsm
/opt/tivoli/tsm/client/ba/bin/dsm.sys:
domain "/zfs/s02/z05/home"
domain "/zfs/s02/z06/home"
domain "/zfs/s02/z07/home"
DNS Entries
Initial settings in /etc/bind/
Adding entries to DNS records
sudo rndc freeze
vi /var/cache/bind/...
# Increment serial numbers!
sudo rndc reload
sudo rndc thaw
Clear DNS cache on laptop
MacOSX:
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder
Switch Configuration
Main switch is located at 10.152.219.1
Accessing the main T38 switch
The switch can only be accessed via the Management port. Either plug directly in that port and set an IP address manually on the 10.152.219.x network, or plug the management port into another port of the router.
Activating SFP28 ports
To use 10G Transceivers in the 25G SFP28 ports, the port speed must be reduced to 10G. This can only be done through the terminal interface by ssh'ing into the switch. For example:
$ ssh [email protected]
[email protected]'s password:
FS#config
Enter configuration commands, one per line. End with CNTL/Z.
FS(config)#interface TFGigabitEthernet 0/29
FS(config-if-TFGigabitEthernet 0/29)#port speed-mode 10G
Warning: Ports TF0/29, TF0/30, TF0/31, TF0/32 will be set speed mode 10G. Continue? [Y/N]:Y
FS(config-if-TFGigabitEthernet 0/29)#end
FS#show interfaces TFGigabitEthernet 0/29
Adding a new node
General steps:
- Enable ssh root access on new nodes. Necessary to ensure configuration files are copied with correct permissions.
- Create patch directories in the cluster configuration setup
- Run the installation script over the list of new nodes
In ~admin/CLUSTER_CONFIG the directory t38cn-patch contains a subset of configuration files that should be broadcast to cluster nodes. There is also a directory for each node that contains a snapshot of files on each node. The idea is to first modify the files in t38cn-patch, copy them to the snapshot directories, and then upload those files to each node.
gather-all <NODE-list-file> copies files from each node to its snapshot directory
scatter-all <NODE-list-file>copies files from each snapshot directory to its node and runs /etc/T38-UPDATE-SCRIPTS/update
When a node has just been configured with the OS, the default is to not allow root to login via ssh. On T38 machines, there is a user account jc with sudo rights. We use that account to enable root access via ssh. The script enable-root-sshdoes this by changing /etc/ssh/sshd_config to set PermitRootLogin yes and restart the ssh server.
run-new-install connects to each new node, copies over the files in the snapshot directory and runs /etc/T38-UPDATE-SCRIPTS/new-install.
Installing OS on a new node
Ensure tftp on sheep is configured to boot the correct image:
vi /etc/default/tftpd-hpa
service tftpd-hpa restart
service tftpd-hpa status
Add an MAC entry for the node and its IPMI service to the TUM machine list Google Sheet, copy the generated host lines to t38-network.dhcp.
vi /etc/dhcp/dhcp.conf
service isc-dhcp-server restart
service isc-dhcp-server status
Boot the new machine with the network boot option (default on new machines). The BIOS should get the expected BMC address and find the expected IP address for the network boot. The installation menu will appear.
During the configuration the network hardware will be detected. At this point the suggested name of the machine should be the same as listed in the dhcp configuration. If not, the dhcp settings were not correctly detected.
Partion with 96 GB for swap and the rest for OS. Swap at beginning. Don't forget the bootable flag for the OS partition.
Install OpenSSH server and Basic Ubuntu Server.
Install grub to master boot record. Complete installation and reboot.
Change BIOS boot settings to boot hard drive before network.
Link speed too low
The network cards all support at least 1Gb/s. If the link fails to auto detect this try forcing with the command:
sudo ethtool -s eno1 speed 1000 duplex full autoneg off
Replace eno1 with the appropriate device.