Rocks - shawfdong/hyades GitHub Wiki
We performed a clean installation of Rocks 6.2 on the Hyades cluster in November 2015. Here are my notes.
When installing the frontend (master node) Hyades using the Rocks 6.2 DVD, I had to bypass the DHCP process[1]:
build ksdevice=em1 asknetworkwhere em1 is the public 10GbE interface; otherwise the installation would fail.
We installed the following rolls:
The coordinates of UC Santa Cruz are 36.97° N, 122.03° W.After installation is completed and the frontend is rebooted, update /etc/motd:
The Hyades cluster is for authorized users only! No Trespassing! Wiki for Hyades is located at http://pleiades.ucsc.edu/hyades/
Rocks 6.2 is based on CentOS 6.6[2]:
# uname -a Linux hyades.ucsc.edu 2.6.32-504.16.2.el6.x86_64 #1 SMP Wed Apr 22 06:48:29 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
The root password for MySQL is in /root/.rocks.my.cnf. The MySQL configuration file is /opt/rocks/mysql/my.cnf.
Create a GUID Partition Table (GPT) on /dev/sdb, which is a RAID-6 volume on 6 physical disks:
# parted /dev/sdb GNU Parted 2.1 Using /dev/sdb Welcome to GNU Parted! Type 'help' to view a list of commands. (parted) mklabel gpt Warning: The existing disk label on /dev/sdb will be destroyed and all data on this disk will be lost. Do you want to continue? Yes/No? Yes (parted) mkpart primary 0% 100% (parted) print free Model: DELL PERC H710P (scsi) Disk /dev/sdb: 7809794048s Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 34s 2047s 2014s Free Space 1 2048s 7809791999s 7809789952s primary 7809792000s 7809794014s 2015s Free Space (parted) quit Information: You may need to update /etc/fstab.
Create an XFS file system on /dev/sdb1 (note there are 4 data disks in the RAID-6 volume):
# yum install xfsprogs # mkfs.xfs -d su=64k,sw=4 /dev/sdb1 meta-data=/dev/sdb1 isize=256 agcount=32, agsize=30506992 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=976223744, imaxpct=5 = sunit=16 swidth=64 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=476672, version=2 = sectsz=512 sunit=16 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 # xfs_admin -L scratch /dev/sdb1 writing all SBs new label = "scratch" # ls -l /dev/disk/by-uuid/d8513df1-2211-4f1c-ae22-204e5860145f lrwxrwxrwx 1 root root 10 Nov 9 14:56 /dev/disk/by-uuid/d8513df1-2211-4f1c-ae22-204e5860145f -> ../../sdb1
Add the following line to /etc/fstab:
UUID=d8513df1-2211-4f1c-ae22-204e5860145f /scratch xfs inode64,nobarrier 1 2
Mount /scratch:
# mkdir /scratch # mount /scratch # chmod 1777 /scratch
List all the network interfaces on Hyades:
# ip link 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 90:b1:1c:1c:56:3d brd ff:ff:ff:ff:ff:ff 3: em2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 90:b1:1c:1c:56:3f brd ff:ff:ff:ff:ff:ff 4: em3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 90:b1:1c:1c:56:41 brd ff:ff:ff:ff:ff:ff 5: em4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 90:b1:1c:1c:56:43 brd ff:ff:ff:ff:ff:ff 6: ib0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN qlen 256 link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:2a:4a:e7 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ffNOTE it reported that the MTU of the IPoIB interface ib0 was 4092. The IPoIB MTU must be 4 bytes less than the InfiniBand link-layer MTU[3]. The link-layer MTU of Mellanox ConnectX-2 VPI single-port adapter is 4096:
# ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.9.1000 node_guid: 0002:c903:002a:4ae6 sys_image_guid: 0002:c903:002a:4ae9 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: MT_0D90110009 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 188 port_lid: 16 port_lmc: 0x00 link_layer: InfiniBand
List networks:
# rocks list network NETWORK SUBNET NETMASK MTU DNSZONE SERVEDNS private: 10.6.0.0 255.255.0.0 1500 local True public: 128.114.126.224 255.255.255.224 1500 ucsc.edu False
Add networks:
# rocks add network ib subnet=10.8.0.0 netmask=255.255.0.0 mtu=4092 # rocks add network 10g subnet=10.7.0.0 netmask=255.255.0.0 mtu=9000 # rocks add network ipmi subnet=10.9.0.0 netmask=255.255.0.0 # rocks list network NETWORK SUBNET NETMASK MTU DNSZONE SERVEDNS 10g: 10.7.0.0 255.255.0.0 9000 10g False ib: 10.8.0.0 255.255.0.0 4092 ib False ipmi: 10.9.0.0 255.255.0.0 1500 ipmi False private: 10.6.0.0 255.255.0.0 1500 local True public: 128.114.126.224 255.255.255.224 1500 ucsc.edu False
Set network interfaces on Hyades:
# rocks set host interface subnet hyades iface=ib0 subnet=ib # rocks set host interface ip hyades iface=ib0 ip=10.8.8.1 # rocks set host interface subnet hyades iface=em2 subnet=10g # rocks set host interface ip hyades iface=em2 ip=10.7.8.1 # rocks set host interface subnet hyades iface=em4 subnet=ipmi # rocks set host interface ip hyades iface=em4 ip=10.9.8.111 # rocks list host interface hyades SUBNET IFACE MAC IP NETMASK MODULE NAME VLAN OPTIONS CHANNEL private em3 90:B1:1C:1C:56:41 10.6.8.1 255.255.0.0 ------ hyades ---- ------- ------- 10g em2 90:B1:1C:1C:56:3F 10.7.8.1 255.255.0.0 ------ hyades ---- ------- ------- public em1 90:B1:1C:1C:56:3D 128.114.126.225 255.255.255.224 ------ hyades ---- ------- ------- ipmi em4 90:B1:1C:1C:56:43 10.9.8.111 255.255.0.0 ------ hyades ---- ------- ------- ib ib0 80:00:00:48:FE:80:00:00:00:00:00:00:00:02:C9:03:00:2A:4A:E7 10.8.8.1 255.255.0.0 ------ hyades ---- ------- -------
Update the settings:
# rocks sync config # rocks sync host network hyadesThe latter command updated /etc/sysconfig/network-scripts/ifcfg-* and brought up the interfaces.
To my surprise, the MTU of ib0 becomes 2044 after the interface is brought up:
# ip addr 6: ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc pfifo_fast state UP qlen 256 link/infiniband 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:2a:4a:e7 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff inet 10.8.8.1/16 brd 10.8.255.255 scope global ib0 inet6 fe80::202:c903:2a:4ae7/64 scope link valid_lft forever preferred_lft foreveralthough it is set to 4092 in /etc/sysconfig/network-scripts/ifcfg-ib0:
DEVICE=ib0 HWADDR=80:00:00:48:FE:80:00:00:00:00:00:00:00:02:C9:03:00:2A:4A:E7 IPADDR=10.8.8.1 NETMASK=255.255.0.0 BOOTPROTO=none ONBOOT=yes MTU=4092I suspect it is probably due to settings in Subnet Manager (SM) — to be investigated.
Tune network performance on Hyades, following procedures in Linux Network Tuning.
Tune Mellanox ConnectX-2 VPI adapter for Open MPI, by appending the following line to /etc/modprobe.d/mlx4.conf:
options mlx4_core log_num_mtt=23 log_mtts_per_seg=3so that max_reg_mem = (2^log_num_mtt) * (2^log_mtts_per_seg) * PAGE_SIZE = 2^23 * 2^3 * (4 kB) = 256 GB (twice the amount of physical memory on Hyades)[4].
Here are the default values of the 2 parameters (we'll reboot Hyades to have the new values to take effect):
# cat /sys/module/mlx4_core/parameters/log_num_mtt 0 # cat /sys/module/mlx4_core/parameters/log_mtts_per_seg 3
We note in passing that max locked memory is set to unlimited, by /etc/security/limits.d/91-infiniband.conf.
The new installation of Rocks 6.2 created new SSH host keys (DSA & RSA) for Hyades. We'll revert to old SSH host keys.
Copy the old SSH host keys for Hyades to /etc/ssh/.
Update SSH host keys for the frontend Hyades in Rocks Secure Attributes database[5]:
# rocks set host sec_attr hyades attr=ssh_host_rsa_key crypted=true value=/etc/ssh/ssh_host_rsa_key # rocks set host sec_attr hyades attr=ssh_host_rsa_key.pub crypted=true value=/etc/ssh/ssh_host_rsa_key.pub # rocks set host sec_attr hyades attr=ssh_host_dsa_key crypted=true value=/etc/ssh/ssh_host_dsa_key # rocks set host sec_attr hyades attr=ssh_host_dsa_key.pub crypted=true value=/etc/ssh/ssh_host_dsa_key.pub
Set the cluster keys to the old ones as well, which will be the SSH host keys for all the other nodes in the cluster, including Eudora & Aesyle:
# rocks set sec_attr attr=ssh_host_rsa_key crypted=true value=/tmp/etc/ssh/ssh_host_rsa_key # rocks set sec_attr attr=ssh_host_rsa_key.pub crypted=true value=/tmp/etc/ssh/ssh_host_rsa_key.pub # rocks set sec_attr attr=ssh_host_dsa_key crypted=true value=/tmp/etc/ssh/ssh_host_dsa_key # rocks set sec_attr attr=ssh_host_dsa_key.pub crypted=true value=/tmp/etc/ssh/ssh_host_dsa_key.pub
Disable root login using password over SSH:
# sed -i -e 's/#PermitRootLogin yes/PermitRootLogin without-password/' /etc/ssh/sshd_config
Modify /etc/ssh/ssh_config, from:
Host * CheckHostIP no ForwardX11 yes ForwardAgent yes StrictHostKeyChecking no UsePrivilegedPort no Protocol 2,1 HostbasedAuthentication yes EnableSSHKeySign yesto:
Host * CheckHostIP no ForwardX11 no ForwardAgent no StrictHostKeyChecking no UsePrivilegedPort no Protocol 2 HostbasedAuthentication yes EnableSSHKeySign yes
Update settings:
# rocks sync configwhich will update /etc/ssh/ssh_known_hosts, among other things.
Restart sshd:
# /etc/init.d/sshd restart
The default password policy on RHEL/CentOS 6 is too lenient. We'll use passwdqc to enforce strong passwords. Change the following line in /etc/pam.d/system-auth:
password requisite pam_cracklib.so try_first_pass retry=3 type=to:
password requisite pam_passwdqc.so min=disabled,24,16,12,8 max=40 passphrase=3 match=4 similar=deny random=42 enforce=everyone retry=3
Modify /etc/gdm/gdm.schemas, change the following lines:
<schema> <key>greeter/Include</key> <signature>s</signature> <default></default> </schema> <schema> <key>greeter/Exclude</key> <signature>s</signature> <default>bin,root,daemon,adm,lp,sync,shutdown,halt,mail,news,uucp,operator,nobody,nobody4,noaccess,postgres,pvm,rpm,nfsnobody,pcap</default> </schema> <schema> <key>greeter/IncludeAll</key> <signature>b</signature> <default>true</default> </schema>to:
<schema> <key>greeter/Include</key> <signature>s</signature> <default>root,dong</default> </schema> <schema> <key>greeter/Exclude</key> <signature>s</signature> <default>bin,daemon,adm,lp,sync,shutdown,halt,mail,news,uucp,operator,nobody,nobody4,noaccess,postgres,pvm,rpm,nfsnobody,pcap</default> </schema> <schema> <key>greeter/IncludeAll</key> <signature>b</signature> <default>false</default> </schema>
Update /etc/sysconfig/iptables;
Update /etc/sysconfig/iptables-config;
# sed -i -e '/^IPTABLES_MODULES_UNLOAD/s/yes/no/' /etc/sysconfig/iptables-config
Restart iptables:
# /etc/init.d/iptables restart
Append the following lines to /etc/fstab:
10.7.7.1:/export/home /home nfs noatime,nosuid,nolock 0 0 10.7.7.1:/export/trove /trove nfs noatime,nosuid,nolock 0 0 10.7.7.2:/export/zang /zang nfs noatime,nosuid,nolock 0 0
Mount the NFS shares:
# mkdir /trove /zang # mount -a
Comment out the following line in /etc/auto.master:
/home /etc/auto.home --timeout=1200
We added a few packages to the Rocks distribution:
- Additional packages for InfiniBand: librdmacm-utils, ibacm, libibverbs-utils, perftest, & qperf. They can be downloaded from CentOS Vault or from our local CentOS 6 mirror:
# cd /export/rocks/install/contrib/6.2/x86_64/RPMS/ # wget http://vault.centos.org/6.6/os/x86_64/Packages/librdmacm-utils-1.0.18.1-1.el6.x86_64.rpm # wget http://pleiades.ucsc.edu/centos/6/os/x86_64/Packages/ibacm-1.0.9-1.el6.x86_64.rpm # wget http://vault.centos.org/6.6/os/x86_64/Packages/libibverbs-utils-1.1.8-3.el6.x86_64.rpm # wget http://vault.centos.org/6.6/os/x86_64/Packages/perftest-2.2-1.el6.x86_64.rpm # wget http://vault.centos.org/6.6/os/x86_64/Packages/qperf-0.4.9-1.el6.x86_64.rpm
- xz-devel:
# wget http://vault.centos.org/6.6/os/x86_64/Packages/xz-devel-4.999.9-0.5.beta.20091007git.el6.x86_64.rpm
- subversion, neon, pakchois, fftw, fftw-devel, gnuplot, gnuplot-common, gnuplot-doc, & gnuplot-latex from our local CentOS 6 mirror
- screen from local CentOS 6 mirror & tmux from local EPEL mirror
- Official HDF5 packages:
# wget https://www.hdfgroup.org/ftp/HDF5/current/bin/RPMS/hdf5-1.8.15.patch1-1.with.szip.encoder.el6.x86_64.rpm # wget https://www.hdfgroup.org/ftp/HDF5/current/bin/RPMS/hdf5-devel-1.8.15.patch1-1.with.szip.encoder.el6.x86_64.rpm
- Packages in Dell OpenManage Deployment Toolkit Version 4.4:
# mkdir /mnt/cdrom # mount -o loop /scratch/dtk_4.4_1294_Linux64_A01.iso /mnt/cdrom # cp /mnt/cdrom/RPMs/rhel6/x86_64/*.rpm /export/rocks/install/contrib/6.2/x86_64/RPMS/
- Packages for Lustre client 2.5.34.1 (see Building Lustre Client from Source):
# cp /root/rpmbuild/RPMS/x86_64/lustre-*.rpm /export/rocks/install/contrib/6.2/x86_64/RPMS/
- Packages for Globus - ToDo
- Maybe valgrind, valgrind-devel, valgrind-openmpi, openmotif, openmotif-devel, libXp, libXp-devel, brctl , screen & tmux - ToDo
# cd /export/rocks/install # rocks create distro
Install those packages on the frontend:
# yum clean all # yum install librdmacm-utils ibacm libibverbs-utils perftest qperf # yum xz-devel fftw fftw-devel gnuplot gnuplot-doc gnuplot-latex # yum install hdf5 hdf5-devel # yum install syscfg srvadmin-idracadm7 # yum install lustre-client-modules lustre-client lustre-client-tests lustre-iokit # yum install screen tmux # yum install zsh subversion libXpNote libXp is required by IDL.
Add /etc/modprobe.d/lustre.conf:
options lnet networks=o2ib0(ib0)
Add my own init script /etc/init.d/lustre:
# chmod +x /etc/init.d/lustre # chkconfig --add lustre
Mount the Lustre file system:
# mkdir /pfs # service lustre start
Create a few symbolic links:
# ln -s /pfs/sw/intel /opt/intel # ln -s /pfs/sw/pgi /opt/pgi # ln -s /pfs/sw/exelis /opt/exelis # ln -s /pfs/sw/allinea /opt/allinea
Append the following line to /usr/share/Modules/init/.modulespath:
/pfs/sw/modulefiles
Set up the default modules:
# /bin/cp -vf /pfs/sw/etc/profile.d/rocks-hpc.csh /etc/profile.d/ # /bin/cp -vf /pfs/sw/etc/profile.d/rocks-hpc.sh /etc/profile.d/
Stop and turn off unnecessary services:
# chkconfig --list # /etc/init.d/fcoe stop # chkconfig fcoe off # /etc/init.d/iscsi stop # chkconfig iscsi off # /etc/init.d/iscsid stop # chkconfig iscsid off # /etc/init.d/lldpad stop # chkconfig lldpad off # /etc/init.d/lvm2-monitor force-stop # chkconfig lvm2-monitor off # /etc/init.d/mdmonitor stop # chkconfig mdmonitor off # /etc/init.d/rpcgssd stop # chkconfig rpcgssd off
Maybe autofs, nfs, rpcbind, & snmpd - ToDo
Add and start lmgrd (license manager for PGI compilers and tools):
# chkconfig --add lmgrd # /etc/init.d/lmgrd start
PGI compilers 14.10 seem to be hard-wired to depend on gcc-4.4.6:
$ module load pgi $ pgf95 hello.f90 -o hello.pgi /usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/4.4.6/crtbegin.o: No such file: No such file or directory
But the gcc version on Rocks 6.2 is 4.4.7:
# ls -l /usr/lib/gcc/x86_64-redhat-linux/ total 4 drwxr-xr-x 5 root root 4096 Nov 4 11:37 4.4.4 lrwxrwxrwx 1 root root 5 Nov 4 11:37 4.4.7 -> 4.4.4
Here is a quick fix:
# cd /usr/lib/gcc/x86_64-redhat-linux/ # ln -s 4.4.4 4.4.6
Create XML configuration files extend-compute.xml[6] & replace-partition.xml[7] in /export/rocks/install/site-profiles/6.2/nodes/.
# cd /scratch # wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run # wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run # chmod +x cuda_7.0.28_linux.run cuda_7.5.18_linux.run # ./cuda_7.0.28_linux.run -extract=/pfs/sw/NVIDIA # ./cuda_7.5.18_linux.run -extract=/pfs/sw/NVIDIAThe runfile for latest driver NVIDIA is NVIDIA-Linux-x86_64-352.39.run, extracted in /pfs/sw/NVIDIA.
Create gpu.xml in /export/rocks/install/site-profiles/6.2/nodes/, and gpu-appliance.xml in /export/rocks/install/site-profiles/6.2/graphs/default/.
Add a new appliance gpu[8]:
# rocks add appliance gpu membership='GPU' node='gpu'
Set the x11 attribute for the gpu appliance to true[9]:
# rocks set appliance attr gpu x11 trueso that the x11 node XML file will be included in the kickstart files for GPU nodes. See Base Roll graph XML file /export/rocks/install/rocks-dist/x86_64/build/graphs/default/base.xml:
<edge from="client" to="x11" cond="x11"/>
Create viz.xml in /export/rocks/install/site-profiles/6.2/nodes/, and viz-appliance.xml in /export/rocks/install/site-profiles/6.2/graphs/default/.
Add a new appliance viz:
# rocks add appliance viz membership='Viz' node='viz'
Set the x11 attribute for the viz appliance to true:
# rocks set appliance attr viz x11 true
Create mic.xml in /export/rocks/install/site-profiles/6.2/nodes/, and mic-appliance.xml in /export/rocks/install/site-profiles/6.2/graphs/default/.
Add a new appliance mic:
# rocks add appliance mic membership='MIC' node='mic'
Rebuild the Rocks distribution:
# rocks list appliance # cd /export/rocks/install # rocks create distro
Rocks 6.2 has introduced many bugs. Although insert-ethers has the option --ipaddr:
# insert-ethers --help Insert Ethernet Addresses - version 6.2 Usage: insert-ethers [-hv] [-p password] [-u host] [-d database] [--help] [--list-rcfiles] [--list-project-info] [--verbose] [--update] [--staticip] [--rcfile arg] [--host host] [--password password] [--db database] [--user host] [--baseip ip address] [--basename basename] [--hostname hostname] [--ipaddr ip address] [--cabinet number] [--rack number] [--inc number] [--rank number] [--replace hostname] [--remove hostname] [--membership membership] [--os the OS to install on the machines]it doesn't honor it! For instance,
# insert-ethers --membership="Compute" --hostname="astro-1-1" --rack=1 --rank=1 --ipaddr="10.6.1.1"won't assign astro-1-1 the IP address 10.6.1.1, but 10.6.255.254!
Fortunately, this can be easily corrected:
# rocks set host interface ip astro-1-1 em1 10.6.1.1
Assign IP an address to the IPoIB interface:
# rocks set host interface subnet astro-1-1 iface=ib0 ib # rocks set host interface ip astro-1-1 ib0 10.8.1.1
Force a re-install at the next PXE boot[10]:
# rocks sync config # rocks set host boot astro-1-1 action=installor sync the network settings:
# rocks sync host network astro-1-1
The above procedure is a bit cumbersome. The silver lining is that it is very amenable to scripting.
One can examine the kickstart file of a node, for example, with:
# rocks list host profile astro-1-1 | tee /tmp/astro-1-1.ks
Install a GPU node with:
# insert-ethers --membership="GPU" --hostname="gpu-1" --rack=7 --rank=11
Because of the insert-ethers bug described in the section Compute Nodes, we had to manually fix the IP addresses for the GPU nodes as well.
There is another bug introduced by Rocks 6.2. The boot order in BIOS had been set as PXE first on the GPU nodes; and it worked fine with Rocks 6.1. However after the installation of Rocks 6.2, PXE first would fail, stuck at the localboot phase! See PXELINUX configuration file /tftpboot/pxelinux/pxelinux.cfg/0A06070B for gpu-1 (the private IP address is 10.6.7.3, 0A06070B in uppercase hexadecimal[11]):
default rocks prompt 0 label rocks localboot 0which may be related to this bug. For the time being, we set the boot order in BIOS as Hard Disk first on the GPU nodes.
The buggy insert-ethers in Rocks 6.2 failed outright with Eudora (insert-ethers in Rocks 6.1 used to work fine):
# insert-ethers --membership="Viz" --hostname="eudora" --rack=7 --rank=3complaining "eudora.ucsc.edu is not in the cluster", likely because the presence of the 10GbE interfaces utterly bewildered the buggy software!
But we can still rescue it:
# rocks set host cpus eudora cpus=32 # rocks add host interface eudora iface=em1 ip=128.114.126.226 mac=d4:ae:52:9e:65:2e subnet=public # rocks add host interface eudora iface=em2 ip=10.7.7.3 mac=d4:ae:52:9e:65:30 subnet=10g # rocks add host interface eudora iface=em3 ip=10.6.7.3 mac=d4:ae:52:9e:65:32 subnet=private # rocks add host interface eudora iface=em4 mac=d4:ae:52:9e:65:34 # rocks add host interface eudora iface=ib0 ip=10.8.7.3 mac=80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:2a:4a:c3 subnet=ib
Force a re-install at the next PXE boot:
# rocks sync config # rocks set host boot eudora action=install
We can examine the PXELINUX configuration file for Eudora /tftpboot/pxelinux/pxelinux.cfg/0A060703 (Eudora's private IP address is 10.6.7.3, -> 0A060703):
default rocks prompt 0 label rocks kernel vmlinuz-6.2-x86_64 append ks ramdisk_size=150000 lang= devfs=nomount pxe kssendmac selinux=0 noipv6 ksdevice=bootif ip=10.6.7.3 gateway=10.6.8.1 netmask=255.255.0.0 dns=10.6.8.1 nextserver=10.6.8.1 initrd=initrd.img-6.2-x86_64 ipappend 2
I was too lazy to reinstall Aesyle; so I simply added the host to Rocks database:
# rocks add host aesyle cpus=12 membership=MIC rack=7 rank=4 # rocks add host interface aesyle iface=em1 ip=128.114.126.227 mac=90:b1:1c:45:4d:74 subnet=public # rocks add host interface aesyle iface=em2 mac=90:b1:1c:45:4d:76 # rocks add host interface aesyle iface=em3 ip=10.6.7.4 mac=90:b1:1c:45:4d:78 subnet=private # rocks add host interface aesyle iface=em4 mac=90:b1:1c:45:4d:7a # rocks set host boot aesyle action=os # rocks sync config
But we had to replace the old 411 keys:
# cd /etc/411-security/ # scp master.pub shared.key aesyle:/etc/411-security/
Set node attribute for compute nodes with Hyper-Threading disabled, e.g.[12]:
# rocks set host attr astro-1-1 torque_properties normal
Set node attribute for compute nodes with Hyper-Threading enabled, e.g.:
# rocks set host attr astro-6-1 torque_properties hyper
Set node attribute for GPU nodes, e.g.:
# rocks set host attr gpu-1 torque_properties gpu
Unfortunately the official method failed to generate the PBS nodes file:
# rocks report pbsnodes Traceback (most recent call last): File "/opt/rocks/bin/rocks", line 259, in <module> command.runWrapper(name, args[i:]) File "/opt/rocks/lib/python2.6/site-packages/rocks/commands/__init__.py", line 1899, in runWrapper self.run(self._params, self._args) File "/opt/rocks/lib/python2.6/site-packages/rocks/commands/report/pbsnodes/__init__.py", line 132, in run self.db.execute(query) File "/opt/rocks/lib/python2.6/site-packages/rocks/commands/__init__.py", line 1176, in execute return self.database.execute(command) File "/opt/rocks/lib/python2.6/site-packages/rocks/db/database.py", line 296, in execute self.results = self.conn.execute(command) File "/opt/rocks/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 709, in execute return self._execute_text(object, multiparams, params) File "/opt/rocks/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 858, in _execute_text statement, parameters File "/opt/rocks/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 927, in _execute_context context) File "/opt/rocks/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1076, in _handle_dbapi_exception exc_info File "/opt/rocks/lib/python2.6/site-packages/sqlalchemy/util/compat.py", line 185, in raise_from_cause reraise(type(exception), exception, tb=exc_tb) File "/opt/rocks/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 920, in _execute_context context) File "/opt/rocks/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 425, in do_execute cursor.execute(statement, parameters) File "/opt/rocks/lib/python2.6/site-packages/MySQLdb/cursors.py", line 174, in execute self.errorhandler(self, exc, value) File "/opt/rocks/lib/python2.6/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue sqlalchemy.exc.ProgrammingError: (ProgrammingError) (1146, "Table 'cluster.node_attributes' doesn't exist") 'select nodes.name, node_attributes.value from nodes, node_attributes where (node_attributes.attr = "torque_properties" and node_attributes.node = nodes.id)' ()but we can easily use a simple script to generate the PBS nodes file /var/spool/torque/server_priv/nodes.
Restart PBS server:
# service pbs_server restart
Use qmgr to tune the batch system.
Modify the Maui config file /opt/maui/maui.cfg.
Restart Maui:
# service maui restart
- ^ Rocks 6.2 - Install and Configure Your Frontend
- ^ Red Hat Enterprise Linux Release Dates
- ^ Configuring IPoIB
- ^ Open MPI is warning me about limited registered memory; what does this mean?
- ^ Rocks 6.2 - Hostbased vs. Key-base SSH Authentication
- ^ Rocks 6.2 - Customizing Configuration of Compute Nodes
- ^ Rocks 6.2 - Compute Node Disk Partitioning
- ^ Rocks 6.2 - Adding a New Appliance Type to the Cluster
- ^ Rocks 6.2 - Configuring Software with XML Files
- ^ Rocks 6.2 - Forcing a Re-install at Next PXE Boot
- ^ PXELINUX
- ^ Torque Roll Documentation