Initial daq cluster setup checklist - art-daq/artdaq GitHub Wiki
Initial DAQ cluster setup checklist.
Objective: To reduce the number service desk tickets during the initial setup of DAQ development / production clusters.
Networking
- define subnets for IPMI, fnal/public and data/daq interfaces
- How many cables are needed? Shared IPMI/public?
- Name interfaces by function.
- define host names for all network interfaces and make them
consistent
- mydaq-br01, mydaq-eb01, mydaq-br01-ipmi, mydaq-br01-daq
- the list of host names should be complete as if all hardware is
available
- Reserve a few for IPs for the next computer installs
- put all host names into /etc/hosts and distribute it across all servers
- How do we automate generation of the hosts file?
- Right now /etc/hosts managed by puppet. Remove this?
- make a consistent IP address assignment across all subnets
- use address blocks for the same server roles
- make the last octet of an IP address being the same across all
NICs of the same host
- Discussion with networking
- configure authentication
- Kerberos for the public interface
- public key for the data interface
- Access everything over private network for the daq user
- User testing artdaq will get instructions to set up their own public key
- create instructions for rebooting servers using IPMI
- enable the 9000 MTU frames on
allDAQ interfaces and networking equipment by default- Switch configuration by networking.
- Just on DAQ network, not all interfaces.
- NFS on public network - jumbo frames for performance? No jumbo on public.
- configure and verify that multicasting is enabled and working all
networking equipment
- Need testing software to verify the configuration.
Users
- define a shared user for
- managing UPS products
- running daq, dcs, databases
- Experiments manage .k5logins for the shared accounts
- add all people from the RSI group to the /root/.k5login
- add all known daq users to the daq and dcs shared accounts
- shared user profiles are not expected to have any customizations
- Control room accounts - shared
Storage areas
- setup a reliable NFS server for /home, /daq/software, /daq/log,
/daq/run_records, /daq/scratch
- No mounts from labs central storage or pnfs.
- cvmfs requires additional configuration to optimize
- reserve adequate disk space for each area
- raid 10 for nfs server
- create a designated scratch area for doing builds on a local NVMe
derive, preferably on the fastest server
- a faster NVMe drive such as Samsung 970 Pro or faster is preferred
- Pick the current SSD drive, larger size has faster write speed.
- Backups
- setup a nightly backup for /home
- setup a weekly backup for /daq areas as needed
- the performance of the NFS should be monitored
- Develop monitoring of NFS performance
- Collect metrics
- /data is a local file system on data logger computers
- raid 10 for performance
- lose half the disk space
- Hardware raid cards
- Turn off raid checking on /data
- Raid checking for home areas on nfs space
Software
- any base software such as the OS and productivity RPMs should be
identical on all servers
- At Fermilab puppet is used
- a default list of installed software packages should not be impeding
the development/testing work, e.g. emacs, vim, mc, tmux, perf,
iperf, strace, dstat,….. VNC/MATE should be installed by default
- Generate a list of packages to install
- Support for MOSH?
- We should try it. Might be blocked by ACLs.
System Services
- Optional: DNS, Kerberos, NIS, Supervisord, influxdb, prometheus.
Ganglia, graphite, grafana- system monitoring - check_mk, net data
- singularity container to distribute monitoring software
- graphite/grafana - part of standard installation
- Keep separate hardware monitoring for system administration
- Combined hardware monitoring for DAQ - DAQ monitoring + hardware monitoring
- Disable selinux enforcing - permissive mode
- Disable firewall on private networks
Other topics
- Buffer sizes in network switches
- General recommendation
- Database computer specs
- System parameters
- Set in puppet
- Standard scripts to verify settings
- Monitoring scripts
- Light weight for running during operations
Geoff
Turn off checking of raid arrays.Raid arrays must be raid 10? You lose half the disk size?Do we really need hosts file?If we use a hosts file we should use a script to create the file.
- ntp from fermilab servers works well. No need for an experiment ntp server.
Who is in the RSI group?- Use Ansible to verify the settings from puppet are correct?