slurm_quick - OpenNebula/one-apps GitHub Wiki
This guide provides the minimum steps to have a running Slurm cluster (Controller and Workers) with an arbitrary number of workers.
First, we need to make sure that the OpenNebula cloud has a running OneGate server. This is a hard requirements for the Slurm Controller node to be able to share the cluster Munge key. This server must be reachable by the Slurm Controller VM, either via IP visibility or using OpenNebula transparent proxies.
Steps to deploy the Slurm Controller:
-
Download the Slurm Controller appliance from the OpenNebula Marketplace. This will download the VM template and VM disk image.
$ onemarketapp export 'Service SlurmController' SlurmController --datastore default -
Adjust the new SlurmController VM template as desired (i.e. CPU, memory, disk size, vNet attachment). The ammount of CPU and memory required will depend on the load of the Slurm cluster (ammount of worker nodes and ammount of jobs received).
-
Instantiate the SlurmController VM template via the FireEdge web interface or through the CLI:
$ onetemplate instantiate SlurmController -
Access your new SlurmController instance via SSH. After accessing it, its configuration state will be shown in the terminal. Once all configurations finished, you should see the following message "All set and ready to serve 8)".
-
Get the controller Munge key with:
onevm show <VM-ID> | grep MUNGE
This key is critical as it must be shared between all cluster nodes and is used to sign and verify communications.
The user can proceed now to deploy an arbitrary number of Slurm Workers:
- Download the Slurm Worker appliance from the OpenNebula Marketplace. This will download the VM template and VM disk image.
$ onemarketapp export 'Service SlurmWorker' SlurmWorker --datastore default - Adjust the new SlurmWorker VM template as desired (i.e. CPU, memory, disk size, vNet attachment).
- Instantiate the SlurmWorker VM template via the OpenNebula FireEdge web interface:
- Set the number of instances to the ammount of worker nodes that will be instantiated
- Click "Next"
- The user will be prompted for 2 appliance parameters:
- Slurm Controller IP address. Make sure the new worker node has IP visibility towards the Slurm Controller.
- Slurm Controller Munge key (base64). This is the key thet the Slurm Controller exposes via OneGate, obtained on the last step of the deployment of the Slurm Controller.
Once the Slurm workers have been instantiated and after a few minutes, those will be automatically added to the Slurm cluster. This addition can be checked by running the following command on the Slurm Controller node:
scontrol show nodes
Next: Slurm Features and usage