Throwaway Slurm cluster in AWS (Feb'21) - EESSI/docs GitHub Wiki
We have created a small throwaway Slurm cluster to experiment with the EESSI pilot repository, thanks to the generous sponsoring by AWS, using Cluster-in-the-Cloud.
Lifetime
This is a throwaway Slurm cluster, it will only be available for a short amount of time!
We currently plan to destroy the cluster one week after the EESSI update meeting of Feb'21, so on Thu Feb 11th 2021.
DO NOT USE THESE RESOURCES FOR PRODUCTION WORK!
Also, use these resources sparingly: make sure to cancel jobs as soon as you're done experimenting.
Getting access
To get access, please contact Kenneth Hoste via the EESSI Slack or email ([email protected]
), so he can create you an account.
Required information:
- desired login name (for example
kehoste
) - first and last name
- GitHub account (which is only used to grab the SSH public keys associated with it, see for example https://github.com/boegel.keys)
- or, alternative, an SSH public key
Logging in
To log in, use your personal account to SSH into the login node of the cluster (you should be informed which hostname to use)
ssh YOUR_USER_NAME_GOES_HERE@HOSTNAME_GOES_HERE
Available resources
Compute nodes
To get a list of all compute nodes, use the list_nodes
command.
Nodes marked with idle~
are idle, but currently not booted, so it will take a couple of minutes to start them when Slurm directs a job to it.
Node types
There are 16 compute nodes available in this throwaway cluster, of 4 different node types:
- 4x
c4.xlarge
: Intel Haswell, 4 cores, 7.5GB RAM - 4x
c5.xlarge
: Intel Skylake or Cascade Lake, 4 cores, 8GB RAM - 4x
c5a.xlarge
: AMD Rome, 4 cores, 8GB RAM - 4x
c6g.xlarge
: AWS Graviton 2 (Arm 64-bite), 4 cores, 8GB RAM
See https://aws.amazon.com/ec2/instance-types/#Compute_Optimized for detailed information.
Slurm configuration
The compute nodes are only started when needed, i.e. when jobs are submitted to run on them.
Keep this in mind when submitting jobs, it may take a couple of minutes before your job starts if there are no (matching) active compute nodes!
To check for active nodes, use the sinfo
command.
Network
There is no high-performance network interconnect for these compute nodes, so be careful with multi-node jobs!
Shared home directory
Your home directory is on a shared filesystem, so any files you create there are also accessible from the compute nodes.
Keep in mind that this is a slow NFS filesystem, it is not well suited for I/O-intensive work!
Submitting jobs
To start a quick interactive job, use:
srun --pty /bin/bash
To submit a job to a specific type of nodes, use the --constraint
option and specify the node "shape
".
For example, to submit to a Graviton2 compute node:
srun --constraint=shape=c6g.xlarge --pty /bin/bash
Access to EESSI pilot repository
The EESSI pilot repository is available on each compute node.
Keep in mind that it may not be mounted yet, since CernVM-FS uses autofs
, so ls /cvmfs
may show nothing.
To get started, just source the EESSI initialization script:
source /cvmfs/pilot.eessi-hpc.org/latest/init/bash
And then use module avail
to check the available software.
We recommend trying to demo scripts available at https://github.com/EESSI/eessi-demo (see the run.sh
script in each subdirectory).