Throwaway Slurm cluster in AWS (Feb'21) - EESSI/docs GitHub Wiki

We have created a small throwaway Slurm cluster to experiment with the EESSI pilot repository, thanks to the generous sponsoring by AWS, using Cluster-in-the-Cloud.

Lifetime

This is a throwaway Slurm cluster, it will only be available for a short amount of time!

We currently plan to destroy the cluster one week after the EESSI update meeting of Feb'21, so on Thu Feb 11th 2021.

DO NOT USE THESE RESOURCES FOR PRODUCTION WORK!

Also, use these resources sparingly: make sure to cancel jobs as soon as you're done experimenting.

Getting access

To get access, please contact Kenneth Hoste via the EESSI Slack or email ([email protected]), so he can create you an account.

Required information:

desired login name (for example kehoste)
first and last name
GitHub account (which is only used to grab the SSH public keys associated with it, see for example https://github.com/boegel.keys)
- or, alternative, an SSH public key

Logging in

To log in, use your personal account to SSH into the login node of the cluster (you should be informed which hostname to use)

ssh YOUR_USER_NAME_GOES_HERE@HOSTNAME_GOES_HERE

Available resources

Compute nodes

To get a list of all compute nodes, use the list_nodes command.

Nodes marked with idle~ are idle, but currently not booted, so it will take a couple of minutes to start them when Slurm directs a job to it.

Node types

There are 16 compute nodes available in this throwaway cluster, of 4 different node types:

4x c4.xlarge: Intel Haswell, 4 cores, 7.5GB RAM
4x c5.xlarge: Intel Skylake or Cascade Lake, 4 cores, 8GB RAM
4x c5a.xlarge: AMD Rome, 4 cores, 8GB RAM
4x c6g.xlarge: AWS Graviton 2 (Arm 64-bite), 4 cores, 8GB RAM

See https://aws.amazon.com/ec2/instance-types/#Compute_Optimized for detailed information.

Slurm configuration

The compute nodes are only started when needed, i.e. when jobs are submitted to run on them.

Keep this in mind when submitting jobs, it may take a couple of minutes before your job starts if there are no (matching) active compute nodes!

To check for active nodes, use the sinfo command.

Network

There is no high-performance network interconnect for these compute nodes, so be careful with multi-node jobs!

Shared home directory

Your home directory is on a shared filesystem, so any files you create there are also accessible from the compute nodes.

Keep in mind that this is a slow NFS filesystem, it is not well suited for I/O-intensive work!

Submitting jobs

To start a quick interactive job, use:

srun --pty /bin/bash

To submit a job to a specific type of nodes, use the --constraint option and specify the node "shape". For example, to submit to a Graviton2 compute node:

srun --constraint=shape=c6g.xlarge --pty /bin/bash

Access to EESSI pilot repository

The EESSI pilot repository is available on each compute node.

Keep in mind that it may not be mounted yet, since CernVM-FS uses autofs, so ls /cvmfs may show nothing.

To get started, just source the EESSI initialization script:

source /cvmfs/pilot.eessi-hpc.org/latest/init/bash

And then use module avail to check the available software.

We recommend trying to demo scripts available at https://github.com/EESSI/eessi-demo (see the run.sh script in each subdirectory).