How to use the LCH Cluster - LCAS/Cluster_wiki GitHub Wiki

Request an account from Riccardo

Login to to the "head" node LCH01

> ssh LCH01

Change your password

> passwd

Check that you can run a TensorFlow python test script ( test/ ) directly on the head node

IMPORTANT : For tensorflow remember to type "module add cuda/9.0" at the shell and/or in any script ...

For example to multiply a 5 x 6 matrix on the gpu:

> cp -R /users/test .
> cd test
> module add cuda/9.0
> python3.5 gpu 5 6
LCH01 2019-03-07 15:48:26
TensorFlow Version 1.12.0
Using GPUs:  name: GeForce GTX 1080 Ti
device: 0
[[ 275.  290.  305.  320.  335.]
 [ 725.  776.  827.  878.  929.]
 [1175. 1262. 1349. 1436. 1523.]
 [1625. 1748. 1871. 1994. 2117.]
 [2075. 2234. 2393. 2552. 2711.]]

Try running some basic linux commands via SLURM

> srun hostname

> srun ls

> srun df -H

How about running that on one of the other compute nodes? Easy:

> srun -wLCH02 df -H

You can check the status of the cluster with

> sinfo -N -l
Wed Aug 22 11:41:15 2018
LCH01          1    debug*        idle   12    1:6:2      1        0      1   (null) none                
LCH02          1    debug*        idle   12    1:6:2      1        0      1   (null) none                
LCH03          1    debug*        idle   12    1:6:2      1        0      1   (null) none                
LCH04          1    debug*        idle   12    1:6:2      1        0      1   (null) none                

To dynamically monitor the cluster status you can use "watch" (Cntrl-C to stop)

> watch --i 0.5 sinfo -N -l

Jobs that are in the SLURM Cluster queue can be seen using the "squeue" command

> squeue

To dynamically monitor the queue use "watch" (Cntrl-C to stop)

> watch --i 0.5 squeue

Submitting a SLURM job

Try submitting the python SLURM tensorflow test script /users/test/

This script will do a tensorflow gpu matrix multiply on each of the four LCH compute nodes


  • This script has several lines that begin with a "#" character -- these are NOT comments but directives to the SLURM job manager
  • At the heart of the script "srun" is invoked to run several python scripts enumerated (from 0) in an auxiliary file called multi.conf
  • The fact that "srun" is invoked means that you could target your script to specific compute nodes by using for example "srun -wLCH0[3-4]"
  • Remember to submit this to the SLURM job manager always use sbatch NOT sh

> sbatch

squeue should show the job running on all four nodes:

              1159     debug   testTF     mike CG       0:08      4 LCH[01-04]

When complete, the results will be found in your current directory in "out.txt"

> cat out.txt
LCH01 2019-03-07 15:46:28
TensorFlow Version 1.12.0
Using GPUs:  name: GeForce GTX 1080 Ti
device: 0
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [1150. 1260. 1370. 1480. 1590.]]
LCH04 2019-03-07 15:46:29
TensorFlow Version 1.12.0
Using GPUs:  name: GeForce GTX 1080 Ti
device: 0
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [1150. 1260. 1370. 1480. 1590.]]
LCH02 2019-03-07 15:46:30
TensorFlow Version 1.12.0
Using GPUs:  name: GeForce GTX 1080 Ti
device: 0
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [1150. 1260. 1370. 1480. 1590.]]
LCH03 2019-03-07 15:46:30
TensorFlow Version 1.12.0
Using GPUs:  name: GeForce GTX 1080 Ti
device: 0
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [1150. 1260. 1370. 1480. 1590.]]

Useful SLURM Commands

Explore SLURM


⚠️ ** Fallback** ⚠️