How to use the LCH Cluster - LCAS/Cluster

Request an account from Riccardo

> ssh LCH01

Change your password

> passwd

Check that you can run a TensorFlow python test script ( test/testTF.py ) directly on the head node

IMPORTANT : For tensorflow remember to type "module add cuda/9.0" at the shell and/or in any script ...

For example to multiply a 5 x 6 matrix on the gpu:

> cp -R /users/test .
> cd test
> module add cuda/9.0
> python3.5  testTF.py gpu 5 6
================================================================================
LCH01 2019-03-07 15:48:26
TensorFlow Version 1.12.0
Using GPUs:  name: GeForce GTX 1080 Ti
device: 0
[[ 275.  290.  305.  320.  335.]
 [ 725.  776.  827.  878.  929.]
 [1175. 1262. 1349. 1436. 1523.]
 [1625. 1748. 1871. 1994. 2117.]
 [2075. 2234. 2393. 2552. 2711.]]

Try running some basic linux commands via SLURM

> srun hostname

> srun ls

> srun df -H

How about running that on one of the other compute nodes? Easy:

> srun -wLCH02 df -H

You can check the status of the cluster with

> sinfo -N -l
Wed Aug 22 11:41:15 2018
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON              
LCH01          1    debug*        idle   12    1:6:2      1        0      1   (null) none                
LCH02          1    debug*        idle   12    1:6:2      1        0      1   (null) none                
LCH03          1    debug*        idle   12    1:6:2      1        0      1   (null) none                
LCH04          1    debug*        idle   12    1:6:2      1        0      1   (null) none

To dynamically monitor the cluster status you can use "watch" (Cntrl-C to stop)

> watch --i 0.5 sinfo -N -l

Jobs that are in the SLURM Cluster queue can be seen using the "squeue" command

> squeue

To dynamically monitor the queue use "watch" (Cntrl-C to stop)

> watch --i 0.5 squeue

Submitting a SLURM job

Try submitting the python SLURM tensorflow test script /users/test/testTF.sh

This script will do a tensorflow gpu matrix multiply on each of the four LCH compute nodes

Notes:

This script has several lines that begin with a "#" character -- these are NOT comments but directives to the SLURM job manager
At the heart of the script "srun" is invoked to run several python scripts enumerated (from 0) in an auxiliary file called multi.conf
The fact that "srun" is invoked means that you could target your script to specific compute nodes by using for example "srun -wLCH0[3-4]"
Remember to submit this to the SLURM job manager always use sbatch NOT sh

> sbatch testTF.sh

squeue should show the job running on all four nodes:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1159     debug   testTF     mike CG       0:08      4 LCH[01-04]

When complete, the results will be found in your current directory in "out.txt"

> cat out.txt
================================================================================
LCH01 2019-03-07 15:46:28
TensorFlow Version 1.12.0
Using GPUs:  name: GeForce GTX 1080 Ti
device: 0
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [1150. 1260. 1370. 1480. 1590.]]
================================================================================
LCH04 2019-03-07 15:46:29
TensorFlow Version 1.12.0
Using GPUs:  name: GeForce GTX 1080 Ti
device: 0
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [1150. 1260. 1370. 1480. 1590.]]
================================================================================
LCH02 2019-03-07 15:46:30
TensorFlow Version 1.12.0
Using GPUs:  name: GeForce GTX 1080 Ti
device: 0
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [1150. 1260. 1370. 1480. 1590.]]
================================================================================
LCH03 2019-03-07 15:46:30
TensorFlow Version 1.12.0
Using GPUs:  name: GeForce GTX 1080 Ti
device: 0
[[ 150.  160.  170.  180.  190.]
 [ 400.  435.  470.  505.  540.]
 [ 650.  710.  770.  830.  890.]
 [ 900.  985. 1070. 1155. 1240.]
 [1150. 1260. 1370. 1480. 1590.]]

Useful SLURM Commands

sinfo
srun
sbatch
scancel
squeue
More Useful SLURM Commands

Explore SLURM

Tutorials

How to use the LCH Cluster - LCAS/Cluster_wiki GitHub Wiki

Submitting a SLURM job

Useful SLURM Commands

Explore SLURM

⚠️ GitHub.com Fallback ⚠️

How to use the LCH Cluster - LCAS/Cluster_wiki GitHub Wiki

Submitting a SLURM job

Useful SLURM Commands

Explore SLURM

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️