How to use the LCH Cluster - LCAS/Cluster_wiki GitHub Wiki
Request an account from Riccardo
Login to to the "head" node LCH01
> ssh LCH01
Change your password
> passwd
Check that you can run a TensorFlow python test script ( test/testTF.py ) directly on the head node
IMPORTANT : For tensorflow remember to type "module add cuda/9.0" at the shell and/or in any script ...
For example to multiply a 5 x 6 matrix on the gpu:
> cp -R /users/test . > cd test > module add cuda/9.0 > python3.5 testTF.py gpu 5 6 ================================================================================ LCH01 2019-03-07 15:48:26 TensorFlow Version 1.12.0 Using GPUs: name: GeForce GTX 1080 Ti device: 0 [[ 275. 290. 305. 320. 335.] [ 725. 776. 827. 878. 929.] [1175. 1262. 1349. 1436. 1523.] [1625. 1748. 1871. 1994. 2117.] [2075. 2234. 2393. 2552. 2711.]]
Try running some basic linux commands via SLURM
> srun hostname
> srun ls
> srun df -H
How about running that on one of the other compute nodes? Easy:
> srun -wLCH02 df -H
You can check the status of the cluster with
> sinfo -N -l Wed Aug 22 11:41:15 2018 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON LCH01 1 debug* idle 12 1:6:2 1 0 1 (null) none LCH02 1 debug* idle 12 1:6:2 1 0 1 (null) none LCH03 1 debug* idle 12 1:6:2 1 0 1 (null) none LCH04 1 debug* idle 12 1:6:2 1 0 1 (null) none
To dynamically monitor the cluster status you can use "watch" (Cntrl-C to stop)
> watch --i 0.5 sinfo -N -l
Jobs that are in the SLURM Cluster queue can be seen using the "squeue" command
> squeue
To dynamically monitor the queue use "watch" (Cntrl-C to stop)
> watch --i 0.5 squeue
Try submitting the python SLURM tensorflow test script /users/test/testTF.sh
This script will do a tensorflow gpu matrix multiply on each of the four LCH compute nodes
Notes:
- This script has several lines that begin with a "#" character -- these are NOT comments but directives to the SLURM job manager
- At the heart of the script "srun" is invoked to run several python scripts enumerated (from 0) in an auxiliary file called multi.conf
- The fact that "srun" is invoked means that you could target your script to specific compute nodes by using for example "srun -wLCH0[3-4]"
- Remember to submit this to the SLURM job manager always use sbatch NOT sh
> sbatch testTF.sh
squeue should show the job running on all four nodes:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1159 debug testTF mike CG 0:08 4 LCH[01-04]
When complete, the results will be found in your current directory in "out.txt"
> cat out.txt ================================================================================ LCH01 2019-03-07 15:46:28 TensorFlow Version 1.12.0 Using GPUs: name: GeForce GTX 1080 Ti device: 0 [[ 150. 160. 170. 180. 190.] [ 400. 435. 470. 505. 540.] [ 650. 710. 770. 830. 890.] [ 900. 985. 1070. 1155. 1240.] [1150. 1260. 1370. 1480. 1590.]] ================================================================================ LCH04 2019-03-07 15:46:29 TensorFlow Version 1.12.0 Using GPUs: name: GeForce GTX 1080 Ti device: 0 [[ 150. 160. 170. 180. 190.] [ 400. 435. 470. 505. 540.] [ 650. 710. 770. 830. 890.] [ 900. 985. 1070. 1155. 1240.] [1150. 1260. 1370. 1480. 1590.]] ================================================================================ LCH02 2019-03-07 15:46:30 TensorFlow Version 1.12.0 Using GPUs: name: GeForce GTX 1080 Ti device: 0 [[ 150. 160. 170. 180. 190.] [ 400. 435. 470. 505. 540.] [ 650. 710. 770. 830. 890.] [ 900. 985. 1070. 1155. 1240.] [1150. 1260. 1370. 1480. 1590.]] ================================================================================ LCH03 2019-03-07 15:46:30 TensorFlow Version 1.12.0 Using GPUs: name: GeForce GTX 1080 Ti device: 0 [[ 150. 160. 170. 180. 190.] [ 400. 435. 470. 505. 540.] [ 650. 710. 770. 830. 890.] [ 900. 985. 1070. 1155. 1240.] [1150. 1260. 1370. 1480. 1590.]]
- sinfo
- srun
- sbatch
- scancel
- squeue
- More Useful SLURM Commands
Tutorials