H.T.U Deep Learning - refraction-ray/TH2-demos GitHub Wiki

天河二号上运行深度学习程序示例：

gpu节点配置：

CPU: 2块Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz, 共20 CPU核
MEM: 256 GB 主内存
GPU: 2块 Nvidia Tesla K80加速卡，每块卡载有24 GB内存
节点间通过InfiniBand网络互联

已安装的程序：

可用module avail命令查看已经配置好的模块。然后用module load xxx加载。其中deeplearning/18Q2集成了多个深度学习相关软件，基于python2.7 (anaconda)，包含CUDA/8.0, cudnn/6.0, caffe/1.0, caffe2, pytorch/0.5a(20180509), tensorflow/1.6, opencv/3, Keras/2.1, scikit-learn/0.19, and others。

方法一：批量式

0. 确定已经登录ln41

[myname@ln41%tianhe2-G test]$ hostname
ln41

1. 编辑作业脚本

[myname@ln41%tianhe2-G test]$ cat myjob.sh
#!/bin/bash
source /BIGDATA1/app_GPU/toolshs/moduleenv.sh
module load TensorFlow/1.3-gpu-py2.7
python train.py

在上面这个脚本里，先加载tensorflow模块，然后运行训练程序。

如果需要其他程序，将模块名称写入如上脚本即可。没有的程序也可以安装在自己目录中。

2. 提交作业

[myname@ln41%tianhe2-G test]$ yhbatch -p gpu myjob.sh

作业启动后会产生slurm-xxxxx.out的输出文件。

提交可能涉及其他参数（如节点数、cpu核数等），详情请参考天河使用手册及slurm文档。

方法二：交互式

0. 确定已经登录ln41

[myname@ln41%tianhe2-G test]$ hostname
ln41

1. 请求资源并登录计算节点

[myname@ln41%tianhe2-G test]$ yhalloc -p gpu -N 1
yhalloc: Granted job allocation 26259
[myname@ln41%tianhe2-G test]$ yhq
         JOBID PARTITION   NAME     USER ST     TIME  NODES NODELIST(REASON)
         12345       gpu   bash   myname  R  INVALID      1 gn06
[myname@ln41%tianhe2-G test]$ ssh gn06
[myname@gn06%tianhe2-G ~]$

2. 加载模块

[myname@gn06%tianhe2-G test]$ source /BIGDATA1/app_GPU/toolshs/moduleenv.sh
[myname@gn06%tianhe2-G test]$ module load TensorFlow/1.3-gpu-py2.7 # 加载模块

3. 运行程序

[myname@gn06%tianhe2-G test]$ cat train.py
#!/usr/bin/env python
import tensorflow as tf
print('tensorflow: %s' % tf.__version__)
[myname@gn06%tianhe2-G test]$ python train.py
tensorflow: 1.3.0