H.T.U Caffe - refraction-ray/TH2-demos GitHub Wiki

Use Caffe On TH-2 ( DEMO)

演示如何在天河上使用Caffe

2016/5/20 by : lijiang 修改 2016/12/16 by : lijiang

Caffe 是目前流行的深度学习框架。

Caffe 更适合使用 GPU 来进行计算。但有客户在对CAFFE 进行 CPU 的优化,以后可能会在TH-2 上进行发布

1. 加载环境

使用 module load 加载 caffe 环境 ; 首先,在不了解系统时可以通过 module avail 查看有哪些可用版本

[nscc-gz_jiangli@ln2 ~]$ module avail caffe

---------------------------------- /WORK/app/modulefiles -----------------------------------
caffe/v20160510-cpu3        caffe/v20160511-gpu-icc     caffe/v20161130-gpu-cudnn
caffe/v20160511-gpu         caffe/v20161129-24d2f67-gpu
[nscc-gz_jiangli@ln2 ~]$ 

如果在基金分区(NSFC),会有所不同。不同的版本对应的CAFFE 代码和安装选项有所不同。

后面的演示是之前在NSFC 分区进行的,在普通分区使用时注意进行相应的替换

使用module load 加载环境.

$ module load caffe/v20160510-cpu3
$ which caffe
/NSFCGZ/app/caffe/v20160510-cpu3/bin/caffe

这里可以看到,caffe 命令及其环境已经成功加载了。

2. 数据处理

$ ls /NSFCGZ/app/caffe/v20160510-cpu3
bin  caffe-master  include  lib  python  share

用户可以自己上传 caffe-master 文件夹,也可以直接从 /NSFCGZ/app/caffe/v20160510-cpu3/caffe-master 复制一份

$ cp -r /NSFCGZ/app/caffe/v20160510-cpu3/caffe-master ~/.
$ cd ~/caffe-master 
$ ls 
CMakeLists.txt	 Makefile		  caffe.cloc  examples	scripts
CONTRIBUTING.md  Makefile.config	  cmake       include	src
CONTRIBUTORS.md  Makefile.config.example  data	      matlab	tools
INSTALL.md	 README.md		  docker      models
LICENSE		 build			  docs	      python

我原来已经下载了 算例 cifar10 的数据; 但为了完整演示,这里使用算例 mnist

$ ls data/mnist
get_mnist.sh

可以看到: 这个文件夹里面只有一个脚本文件;如果按在自己电脑上的方法去执行则会出错:

$ data/mnist/get_mnist.sh

Downloading...
--2016-05-20 15:58:24--  http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Resolving yann.lecun.com... failed: Temporary failure in name resolution.
wget: unable to resolve host address `yann.lecun.com'
gzip: train-images-idx3-ubyte.gz: No such file or directory
--2016-05-20 15:58:24--  http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Resolving yann.lecun.com... failed: Temporary failure in name resolution.
wget: unable to resolve host address `yann.lecun.com'
gzip: train-labels-idx1-ubyte.gz: No such file or directory
--2016-05-20 15:58:24--  http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Resolving yann.lecun.com... failed: Temporary failure in name resolution.
wget: unable to resolve host address `yann.lecun.com'
gzip: t10k-images-idx3-ubyte.gz: No such file or directory
--2016-05-20 15:58:24--  http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Resolving yann.lecun.com... failed: Temporary failure in name resolution.
wget: unable to resolve host address `yann.lecun.com'
gzip: t10k-labels-idx1-ubyte.gz: No such file or directory

原因很简单,机器是不联网的。。。我们需要手工将压缩包或者数据文件放过来。 这里可以从 /NSFCGZ/app/share 拷贝一份过来。

$ cp /NSFCGZ/app/share/mnist_data.tar.gz data/mnist/.

然后修改一下 data/mnist/get_mnist.sh 这个脚本 (这里没法演示修改的过程;大家自己修改 , 这里用 diff 查看修改前后的对比 , 后面如有文件修改同此):

$ diff data/mnist/get_mnist.sh data/mnist/get_mnist-th.sh

8c8
< 
---
> tar xvf *.tar.gz
12c12
<         wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz
---
>         #wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz

使用修改后的文件处理输入文件

$ data/mnist/get_mnist-th.sh 
$ ls data/mnist/

Downloading...
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte.gz
t10k-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte.gz
get_mnist-th.sh   mnist_data.tar.gz	  train-images-idx3-ubyte
get_mnist.sh	  t10k-images-idx3-ubyte  train-labels-idx1-ubyte
get_mnist.sh-org  t10k-labels-idx1-ubyte

执行完脚本后可以发现数据已经就绪了。进行下一步操作。

$ examples/mnist/create_mnist.sh

Creating lmdb...
examples/mnist/create_mnist.sh: line 16: build/examples/mnist/convert_mnist_data.bin: Permission denied
examples/mnist/create_mnist.sh: line 18: build/examples/mnist/convert_mnist_data.bin: Permission denied
Done.

这里执行会出错,同样需要稍微修改下文件 :

$ diff examples/mnist/create_mnist.sh  examples/mnist/create_mnist_th.sh

16c16,17
< $BUILD/convert_mnist_data.bin $DATA/train-images-idx3-ubyte \
---
> #$BUILD/
> convert_mnist_data $DATA/train-images-idx3-ubyte \
18c19,20
< $BUILD/convert_mnist_data.bin $DATA/t10k-images-idx3-ubyte \
---
> #$BUILD/
> convert_mnist_data $DATA/t10k-images-idx3-ubyte \

$ examples/mnist/create_mnist_th.sh

Creating lmdb...
I0520 16:10:11.905948 10730 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_train_lmdb
I0520 16:10:11.914786 10730 convert_mnist_data.cpp:88] A total of 60000 items.
I0520 16:10:11.914836 10730 convert_mnist_data.cpp:89] Rows: 28 Cols: 28
I0520 16:10:13.172996 10730 convert_mnist_data.cpp:108] Processed 60000 files.
I0520 16:10:13.472939 10737 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_test_lmdb
I0520 16:10:13.474231 10737 convert_mnist_data.cpp:88] A total of 10000 items.
I0520 16:10:13.474262 10737 convert_mnist_data.cpp:89] Rows: 28 Cols: 28
I0520 16:10:13.711577 10737 convert_mnist_data.cpp:108] Processed 10000 files.
Done.

执行修改后的文件,一切顺利

3.计算

本例要演示的计算文件为 examples/mnist/train_lenet.sh ; 同样需要略作修改 :

$ diff examples/mnist/train_lenet.sh  examples/mnist/train_lenet_th.sh

3c3,4
< ./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt
---
> #./build/tools/
>  yhrun -p nsfc2 -n 1 caffe train --solver=examples/mnist/lenet_solver.prototxt &> examples/mnist/lenet.log

这个文件的修改是为了使用 SLURM 作业系统,通过 yhrun 来提交计算任务,并且创建了日志文件 examples/mnist/lenet.log 。

另外, 由于是使用 CPU 来计算 , 需要修改 examples/mnist/lenet_solver.prototxt 中的solver_mode: GPU 为 solver_mode: CPU; 这里不再展示。

$ yhbatch -p nsfc2 examples/mnist/train_lenet_th.sh

Submitted batch job 383308

这里用yhbatch 提交作业

$ yhqueue

             JOBID PARTITION     NAME         USER ST       TIME  NODES NODELIST(REASON)
            383308     nsfc2 train_le nscc-gz_jian  R       0:17      1 cn8035

作业已经在运行了,可以用tail 查看log (加上 -f 参数可以连续查看,但演示环境不允许)


$ tail examples/mnist/lenet.log

I0520 16:29:50.440232 19425 sgd_solver.cpp:106] Iteration 1400, lr = 0.00906403
I0520 16:29:55.268719 19425 solver.cpp:337] Iteration 1500, Testing net (#0)
I0520 16:29:58.517141 19425 solver.cpp:404]     Test net output #0: accuracy = 0.9833
I0520 16:29:58.517194 19425 solver.cpp:404]     Test net output #1: loss = 0.0494627 (* 1 = 0.0494627 loss)
I0520 16:29:58.576706 19425 solver.cpp:228] Iteration 1500, loss = 0.0795374
I0520 16:29:58.576748 19425 solver.cpp:244]     Train net output #0: loss = 0.0795374 (* 1 = 0.0795374 loss)
I0520 16:29:58.576759 19425 sgd_solver.cpp:106] Iteration 1500, lr = 0.00900485
I0520 16:30:03.441340 19425 solver.cpp:228] Iteration 1600, loss = 0.115878
I0520 16:30:03.441392 19425 solver.cpp:244]     Train net output #0: loss = 0.115878 (* 1 = 0.115878 loss)
I0520 16:30:03.441401 19425 sgd_solver.cpp:106] Iteration 1600, lr = 0.00894657

运算一切正常,本演示结束 。

4.LN41 环境下CAFFE的使用

使用 module 工具查看可用环境并加载

[nscc-gz_jiangli@ln41%tianhe2-G caffe]$ module avail caffe

------------------------------------------------------------------------------------------ /BIGDATA/app/modulefiles -------------------------------------------------------------------------------------------
caffe/cudnn
[nscc-gz_jiangli@ln41%tianhe2-G caffe]$ module load caffe/cudnn

演示计算

[nscc-gz_jiangli@ln41%tianhe2-G caffe]$ yhrun -p gpu --cpu_bind=none -n 1 caffe train --solver=examples/cifar10/cifar10_quick_solver.prototxt
I0428 16:55:13.732276  1691 caffe.cpp:217] Using GPUs 0
I0428 16:55:25.313719  1691 caffe.cpp:222] GPU 0: Tesla K80
I0428 16:55:27.366025  1691 solver.cpp:48] Initializing solver from parameters:
test_iter: 100
test_interval: 500
base_lr: 0.001
display: 100
max_iter: 4000
lr_policy: "fixed"
momentum: 0.9
weight_decay: 0.004
snapshot: 4000
snapshot_prefix: "examples/cifar10/cifar10_quick"
solver_mode: GPU
device_id: 0
net: "examples/cifar10/cifar10_quick_train_test.prototxt"
train_state {
  level: 0
  stage: ""
}
snapshot_format: HDF5
I0428 16:55:27.391957  1691 solver.cpp:91] Creating training net from net file: examples/cifar10/cifar10_quick_train_test.prototxt
I0428 16:55:27.430917  1691 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0428 16:55:27.430982  1691 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0428 16:55:27.431417  1691 net.cpp:58] Initializing net from parameters:
name: "CIFAR10_quick"

...

I0428 16:56:16.794607  1691 solver.cpp:317] Iteration 4000, loss = 0.613817
I0428 16:56:16.794644  1691 solver.cpp:337] Iteration 4000, Testing net (#0)
I0428 16:56:17.164870  1691 solver.cpp:404]     Test net output #0: accuracy = 0.7087
I0428 16:56:17.164904  1691 solver.cpp:404]     Test net output #1: loss = 0.861607 (* 1 = 0.861607 loss)
I0428 16:56:17.164909  1691 solver.cpp:322] Optimization Done.
I0428 16:56:17.164913  1691 caffe.cpp:254] Optimization Done.
[nscc-gz_jiangli@ln41%tianhe2-G caffe]$ 

可以看到约1min 计算就完成了。 这里为了方便演示用yhrun 提交的计算任务,实际使用过程中请使用yhbatch 提交计算。 另外每个GPU 节点实际有两款K80 GPU , 请合理利用。

拷贝和安装

可从官网下载最新的caffe 代码 , 或者从 /BIGDATA/app/caffe/cudnn/ 拷贝一份到本地文件夹,进入后

  1. 加载库环境 module load caffe/cudnn , 如果需要对这些库环境进行修改,可以通过环境变量进行重新设置

/BIGDATA/app/caffe/caffe-cp 是caffe 依赖的第三方环境的一个拷贝,具体可以参考ln2 上的 modulefile caffe/v20161130-gpu-cudnn

  1. 按需求修改 Makefile.config , 大部分设置可以参考 /BIGDATA/app/caffe/cudnn/Makefile.config

  2. make -j 12