H.T.U Caffe - refraction-ray/TH2-demos GitHub Wiki
Use Caffe On TH-2 ( DEMO)
演示如何在天河上使用Caffe
2016/5/20 by : lijiang 修改 2016/12/16 by : lijiang
Caffe 是目前流行的深度学习框架。
Caffe 更适合使用 GPU 来进行计算。但有客户在对CAFFE 进行 CPU 的优化,以后可能会在TH-2 上进行发布
1. 加载环境
使用 module load 加载 caffe 环境 ; 首先,在不了解系统时可以通过 module avail 查看有哪些可用版本
[nscc-gz_jiangli@ln2 ~]$ module avail caffe
---------------------------------- /WORK/app/modulefiles -----------------------------------
caffe/v20160510-cpu3 caffe/v20160511-gpu-icc caffe/v20161130-gpu-cudnn
caffe/v20160511-gpu caffe/v20161129-24d2f67-gpu
[nscc-gz_jiangli@ln2 ~]$
如果在基金分区(NSFC),会有所不同。不同的版本对应的CAFFE 代码和安装选项有所不同。
后面的演示是之前在NSFC 分区进行的,在普通分区使用时注意进行相应的替换
使用module load 加载环境.
$ module load caffe/v20160510-cpu3
$ which caffe
/NSFCGZ/app/caffe/v20160510-cpu3/bin/caffe
这里可以看到,caffe 命令及其环境已经成功加载了。
2. 数据处理
$ ls /NSFCGZ/app/caffe/v20160510-cpu3
bin caffe-master include lib python share
用户可以自己上传 caffe-master 文件夹,也可以直接从 /NSFCGZ/app/caffe/v20160510-cpu3/caffe-master 复制一份
$ cp -r /NSFCGZ/app/caffe/v20160510-cpu3/caffe-master ~/.
$ cd ~/caffe-master
$ ls
CMakeLists.txt Makefile caffe.cloc examples scripts
CONTRIBUTING.md Makefile.config cmake include src
CONTRIBUTORS.md Makefile.config.example data matlab tools
INSTALL.md README.md docker models
LICENSE build docs python
我原来已经下载了 算例 cifar10 的数据; 但为了完整演示,这里使用算例 mnist
$ ls data/mnist
get_mnist.sh
可以看到: 这个文件夹里面只有一个脚本文件;如果按在自己电脑上的方法去执行则会出错:
$ data/mnist/get_mnist.sh
Downloading...
--2016-05-20 15:58:24-- http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Resolving yann.lecun.com... failed: Temporary failure in name resolution.
wget: unable to resolve host address `yann.lecun.com'
gzip: train-images-idx3-ubyte.gz: No such file or directory
--2016-05-20 15:58:24-- http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Resolving yann.lecun.com... failed: Temporary failure in name resolution.
wget: unable to resolve host address `yann.lecun.com'
gzip: train-labels-idx1-ubyte.gz: No such file or directory
--2016-05-20 15:58:24-- http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Resolving yann.lecun.com... failed: Temporary failure in name resolution.
wget: unable to resolve host address `yann.lecun.com'
gzip: t10k-images-idx3-ubyte.gz: No such file or directory
--2016-05-20 15:58:24-- http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Resolving yann.lecun.com... failed: Temporary failure in name resolution.
wget: unable to resolve host address `yann.lecun.com'
gzip: t10k-labels-idx1-ubyte.gz: No such file or directory
原因很简单,机器是不联网的。。。我们需要手工将压缩包或者数据文件放过来。 这里可以从 /NSFCGZ/app/share 拷贝一份过来。
$ cp /NSFCGZ/app/share/mnist_data.tar.gz data/mnist/.
然后修改一下 data/mnist/get_mnist.sh 这个脚本 (这里没法演示修改的过程;大家自己修改 , 这里用 diff 查看修改前后的对比 , 后面如有文件修改同此):
$ diff data/mnist/get_mnist.sh data/mnist/get_mnist-th.sh
8c8
<
---
> tar xvf *.tar.gz
12c12
< wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz
---
> #wget --no-check-certificate http://yann.lecun.com/exdb/mnist/${fname}.gz
使用修改后的文件处理输入文件
$ data/mnist/get_mnist-th.sh
$ ls data/mnist/
Downloading...
train-images-idx3-ubyte.gz
train-labels-idx1-ubyte.gz
t10k-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte.gz
get_mnist-th.sh mnist_data.tar.gz train-images-idx3-ubyte
get_mnist.sh t10k-images-idx3-ubyte train-labels-idx1-ubyte
get_mnist.sh-org t10k-labels-idx1-ubyte
执行完脚本后可以发现数据已经就绪了。进行下一步操作。
$ examples/mnist/create_mnist.sh
Creating lmdb...
examples/mnist/create_mnist.sh: line 16: build/examples/mnist/convert_mnist_data.bin: Permission denied
examples/mnist/create_mnist.sh: line 18: build/examples/mnist/convert_mnist_data.bin: Permission denied
Done.
这里执行会出错,同样需要稍微修改下文件 :
$ diff examples/mnist/create_mnist.sh examples/mnist/create_mnist_th.sh
16c16,17
< $BUILD/convert_mnist_data.bin $DATA/train-images-idx3-ubyte \
---
> #$BUILD/
> convert_mnist_data $DATA/train-images-idx3-ubyte \
18c19,20
< $BUILD/convert_mnist_data.bin $DATA/t10k-images-idx3-ubyte \
---
> #$BUILD/
> convert_mnist_data $DATA/t10k-images-idx3-ubyte \
$ examples/mnist/create_mnist_th.sh
Creating lmdb...
I0520 16:10:11.905948 10730 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_train_lmdb
I0520 16:10:11.914786 10730 convert_mnist_data.cpp:88] A total of 60000 items.
I0520 16:10:11.914836 10730 convert_mnist_data.cpp:89] Rows: 28 Cols: 28
I0520 16:10:13.172996 10730 convert_mnist_data.cpp:108] Processed 60000 files.
I0520 16:10:13.472939 10737 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_test_lmdb
I0520 16:10:13.474231 10737 convert_mnist_data.cpp:88] A total of 10000 items.
I0520 16:10:13.474262 10737 convert_mnist_data.cpp:89] Rows: 28 Cols: 28
I0520 16:10:13.711577 10737 convert_mnist_data.cpp:108] Processed 10000 files.
Done.
执行修改后的文件,一切顺利
3.计算
本例要演示的计算文件为 examples/mnist/train_lenet.sh ; 同样需要略作修改 :
$ diff examples/mnist/train_lenet.sh examples/mnist/train_lenet_th.sh
3c3,4
< ./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt
---
> #./build/tools/
> yhrun -p nsfc2 -n 1 caffe train --solver=examples/mnist/lenet_solver.prototxt &> examples/mnist/lenet.log
这个文件的修改是为了使用 SLURM 作业系统,通过 yhrun 来提交计算任务,并且创建了日志文件 examples/mnist/lenet.log 。
另外, 由于是使用 CPU 来计算 , 需要修改 examples/mnist/lenet_solver.prototxt 中的solver_mode: GPU 为 solver_mode: CPU; 这里不再展示。
$ yhbatch -p nsfc2 examples/mnist/train_lenet_th.sh
Submitted batch job 383308
这里用yhbatch 提交作业
$ yhqueue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
383308 nsfc2 train_le nscc-gz_jian R 0:17 1 cn8035
作业已经在运行了,可以用tail 查看log (加上 -f 参数可以连续查看,但演示环境不允许)
$ tail examples/mnist/lenet.log
I0520 16:29:50.440232 19425 sgd_solver.cpp:106] Iteration 1400, lr = 0.00906403
I0520 16:29:55.268719 19425 solver.cpp:337] Iteration 1500, Testing net (#0)
I0520 16:29:58.517141 19425 solver.cpp:404] Test net output #0: accuracy = 0.9833
I0520 16:29:58.517194 19425 solver.cpp:404] Test net output #1: loss = 0.0494627 (* 1 = 0.0494627 loss)
I0520 16:29:58.576706 19425 solver.cpp:228] Iteration 1500, loss = 0.0795374
I0520 16:29:58.576748 19425 solver.cpp:244] Train net output #0: loss = 0.0795374 (* 1 = 0.0795374 loss)
I0520 16:29:58.576759 19425 sgd_solver.cpp:106] Iteration 1500, lr = 0.00900485
I0520 16:30:03.441340 19425 solver.cpp:228] Iteration 1600, loss = 0.115878
I0520 16:30:03.441392 19425 solver.cpp:244] Train net output #0: loss = 0.115878 (* 1 = 0.115878 loss)
I0520 16:30:03.441401 19425 sgd_solver.cpp:106] Iteration 1600, lr = 0.00894657
运算一切正常,本演示结束 。
4.LN41 环境下CAFFE的使用
使用 module 工具查看可用环境并加载
[nscc-gz_jiangli@ln41%tianhe2-G caffe]$ module avail caffe
------------------------------------------------------------------------------------------ /BIGDATA/app/modulefiles -------------------------------------------------------------------------------------------
caffe/cudnn
[nscc-gz_jiangli@ln41%tianhe2-G caffe]$ module load caffe/cudnn
演示计算
[nscc-gz_jiangli@ln41%tianhe2-G caffe]$ yhrun -p gpu --cpu_bind=none -n 1 caffe train --solver=examples/cifar10/cifar10_quick_solver.prototxt
I0428 16:55:13.732276 1691 caffe.cpp:217] Using GPUs 0
I0428 16:55:25.313719 1691 caffe.cpp:222] GPU 0: Tesla K80
I0428 16:55:27.366025 1691 solver.cpp:48] Initializing solver from parameters:
test_iter: 100
test_interval: 500
base_lr: 0.001
display: 100
max_iter: 4000
lr_policy: "fixed"
momentum: 0.9
weight_decay: 0.004
snapshot: 4000
snapshot_prefix: "examples/cifar10/cifar10_quick"
solver_mode: GPU
device_id: 0
net: "examples/cifar10/cifar10_quick_train_test.prototxt"
train_state {
level: 0
stage: ""
}
snapshot_format: HDF5
I0428 16:55:27.391957 1691 solver.cpp:91] Creating training net from net file: examples/cifar10/cifar10_quick_train_test.prototxt
I0428 16:55:27.430917 1691 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0428 16:55:27.430982 1691 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0428 16:55:27.431417 1691 net.cpp:58] Initializing net from parameters:
name: "CIFAR10_quick"
...
I0428 16:56:16.794607 1691 solver.cpp:317] Iteration 4000, loss = 0.613817
I0428 16:56:16.794644 1691 solver.cpp:337] Iteration 4000, Testing net (#0)
I0428 16:56:17.164870 1691 solver.cpp:404] Test net output #0: accuracy = 0.7087
I0428 16:56:17.164904 1691 solver.cpp:404] Test net output #1: loss = 0.861607 (* 1 = 0.861607 loss)
I0428 16:56:17.164909 1691 solver.cpp:322] Optimization Done.
I0428 16:56:17.164913 1691 caffe.cpp:254] Optimization Done.
[nscc-gz_jiangli@ln41%tianhe2-G caffe]$
可以看到约1min 计算就完成了。 这里为了方便演示用yhrun 提交的计算任务,实际使用过程中请使用yhbatch 提交计算。 另外每个GPU 节点实际有两款K80 GPU , 请合理利用。
拷贝和安装
可从官网下载最新的caffe 代码 , 或者从 /BIGDATA/app/caffe/cudnn/ 拷贝一份到本地文件夹,进入后
- 加载库环境 module load caffe/cudnn , 如果需要对这些库环境进行修改,可以通过环境变量进行重新设置
/BIGDATA/app/caffe/caffe-cp 是caffe 依赖的第三方环境的一个拷贝,具体可以参考ln2 上的 modulefile caffe/v20161130-gpu-cudnn
-
按需求修改 Makefile.config , 大部分设置可以参考 /BIGDATA/app/caffe/cudnn/Makefile.config
-
make -j 12