YOLO：模型训练 - zhonglong/TPV GitHub Wiki

数据集

	Pascal VOC dataset	COCO dataset
训练集	VOCtrainval_11-May-2012.tar(1.86G) VOCtrainval_06-Nov-2007.tar(438M)	train2014.zip(12.5G)
测试集	VOCtest_06-Nov-2007.tar(430M)	val2014.zip(6.18G)
标签	NA	labels.tgz(17.1M)
物体种类	20	80
图片数量	16551（训练） 4952（测试）	117264（训练） 5000（测试）
迭代次数	50200	500200
训练时长	21小时	?

数据集镜像

Pascal VOC Dataset Mirror
https://pjreddie.com/projects/pascal-voc-dataset-mirror/
Common Objects in Context Dataset Mirror
https://pjreddie.com/projects/coco-mirror/

注

COCO数据集中有一张图片已损坏，需从训练集中删除：COCO_train2014_000000167126.jpg

配置文件

配置文件位于cfg目录下，如yolov3.cfg，yolov3-tiny.cfg，yolov3-voc.cfg，其中带voc后缀的是VOC数据集，否则是COCO数据集。

	Pascal VOC dataset	COCO dataset	说明
max_batches	50200	500200	最大迭代次数
steps	40000,45000	400000,450000	学习速率相关，当迭代次数超过此阈值，学习速率会乘以一个比例系数（变小）
filters	75	255	公式：(classes + 5) x 3
classes	20	80	物体种类
ignore_thresh	.5	.7	损失函数相关，YOLO层一旦IoU超过此阈值，delta视为0

训练过程

模型训练过程可以参考如下官网文档，以VOC数据集为例：

Training YOLO on VOC
https://pjreddie.com/darknet/yolo/

但是官网文档有几个地方没说完整：

首先，训练和测试所用的配置文件（cfg）是不同的，这个在cfg文件中有说明，要先修改cfg文件再开始训练。

# Testing
# batch=1
# subdivisions=1
# Training
batch=64
subdivisions=2

其次，预训练卷积权重（Pretrained Convolutional Weight），darknet53.conv.74是针对yolov3模型的。而针对yolov3-tiny模型，需按照如下链接的方法，生成合适的权重文件。

How to train tiny-yolo (to detect your custom objects):
https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

最后，模型训练支持断点，只需要修改权重文件的路径，请对比如下初始训练和断点训练的命令：

./darknet detector train data/voc.data cfg/yolov3-voc.cfg **darknet53.conv.74**
./darknet detector train data/voc.data cfg/yolov3-voc.cfg **backup/yolov3-voc.backup**

训练过程中每100次迭代会自动备份到backup文件，每10000次迭代和前900次迭代会生成权重文件。

训练时长

训练过程中，每次迭代的批次是64（batch=64），即一次迭代过程要处理64张图片。一般来说，每种物体需要2000次迭代。所以理论上VOC数据集需要40000次迭代，yolov3-voc.cfg设置的是50200次。

在.7服务器上，每次迭代需要1分钟左右，这个时间并不固定，短的话三十几秒，长的话九十几秒。单次迭代与图片分辨率有关，训练开始前会先对图片做缩放操作，范围是320 ~ 608，32的整数倍。

    if(l.random && count++%10 == 0){
        printf("Resizing\n");
        int dim = (rand() % 10 + 10) * 32;
        if (get_current_batch(net)+200 > net->max_batches) dim = 608;
        //int dim = (rand() % 4 + 16) * 32;
        printf("%d\n", dim);
        args.w = dim;
        args.h = dim;

以每次迭代60秒来计算，50200次迭代需要：

60 x 50200 /3600 /24 = 34.9 天

大概一个月才能训练出一个模型。不过模型训练是逐步求精的，我试过迭代800次的模型已经能够识别出dog.jpg中的汽车，不过识别不了其他物体。

对训练时间有疑问的可以参考官网文档：

Hardware Guide: Neural Networks on GPUs (Updated 2016-1-30)
https://pjreddie.com/darknet/hardware-guide/

Probably the most important and most expensive part of your build will be the GPUs, and for good reason. GPUs are more than 100x faster for training and testing neural networks than a CPU. The bulk of our computation will be multiplying big matrices together so we want a card with high single precision performance.
Titan X
This is probably what you want. Designed to be NVIDIA's highest-end gaming GPU, they pack almost 7 TFLOPS of processing power for only $1,000 and you can fit 4 of them into a single machine. With 12 GB of VRAM they can run all the big models with plenty of room to spare.

GPU比CPU训练速度快百倍，而且作者使用四块Titan X显卡（每块1000美金）。

OpenBLAS优化

之前在手机上验证，检测图片时采用OpenBLAS比OpenMP快1倍左右，因此希望在训练时也引入OpenBLAS。

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
https://github.com/xianyi/OpenBLAS/

由于没有Fortran编译器，因此指定编译参数：ONLY_CBLAS = 1
以静态库方式导入Darknet，因此指定编译参数：NO_SHARED = 1

运行make命令编译得到libopenblas.a。切换到Darknet，用OpenBLAS的 cblas_sgemm 函数替换Darknet的 gemm_cpu 函数，重新编译Darknet。

SGEMM, DGEMM, CGEMM, and ZGEMM
https://www.ibm.com/support/knowledgecenter/en/SSFHY8_5.5.0/com.ibm.cluster.essl.v5r5.essl100.doc/am5gr_hsgemm.htm

再次进行Darknet模型训练，跟踪几次迭代时间在三四十秒，对比导入前至少提升50%。

CUDA加速

PTA路由器下挂了一台电脑，配置一块NVidia GTX1060显卡用于加速模型训练，地址：

172.20.30.50
用户名和密码：xmic。

编译Darknet时，将GPU和CUDNN开关设置为1可启用CUDA加速，至少提升一个数量级。启用CUDA加速后，以VOC数据集为例，每次迭代仅需要1秒左右，总训练时长降为：

1 x 50200 /3600 = 14 小时

大半天就可以完成模型训练。

实际训练过程中，发现没多久就出现“Out Of Memory”，查阅相关资料建议修改cfg文件，增大subdivisions的值，如下：

# Testing
# batch=1
# subdivisions=1
# Training
batch=64
# subdivisions=2
subdivisions=8