2018 06 - PaddlePaddle/continuous_evaluation GitHub Wiki
- 值班人 严春伟
- 继续 debug 昨天 vgg16的问题,暂时没有新的 kpi 影响
- 相关结论在 https://github.com/PaddlePaddle/continuous_evaluation/issues/88 中
- 问题 PR 已经 Revert,后续 CE 正常 pass。
- 值班人 严春伟
-
问题1
- analysis模块编译依赖失败,已经解决
- https://github.com/PaddlePaddle/Paddle/pull/11803
- 问题2, 正在查
- vgg16 模型稳定出现
flowers_32_train_speed
变差的问题 - 怀疑问题 PR出现的测试
- 单独 commit debug 结果
- vgg16 模型稳定出现
-
问题1
下面是相邻的几个 commit 单独 debug vgg16模型的结果,commit 按时间先后排序
commmit | status |
---|---|
bc28cf613f9e | PASS |
a2e43ae5ce69 | PASS |
19e877ffdb4f | FAIL |
对应 issue: https://github.com/PaddlePaddle/continuous_evaluation/issues/88
- 值班人 郭超容
- 2 次 fail
- 问题1:
- http://18.222.34.7:8080/viewLog.html?tab=buildLog&buildTypeId=Paddle_ContinuousEvaluation&buildId=659
- resnet50 单卡段错误
commit id 106ee9d1cc8f71a2e961d69cc4c3277a5460a2d4
[13:37:35] ----------- Configuration Arguments -----------
[13:37:35] batch_size: 64
[13:37:35] data_format: NCHW
[13:37:35] data_set: flowers
[13:37:35] device: GPU
[13:37:35] gpu_id: 0
[13:37:35] infer_only: False
[13:37:35] iterations: 80
[13:37:35] log_dir: ./
[13:37:35] model: resnet_imagenet
[13:37:35] pass_num: 3
[13:37:35] skip_batch_num: 5
[13:37:35] use_cprof: False
[13:37:35] use_fake_data: False
[13:37:35] use_nvprof: False
[13:37:35] ------------------------------------------------
[13:37:35] del memory.txt
[13:37:37] /usr/local/lib/python2.7/dist-packages/paddle/fluid/average.py:42: Warning: The WeightedAverage is deprecated, please use fluid.metrics.Accuracy instead.
[13:37:37] (self.__class__.__name__), Warning)
[13:38:28] Pass:0, Loss:5.186864, Train Accuray:0.063672, Test Accuray:0.047059, Handle Images Duration: 38.863647
[13:38:28]
[13:39:17] Pass:1, Loss:3.916058, Train Accuray:0.148828, Test Accuray:0.118627, Handle Images Duration: 41.423943
[13:39:17]
[13:40:06] Pass:2, Loss:3.371016, Train Accuray:0.205469, Test Accuray:0.134314, Handle Images Duration: 41.409965
[13:40:06]
[13:40:06] Total examples: 15040, total time: 121.69756
[13:40:06] 123.58506 examples/sec, 0.51786 sec/batch
[13:40:06]
[13:40:07] *** Aborted at 1528292407 (unix time) try "date -d @1528292407" if you are using GNU date ***
[13:40:07] PC: @ 0x0 (unknown)
[13:40:07] *** SIGSEGV (@0x58) received by PID 36315 (TID 0x7f12c45c7700) from PID 88; stack trace: ***
[13:40:07] @ 0x7f134cafa390 (unknown)
[13:40:07] @ 0x4bc5bb PyEval_EvalFrameEx
[13:40:07] @ 0x4b9ab6 PyEval_EvalCodeEx
[13:40:07] @ 0x4d55f3 (unknown)
[13:40:07] @ 0x4a577e PyObject_Call
[13:40:07] @ 0x4bed3d PyEval_EvalFrameEx
[13:40:07] @ 0x4c136f PyEval_EvalFrameEx
[13:40:07] @ 0x4c136f PyEval_EvalFrameEx
[13:40:07] @ 0x4b9ab6 PyEval_EvalCodeEx
[13:40:07] @ 0x4d54b9 (unknown)
[13:40:07] @ 0x4eebee (unknown)
[13:40:07] @ 0x4a577e PyObject_Call
[13:40:07] @ 0x4c5e10 PyEval_CallObjectWithKeywords
[13:40:07] @ 0x589172 (unknown)
[13:40:07] @ 0x7f134caf06ba start_thread
[13:40:07] @ 0x7f134c82641d clone
[13:40:07] @ 0x0 (unknown)
[13:40:07] ./run.xsh: line 13: 36315 Segmentation fault (core dumped) FLAGS_benchmark=true FLAGS_fraction_of_gpu_memory_to_use=0.0 python model.py --device=GPU --batch_size=64 --data_set=flowers --model=resnet_imagenet --pass_num=3 --gpu_id=$cudaid
后面版本该错误未出现, 应该是ce当时环境问题
- 问题2
- log url:http://18.222.34.7:8080/viewLog.html?buildId=700&tab=buildResultsDiv&buildTypeId=Paddle_ContinuousEvaluation
- issue:https://github.com/PaddlePaddle/Paddle/issues/11322
- 问题描述:transformer model hanges
- 值班人: 卫科
- 无
- 值班人: 董志宏
- 无
- 值班人:严春伟
- 问题1: transformer hang问题修复
- 值班人:武毅
- 问题1: vgg多机模型有问题
- 值班人:党青青
- 问题1:
- 值班人:刘毅冰
- 问题1: lstm内存kpi超阈值
- 问题2:mnist训练时长超出阈值
- 值班人:巩伟宝
- 问题:环境问题,找不到wheel包, 解决人:严春伟
12:00:59][Step 2/3] Requirement 'python/dist/*.whl' looks like a filename, but the file does not exist
[12:00:59][Step 2/3] *.whl is not a valid wheel filename.
[12:00:59][Step 2/3] You are using pip version 9.0.3, however version 10.0.1 is available.
- 值班人 郭晟
- 7 次 fail
- 问题1:
- http://18.222.34.7:8080/viewLog.html?buildId=785&buildTypeId=Paddle_ContinuousEvaluation
- 问题描述:CE paddle编译对anakin支持有问题,导致编译失败。
- 问题2
- http://18.222.34.7:8080/viewLog.html?buildId=786&buildTypeId=Paddle_ContinuousEvaluation
- 问题描述:CE paddle编译对anakin支持有问题,导致编译失败。已经修复
- 问题3:
- http://18.222.34.7:8080/viewLog.html?buildId=787&buildTypeId=Paddle_ContinuousEvaluation
- 问题描述:
- language_model:[imikolov_20_avg_ppl_card4] failed;
- transformer:[train_pass_duration_kpi] failed;[test_avg_ppl_kpi_card4] failed
- commit id: f4dce5674dc639150e2a498bdcb30c2eefa4da24
- https://github.com/PaddlePaddle/Paddle/pull/11437
- 新增Python wrapper应该不会影响CE评估结果,应该属于波动
- 问题4:
- http://18.222.34.7:8080/viewLog.html?buildId=788&buildTypeId=Paddle_ContinuousEvaluation
- 问题描述:
- image_classification:[train_acc_top5_kpi] failed;
- language_model:[imikolov_20_avg_ppl] failed;[imikolov_20_avg_ppl_card4] failed;[imikolov_20_pass_duration_card4] failed
- transformer:[train_pass_duration_kpi] failed;[test_avg_ppl_kpi_card4] failed
- commit id: 745ea4dcf0a63cda83f1e908c69ec9b22cf82995
- https://github.com/PaddlePaddle/Paddle/pull/11354
- 文档完善工作应该不会影响CE评估结果,应该属于波动
- 问题5:
-
http://18.222.34.7:8080/viewLog.html?buildId=794&buildTypeId=Paddle_ContinuousEvaluation
-
问题描述:
- image_classification:[train_acc_top5_kpi] failed;
- language_model:[imikolov_20_avg_ppl] failed;[imikolov_20_avg_ppl_card4] failed;[imikolov_20_pass_duration_card4] failed
- transformer:[train_pass_duration_kpi] failed;[test_avg_ppl_kpi_card4] failed
-
commit id: 916e863f85df467280857f89d27bad4aeaa25d92,5fd142c3fd5cd673802593befd0f27a2257134f4等
- https://github.com/PaddlePaddle/Paddle/pull/11504,https://github.com/PaddlePaddle/Paddle/pull/11487等
- 文档完善和tensorrt inference工作应该不会影响CE评估结果,应该属于波动
-
transformer 和language module 加入多卡场景,每卡batch size大小设置不对。导致数据不符合预期。已经修复
-
- 问题6:
- http://18.222.34.7:8080/viewLog.html?buildId=797&buildTypeId=Paddle_ContinuousEvaluation
- 问题描述:Evaluate [0ddc5d86319f33e560afe274ce17f038cdfe498a] successed!但有其他环境问题
[12:37:50]Evaluate [0ddc5d86319f33e560afe274ce17f038cdfe498a] successed!
[12:37:50]updating baseline
[12:37:50]current kpi imikolov_20_pass_duration_card4_factor.txt better than history by 0.465918, update baseline
[12:37:51]current kpi train_pass_duration_kpi_card4_factor.txt better than history by 0.665026, update baseline
[12:37:51]update github baseline
[12:37:51]To [email protected]:PaddlePaddle/paddle-ce-latest-kpis.git
[12:37:51] ! [rejected] master -> master (fetch first)
[12:37:51]error: failed to push some refs to '[email protected]:PaddlePaddle/paddle-ce-latest-kpis.git'
-
多个tasks 都更新基数据,可能会出现上述问题,修复方法:https://github.com/PaddlePaddle/continuous_evaluation/pull/73/files
-
问题7:
- http://18.222.34.7:8080/viewLog.html?buildId=807&buildTypeId=Paddle_ContinuousEvaluation
- 问题描述:
- lstm:[imdb_32_gpu_memory] failed;
- commit id: 3a4b6cdaa00de1d6f204a032b6ccf6f329d6a05c
- https://github.com/PaddlePaddle/Paddle/pull/11488
- 文档完善工作应该不会影响CE评估结果,应该属于波动
- 值班人 王豪爽
- 2 次 fail
- 问题1:
- http://18.222.34.7:8080/viewLog.html?buildId=828&buildTypeId=Paddle_ContinuousEvaluation
- 问题描述:image_classification任务随机失败。
- commit id: 69827f305b45214b99795166babac47fe057dca8
- 该commit修改concat op, 并不应该影响image_classification任务,后判断为随机失败,后续待跟进。
- 问题2:
- http://18.222.34.7:8080/viewLog.html?buildId=830&buildTypeId=Paddle_ContinuousEvaluation
- 问题描述:mnist任务失败
- commit id: e8f5757d6692642c22bd8b9081e1021cdd76ecb6
- 需要 @闵启阳 确认
- 值班人 于洋
- 无
- 值班人 于洋
- 问题:
- 代码逻辑错误导致 模型异常
- https://github.com/PaddlePaddle/Paddle/issues/11615
- (问题所在代码行:233行) https://github.com/PaddlePaddle/Paddle/pull/11102/files#diff-2f61feb27730d0f42a351131c7077ec2
- 已经修复
潜在问题: - NER模型 不应该用Evaluator这个废弃的API
- 临时的修复 https://github.com/PaddlePaddle/paddle-ce-latest-kpis/pull/57
- 值班人 冯佳宜
- http://18.222.34.7:8080/viewLog.html?buildId=972&buildTypeId=Paddle_ContinuousEvaluation
- commit id: dbca7f166ddb62fbe3bd5dc50230f6347f0863ca
- 问题1:
- 描述:mnist任务失败
- 原因是fail阈值设置过小,已经调大
- pr: https://github.com/PaddlePaddle/paddle-ce-latest-kpis/pull/60
- 问题2:
- 描述:image_classification任务失败
- 怀疑是shuffle带来的随机性,已经关闭shuffle,仍在继续观察中
- pr: https://github.com/PaddlePaddle/paddle-ce-latest-kpis/pull/59
- 问题1:
- 值班人 乔龙飞
- 问题1: 编译失败 已修复:
- 问题2: sequence 模型超过阈值
- 值班人 汤伟
- 问题1:
- http://18.222.34.7:8080/viewLog.html?tab=buildChangesDiv&buildId=1123&buildTypeId=Paddle_ContinuousEvaluation
- commit id: b20fa022ed3f8b86f246081b61591d79c5f6fabe
- 描述:编译失败
- merge更新代码,导致变量名变更
- pr: https://github.com/PaddlePaddle/Paddle/pull/11717
- 问题2:
- 描述:image_classification模型 acc 指标固定不下来
- https://github.com/PaddlePaddle/paddle-ce-latest-kpis/issues/70
- CE中已关闭该指标,请青青老师调查并open该指标
- pr: https://github.com/PaddlePaddle/paddle-ce-latest-kpis/pull/69
- 问题1:
-
值班人 李青晟
- 无
-
74ca73b80d29870a2931d853cc26c6465102808d
- 失败task
- image_classification 等
- 可能为CE机器上跑了其他任务干扰了性能数据
- 失败task