2018 07 - PaddlePaddle/continuous_evaluation GitHub Wiki
值班人: 王豪爽
- transformer task failed
Evaluate [91fb0156ca1a97a247f571960440bae993c28a1c] failed!
[04:29:09]W: [Step 1/1] The details:
[04:29:09]W: [Step 1/1] task: transformer
[04:29:09]W: [Step 1/1] passed: False
[04:29:09]W: [Step 1/1] infos Task is disabled, [test_avg_ppl_kpi] pass
[04:29:09]W: [Step 1/1] [train_pass_duration_kpi] failed, diff ratio: 0.03915986083400443 larger than 0.03.
[04:29:09]W: [Step 1/1] [test_avg_ppl_kpi_card4] pass
[04:29:09]W: [Step 1/1] [train_pass_duration_kpi_card4] failed, diff ratio: 0.036419812602559304 larger than 0.03.
[04:29:09]W: [Step 1/1] kpis keys ['test_avg_ppl_kpi_card4', 'train_pass_duration_kpi', 'train_pass_duration_kpi_card4', 'test_avg_ppl_kpi']
[04:29:09]W: [Step 1/1] kpis values [[[77.48257446289062]], [[91.40244913101196]], [[35.20487093925476]], [[23.3564453125]]]
[04:29:21]W: [Step 1/1] Process exited with code 255
[04:29:21]E: [Step 1/1] Process exited with code 255
[04:29:21]E: [Step 1/1] Step build and run paddle (Command Line) failed
- job地址:http://ce.paddlepaddle.org:8080/viewLog.html?buildId=524&buildTypeId=PaddleCe_CEBuild&tab=buildLog
- 怀疑导致该问题的commit为:91fb0156ca1a97a247f571960440bae993c28a1c (by @dongzhihong01)
值班人: 郭晟
- git clone paddle-ce-latest-kpis仓库时出错
[03:12:27][Step 1/1] Initialized empty Git repository in /workspace/modelce/tasks/.git/
[03:12:30][Step 1/1] Host key verification failed.
[03:12:30][Step 1/1] fatal: The remote end hung up unexpectedly
[03:12:30][Step 1/1] Traceback (most recent call last):
[03:12:30][Step 1/1] File "/usr/local/bin/xonsh", line 3, in <module>
[03:12:30][Step 1/1] main()
[03:12:30][Step 1/1] File "/usr/local/lib/python3.5/site-packages/xonsh/main.py", line 344, in main
[03:12:30][Step 1/1] _failback_to_other_shells(args, err)
[03:12:30][Step 1/1] File "/usr/local/lib/python3.5/site-packages/xonsh/main.py", line 308, in _failback_to_other_shells
[03:12:30][Step 1/1] File "./main.xsh", line 243, in <module>
[03:12:30][Step 1/1] main()
[03:12:30][Step 1/1] File "./main.xsh", line 35, in main
[03:12:30][Step 1/1] refresh_baseline_workspace()
[03:12:30][Step 1/1] File "./main.xsh", line 89, in refresh_baseline_workspace
[03:12:30][Step 1/1] git clone @(config.baseline_repo_url) @(config.baseline_path)
[03:12:30][Step 1/1] File "/usr/local/lib/python3.5/site-packages/xonsh/built_ins.py", line 881, in subproc_captured_hiddenobject
[03:12:30][Step 1/1] File "/usr/local/lib/python3.5/site-packages/xonsh/proc.py", line 2154, in _raise_subproc_error
[03:12:30][Step 1/1] output=self.output)
[03:12:30][Step 1/1] subprocess.CalledProcessError: Command '['/usr/bin/git', 'clone', '[email protected]:PaddlePaddle/paddle-ce-latest-kpis.git', '/workspace/modelce/tasks']' returned non-zero exit status 128
[03:12:43][Step 1/1] Process exited with code 1
- job地址:http://ce.paddlepaddle.org:8080/viewLog.html?buildId=484&buildTypeId=PaddleCe_CEBuild&tab=buildLog
- 应该为当时问题环境网络问题,后面无再出现
问题2 - text_classification 晚上的时候失败,明天值班同学需留意下 - job: http://ce.paddlepaddle.org:8080/viewLog.html?buildId=501&tab=buildResultsDiv&buildTypeId=PaddleCe_CEBuild
值班人: 巩伟宝 问题: image classification模型超阈值 job 地址:http://ce.paddlepaddle.org:8080/viewLog.html?buildId=428&buildTypeId=PaddleCe_CEBuild issue:https://github.com/PaddlePaddle/paddle-ce-latest-kpis/issues/70 【已修复,待验证】
值班人: 刘毅冰
问题: LSTM kpi异常
值班人: 党青青
问题: 无
值班人: 赵成舵
问题: 无
值班人: 骆涛
问题: 无
- 值班人:闫旭
- 问题1:vgg16模型性能下降问题,在aws->p40->v100迁移过程中逐渐下降
- issue:https://github.com/PaddlePaddle/paddle-ce-latest-kpis/issues/96
- 状态:已定位问题,待解决
- 问题2:transformer性能问题依然存在
- issue:https://github.com/PaddlePaddle/paddle-ce-latest-kpis/issues/80
- 状态:已定位问题,待解决
- 值班人:董志宏
- 问题1:transformer模型性能下降问题,6%-10%浮动
- issue:https://github.com/PaddlePaddle/paddle-ce-latest-kpis/issues/80
- 状态:已定位问题,待解决
- 值班人: 郭超容
- 问题1: transformer模型性能下降8%左右(必现)
- 问题2: 这两天vgg16 单卡跑时,出现了2次Segmentation fault
- 值班人: 卫科
- 问题1:
- 因api fluid.layers.get_places变动引起一些模型训练时失败。
- job地址:
- 需要更新相关模型代码:https://github.com/PaddlePaddle/paddle-ce-latest-kpis/pull/77/files,待merge后进行验证。
- 问题2:
- 模型lstm和resnet50分别出现过一次训练中core掉的情况。
- job地址:
- 相关log如下,已建立相关issue跟踪:https://github.com/PaddlePaddle/paddle-ce-latest-kpis/issues/76
*** Aborted at 1531370445 (unix time) try "date -d @1531370445" if you are using GNU date ****** Aborte
[04:40:45] PC: @ 0x0 (unknown)
[04:40:45] *** SIGSEGV (@0x58) received by PID 64205 (TID 0x7f78c0291700) from PID 88; stack trace: ***
[04:40:45] @ 0x7f79497d4390 (unknown)
[04:40:45] @ 0x4bc5bb PyEval_EvalFrameEx
[04:40:45] @ 0x4b9ab6 PyEval_EvalCodeEx
[04:40:45] @ 0x4d55f3 (unknown)
[04:40:45] @ 0x4a577e PyObject_Call
[04:40:45] @ 0x4bed3d PyEval_EvalFrameEx
[04:40:45] @ 0x4c136f PyEval_EvalFrameEx
[04:40:45] @ 0x4c136f PyEval_EvalFrameEx
[04:40:45] @ 0x4b9ab6 PyEval_EvalCodeEx
[04:40:45] @ 0x4d54b9 (unknown)
[04:40:45] @ 0x4eebee (unknown)
[04:40:45] @ 0x4a577e PyObject_Call
[04:40:45] @ 0x4c5e10 PyEval_CallObjectWithKeywords
[04:40:45] @ 0x589172 (unknown)
[04:40:45] @ 0x7f79497ca6ba start_thread
[04:40:45] @ 0x7f794950041d clone
[04:40:45] @ 0x0 (unknown)
[04:40:46] ./run.xsh: line 13: 64205 Segmentation fault (core dumped) FLAGS_benchmark=true FLAGS_fraction_of_gpu_memory_to_use=0.0 python model.py --device=GPU --batch_size=64 --data_set=flowers --model=resnet_imagenet --pass_num=3 --gpu_id=$cudaid
值班人: 卫科
- 出现一次指标波动mnist和image_classification,不过波动较小,分别超出阈值0.01和0.008。
- job地址:
- 重新跑下后恢复。
值班人: 卫科
- 出现一次编译错误,inference相关的,骆涛修改后修复。
- job地址:
值班人: 武毅
- vgg 16 出现一次segment fault,后面的pr没有复现。
- job地址:
- 待跟踪
值班人: 武毅
- 编译失败,
- job url:
- 失败信息
[06:15:21] sh: 1: patchelf: not found
[06:15:21] Traceback (most recent call last):
[06:15:21] File "setup.py", line 134, in <module>
[06:15:21] raise Exception("patchelf --set-rpath for libmkldnn.so.0 fails")
[06:15:21] Exception: patchelf --set-rpath for libmkldnn.so.0 fails
[06:15:21] make[2]: *** [python/build/.timestamp] Error 1
[06:15:21] make[1]: *** [python/CMakeFiles/paddle_python.dir/all] Error 2
[06:15:21] make[1]: *** Waiting for unfinished jobs....
- 镜像中安装patchelf 包之后恢复(apt-get install patchelf)
值班人: 武毅
- lstm模型 imdb_32_gpu_memory 指标有波动,
- job url:
- 已恢复
值班人: 邱学忠
- sequence_tagging_for_ner 模型的pass during值有波动,后面的pr恢复正常
- job url:
- 已恢复
值班人: 曾锦乐
问题: 无
值班人: 闵启阳
问题: 无