2018 07 - PaddlePaddle/continuous_evaluation GitHub Wiki

2018年7月值班日志

备注:最新的日志记录在最上面

20180731

值班人: 王豪爽
问题1:

  • transformer task failed
Evaluate [91fb0156ca1a97a247f571960440bae993c28a1c] failed!
[04:29:09]W:	 [Step 1/1] The details:
[04:29:09]W:	 [Step 1/1] task: transformer
[04:29:09]W:	 [Step 1/1] passed:  False
[04:29:09]W:	 [Step 1/1] infos Task is disabled, [test_avg_ppl_kpi] pass
[04:29:09]W:	 [Step 1/1] [train_pass_duration_kpi] failed, diff ratio: 0.03915986083400443 larger than 0.03.
[04:29:09]W:	 [Step 1/1] [test_avg_ppl_kpi_card4] pass
[04:29:09]W:	 [Step 1/1] [train_pass_duration_kpi_card4] failed, diff ratio: 0.036419812602559304 larger than 0.03.
[04:29:09]W:	 [Step 1/1] kpis keys ['test_avg_ppl_kpi_card4', 'train_pass_duration_kpi', 'train_pass_duration_kpi_card4', 'test_avg_ppl_kpi']
[04:29:09]W:	 [Step 1/1] kpis values [[[77.48257446289062]], [[91.40244913101196]], [[35.20487093925476]], [[23.3564453125]]]
[04:29:21]W:	 [Step 1/1] Process exited with code 255
[04:29:21]E:	 [Step 1/1] Process exited with code 255
[04:29:21]E:	 [Step 1/1] Step build and run paddle (Command Line) failed

20180730

值班人: 郭晟
问题1:

  • git clone paddle-ce-latest-kpis仓库时出错
[03:12:27][Step 1/1] Initialized empty Git repository in /workspace/modelce/tasks/.git/
[03:12:30][Step 1/1] Host key verification failed.
[03:12:30][Step 1/1] fatal: The remote end hung up unexpectedly
[03:12:30][Step 1/1] Traceback (most recent call last):
[03:12:30][Step 1/1]   File "/usr/local/bin/xonsh", line 3, in <module>
[03:12:30][Step 1/1]     main()
[03:12:30][Step 1/1]   File "/usr/local/lib/python3.5/site-packages/xonsh/main.py", line 344, in main
[03:12:30][Step 1/1]     _failback_to_other_shells(args, err)
[03:12:30][Step 1/1]   File "/usr/local/lib/python3.5/site-packages/xonsh/main.py", line 308, in _failback_to_other_shells
......
[03:12:30][Step 1/1]   File "./main.xsh", line 243, in <module>
[03:12:30][Step 1/1]     main()
[03:12:30][Step 1/1]   File "./main.xsh", line 35, in main
[03:12:30][Step 1/1]     refresh_baseline_workspace()
[03:12:30][Step 1/1]   File "./main.xsh", line 89, in refresh_baseline_workspace
[03:12:30][Step 1/1]     git clone @(config.baseline_repo_url) @(config.baseline_path)
[03:12:30][Step 1/1]   File "/usr/local/lib/python3.5/site-packages/xonsh/built_ins.py", line 881, in subproc_captured_hiddenobject
......
[03:12:30][Step 1/1]   File "/usr/local/lib/python3.5/site-packages/xonsh/proc.py", line 2154, in _raise_subproc_error
[03:12:30][Step 1/1]     output=self.output)
[03:12:30][Step 1/1] subprocess.CalledProcessError: Command '['/usr/bin/git', 'clone', '[email protected]:PaddlePaddle/paddle-ce-latest-kpis.git', '/workspace/modelce/tasks']' returned non-zero exit status 128
[03:12:43][Step 1/1] Process exited with code 1

问题2 - text_classification 晚上的时候失败,明天值班同学需留意下 - job: http://ce.paddlepaddle.org:8080/viewLog.html?buildId=501&tab=buildResultsDiv&buildTypeId=PaddleCe_CEBuild

20180727

值班人: 巩伟宝 问题: image classification模型超阈值 job 地址:http://ce.paddlepaddle.org:8080/viewLog.html?buildId=428&buildTypeId=PaddleCe_CEBuild issue:https://github.com/PaddlePaddle/paddle-ce-latest-kpis/issues/70 【已修复,待验证】

20180724

值班人: 刘毅冰
问题: LSTM kpi异常 http://ce.paddlepaddle.org:8080/viewLog.html?buildId=321&buildTypeId=PaddleCe_CEBuild&tab=buildResultsDiv

20180723

值班人: 党青青
问题: 无

20180720

值班人: 赵成舵
问题: 无

20180719

值班人: 骆涛
问题: 无

20180718

20180716

20180713

20180712

	*** Aborted at 1531370445 (unix time) try "date -d @1531370445" if you are using GNU date ****** Aborte
[04:40:45]	PC: @                0x0 (unknown)
[04:40:45]	*** SIGSEGV (@0x58) received by PID 64205 (TID 0x7f78c0291700) from PID 88; stack trace: ***
[04:40:45]	    @     0x7f79497d4390 (unknown)
[04:40:45]	    @           0x4bc5bb PyEval_EvalFrameEx
[04:40:45]	    @           0x4b9ab6 PyEval_EvalCodeEx
[04:40:45]	    @           0x4d55f3 (unknown)
[04:40:45]	    @           0x4a577e PyObject_Call
[04:40:45]	    @           0x4bed3d PyEval_EvalFrameEx
[04:40:45]	    @           0x4c136f PyEval_EvalFrameEx
[04:40:45]	    @           0x4c136f PyEval_EvalFrameEx
[04:40:45]	    @           0x4b9ab6 PyEval_EvalCodeEx
[04:40:45]	    @           0x4d54b9 (unknown)
[04:40:45]	    @           0x4eebee (unknown)
[04:40:45]	    @           0x4a577e PyObject_Call
[04:40:45]	    @           0x4c5e10 PyEval_CallObjectWithKeywords
[04:40:45]	    @           0x589172 (unknown)
[04:40:45]	    @     0x7f79497ca6ba start_thread
[04:40:45]	    @     0x7f794950041d clone
[04:40:45]	    @                0x0 (unknown)
[04:40:46]	./run.xsh: line 13: 64205 Segmentation fault      (core dumped) FLAGS_benchmark=true FLAGS_fraction_of_gpu_memory_to_use=0.0 python model.py --device=GPU --batch_size=64 --data_set=flowers --model=resnet_imagenet --pass_num=3 --gpu_id=$cudaid 

20180711

值班人: 卫科
问题1:

20180710

值班人: 卫科
问题1:

20180709

值班人: 武毅
问题1:

20180706

值班人: 武毅
问题1:

[06:15:21]	sh: 1: patchelf: not found
[06:15:21]	Traceback (most recent call last):
[06:15:21]	  File "setup.py", line 134, in <module>
[06:15:21]	    raise Exception("patchelf --set-rpath for libmkldnn.so.0 fails")
[06:15:21]	Exception: patchelf --set-rpath for libmkldnn.so.0 fails
[06:15:21]	make[2]: *** [python/build/.timestamp] Error 1
[06:15:21]	make[1]: *** [python/CMakeFiles/paddle_python.dir/all] Error 2
[06:15:21]	make[1]: *** Waiting for unfinished jobs.... 
  • 镜像中安装patchelf 包之后恢复(apt-get install patchelf)

20180705

值班人: 武毅
问题1:

20180704

值班人: 邱学忠
问题1:

20180703

值班人: 曾锦乐
问题: 无

20180702

值班人: 闵启阳
问题: 无

⚠️ **GitHub.com Fallback** ⚠️