2018 10 - PaddlePaddle/continuous_evaluation GitHub Wiki
- 值班人: 邱学忠 无 ci有偶发性问题,分布式相关问题请武毅老师查看。
- 值班人: 唐舰
- CE问题
- language_model: imikolov_20_pass_duration 指标中途有一次稍微超过阈值,rerun之后恢复正常。 @guocheng已提PR调整修复 https://github.com/PaddlePaddle/paddle-ce-latest-kpis/pull/166。
- 值班人: 曾锦乐
- CE问题
- 任务:model_icnet
- model repo url:
- fail现象:模型KPI错误
- 定位原因为该模型没有fix random seed,已提PR修复:https://github.com/PaddlePaddle/models/pull/1267
- 值班人: 王冠中
- CI问题1
- 任务:vis demo se_resnext50 runs fail
- ci url: http://ci.paddlepaddle.org/viewLog.html?buildId=22700&tab=buildResultsDiv&buildTypeId=Manylinux1_CpuAvxOpenblas
- 召龙已提pr修复
- CE 问题1
- 任务:model_language_model
- model repo url: http://180.76.57.222:8111/viewLog.html?buildId=811&buildTypeId=PaddleModesl_Build
- fail现象:eval language_model 出现SIGSEGV
- 正在定位pr
- CE 问题2
- 任务:model_object_detection,sequence_tagging_for_ner
- model repo url: http://180.76.57.222:8111/viewLog.html?buildId=812&buildTypeId=PaddleModesl_Build; http://ce.paddlepaddle.org:8080/viewLog.html?buildId=2040&buildTypeId=PaddleCe_CEBuild&tab=buildLog
- fail现象:variable 类型报错
- 潘欣已经revert
- 值班人: 何荞至
- 问题1
- 任务:python35的CI上这个inference得单测似乎经常random fail掉
- ci url: http://ci.paddlepaddle.org/viewLog.html?buildId=22406&buildTypeId=Paddle_PrCiPython35&tab=buildLog
骆涛已经revert后merge了。
-
值班人: 王贵豹
-
问题1
-
任务:model_object_detection
-
ce teamcity url: http://180.76.57.222:8111/viewLog.html?buildId=793&buildTypeId=PaddleModesl_Build
-
fail现象:执行train.py失败,现象为:
[05:57:01]W: [Step 1/1] % (self.class.name, self.class.name), Warning)
[05:57:01]W: [Step 1/1] Traceback (most recent call last):
[05:57:01]W: [Step 1/1] File "train.py", line 305, in
[05:57:01]W: [Step 1/1] val_file_list=val_file_list)
[05:57:01]W: [Step 1/1] File "train.py", line 181, in train
[05:57:01]W: [Step 1/1] train_py_reader.decorate_paddle_reader(train_reader)
[05:57:01]W: [Step 1/1] File "/opt/python/cp27-cp27mu/lib/python2.7/site-packages/paddle/fluid/layers/io.py", line 590, in set_paddle_reader
[05:57:01]W: [Step 1/1] data(
[05:57:01]W: [Step 1/1] NameError: free variable 'data' referenced before assignment in enclosing scope
[05:57:01] : [Step 1/1] *****
-
分析和处理:与青青和龙飞沟通,该问题已经修复,等待CI
-
后续:龙飞的CI生效后应该就能解决问题。20181010 build #224已经通过
-
-
问题2
-
任务:model_image_classification
-
ce teamcity url: http://180.76.57.222:8111/viewLog.html?buildId=793&buildTypeId=PaddleModesl_Build
-
fail KPI:
details: [test_cost] failed, diff ratio: 0.06443547270471289 larger than 0.02
details: [train_acc_top1_card4] failed, diff ratio: 0.04000974827231589 larger than 0.02.
details: [train_acc_top5_card4] failed, diff ratio: 0.038513745905762564 larger than 0.02.
details: [train_cost_card4] failed, diff ratio: 0.03390059741586375 larger than 0.02.
details: [test_acc_top1_card4] failed, diff ratio: 0.14102562327829346 larger than 0.02.
details: [test_acc_top5_card4] failed, diff ratio: 0.10873440306454428 larger than 0.02.
-
分析和处理:与青青沟通,该问题与骆涛此前关于FAST_MATH的开关有关。骆涛在hi群里问到build时CMakeCache.txt里面with_fast_math开关是否打开;卫科帮忙去现场看了一下WITH_FAST_MATH=off。实际上应该是打开的。怀疑是build环境没有清理cache引起。清理cmake cache后重新编译,该model evaluation通过。
-
后续:请QA关注build集群的cache问题
-
-
问题3
-
任务:model_sequence_tagging_for_ner
-
ce teamcity url: http://180.76.57.222:8111/viewLog.html?buildId=793&buildTypeId=PaddleModesl_Build
-
fail KPI:
details: [train_duration] failed, diff ratio: 0.13093519877122206 larger than 0.05.
-
分析和处理:训练时间超过预期。与卫科沟通,怀疑同样是因为WITH_FAST_MATH未打开引起。重新清理cache后编译。重新执行该model evaluation通过
-
后续:请QA关注build集群的cache问题
-
-
问题4
- 任务:编译失败
- ce teamcity url: http://180.76.57.222:8111/viewLog.html?buildId=793&buildTypeId=PaddleModesl_Build
- 现象:
[10:25:29]W: [Step 1/1] CMake Error at cmake/generic.cmake:245 (add_library):
[10:25:29]W: [Step 1/1] Cannot find source file:
[10:25:29]W: [Step 1/1]
[10:25:29]W: [Step 1/1] brpc_sendrecvop_utils.cc
[10:25:29]W: [Step 1/1]
[10:25:29]W: [Step 1/1] Tried extensions .c .C .c++ .cc .cpp .cxx .m .M .mm .h .hh .h++ .hm .hpp
[10:25:29]W: [Step 1/1] .hxx .in .txx
[10:25:29]W: [Step 1/1] Call Stack (most recent call first):
[10:25:29]W: [Step 1/1] cmake/generic.cmake:703 (cc_library)
[10:25:29]W: [Step 1/1] paddle/fluid/operators/distributed/CMakeLists.txt:33 (brpc_library)
[10:25:29]W: [Step 1/1]
[10:25:29]W: [Step 1/1]
[10:25:29]W: [Step 1/1] CMake Error at cmake/generic.cmake:302 (add_executable):
[10:25:29]W: [Step 1/1] Cannot find source file:
[10:25:29]W: [Step 1/1]
[10:25:29]W: [Step 1/1] brpc_serde_test.cc
[10:25:29]W: [Step 1/1]
[10:25:29]W: [Step 1/1] Tried extensions .c .C .c++ .cc .cpp .cxx .m .M .mm .h .hh .h++ .hm .hpp
[10:25:29]W: [Step 1/1] .hxx .in .txx
[10:25:29]W: [Step 1/1] Call Stack (most recent call first):
[10:25:29]W: [Step 1/1] paddle/fluid/operators/distributed/CMakeLists.txt:43 (cc_test)
[10:25:29]W: [Step 1/1]
[10:25:29]W: [Step 1/1]
[10:25:32]W: [Step 1/1] CMake Error at cmake/generic.cmake:265 (add_dependencies):
[10:25:32]W: [Step 1/1] The dependency target "gpr" of target "executor" does not exist.
[10:25:32]W: [Step 1/1] Call Stack (most recent call first):
[10:25:32]W: [Step 1/1] paddle/fluid/framework/CMakeLists.txt:162 (cc_library)
[10:25:32]W: [Step 1/1]
[10:25:32]W: [Step 1/1]
[10:25:32]W: [Step 1/1] CMake Error at cmake/generic.cmake:265 (add_dependencies):
[10:25:32]W: [Step 1/1] The dependency target "grpc++_unsecure" of target "executor" does not
[10:25:32]W: [Step 1/1] exist.
[10:25:32]W: [Step 1/1] Call Stack (most recent call first):
[10:25:32]W: [Step 1/1] paddle/fluid/framework/CMakeLists.txt:162 (cc_library)
[10:25:32]W: [Step 1/1]
[10:25:32]W: [Step 1/1]
[10:25:32]W: [Step 1/1] CMake Error at cmake/generic.cmake:265 (add_dependencies):
[10:25:32]W: [Step 1/1] The dependency target "grpc_unsecure" of target "executor" does not exist.
[10:25:32]W: [Step 1/1] Call Stack (most recent call first):
[10:25:32]W: [Step 1/1] paddle/fluid/framework/CMakeLists.txt:162 (cc_library)
[10:25:32]W: [Step 1/1]
[10:25:32]W: [Step 1/1]
[10:25:32]W: [Step 1/1] CMake Error at cmake/generic.cmake:265 (add_dependencies):
[10:25:32]W: [Step 1/1] The dependency target "sendrecvop_grpc" of target "executor" does not
[10:25:32]W: [Step 1/1] exist.
[10:25:32]W: [Step 1/1] Call Stack (most recent call first):
[10:25:32]W: [Step 1/1] paddle/fluid/framework/CMakeLists.txt:162 (cc_library)
- 分析和处理:从源代码分析发现paddle/fluid/operators/distributed/CMakeLists.txt的brpc_library包含了不存在的源文件brpc_sendrecvop_utils.cc和brpc_serde_test.cc。20181011跟伟宝商量,如果WITH_DISTRIBUTED开关打开,是不会编译到brpc_library这个目标的。现在怀疑同样是编译集群cache没有清理干净。
- 后续:因为QA同学已经清过一次cache,且10月10日最后一次build成功了,此问题暂不跟进了。
值班人:张文慧
- 大面积的fail是因为有人在build任务去debug了,只选了一个任务。 debug 需要用这个链接(http://180.76.57.222:8111/project.html?projectId=PaddleModesl_ModelsDebug&branch_PaddleModesl_ModelsDebug)
值班人:刘毅 (实际程耀值班,和刘毅交换)
- 持续引发image_classification超时报警的问题定位为CE KPI type有误,已由青青修复,具体见PR链接
值班人:党青青 (实际骆涛值班,和党青青交换)
- 性能问题:image classificatin 4卡速度变慢," [train_speed_card4] failed, diff ratio: 0.0799147480018833 larger than 0.05. " 该问题已从26号左右持续到现在,待定位是那个PR引起。
值班人:骆涛 (实际党青青值班,和骆涛交换)
- 性能问题:image classificatin 4卡速度变慢," [train_speed_card4] failed, diff ratio: 0.08058202954940144 larger than 0.05. " 该问题已从26号左右持续到现在,待定位是那个PR引起。 原因:时间由0.0101 降到 0.0093(左右), 波动大于5%,而时间缩短,这类KPI是变好,是期望的。 检查发现_ce.py里的KPI用错了。 Fix by https://github.com/PaddlePaddle/models/pull/1343