2018 10 - PaddlePaddle/continuous_evaluation GitHub Wiki

2018 10月值班日志

20181024

  • 值班人: 邱学忠 无 ci有偶发性问题,分布式相关问题请武毅老师查看。

20181022

20181013

20181013

20181011

骆涛已经revert后merge了。

20181010

  • 值班人: 王贵豹

  • 问题1

    • 任务:model_object_detection

    • ce teamcity url: http://180.76.57.222:8111/viewLog.html?buildId=793&buildTypeId=PaddleModesl_Build

    • fail现象:执行train.py失败,现象为:

      [05:57:01]W: [Step 1/1] % (self.class.name, self.class.name), Warning)

      [05:57:01]W: [Step 1/1] Traceback (most recent call last):

      [05:57:01]W: [Step 1/1] File "train.py", line 305, in

      [05:57:01]W: [Step 1/1] val_file_list=val_file_list)

      [05:57:01]W: [Step 1/1] File "train.py", line 181, in train

      [05:57:01]W: [Step 1/1] train_py_reader.decorate_paddle_reader(train_reader)

      [05:57:01]W: [Step 1/1] File "/opt/python/cp27-cp27mu/lib/python2.7/site-packages/paddle/fluid/layers/io.py", line 590, in set_paddle_reader

      [05:57:01]W: [Step 1/1] data(

      [05:57:01]W: [Step 1/1] NameError: free variable 'data' referenced before assignment in enclosing scope

      [05:57:01] : [Step 1/1] *****

    • 分析和处理:与青青和龙飞沟通,该问题已经修复,等待CI

    • ISSUE: https://github.com/PaddlePaddle/Paddle/pull/13787

    • 后续:龙飞的CI生效后应该就能解决问题。20181010 build #224已经通过

  • 问题2

    • 任务:model_image_classification

    • ce teamcity url: http://180.76.57.222:8111/viewLog.html?buildId=793&buildTypeId=PaddleModesl_Build

    • fail KPI:

      details: [test_cost] failed, diff ratio: 0.06443547270471289 larger than 0.02

      details: [train_acc_top1_card4] failed, diff ratio: 0.04000974827231589 larger than 0.02.

      details: [train_acc_top5_card4] failed, diff ratio: 0.038513745905762564 larger than 0.02.

      details: [train_cost_card4] failed, diff ratio: 0.03390059741586375 larger than 0.02.

      details: [test_acc_top1_card4] failed, diff ratio: 0.14102562327829346 larger than 0.02.

      details: [test_acc_top5_card4] failed, diff ratio: 0.10873440306454428 larger than 0.02.

    • 分析和处理:与青青沟通,该问题与骆涛此前关于FAST_MATH的开关有关。骆涛在hi群里问到build时CMakeCache.txt里面with_fast_math开关是否打开;卫科帮忙去现场看了一下WITH_FAST_MATH=off。实际上应该是打开的。怀疑是build环境没有清理cache引起。清理cmake cache后重新编译,该model evaluation通过。

    • 后续:请QA关注build集群的cache问题

  • 问题3

    • 任务:model_sequence_tagging_for_ner

    • ce teamcity url: http://180.76.57.222:8111/viewLog.html?buildId=793&buildTypeId=PaddleModesl_Build

    • fail KPI:

      details: [train_duration] failed, diff ratio: 0.13093519877122206 larger than 0.05.

    • 分析和处理:训练时间超过预期。与卫科沟通,怀疑同样是因为WITH_FAST_MATH未打开引起。重新清理cache后编译。重新执行该model evaluation通过

    • 后续:请QA关注build集群的cache问题

  • 问题4

    [10:25:29]W: [Step 1/1] CMake Error at cmake/generic.cmake:245 (add_library):

    [10:25:29]W: [Step 1/1] Cannot find source file:

    [10:25:29]W: [Step 1/1]

    [10:25:29]W: [Step 1/1] brpc_sendrecvop_utils.cc

    [10:25:29]W: [Step 1/1]

    [10:25:29]W: [Step 1/1] Tried extensions .c .C .c++ .cc .cpp .cxx .m .M .mm .h .hh .h++ .hm .hpp

    [10:25:29]W: [Step 1/1] .hxx .in .txx

    [10:25:29]W: [Step 1/1] Call Stack (most recent call first):

    [10:25:29]W: [Step 1/1] cmake/generic.cmake:703 (cc_library)

    [10:25:29]W: [Step 1/1] paddle/fluid/operators/distributed/CMakeLists.txt:33 (brpc_library)

    [10:25:29]W: [Step 1/1]

    [10:25:29]W: [Step 1/1]

    [10:25:29]W: [Step 1/1] CMake Error at cmake/generic.cmake:302 (add_executable):

    [10:25:29]W: [Step 1/1] Cannot find source file:

    [10:25:29]W: [Step 1/1]

    [10:25:29]W: [Step 1/1] brpc_serde_test.cc

    [10:25:29]W: [Step 1/1]

    [10:25:29]W: [Step 1/1] Tried extensions .c .C .c++ .cc .cpp .cxx .m .M .mm .h .hh .h++ .hm .hpp

    [10:25:29]W: [Step 1/1] .hxx .in .txx

    [10:25:29]W: [Step 1/1] Call Stack (most recent call first):

    [10:25:29]W: [Step 1/1] paddle/fluid/operators/distributed/CMakeLists.txt:43 (cc_test)

    [10:25:29]W: [Step 1/1]

    [10:25:29]W: [Step 1/1]

    [10:25:32]W: [Step 1/1] CMake Error at cmake/generic.cmake:265 (add_dependencies):

    [10:25:32]W: [Step 1/1] The dependency target "gpr" of target "executor" does not exist.

    [10:25:32]W: [Step 1/1] Call Stack (most recent call first):

    [10:25:32]W: [Step 1/1] paddle/fluid/framework/CMakeLists.txt:162 (cc_library)

    [10:25:32]W: [Step 1/1]

    [10:25:32]W: [Step 1/1]

    [10:25:32]W: [Step 1/1] CMake Error at cmake/generic.cmake:265 (add_dependencies):

    [10:25:32]W: [Step 1/1] The dependency target "grpc++_unsecure" of target "executor" does not

    [10:25:32]W: [Step 1/1] exist.

    [10:25:32]W: [Step 1/1] Call Stack (most recent call first):

    [10:25:32]W: [Step 1/1] paddle/fluid/framework/CMakeLists.txt:162 (cc_library)

    [10:25:32]W: [Step 1/1]

    [10:25:32]W: [Step 1/1]

    [10:25:32]W: [Step 1/1] CMake Error at cmake/generic.cmake:265 (add_dependencies):

    [10:25:32]W: [Step 1/1] The dependency target "grpc_unsecure" of target "executor" does not exist.

    [10:25:32]W: [Step 1/1] Call Stack (most recent call first):

    [10:25:32]W: [Step 1/1] paddle/fluid/framework/CMakeLists.txt:162 (cc_library)

    [10:25:32]W: [Step 1/1]

    [10:25:32]W: [Step 1/1]

    [10:25:32]W: [Step 1/1] CMake Error at cmake/generic.cmake:265 (add_dependencies):

    [10:25:32]W: [Step 1/1] The dependency target "sendrecvop_grpc" of target "executor" does not

    [10:25:32]W: [Step 1/1] exist.

    [10:25:32]W: [Step 1/1] Call Stack (most recent call first):

    [10:25:32]W: [Step 1/1] paddle/fluid/framework/CMakeLists.txt:162 (cc_library)

    • 分析和处理:从源代码分析发现paddle/fluid/operators/distributed/CMakeLists.txt的brpc_library包含了不存在的源文件brpc_sendrecvop_utils.cc和brpc_serde_test.cc。20181011跟伟宝商量,如果WITH_DISTRIBUTED开关打开,是不会编译到brpc_library这个目标的。现在怀疑同样是编译集群cache没有清理干净。
    • 后续:因为QA同学已经清过一次cache,且10月10日最后一次build成功了,此问题暂不跟进了。

20181009

值班人:张文慧

CE

20181008

值班人:刘毅 (实际程耀值班,和刘毅交换)

CE

  • 持续引发image_classification超时报警的问题定位为CE KPI type有误,已由青青修复,具体见PR链接

20181003

值班人:党青青 (实际骆涛值班,和党青青交换)

CE

  • 性能问题:image classificatin 4卡速度变慢," [train_speed_card4] failed, diff ratio: 0.0799147480018833 larger than 0.05. " 该问题已从26号左右持续到现在,待定位是那个PR引起。

20181001

值班人:骆涛 (实际党青青值班,和骆涛交换)

CE

  • 性能问题:image classificatin 4卡速度变慢," [train_speed_card4] failed, diff ratio: 0.08058202954940144 larger than 0.05. " 该问题已从26号左右持续到现在,待定位是那个PR引起。 原因:时间由0.0101 降到 0.0093(左右), 波动大于5%,而时间缩短,这类KPI是变好,是期望的。 检查发现_ce.py里的KPI用错了。 Fix by https://github.com/PaddlePaddle/models/pull/1343
⚠️ **GitHub.com Fallback** ⚠️