System Log: Burn In Test - calab-ntu/gpu-cluster GitHub Wiki


2020/04/03

Node Logs
eureka00
eureka01 [2020/05/21] Passed.
eureka02 [2020/04/07] Failed. Inconsistent Data_000007. Differences in Record__Time: Par_Acc begin on the order of round-off errors. [2020/04/23] Failed (after replacing a broken RAM). Inconsistent Data_000006. Differences in Record__Time: Par_Acc begin on the order of round-off errors. [2020/04/25] Passed. [2020/05/07] Failed. Inconsistent Data_000001. Differences in Record__Time: Hydro_Acc/Par_Acc begin on the order of round-off errors. [2020/05/09] Passed.
eureka03 [2020/04/07] Passed.
eureka04 [2020/04/07] Passed.
eureka05 [2020/04/07] Crashed. Inconsistent Data_000005. ERROR : incorrect time-step (dTime_min = 0.00000000000000e+00) !! at Time: 1.1443741e-01 -> 1.1490857e-01, Step: 192 -> 193, dt_base: 4.7116380e-04. [2020/04/24] Crashed. ERROR : AutoReduceDtCoeff (8.5899346e-02) < AUTO_REDUCE_DT_FACTOR_MIN (1.0000000e-01) !! --> AUTO_REDUCE_DT failed, and the program will be terminated ...... at Time: 6.9780206e-02 -> 7.0379401e-02, Step: 107 -> 108, dt_base: 5.9919502e-04. [2020/07/08] Failed (after MB replacement and RAM testing). Inconsistent Data_000006. Differences in Record__Time: Par_Acc begin on the order of round-off errors. [2020/07/08] Crashed. ERROR : incorrect time-step (dTime_min = 0.00000000000000e+00) !! at Time: 7.6882900e-02 -> 7.7465633e-02, Step: 119 -> 120, dt_base: 5.8273278e-04. [2020/07/11] Crashed. ERROR : incorrect time-step (dTime_min = 0.00000000000000e+00) !! at Time: 1.4682939e-01 -> 1.4720581e-01, Step: 272 -> 273, dt_base: 3.7641457e-04. [2020/07/14] Passed. [2020/07/16] Passed (after replacing all 8 DIMMs). [2020/07/19] Passed (after switching back to 2080 Super GPU). [2020/07/22] Passed. [2020/08/06] Passed (after reinstalling all hardware and OS).
eureka06 [2020/04/07] Passed. [2020/07/16] Passed (after replacing all 8 DIMMs).
eureka07 [2020/04/05] Failed. Inconsistent Data_000003. Differences in Record__Time: Hydro_Acc begin on the order of round-off errors. [2020/04/05] Failed. Inconsistent Data_000002. Differences in Record__Time: Hydro_Acc/Par_Acc begin on the order of round-off errors. [2020/04/12] Failed. Inconsistent Data_000006. Differences in Record__Time: Hydro_CFL begin on the order of round-off errors. [2020/04/24] Failed. Inconsistent Data_000001. Differences in Record__Time: Hydro_Acc/Par_Acc begin on the order of round-off errors.
eureka08 [2020/05/21] Passed.
eureka09 [2020/05/21] Passed.
eureka10 [2020/05/15] Passed.
eureka11 [2020/04/07] Passed. [2020/04/24] Passed (after replacing a broken RAM). [2020/07/16] Passed (after replacing all 8 DIMMs). [2020/08/07] Passed (after reinstalling all hardware and OS). [2020/09/24] Passed (after replacing CPU).
eureka12 [2020/05/24] Passed.
eureka13 [2020/07/16] Passed (after replacing all 8 DIMMs).
eureka14 [2020/04/13] Passed. [2020/05/15] Failed. Inconsistent Data_000006. Differences in Record__Time: Par_Acc begin on the order of round-off errors. [2020/05/18] Passed. [2020/05/21] Passed. 2020/05/24] Passed.
eureka15 [2020/04/13] Passed. [2020/05/15] Passed.
eureka16 [2020/04/11] Failed (after power strip replacement). Inconsistent Data_000001. Differences in Record__Time: Par_Acc begin on the order of round-off errors. [2020/04/12] Failed. Inconsistent Data_000003. Differences in Record__Time: Hydro_CFL begin on the order of round-off errors. [2020/04/12] Failed. Inconsistent Data_000002. Differences in Record__Time: Par_Acc begin on the order of round-off errors. [2020/05/07] Passed (after RAM test but without any replacement). [2020/05/09] Passed. [2020/05/12] Passed. [2020/06/14] Passed. [2020/06/17] Passed.
eureka17 [2020/05/21] Passed.
eureka18 [2020/05/21] Passed.
eureka19 [2020/05/21] Passed.
eureka20 [2020/05/21] Passed.
eureka21 [2020/06/16] Passed.
eureka22 [2020/04/13] Passed.
eureka23 [2020/05/06] Passed.
eureka24 [2020/04/04] Failed. Inconsistent Data_000001. Differences in Record__Time: Par_Acc begin on the order of round-off errors. [2020/04/04] Failed. Inconsistent Data_000001. Differences in Record__Time: Par_Acc begin on the order of round-off errors. [2020/04/07] Passed (after reboot) [2020/04/11] Failed (after power strip replacement). Inconsistent Data_000002. Differences in Record__Time: Hydro_Acc/Par_Acc begin on the order of round-off errors. [2020/04/12] Failed. Inconsistent Data_000006. Differences in Record__Time: Hydro_CFL begin on the order of round-off errors. [2020/04/27] Failed. Inconsistent Data_000008. Differences in Record__Time: Hydro_Acc/Par_Acc begin on the order of round-off errors.
eureka25 [2020/04/13] Passed. [2020/07/16] Passed (after replacing all 8 DIMMs).
eureka26 [2020/04/13] Passed. [2020/05/07] Passed. [2020/06/09] Crashed (after the event [2020/06/06] Hot carsh due to the pump power was not connected). ERROR : AutoReduceDtCoeff (8.5899346e-02) < AUTO_REDUCE_DT_FACTOR_MIN (1.0000000e-01) !! --> AUTO_REDUCE_DT failed, and the program will be terminated ...... at Time: 4.3585089e-03 -> 5.0807114e-03, Step: 6 -> 7, dt_base: 7.2220259e-04. [2020/06/10] Crashed. ERROR : AutoReduceDtCoeff (8.5899346e-02) < AUTO_REDUCE_DT_FACTOR_MIN (1.0000000e-01) !! --> AUTO_REDUCE_DT failed, and the program will be terminated ...... at Time: 9.9243560e-02 -> 9.9765945e-02, Step: 160 -> 161, dt_base: 5.2238471e-04. [2020/09/20] Passed (after replacing all RAM and CPU). [2020/09/23] Passed.
eureka27 [2020/05/07] Failed. Inconsistent Data_000006. Differences in Record__Time: Par_Acc begin on the order of round-off errors. [2020/05/09] Failed. Inconsistent Data_000007. Differences in Record__Time: Hydro_Acc/Par_Acc begin on the order of round-off errors.
eureka28 [2020/04/12] Failed. Inconsistent Data_000006. Differences in Record__Time: Hydro_CFL begin on the order of round-off errors. [2020/04/12] Failed. Inconsistent Data_000001. Differences in Record__Time: Hydro_Acc/Par_Acc begin on the order of round-off errors. [2020/05/07] Passed (after RAM test but without any replacement). [2020/05/09] Passed. [2020/05/12] Passed.
eureka29 [2020/05/07] Passed.
eureka30 [2020/05/07] Passed.
eureka31 [2020/04/07] Passed.
eureka32 [2020/04/07] Passed.
eureka33 [2020/04/06] Crashed. ERROR : NPar_Lv_Sum (646758) != expect (646750) ! at Time: 3.6350235e-03 -> 4.3585089e-03, Step: 5 -> 6, dt_base: 7.2348531e-04. [2020/04/07] Crashed. ERROR : NPar_Lv_Sum (622661) != expect (622653) !! at Time: 7.2169275e-02 -> 7.2763142e-02, Step: 111 -> 112, dt_base: 5.9386715e-04. [2020/04/13] No crush, but unclear passed or not due to the lack of ref solution. [2020/04/13] Crashed. ERROR : NPar_Lv_Sum (636498) != expect (636490) !! at Time: 1.7177931e-02 -> 1.7878936e-02, Step: 24 -> 25, dt_base: 7.0100452e-04. [2020/09/24] Passed (after replacing CPU). [2020/09/26] Passed.

Links