eureka00 |
[2020/04/03] IB card is missing. lspci -v | grep Mellanox shows nothing. [2022/03/07] df command stuck or mounted directories stuck kernel: INFO: task scp:48089 blocked for more than 120 seconds. kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: scp D ffff88937b43d230 0 48089 48088 0x00000080 Solution: Reboot the problematic node. |
* Failure count: 0 * RAM frequency is 2667 instead of 2933 |
eureka01 |
[2020/04/20] Reboot due to GPU hang. |
* Failure count: 0 |
eureka02 |
[2020/04/21] Replaced RAM which fail test by health RAM on eureka33 [2021/04/13] Reboot for system hangs while using gpu. [20220926] Fan broken. |
* Failure count: 0 * Burn-in test failed |
eureka03 |
[2021/04/07] Down unexpectedly. |
* Failure count: 1 |
eureka04 |
[2020/05/13] Down unexpectedly. [2020/05/14] Down unexpectedly. [2020/05/15] Kernel: perf: interrupt took too long (2502>2500), lowering kernel.perf_event_max_sample_rate to 79000 [2020/05/26] Down unexpectedly. [2020/06/19] Down unexpectedly. [2020/06/22] Down unexpectedly. [2020/06/28] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x1e8 kernel: MG_CPU[2818]: segfault at 0 ip 0000000000405e48 sp 00007fff55c688c0 error 4 in MG_CPU[400000+7000] [2020/06/30] Down unexpectedly. [2020/08/17] pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /var/spool/TORQUE/spool/7172.eureka00.gpucluster.calab.OU [email protected]:/work1/jared/gamer-fork/bin/CDM/gamer.o7172' failed with status=1, giving up after 4 attempts <br> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file /var/spool/TORQUE/spool/7172.eureka00.gpucluster.calab.OU to [email protected]:/work1/jared/gamer-fork/bin/CDM/gamer.o7172 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /var/spool/TORQUE/spool/7172.eureka00.gpucluster.calab.ER [email protected]:/work1/jared/gamer-fork/bin/CDM/gamer.e7172' failed with status=1, giving up after 4 attempts pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file /var/spool/TORQUE/spool/7172.eureka00.gpucluster.calab.ER to [email protected]:/work1/jared/gamer-fork/bin/CDM/gamer.e7172 pbs_mom: LOG_ERROR::req_cpyfile, #012#012Unable to copy file /var/spool/TORQUE/spool/7172.eureka00.gpucluster.calab.OU to [email protected]:/work1/jared/gamer-fork/bin/CDM/gamer.o7172#012*** error from copy#012Host key verification failed.#015#012lost connection#012*** end error output#012Output retained on that host in: /var/spool/TORQUE/undelivered/7172.eureka00.gpucluster.calab.OU#012#012Unable to copy file /var/spool/TORQUE/spool/7172.eureka00.gpucluster.calab.ER to [email protected]:/work1/jared/gamer-fork/bin/CDM/gamer.e7172#012*** error from copy#012Host key verification failed.#015#012lost connection#012*** end error output#012Output retained on that host in: /var/spool/TORQUE/undelivered/7172.eureka00.gpucluster.calab.ER [2020/08/18] Down unexpectedly. [2020/08/20] kernel:NMI watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [rngd:5449] [2020/08/21] Down unexpectedly due to system panic. [2020/08/31] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x1cd6 [2020/09/01] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x2494 [2020/09/01] Down unexpectedly [2020/09/01] Down unexpectedly [2020/09/18] Down unexpectedly [2020/09/22] Down unexpectedly. [2020/09/28] Down unexpectedly. [2020/09/29] Down unexpectedly. [2020/09/29] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x178 kernel: perf: interrupt took too long (3168 > 3155), lowering kernel.perf_event_max_sample_rate to 63000 kernel: MUSIC[19162]: segfault at 7f6a5e9ff810 ip 0000000000579e1e sp 00007f7caeffca20 error 4 [2020/09/30] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x350 kernel: enzo.exe[57668]: segfault at 81f7808 ip 00007fd02a41333e sp 00007fd02a61ba60 error 4 in ld-2.17.so[7fd02a3ff000+22000] [2020/09/30] Down unexpectedly. [2020/10/06] Down unexpectedly. [2020/11/17] Down unexpectedly. [2020/11/21] Down unexpectedly. [2020/11/23] Down unexpectedly. [2021/01/15] Down unexpectedly. **[2021/01/20] Down unexpectedly. [2021/02/03] Down unexpectedly. [2021/04/20] Down unexpectedly. |
* Failure count: 23 |
eureka05 |
[2020/04/06] Memory test passes (120Hr). [2020/04/06] Marked online with the testq queue. [2020/07/06] Replace motherboard with new one. [2020/07/09] GPU replaced by EVGA RTX 2080 [2020/07/13] Replace all RAMs with new ones. [2020/07/17] Switch GPU back to 2080 super. [2020/07/22] Switch from maintenanceq to workq . [2020/07/23] kernel: enzo.exe[17758]: segfault at 73f2cf8 ip 00007f942a8d756b sp 00007f942cdeb7b0 error 6 in libc-2.17.so[7f942a855000+f5000] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x5ba [2020/07/23] Down unexpectedly [2020/07/25] Down unexpectedly [2020/08/02] Down unexpectedly [2020/08/03] Reinstall hard wares and OS, started burn in test. [2020/08/06] Burn in test passed, marked as workq [2020/08/12] Down unexpectedly. [2020/08/20] Change Ethernet cable and switch port. [2020/09/08] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x249e [2020/09/08] Down unexpectedly. [2020/10/15] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x3f38 [2020/10/16] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x3fbe [2020/10/16] pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /var/spool/TORQUE/spool/11043.eureka00.gpucluster.calab.OU [email protected]:/projectY/clifflin/gamer-fork/bin/LSSHalo_light_Test/plot_script/gamer.o11043' failed with status=1, giving up after 4 attempts pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file /var/spool/TORQUE/spool/11043.eureka00.gpucluster.calab.OU to [email protected]:/projectY/clifflin/gamer-fork/bin/LSSHalo_light_Test/plot_script/gamer.o11043 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /var/spool/TORQUE/spool/11043.eureka00.gpucluster.calab.ER [email protected]:/projectY/clifflin/gamer-fork/bin/LSSHalo_light_Test/plot_script/gamer.e11043' failed with status=1, giving up after 4 attempts pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file /var/spool/TORQUE/spool/11043.eureka00.gpucluster.calab.ER to [email protected]:/projectY/clifflin/gamer-fork/bin/LSSHalo_light_Test/plot_script/gamer.e11043 pbs_mom: LOG_ERROR::req_cpyfile, #012#012Unable to copy file /var/spool/TORQUE/spool/11043.eureka00.gpucluster.calab.OU to [email protected]:/projectY/clifflin/gamer-fork/bin/LSSHalo_light_Test/plot_script/gamer.o11043#012*** error from copy#012Host key verification failed.#015#012lost connection#012*** end error output#012Output retained on that host in: /var/spool/TORQUE/undelivered/11043.eureka00.gpucluster.calab.OU#012#012Unable to copy file /var/spool/TORQUE/spool/11043.eureka00.gpucluster.calab.ER to [email protected]:/projectY/clifflin/gamer-fork/bin/LSSHalo_light_Test/plot_script/gamer.e11043#012*** error from copy#012Host key verification failed.#015#012lost connection#012*** end error output#012Output retained on that host in: /var/spool/TORQUE/undelivered/11043.eureka00.gpucluster.calab.ER **[2020/10/16] Down unexpectedly. ** **[2020/10/26] Down unexpectedly. ** **[2020/10/28] Down unexpectedly. ** **[2020/10/30] Down unexpectedly. ** **[2020/11/11] Down unexpectedly. ** **[2020/12/10] Down unexpectedly. ** **[2020/12/15] Down unexpectedly. ** **[2021/01/15] Down unexpectedly. ** **[2021/02/04] Down unexpectedly. ** **[2021/03/05] Down unexpectedly. ** [2021/06/15] Pass the MEMtest. |
* Failure count: 16 * Burn-in test failed |
eureka06 |
**[2020/04/11] Down unexpectedly. ** [2020/04/13] Check PSU line connection was good and reboot. [2020/04/13] Correct BOOTPROTO in /etc/sysconfig/network-scripts/ifcfg-enp4s0 from dhcp to static . Not sure if it was related to the system crash on 04/11 but note that only nodes 06 and 09 had this incorrect network setup and both nodes crashed on 04/11). [2020/04/26] Down unexpectedly. [2020/05/04] Down unexpectedly. [2020/05/17] Down unexpectedly. [2020/06/05] Clean and reinstall all component. Reboot with only 16G memory. [2020/07/14] Reboot with new RAMs in total 128 G memory. [2020/07/17] Marked as workq [2020/08/17] Down unexpectedly. [2020/09/03] Down unexpectedly. **[2020/09/22] Down unexpectedly. ** [2020/09/27] Down unexpectedly. [2020/11/13] Down unexpectedly. [2020/12/10] Down unexpectedly. [2021/02/03] Down unexpectedly. [2021/02/07] Down unexpectedly. [2021/02/10] Down unexpectedly. [2022/10/27] GPU malfunction Oct 26 22:45:07 eureka06 kernel: NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x24:0xffff:1200) Oct 26 22:45:07 eureka06 kernel: NVRM: GPU 0000:41:00.0: rm_init_adapter failed, device minor number 0 Recover to normal after reboot. [2022/11/21] GPU malfunction nvidia-smi report Unable to determine the device handle for GPU 0000:41:00.0: Recover to normal after reboot. |
* Failure count: 13 |
eureka07 |
[2020/09/03] Down unexpectedly. [2020/09/27] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x1a03 [2020/09/28] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x1b22 [2020/10/01] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x202e [2020/10/02]Down unexpectedly. [2020/10/12]Down unexpectedly. [2020/11/12]Down unexpectedly. [2021/03/05] Down unexpectedly. |
* Failure count: 5 * Large CPU temperature oscillation * Burn-in test failed |
eureka08 |
[2020/05/11] Down unexpectedly. [2020/07/08] Reboot due to slow CPU performance (back to normal after that). [2020/10/11] Reboot due to zombie process. |
* Failure count: 1 |
eureka09 |
[2020/04/11] Down unexpectedly. [2020/04/13] Check PSU line connection was good and reboot. [2020/04/13] Correct BOOTPROTO in /etc/sysconfig/network-scripts/ifcfg-enp4s0 from dhcp to static . Not sure if it was related to the system crash on 04/11 but note that only nodes 06 and 09 had this incorrect network setup and both nodes crashed on 04/11). [2020/08/14] Down unexpectedly while running gamer [2020/08/31] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x18d3 [2020/09/01] Down unexpectedly [2020/10/06] Down unexpectedly [2020/10/08] Down unexpectedly [2020/10/11] Zombie process occurred. |
* Failure count: 5 |
eureka10 |
[2020/04/21]error messages: BUG: Bad page map in process gamer pte:8000001e2820a867 pmd:1d7e0c1067 and NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [gamer:70948] . [2020/04/21] Reboot by hand. [2020/05/13] Replaced RAM which fail test by health RAM on eureka21 [2020/09/07] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x1d39 [2020/09/07] Down unexpectedly [2020/10/06] Down unexpectedly [2024/08/05] Memtest Passed. |
* Failure count: 2 * Large CPU temperature oscillation |
eureka11 |
[2020/04/21] Replaced RAM which fail test by health RAM on eureka33 . [2020/04/24] Down unexpectedly. [2020/04/25] Down unexpectedly. [2020/04/27] Down unexpectedly. [2020/05/16] Down unexpectedly. [2020/05/16] Kernel: perf: interrupt took too long (3133>3131), lowering kernel.perf_event_max_sample_rate to 63000 [2020/05/19] Down unexpectedly. [2020/05/19] Kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0):Async event for bogus resource 0x4d7 [2020/06/08] Down unexpectedly, tune memory speed to 2666 MHz cause it was using different memory. [2020/06/08] Clean and reinstall all components. Reboot with only 16G memory. [2020/07/14] Reboot with new RAMs in total 128 G memory. [2020/07/17] Marked as workq [2020/07/20] Down unexpectedly. [2020/07/20] kernel: gamer (6552): Using mlock ulimits for SHM_HUGETLB is deprecated kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x16e pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /var/spool/TORQUE/spool/5636.eureka00.gpucluster.calab.OU [email protected]:/work1/jared/gamer-fork/bin/plummer/gamer.o5636' failed with status=1, giving up after 4 attempts <br> pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file /var/spool/TORQUE/spool/5636.eureka00.gpucluster.calab.OU to [email protected]:/work1/jared/gamer-fork/bin/plummer/gamer.o5636 [2020/07/22] Down unexpectedly. [2020/07/24] Down unexpectedly. [2020/07/24] Down unexpectedly. [2020/07/27] Replaced motherboard. [2020/07/28] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x1ab [2020/07/30] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x31b [2020/07/30] Down unexpectedly while running gamer. [2020/07/30] Down unexpectedly while running python yt_scrip [2020/07/30] Down unexpectedly while running python plot [2020/08/02] Down unexpectedly. [2020/08/05] Replaced RAM and reinstall hard wares and OS, start burn in test. [2020/08/10] Burn in test passed, marked as workq [2020/08/11] Down unexpectedly twice during running gamer . [2020/08/12] Replaced IB card with eureka26 [2020/08/12] Down unexpectedly during running gamer . [2020/08/14] Down unexpectedly while running nothing [2020/08/17] Reboot with replace GPU to EVGA RTX 2080. [2020/08/20] Change Ethernet cable and switch port. [2020/08/24] Down unexpectedly. [2020/08/24] Replaced new SSD, and reinstall CentOS7. [2020/08/24] Down unexpectedly. [2020/09/21] Replace CPU Burn-in test passed, putted back to workq [2020/09/27] Down unexpectedly. [2020/11/29] Down unexpectedly. [2020/12/18] Down unexpectedly. [2021/04/13] Reboot for system hangs while using gpu. [2024/08/05] Memtest Failed: 248 errors. -> Retest |
* Failure count: 26 |
eureka12 |
|
* Failure count: 0 * Freeze frequently before switching CPU with eureka33 |
eureka13 |
[2020/04/13] Down unexpectedly. [2020/04/13] Could not reboot, boot button on motherboard and rack are not response. Reinsert RAM didn't help. Reset BIOS didn't help [2020/06/05] Replaced motherboard, clean and reinstall all components. Reboot with only 16G memory. [2020/07/14] Reboot with new RAMs in total 128 G memory. [2020/07/17] Marked as workq [2020/07/24] Down unexpectedly. [2021/01/08] Down unexpectedly. [2021/03/17] Down unexpectedly. |
* Failure count: 4 |
eureka14 |
[2020/05/13] Replaced RAM which fail test by health RAM on eureka21 [2020/11/19] kernel: NVRM: GPU at PCI:0000:41:00: GPU-1006f770-a585-decd-4d5f-f2f23b708ff6 <br> kernel: NVRM: GPU Board Serial Number: <br> kernel: NVRM: Xid (PCI:0000:41:00): 31, pid=6280, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x2_06600000. Fault is of type FAULT_PTE ACCESS_TYPE_VIRT_WRITE <br> kernel: NVRM: Xid (PCI:0000:41:00): 31, pid=90173, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7fd5_bad4b000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ <br> kernel: NVRM: Xid (PCI:0000:41:00): 31, pid=90727, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7fd5_bad4b000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ <br> kernel: NVRM: Xid (PCI:0000:41:00): 31, pid=90880, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7fd5_bad4b000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ [2020/11/19] Down unexpectedly. [2020/11/19] Down unexpectedly. [2020/03/09] Down unexpectedly. [2020/03/10] Down unexpectedly. [2021/04/07] Down unexpectedly. [2021/04/15] Down unexpectedly. |
* Failure count: 7 |
eureka15 |
[2020/04/07] Down unexpectedly. Marked offline [2020/04/08] The connection of PSU lines are good. Reboot with reinsert memories. [2020/04/20] Reboot due to GPU hang. [2020/05/13] Replaced RAM which fail test by health RAM on eureka21 [2020/10/29] Down unexpectedly. [2020/11/06] Down unexpectedly. [2020/11/24] Down unexpectedly. [2020/03/09] Down unexpectedly. |
* Failure count: 7 |
eureka16 |
[2020/04/01] Down unexpectedly. [2020/04/06] The connection of PSU lines are good. Reboot with reinsert memories. [2020/05/15] Down unexpectedly. [2020/05/17] Kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0):Async event for bogus resource 0x328 [2020/05/19] Down unexpectedly. [2020/05/20] Down unexpectedly. [2020/06/10] Down unexpectedly. [2020/06/12] Clean and reinstall all components. [2020/06/25] Down unexpectedly. [2020/10/29] Down unexpectedly. [2020/12/05] Down unexpectedly. [2020/12/25] Down unexpectedly. [2021/01/08] Down unexpectedly. [2021/01/08] Down unexpectedly. [2021/02/05] Down unexpectedly. [2021/03/08] Down unexpectedly. [2020/03/10] Down unexpectedly. [2021/06/21] Pass the MEMtest. |
* Failure count: 14 |
eureka17 |
[2020/04/17] Down unexpectedly [2020/04/17] The connection of PSU lines are good. Boot directly. [2020/06/02] Kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0):Async event for bogus resource 0x16b [2020/06/05] Down unexpectedly. [2020/07/15] kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0): Async event for bogus resource 0x19ae. [2020/07/17] kernel: python invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 [2020/07/17] Down unexpectedly. [2020/11/07] Down unexpectedly. [2020/12/31] Down unexpectedly. [2021/01/15] Down unexpectedly. [2021/01/15] Down unexpectedly. [2021/03/10] Down unexpectedly. [2021/06/17] Pass the MEMtest. |
* Failure count: 8 |
eureka18 |
[2020/05/13] Replaced RAM which fail test by health RAM on eureka21 |
* Failure count: 0 |
eureka19 |
[2021/05/20] Replace MB [2022/11/01] Replace thermal paste |
* Failure count: 0 |
eureka20 |
[2022/10/31] Replace thermal paste |
* Failure count: 0 |
eureka21 |
|
* Failure count: 0 |
eureka22 |
|
* Failure count: 0 |
eureka23 |
[2020/05/04] Replace RAMs fail memtest with health ones from eureka33 (P000258) [2020/11/06] Down unexpectedly. |
* Failure count: 2 |
eureka24 |
[2020/03/??] IB port_rcv_errors overflow. ibdiagnet shows port_rcv_errors : 65535 (overflow) . [2020/06/13] Reboot to enable NUMA. [2021/04/22] Down unexpectedly. [2021/06/21] Pass the MEMtest. |
* Failure count: 2 * Burn-in test failed |
eureka25 |
[2020/04/02] Down unexpectedly. [2020/04/06] The connection of PSU lines are good. Reboot with reinsert memories. [2020/04/07] Down unexpectedly. Marked offline [2020/04/19] Down unexpectedly. [2020/04/20] Reboot with check PSU connect(good connection) and reinsert RAM. [2020/04/28] Down unexpectedly during memory test. [2020/04/28] Down unexpectedly during memory test. [2020/04/29] Down unexpectedly during memory test. [2020/05/04] Replace RAMs fail memtest with health ones from eureka33 (P000264, P000268) [2020/05/06] Down unexpectedly [2020/06/02] Replaced motherboard, clean and reinstall all components. Reboot with only 16G memory. [2020/07/14] Reboot with new RAMs in total 128 G memory. [2020/07/17] Marked as workq [2020/08/12] Down unexpectedly while nothing running. [2020/09/25] pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /var/spool/TORQUE/spool/9575.eureka00.gpucluster.calab.OU [email protected]:/projectY/clifflin/fit_data/gamer.o9575' failed with status=1, giving up after 4 attempts pbs_mom: LOG_ERROR::req_cpyfile, #012#012Unable to copy file /var/spool/TORQUE/spool/9575.eureka00.gpucluster.calab.OU to [email protected]:/projectY/clifflin/fit_data/gamer.o9575#012*** error from copy#012Host key verification failed.#015#012lost connection#012*** end error output#012Output retained on that host in: /var/spool/TORQUE/undelivered/9575.eureka00.gpucluster.calab.OU#012#012Unable to copy file /var/spool/TORQUE/spool/9575.eureka00.gpucluster.calab.ER to [email protected]:/projectY/clifflin/fit_data/gamer.e9575#012*** error from copy#012Host key verification failed.#015#012lost connection#012*** end error output#012Output retained on that host in: /var/spool/TORQUE/undelivered/9575.eureka00.gpucluster.calab.ER [2020/09/25] Down unexpectedly [2020/11/06] Down unexpectedly. [2020/11/24] Down unexpectedly. [2021/03/13] Down unexpectedly. [2021/04/21] Down unexpectedly. [2021/06/21] Pass the MEMtest. |
* Failure count: 14 |
eureka26 |
[2020/05/04] Replace RAMs fail memtest with health ones from eureka33 (P000262, P000263) [2020/06/02] kernel: gamer (62304): Using mlock ulimits for SHM_HUGETLB is deprecated kernel: pcieport 0000:00:03.1: AER: Corrected error received: id=0000 kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID) kernel: pcieport 0000:00:03.1: device [1022:1453] error status/mask=00001000/00006000 kernel: pcieport 0000:00:03.1: [12] Replay Timer Timeout [2020/06/05] Down unexpectedly. [2020/06/06] Hot crash due to the pump power was not connected [2020/09/18] Replaced CPU, motherboard and RAM. And reinstall OS. [2020/11/06] Down unexpectedly. [2020/12/10] Down unexpectedly. [2020/12/22] Down unexpectedly. [2021/01/10] Down unexpectedly. |
* Failure count: 7 |
eureka27 |
[2021/03/20] Down unexpectedly. |
* Failure count: 1 |
eureka28 |
[2021/03/20] Down unexpectedly. |
* Failure count: 1 |
eureka29 |
[2020/05/13] Down unexpectedly. [2020/05/13] Kernel: mlx5_core 0000:09:00.0: rsc_event_notifier:190:(pid 0):Async event for bogus resource 0x8d5 [2021/01/15] Down unexpectedly. [2021/01/16] Down unexpectedly. [2021/02/19] Down unexpectedly. [2021/03/12] Drop to floor during cleaning [2021/03/29] ssh error ssh exited with exit code 15 Reboot to solve it [2021/03/31] ssh error happened again, marked as maintainenceq [2021/04/22] After appearance check, Memtest86+ test for 48 hour, and gpu burn-in test from https://github.com/Microway/gpu-burn.git and gamer running test, putted it back to queue system and marked as workq again [2022/10/27] Replace thermal paste. But still hot [2022/11/17] Replace cooler from eureka02 |
* Failure count: 6 |
eureka30 |
[2021/05/31] Replace MB |
* Failure count: 0 |
eureka31 |
[2020/09/02] Down unexpectedly. kernel: NVRM: GPU at PCI:0000:41:00: GPU-0c76122e-2182-1889-cf06-cc971d55ae15 Nov 19 09:40:18 eureka31 kernel: NVRM: GPU Board Serial Number: Nov 19 09:40:18 eureka31 kernel: NVRM: Xid (PCI:0000:41:00): 31, pid=6289, Ch 00000020, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x7f56_e8e00000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ Nov 19 09:40:21 eureka31 abrt-server: Executable '/projectY/tseng/gamer/bin/srhydro/selfgravity/gamer' doesn't belong to any package and ProcessUnpackaged is set to 'no' Nov 19 09:40:21 eureka31 abrt-server: 'post-create' on '[2020/11/19] /var/spool/abrt/ccpp-2020-11-19-09:40:18-43563' exited with 1<br> abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2020-11-19-09:40:18-43563' <br> kernel: NVRM: Xid (PCI:0000:41:00): 31, pid=43562, Ch 00000021, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_1 faulted @ 0x7f57_1f000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ <br> kernel: NVRM: Xid (PCI:0000:41:00): 31, pid=6292, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_7 faulted @ 0x2_06600000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE <br> kernel: NVRM: Xid (PCI:0000:41:00): 31, pid=43826, Ch 00000011, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7f8a_4ad4b000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ <br> kernel: NVRM: Xid (PCI:0000:41:00): 31, pid=44125, Ch 00000010, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x7f8a_4ad4b000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ [2020/11/19] Down unexpectedly. [2020/12/23] Down unexpectedly. [2020/12/23] Down unexpectedly. [2021/01/24] Down unexpectedly. [2021/06/17] Pass the MEMtest. [2022/10/28] Replace thermal paste. But still hot |
* Failure count: 5 |
eureka32 |
[2020/04/24] Switch from workq to titanxq . [2020/05/01] GPU failed. Error messages: NVRM: GPU 0000:41:00.0: GPU has fallen off the bus. ... NVRM: A GPU crash dump has been created. If possible, please run#012NVRM: Nvidia-bug-report.sh as root to collect this data before#012NVRM: the NVIDIA kernel module is unloaded. [2020/05/04] Reboot due to the GPU failure event on 05/01. [2020/05/25] nvidia-smi shows Unable to determine the device handle for GPU 0000:41:00.0: GPU is lost. Reboot the system to recover this GPU . [2020/06/10] Disable MPS for tensorflow since stream callbacks are not supported on pre-Volta MPS clients. [2021/06/11] Pass RAMtest86 test(4 rounds) |
* Failure count: 0 |
eureka33 |
[2020/03/26] One Titan X GPU is missing. /software/cuda/10.2/NVIDIA_CUDA-10.2_Samples/1_Utilities/deviceQuery/deviceQuery shows Detected 1 CUDA Capable device(s) but nvidia-smi -q reports Attached GPUs : 2 . [2020/03/27] Marked offline. [2020/04/06] The missing Titan X GPU is back by setting export CUDA_VISIBLE_DEVICES=0,1 in /etc/rc.local and reboot. [2020/04/06] Marked online with the testq queue. Begin Burn-in test. [2020/04/06] Down unexpectedly and immediately after initiating the burn-in test. Switch the power input to an isolated 15A plug and reboot. [2020/04/23] Down unexpectedly. [2020/04/24] Down unexpectedly. [2020/06/03] Clean and reinstall all components. Reboot with only 16G memory. [2020/06/10] Down unexpectedly. [2020/06/10] Disable MPS for tensorflow since stream callbacks are not supported on pre-Volta MPS clients. [2020/06/10] Down unexpectedly. [2020/06/11] Replaced motherboard [2020/09/16] Replace RAM and CPU. And reinstall OS. [2021/01/03] Down unexpectedly. [2021/01/07] Down unexpectedly. [2021/01/11] kernel: pcieport 0000:40:01.1: AER: Corrected error received: id=0000 kernel: pcieport 0000:40:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=4009(Receiver ID) kernel: pcieport 0000:40:01.1: device [1022:1453] error status/mask=00000040/00006000 kernel: pcieport 0000:40:01.1: [ 6] Bad TLP [2021/01/11] Down unexpectedly. [2021/01/11] Down unexpectedly. [2021/01/12] Down unexpectedly. [2021/02/20] Down unexpectedly. |
* Failure count: 11 * One memory stick is missing * One memory slot malfunctioned before switching CPU with eureka12 * Burn-in test failed |