Power Cycle Server - lyulyul/shine-cluster GitHub Wiki

Instead of going to the server room to force a shutdown to recover, you can control the shutdown remotely. Install IPMIView on the client, and the same can be done remotely by switching off and restoring the dagobah temporarily. IPMIView can remotely view the status of the server, temperature, power, and can be visualized, it is a very good tool. You can install it at the following URL:

(Documentation) https://www.boston.co.uk/blog/2022/02/09/supermicro-ipmi-how-to-series-part-one.aspx

The IPMI address is the original IP address add 200. For example, if the IP is 0.0.0.1, then the IPMI control address is 0.0.0.201

1.下载相关控制软件:

(Optional downloadable app) https://www.supermicro.com/zh_tw/solutions/management-software/ipmi-utilities (supermicro)

sudo apt install ipmitool

2.查看用户列表

sudo ipmitool user list

3.设置用户密码

sudo ipmitool user set password 2 PASSWORD

4.BIOS设置IPMI地址

5.在浏览器上输入IPMI地址,用用户名和密码登录

6.Remote Control -> Power Control -> Power Cycle Server 或者分两步 Power Off Server - Immediate(关机) -> Power On Server(开机)

GPU掉了怎么办?

我们常用的(sudo) reboot执行的是warm reboot又叫soft reboot,主板从来没有断电过。reboot不能让GPU RAM中的坏数据清空。

要执行cold reboot(或hard reboot,physical reboot),要在IPMI界面按Power Cycle Server。它是 cold reboot,使GPU彻底断电,所以能恢复使用。

The IMPI address is on a need-to-know basis, per secure by obscurity.