watchdog - victronenergy/venus GitHub Wiki
Be aware that are multiple things that can all be referred to as a 'watchdog':
- the watchdog-counter in the silicium/SoC
- the watchdog-driver, loaded in the kernel
- the watchdog kernel module
- the watchdog-process in userland.
- watchdog IC in hardware: the boards hosting a nanopi som, and the Cerbo GX which has the Eistein som, have an external watchdog. That was added because a bad power supply can wedge the entire SoC including the in there embedded watchdog.
Configuration
- The watchdog is started early, to watch over the system already during boot.
- For the CCGX, it is enabled in u-boot
- For the CANvu, it is enabled in u-boot
- For the Beaglebone seems also enabled early, u-boot printes
Watchdog enabled
. - For Cerbo GX, EasySolar and all other nanopi / einstein, ie allwinner-having-devices, there is an external hardware IC, see schematics. Its always on, so unlike SoC integrated watchdog doesn’t need enabling from u-boot or other software.
- When the kernel starts up, it normally checks if the watchdog is already running and in such case automatically keeps updating it. Which is a feature we do not want: we want to start the watchdog with a long time out (fe. 60 seconds) in u-boot, and then expect the device to reboot in case the kernel fails to load. Per kernel version 4.13, a new config option was added to disable this automatic updating. So to use that, any kernels with version 4.13 or later need
# CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED is not set
in their config. And yes, thats a confusing way to disable a feature, but thats just how its done. Issue. - Our kernels are configured with CONFIG_WATCHDOG_NOWAYOUT. This ensures that also accidentally stopping, rather than killing, the watchdog-process will never lead to locked up system. The difference between stopping and killing is that stopping the user land process makes it also (try) and stop the watchdog. Google CONFIG_WATCHDOG_NOWAYOUT for more info.
- When initiation a reboot, the watchdog-process is stopped first. To make sure that a problem during a subsequent step in the reboot process can never lead to a system hang.
- The watchdog-process is stopped/killed under certain special conditions, to make sure the system resets (dbus daemon, connman, anything else?)
- The watchdog-process is configured to
- watch for max free memory as well as watch average system load.
- (in progress) check that connectivity to the VRM servers is up. https://github.com/victronenergy/venus/issues/287
- In case the watchdog-process sees a problem, it will make sure the system reboots and will also attempt to:
- write the reason to stop to a file, for VRM. See below
- write the process list to
/data/log/watchdog_processlist.txt
- The watchdog-process will append a line to /data/wtmp in case of a repair.
Boot type codes as sent to VRM
Software originated, hence available on all machines:
code | description | origin |
---|---|---|
-3 | tmp file already read, vrmlogger restarted!? | vrmlogger |
-2 | Reading tmp file failed | vrmlogger |
-1 | Reading watchdog register failed | get_boot_type.c |
29997 | Max load avg exceeded | watchdog-process |
30012 | watchdog-ENOMEM | watchdog-process |
30253 | watchdog-EMAXLOAD | watchdog-process |
Hardware related:
code | description | ccgx | beaglebone | nanopi | CANvu500 |
---|---|---|---|---|---|
0 | Hw without boottype support or old ccgx kernel | x | x | x | |
1 | Cold boot or reboot | x | |||
2 | Unreproducable reset on CCGX | x | |||
3 | Reset button | x | |||
4 | Cold boot | x | |||
5 | Reboot command | x | |||
17 | Watchdog reboot | x | x |
- For the nanopi we do use the watchdog in the SoC, but its driver doesn't export a boot type.
- https://github.com/victronenergy/venus/issues/10
Files available for diagnostics
When logged in a system that rebooted, you might well see this:
root@ccgx:~# cat /tmp/last_boot_type
-3
root@ccgx:~# cat /tmp/last_boot_type.orig
30253
/data/log/watchdog_processlist.txt
And then there is also /data/wtmp, see man page.
Details on how the reporting to VRM works
In the VRM Portal we want to store the reason for a (re)boot, for diagnostics purposes.
If the watchdog-process decided to reboot, we are interested in its reason, and otherwise we are interested in the data in the microprocessor register.
The code we want to go to VRM is selected here, and written to /tmp/last_boot_type
.
Vrmlogger reads the code, sends it, and then writes -3 in top of the file, meaning tmp file already read
. Seeing that on VRM usually means that vrmlogger has restarted either on purpose or because it crashed. In case it can't read the file it will report -2 to vrm.
More details
From watch_err.h in the software watchdog project (note, 255 -> 30253
#define EREBOOT 255 /* unconditional reboot (255 = -1 as unsigned 8-bit) */
#define ERESET 254 /* unconditional hard reset */
#define EMAXLOAD 253 /* load average too high */
#define ETOOHOT 252 /* too hot inside */
#define ENOLOAD 251 /* /proc/loadavg contains no data */
#define ENOCHANGE 250 /* file wasn't changed in the given interval */
#define EINVMEM 249 /* /proc/meminfo contains invalid data */
#define ECHKILL 248 /* child was killed by signal */
#define ETOOLONG 247 /* child didn't return in time */
#define EUSERVALUE 246 /* reserved for user error code */
#define EDONTKNOW 245 /* unknown, not "no error" (i.e. success) but implies test still running */
Getting a list from the database:
select valueEnum, nameEnum, count(*) as nr_of_sites, min(l.secondsAgo) / 60 as min_minutes_ago, max(l.secondsAgo) / 60 as max_minutes_ago from vwLastLogData l where idDataAttribute = 237 group by valueEnum;