Hub crashes & logs - mtholyoke/JupyterHub-on-AWS GitHub Wiki
TLJH has documentation on application logs here.
First, gather the date and time (in UTC - here's a random Eastern time <-> UTC time converter!) of when the crash occurred - two good datestamps to try to have are the time that the error was noticed and the time that the system recovered. When looking in the logs, you may need to start looking a little earlier than when the error was detected in order to see/understand what happened (I dunno, like half an hour or so).
Hint: you can see what the server thinks is the current time with uptime
.
Then, log in:
- Go to https://dsjupyterhub.mtholyoke.edu (or whichever hub you want to investigate) and log in as yourself.
- From the "New v" button, select "Terminal"
- You're in!
Note: You will have to be configured as an admin in order to run sudo
.
These systems are configured to centralize their logs, and those logs can be viewed using a command called journalctl
.
You can get pretty much every log ever like this:
sudo journalctl
You can limit which logs you care about by specifying a "unit" (basically the application of interest) using -u unit
. Two of the main options here are jupyterhub
and traeffik
. Traefik is a load-balancing/networking application commonly used when deploying cloud-native applications like JupyterHub. For example:
sudo journalctl -u jupyterhub
For any journalctl
command, you can optionally use --since "yyyy-MM-dd HH:mm" --until "yyyy-MM-dd HH:mm"
to get only the logs between those datestamps. For example:
sudo journalctl -u jupyterhub --since "2022-10-04 00:00" --until "2022-10-04 17:00"
You can also send the log to a file to look through later instead of standard out by adding > file.out
at the very end of your command. For example:
sudo journalctl > journal.out
Further, you can search for specific phrases in the logs by adding | grep "search text"
after the journalctl
command. For example:
sudo journalctl | grep "Out of memory"
You can combine all of these things like this:
sudo journalctl -u jupyterhub --since "2022-10-04 00:00" --until "2022-10-04 17:00" | grep "Out of memory > jupyterhub-out-of-memory.out
If you see a UUID in a log message (like Oct 04 18:39:14 ip-172-31-76-2 kernel: Out of memory: Killed process 2419 (jupyterhub-sing) total-vm:22034816kB, anon-rss:21789220kB, file-rss:1516kB, shmem-rss:0kB, UID:<UUID> pgtables:42920kB oom_score_adj:0
), you can determine who the associated user was:
id -nu UUID
-
-- Boot <some hash> --
: the system rebooted -
SIG
something: A low-level system signal was thrown. Usually something likeSIGABRT
orSIGTERM
. -
Out of memory: Killed process
oroom-kill
: Usually this is some notebook that went out of control, but the system or resource limits caught it and killed it before it could take over the entire system.
top
or htop
can be useful for real-time monitoring of processes, users, and resource utilization.