Hub crashes & logs - mtholyoke/JupyterHub-on-AWS GitHub Wiki

Resources

TLJH has documentation on application logs here.

Looking around

First, gather the date and time (in UTC - here's a random Eastern time <-> UTC time converter!) of when the crash occurred - two good datestamps to try to have are the time that the error was noticed and the time that the system recovered. When looking in the logs, you may need to start looking a little earlier than when the error was detected in order to see/understand what happened (I dunno, like half an hour or so).

Hint: you can see what the server thinks is the current time with uptime.

Then, log in:

Note: You will have to be configured as an admin in order to run sudo.

These systems are configured to centralize their logs, and those logs can be viewed using a command called journalctl.

All logs

You can get pretty much every log ever like this:

sudo journalctl

Specific application logs

You can limit which logs you care about by specifying a "unit" (basically the application of interest) using -u unit. Two of the main options here are jupyterhub and traeffik. Traefik is a load-balancing/networking application commonly used when deploying cloud-native applications like JupyterHub. For example:

sudo journalctl -u jupyterhub 

Specific dates and times

For any journalctl command, you can optionally use --since "yyyy-MM-dd HH:mm" --until "yyyy-MM-dd HH:mm" to get only the logs between those datestamps. For example:

sudo journalctl -u jupyterhub --since "2022-10-04 00:00" --until "2022-10-04 17:00"

Sending output to a file

You can also send the log to a file to look through later instead of standard out by adding > file.out at the very end of your command. For example:

sudo journalctl > journal.out

Looking for specific phrases

Further, you can search for specific phrases in the logs by adding | grep "search text" after the journalctl command. For example:

sudo journalctl | grep "Out of memory"

Put all of that together

You can combine all of these things like this:

sudo journalctl -u jupyterhub --since "2022-10-04 00:00" --until "2022-10-04 17:00" | grep "Out of memory > jupyterhub-out-of-memory.out

Investigating UUIDs

If you see a UUID in a log message (like Oct 04 18:39:14 ip-172-31-76-2 kernel: Out of memory: Killed process 2419 (jupyterhub-sing) total-vm:22034816kB, anon-rss:21789220kB, file-rss:1516kB, shmem-rss:0kB, UID:<UUID> pgtables:42920kB oom_score_adj:0), you can determine who the associated user was:

id -nu UUID

Things that might be flags for an issue:

  • -- Boot <some hash> --: the system rebooted
  • SIGsomething: A low-level system signal was thrown. Usually something like SIGABRT or SIGTERM.
  • Out of memory: Killed process or oom-kill: Usually this is some notebook that went out of control, but the system or resource limits caught it and killed it before it could take over the entire system.

Real-time debugging

top or htop can be useful for real-time monitoring of processes, users, and resource utilization.

⚠️ **GitHub.com Fallback** ⚠️