Error monitoring and triaging workflow - magma/magma GitHub Wiki

Error monitoring with Sentry

Sentry is the error monitoring platform used within the project. It gathers and groups errors (called events in Sentry) it receives. Grouped events are called issue on Sentry. Each event comes with context information, e.g. tags and breadcrumbs (trail of events preceding the error).

You need to have permissions to access the Magma Sentry. See the documentation on how to enable Sentry for an AGW.

We distinguish two types of errors, platform errors and application errors. Magma's Sentry is dedicated to detection and monitoring of platform errors.

Platform Errors

Platform errors are the type of errors that have to be handled by developers. They can be bugs or unexpected behavior.

Application Errors

Application errors are the kind of errors that have to be resolved by Magma users, that is a failure of a working program for example due to faulty configurations. These types of errors are not unexpected and have to be fixed by the user.

Sentry workflow for new errors

  1. Trigger: New issue on Sentry (unknown issue, regression issue or unignored issue)
  2. Sentry automatic message to respective Magma Slack channel (devops-sentry, devops-sentry-python).
  3. Create a Github issue for new Sentry issue.
  4. Assign a fitting owner to the new Github issue. The owner of the issue will also be owner of the Sentry issue.
  5. Handling the Github issue with one of two outcomes:
    1. A bug was fixed in ${sentry-release}.
    2. An application error was excluded from reporting.
  6. Closing the Sentry issue:
    1. Resolve in ${sentry-release}.
    2. Ignore with certain condition.
  7. Closing Github issue naming the action taken. (Can be a linked PR, of course.)

Event volume and quota

:warning: Event volume can vary for existing issues and affect the Sentry subscription, as it is event volume based. Thus event volume should be regularly monitored.

The event volume can be assessed in the Issue section or the Stats section of Sentry.

The event quota (which is connected to the billing subscription) can be accessed in the Billing section, but only if you have the owner or billing role. Project Magma has a event quota of 4000000 (=4M) as of January 2022.

Your billing section is accessible at https://sentry.io/settings/${YOUR_ORGANISATION}/billing/overview/. You can find more information on quota management in the official Sentry documentation.

Rate Limiting

There is documentation in the official Sentry documentation regarding quota management. One specific action to be taken is rate limiting. Rate limits are project specific and have to be configured for every client key (DSN). Your client keys are accessible at https://sentry.io/settings/${YOUR_ORGANISATION}/projects/${YOUR_PROJECT}/keys/ expanding the a specific key with the "Configure" button.

For Magma we recommend a rate limit of 100/minute as should be configured for the Magma Python project.

Sentry Documentation

Sentry has a good documentation. You are encouraged to use it. Amongst other it tells you about

Best practices

  • If you resolve an issue manually, give a commit or version for which it is resolved. Do not resolve unconditionally.
  • If you decide to ignore a Sentry issue, you should use a sensible ignore condition.
  • If you decide to ignore a Sentry issue, leave a justification in the "Activity" tab.

Assigning issues to developers

An owner should be assigned to each Github issue created in Sentry. If it is unclear whom to assign the issue, chose the respective code owners of the affected part of the code.

Handling issues

Any via Sentry created Github issue needs to be handled. How an issue is handled depends on its category of error.

Platform Error

If an issue is a platform error, it has to be fixed. If a fix is merged the assignee of the Github issue also resolves the issue on Sentry manually giving a commit or automatically via Github integration.

Application Error

If an issue is an application error that does not require action by developers, apply one of the following strategies. When an application error is handled by one of the below options, it should be ignored on Sentry.

Opt-out

You can use the opt-out functionality to disconnect the issue from Sentry by tagging the error log message:

from magma.common.sentry import EXCLUDE_FROM_ERROR_MONITORING

logging.error("An application error happened...", extra=EXCLUDE_FROM_ERROR_MONITORING)

Log level reduction

You can reduce the log level of the error to warning or info. This should only be done if it is clear that the issue does not need to be communicated to users as an error.

Client side filters

In some cases it can also make sense to make use of the existing filter options that block error messages from showing up on Sentry. Error messages can be filtered on client side, which prevents them from being sent to Sentry. This option is helpful for third party code for example.

:warning: This option should only be applied with care, if none of the above can be utilized.

Server side filters

Issues can be filtered on Sentry side as well. This option can be helpful if an issue is causing a lot of noise and should be blocked quickly without going through the review process on Github, but should not be the permanent solution.

With one exception: In case the issue comes from an old release, that does not contain opt-out and client-filter functionality, the server side filters can be used to exclude the issue from Sentry.

This solution will require elevated rights in Sentry.

:warning: This option should only be applied with care, if the event volume spikes and the Sentry quota is likely to be exceeded without immediate action.

AGWs connected to the project Sentry account

AGW name Contact Comment
hil-1, hil-2, phy-u4, phy-u5, phy-u7, phy-u13 @milapjoshi Hardware-in-the-loop lab
TVM3-SA-regression, TVM3-SA-regression2, TVM3-SA-baseline @vaishaliCI
u-160-161-nat, u-160-161-nonnat ?
agw-aws ?
fre-agw7 ?
regressionAGW ?
agwNSA ?
agw-dev ?
ip-10-23-4-93 ?
AGW-non-nat ?
agw-stateless ?
agw-stateful ?
agw-mvp-stateles ?

Inbound filters on Sentry

For each Sentry project inbound filters can be configured, to exclude events witch matching error messages. The filtered out events do not count into the quota. The 'Stats' section provides statistics on the amount of events that are excluded on server side by these filters.

The filters are needed since some AGWs that are connected to Sentry do not run on versions in which these errors have been explicitly excluded (via opt-out).

The filters of the Python project can be configured here. For the Native project currently no inbound filters are currently configured. Filters can be configured by accounts with 'Manager' permissions or more.

All applied filters are documented in the table below. If you add or remove a filter, please also update the table accordingly.

pattern reason for exclusion date added
*GetServiceInfo* frequent connection error, excluded via opt-out in 1.7
*GetOperationalStates* frequent connection error, excluded via opt-out in 1.7
*ConnectionError* frequent connection error, not an issue in 1.7
*Checkin Error* frequent connection error, excluded via opt-out in 1.7
*GetChallenge error!* frequent connection error, excluded via opt-out in 1.7
*Connection to FluentBit* frequent connection error, excluded via opt-out in 1.7
*\[SyncRPC\]* frequent connection error, excluded via opt-out in 1.7
*Metrics upload error* frequent connection error, excluded via opt-out in 1.7
*Streaming from the cloud failed!* frequent connection error, excluded via opt-out in 1.7
*Fetch subscribers error!* frequent connection error, excluded via opt-out in 1.7
*GRPC call failed for state replication* frequent connection error, excluded via opt-out in 1.7
*database or disk is full* high volume error message from TVM3-SA-baseline agw 2022-02-28

Sentry integration with

Slack

Assigning, resolving and ignoring Sentry issues can be done in the Slack channel directly if you link your Slack account to your Sentry account (type /sentry link in Slack).

Debugging C/C++ with Sentry

If you want to use log file for debugging, the Sentry SDK supports log file upload. The Swagger UI integrates configuration options for activating and deactivating file upload. There are two configuration options for sending log files to Sentry after a C++ service crash occurs. The first option is upload_mme_log. In the case it is set to true like the following example, the MME service log file located in /var/log/mme.log will be submitted within the crash report.

"upload_mme_log": true

The second option number_of_lines_in_log supports the transmission of the journal syslog file /var/log/syslog. It selects the last n entries in the file and sends them to Sentry attached to the crash report. The sent file considers log entries of all services marked with magma@ and additionally of the service sctpd. In the subsequent example, n is 1000. If n is set to 0, number_of_lines_in_log is disabled, and no log history will be sent.

"number_of_lines_in_log": 1000 

If the options are activated, log files are aviable on sentry.io in the corresponding project.

The Sentry SDK for C++ contains BreakPad, which is able to capture information when a process crashes and generate minidumps. Subsequently, Sentry SDK uploads it to the URL specified in the sentry configuration. Minidumps are small memory dumps that typically include the runtime stack of all active threads when a crash occurs as well as relevant meta-information about the application.

More information is given in the mindump documentation provided by Sentry.

:warning: Minidumps as well as converted coredumps do not support resuming processes. Moreover, it can happen that no meaningful stacktraces can be generated even all required crash report files are available.

Correct linking with Bazel

While minidumps are uploaded automatically, debug information files generated with objcopy must be uploaded manually via sentry-cli beforehand. When using Bazel for the build process, gold is set as linker by default which may lead to missing unwind information and may reduce the quality of the extractable stacktrace. Switching to lld as a linker solves the issue and additionally provides the opportunity to combine unwind and symbol information in one executable when using the objcopy command. Since both information are combined in one file, the debugger can benefit from it in a later step.

Debug crash reports

In order to debug crash reports, we have some requirements.

  1. We recommend using the Magma VM for debugging since all required system and third-party libraries are there available.
  2. You need to download two files, the executable equipped with debug information and the minidump file.
    1. On sentry.io, go to your sentry project, select Issues and choose one issue.
    2. Download the executable
      1. Scroll down to Image Loaded and select Show Details.
      2. Select View for ${YOUR_SERVICE_EXECUTABLE} labelled with elf debug companion.
      3. Click on the Download button to get ${YOUR_SERVICE_EXECUTABLE}.debug.
    3. Download the minidump
      1. In the issue view, go down to Attachments below Image Loaded.
      2. Download the .dmp file by clicking the download button.
    4. Store the downloaded files inside the Magma Repo in a temporary folder, e.g. minidump. Then, if you want to know which dynamic libraries are required, you can run ldd on the downloaded ${EXECUTABLE_IDENTIFIER}.debug.
      VM:~/magma/minidump$ ldd ${EXECUTABLE_IDENTIFIER}.debug
      
  3. We provide two alternative ways to debug minidump files using either LLDB or GDB. However, we recommend using LLDB as our preferred debugger. In general, LLDB shows more utilisable information than GDB and additionally provides a usable UI out of the box.

LLDB (recommended)

LLDB debugging is compatible with both the Magma VM and the DevContainer.

When using the LLDB debugger, you start the debugger inside the minidump folder with the following command line.

  VM:~/magma/minidump$ lldb --core ${IDENTIFIER}.dmp ${EXECUTABLE_IDENTIFIER}.debug

Inside LLDB, use the thread backtrace or the bt command to show the current back trace of that thread which caused the exception. It starts on top of the stack. You can jump back in the back trace by using the up command one or several times until you have reached the position of interest. If available, the corresponding part of the source code is displayed. The command frame variable lists all local variables and their current states if available. target variable does the same for global and static variables.

Alternatively, you can start the text-based UI mode with the command gui when LLDB is running. It offers the same function as the described commands.

For further suitable LLDB commands, you have a look at the table below. It is a selection of essential commands.

GDB

While the debugger GDB is already installed in the Magma VM, it has additional requirements. Especially GDB requires coredump files, such that for GDB, you must convert the minidump file into a gdb-readable coredump format. That can be achieved by the program minidump-2-core. The program is not installed in the VM. You can get it by following the instructions in BreakPad. The following command transfers the minidump file into a coredump.

  VM:~/magma/minidump$ ~/magma/dev_tools/minidump-2-core ${IDENTIFIER}.dmp -o core.dmp

Afterwards, you can test core.dump by using readelf. The program is already installed in the VM. In general, it displays information about the contents of ELF format files. The generated coredump file follows this format. If the coredump file is corrupt, the execution of the following line will show warnings.

  VM:~/magma/minidump$ readelf -a  core.dmp 2>&1 | grep -i warn 

Now you can start the debugger inside the minidump folder with the following command line.

  VM:~/magma/minidump$ gdb --core core.dmp ${EXECUTABLE_IDENTIFIER}.debug 

The debugging procedure inside GDB corresponds to that of LLDB. Only some functions are called via other commands. The table below the LLDB commands and their GDB counterparts.

Alternatively, you can activate a separate source code view inside the GDB environment by using -tui as an additional start parameter.

  VM:~/magma/minidump$ gdb --core core.dmp ${EXECUTABLE_IDENTIFIER}.debug -tui

If you want more visual features, you can, for example, install GDB dashboard. However, there are a couple of UIs available in GitHub for LLDB as well as for GDB.

Debug Tool Commands for LLDB and GDB

The table below shows suitable GDB commands and their LLDB counterparts.

Commands GDB LLDB
Show the stack backtrace of the current frame backtrace bt
Select the frame that calls the current frame up up
Select the frame that is called by the current frame down down
Show local variables of the current frame info locals frame variable
Show content of the local variable ${var} p ${var} frame variable ${var}
Show all global/static variables - target variable
Show contents of the global variable ${var} p ${var} target variable ${var}
List all threads info threads thread list
Select the thread with the number ${nr}] thread ${nr} thread select ${nr}
Show the backtrace for all threads thread apply all bt thread backtrace all
List the executable and all dependent shared libraries info shared image list
Disassemble the current function for the current frame disassemble di
Disassemble 10 instructions from a given address x/10i ${address} di -s ${address} -c 10
Show mixed source and disassembly for the current function - di -f -m
Disassemble the current function and show the opcode bytes - di -f -b
Show all registers for the current thread info all-registers register read --all
Show values for the register named "rip" p/t $rip p/t $rip