VDK Logging And Error Handling: Technical Analysis - vmware/versatile-data-kit GitHub Wiki

Goals

Create a high-level overview of existing code
Identify potential areas of improvement for VDK logging and error handling

Logging

Logging configuration is implemented as a plugin. The plugin hooks into the vdk_configure and initialize_job methods. The vdk_configure hook configures the host, port, enabled flag and socket type for SYSLOG. The initialize_job hook fetches all the relevant job data and passes it to the configure_loggers method.

Line 207

configure_logs deals with more SYSLOG configuration and configures logging for different environments, e.g. CLOUD vs. LOCAL.

There is only one logging formatter which is hardcoded. It's used in all environments.

    DETAILED_FORMAT = (
        f"%(asctime)s [VDK] {job_name} [%(levelname)-5.5s] %(name)-30.30s %(filename)20.20s:%("
        f"lineno)-4.4s %(funcName)-16.16s[id:{attempt_id}]- %(message)s"
    )

Line 110

Line 148

Line 160

Error Handling

Error handling is built into VDK core.

https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/internal/core/errors.py

There are three types of errors thrown by VDK

PLATFORM_ERROR - infrastructure errors
USER_ERROR - errors in user code/configuration
CONFIG_ERROR - errors in the configuration provided to VDK

Each error has a corresponding accountable

PLATFORM_ERROR - should be fixed by the PLATFORM (SRE Team, Platform team, operating the infrastructure and services).
USER_ERROR - should be fixed by the end USER (or data job owner), for example: supplied bad arguments, bug in user code.
CONFIG_ERROR that occurred during:
- platform run (in case the data job runs on platfrom infrastructure), is handled by the PLATFORM;
- local run (in case the data job runs on local end user infrastructure), is handled by the USER.

Errors that are caught are wrapped in the Resolvable class. It describes who is responsible for handling the error and how it should be handled.

https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/internal/core/errors.py#L80

resolvable_by: Indicates the resolvable type.
resolvable_by_actual: Who is actually responsible for resolving it
error_message: the error message
exception: the exception related to the error
resolved: indicate if the error is resolved (for example error may be handled in user code and they are considred resolved). It should be use for informative purposes. It may be None/empty (for example if error originates from a new thread spawned by a job step)

Errors are reported by the following functions

log_exception

log_and_throw

log_and_rethrow

They share similar behaviour when it comes to logging.

They build the error message, which has the following format.

    error_message = __build_message_for_end_user(
        to_be_fixed_by,
        resolvable_by_actual,
        what_happened,
        why_it_happened,
        consequences,
        countermeasures,
    )

Then the error is pushed to the resolvable context and the method decides what to do with the underlying exception, e.g. throw a new one, re-throw or do nothing. The resolvable context is used to determine who is accountable for fixing the error.

The get_blamee_overall function determines who is accountable for fixing the error using the resolver context. It is called in the termination messages plugin and the notifications plugin. Both of these plugin hook into the vdk_exit method.

Several error types are also defined. These types are re-used in plugins and user code.

https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-core/src/vdk/internal/core/errors.py#L160

Potential Areas of Improvement

log formatting - we should provide users the ability to choose between several different log formatters. We could also let users specify their own. This is important due to the fact that we're using VDK in different ways on different environments.
- someone using vdk locally might not care about attempt ids
- someone using vdk in a production environment might want advanced treatment of new lines
- someone using dags may or may not care about parent jobs
dags - we currently have no mechanism of distinguishing which job the error came from if it was part of a DAG

TODO: Provide examples

error handling - currently, the burden of providing accurate information on who is accountable for fixing errors and how to fix them is delegated to the caller. This causes two problems: 1. The initial error message in the logs is far removed from the actual error 2. It's hard to find the actual root cause in the logs using the stack trace. Ideally, we would like to have error messages similar to npm, where the root cause is specified immediately and a detailed stack trace is available somewhere else.

https://lucasfcosta.com/2022/06/01/ux-patterns-cli-tools.html

TODO: Provide examples

multi-threading and the resolver context

TODO: Sync with @tozka about this

Next Steps

Analyze existing customer data
Check for other areas of improvement based on data
Prioritize identified areas of improvement based on data.