VDK Logging And Error Handling: Technical Analysis - vmware/versatile-data-kit GitHub Wiki
Goals
- Create a high-level overview of existing code
- Identify potential areas of improvement for VDK logging and error handling
Logging
Logging configuration is implemented as a plugin. The plugin hooks into the vdk_configure
and initialize_job
methods. The vdk_configure
hook configures the host, port, enabled flag and socket type for SYSLOG. The initialize_job
hook fetches all the relevant job data and passes it to the configure_loggers
method.
configure_logs
deals with more SYSLOG configuration and configures logging for different environments, e.g. CLOUD vs. LOCAL.
There is only one logging formatter which is hardcoded. It's used in all environments.
DETAILED_FORMAT = (
f"%(asctime)s [VDK] {job_name} [%(levelname)-5.5s] %(name)-30.30s %(filename)20.20s:%("
f"lineno)-4.4s %(funcName)-16.16s[id:{attempt_id}]- %(message)s"
)
Error Handling
Error handling is built into VDK core.
There are three types of errors thrown by VDK
- PLATFORM_ERROR - infrastructure errors
- USER_ERROR - errors in user code/configuration
- CONFIG_ERROR - errors in the configuration provided to VDK
Each error has a corresponding accountable
- PLATFORM_ERROR - should be fixed by the PLATFORM (SRE Team, Platform team, operating the infrastructure and services).
- USER_ERROR - should be fixed by the end USER (or data job owner), for example: supplied bad arguments, bug in user code.
- CONFIG_ERROR that occurred during:
- platform run (in case the data job runs on platfrom infrastructure), is handled by the PLATFORM;
- local run (in case the data job runs on local end user infrastructure), is handled by the USER.
Errors that are caught are wrapped in the Resolvable
class. It describes who is responsible for handling the error and how it should be handled.
- resolvable_by: Indicates the resolvable type.
- resolvable_by_actual: Who is actually responsible for resolving it
- error_message: the error message
- exception: the exception related to the error
- resolved: indicate if the error is resolved (for example error may be handled in user code and they are considred resolved). It should be use for informative purposes. It may be None/empty (for example if error originates from a new thread spawned by a job step)
Errors are reported by the following functions
They share similar behaviour when it comes to logging.
They build the error message, which has the following format.
error_message = __build_message_for_end_user(
to_be_fixed_by,
resolvable_by_actual,
what_happened,
why_it_happened,
consequences,
countermeasures,
)
Then the error is pushed to the resolvable context and the method decides what to do with the underlying exception, e.g. throw a new one, re-throw or do nothing. The resolvable context is used to determine who is accountable for fixing the error.
The get_blamee_overall function determines who is accountable for fixing the error using the resolver context. It is called in the termination messages plugin and the notifications plugin. Both of these plugin hook into the vdk_exit
method.
Several error types are also defined. These types are re-used in plugins and user code.
Potential Areas of Improvement
-
log formatting - we should provide users the ability to choose between several different log formatters. We could also let users specify their own. This is important due to the fact that we're using VDK in different ways on different environments.
- someone using vdk locally might not care about attempt ids
- someone using vdk in a production environment might want advanced treatment of new lines
- someone using dags may or may not care about parent jobs
-
dags - we currently have no mechanism of distinguishing which job the error came from if it was part of a DAG
TODO: Provide examples
- error handling - currently, the burden of providing accurate information on who is accountable for fixing errors and how to fix them is delegated to the caller. This causes two problems: 1. The initial error message in the logs is far removed from the actual error 2. It's hard to find the actual root cause in the logs using the stack trace. Ideally, we would like to have error messages similar to npm, where the root cause is specified immediately and a detailed stack trace is available somewhere else.
https://lucasfcosta.com/2022/06/01/ux-patterns-cli-tools.html
TODO: Provide examples
- multi-threading and the resolver context
TODO: Sync with @tozka about this
Next Steps
- Analyze existing customer data
- Check for other areas of improvement based on data
- Prioritize identified areas of improvement based on data.