VDK Structlog Integration - vmware/versatile-data-kit GitHub Wiki
Use Structlog in vdk-core, plugins and data jobs
If we make structlog a dependency of vdk-core, it raises two follow-up questions.
- how do we make sure plugins use the correct logger
- how do we make sure user code uses the correct logger
The naive solution is to use structlog as is in vdk-core code and pass it inside a wrapper for plugins and use code. For plugins, the wrapper can be inside CommonContext. For user code, we can expose an object instance, similar to job_input
and have user logs inside data jobs be done through the custom object instance.
PRO: Simple and straightforward, doesn't add too much complexity to our existing code base.
CON: There is absolutely no guarantee that users and plugin devs will use the objects we provide and not opt for logging.getLogger()
.
CON: This is too much of a breaking change for existing setups. We would have to create the configuration mechanism as well and ship the two features together.
Use structlog for vdk-core, use native logging for plugins and user code
https://www.structlog.org/en/stable/api.html#structlog.stdlib.render_to_log_kwargs https://www.structlog.org/en/stable/standard-library.html#rendering-using-logging-based-formatters
We can use structlog in vdk-core and benefit from the build-in configuration system, event dicts, etc. At the same time, we can do the formatting with regular logging by writing and passing our own formatters, or already existing ones, e.g. JSON. This allows plugins and user code to still use logging.getLogger()
PRO: Ensures that users will always benefit from structlog, e.g. they can use structlog IF they want to, for example if plugins want to add to the event dict, but still get some of the benefits if they don't.
CON: Structlog is still a dependency of core
CON: We potentially have to write our own formatters
Create a vdk-structlog plugin and use native logging for vdk-core, plugins and user code
This approach is similar to the above one, but exports all the structlog configuration to a vdk-plugin. We can create the plugin that adds structlog and then use native logging in the whole vdk-codebase.
PRO: Structlog is not a dependency of vdk-core
PRO: Ensures users will always benefit from structlog (not necesserily true for vdk-core devs 😢)
CON: Hard to pass values in event dicts outside of configuration, e.g. we can add stuff in the processor chain, but for passing key-value pairs on the fly, we would have to somehow fetch the extra params, get the event dict and add to it.
This also requires adding a custom processor to the processor chain, that will read extra and add entries to the event dict. The event dict will then pass to extra again because of the redner_log_to_kwargs
CON: Requires POC, not sure if config will work as a plugin or if passing from regular logging to the dict is a thing.
Configuration
Logging configuration should be relatively straightforward with vdk's existing mechanism and structlog configuration. We should support some new config.init values, for example
vdk_log_metadata=[timestamp, level, other stuff]
vdk_log_format=json|console|something else
We should also set the log level using existing log level configuration.
Custom Filters
We should write our own filters for our custom metadata values. For example, if we want to filter out the step name, there is no structlog filter that will do it for us out of the box.
Custom Formatters
We should support a json formatter for the regular logger, or structlog. How to do this is explicitly stated in the structlog documentation and is relatively straightforward. We should still look into it in case we go for the custom event dict values in Option 3.
Additionally, whatever option we pick for integrating, we should also write our own console formatter. Custom entries to the event dict are logged after the logging message, which is not very convenient.
We want something like
2023-09-26 18:19:58 [info ] 30_ingest_to_table.py Doing something awesome over here, boss
instead of
2023-09-26 18:19:58 [info ] Doing something awesome over here, boss step=30_ingest_to_table.py
Metadata
We should decide what kind of metatadata to support for logs, e.g. timestamps, level, file name, class, step number...