VDK Structlog Integration - vmware/versatile-data-kit GitHub Wiki

Use Structlog in vdk-core, plugins and data jobs

If we make structlog a dependency of vdk-core, it raises two follow-up questions.

  • how do we make sure plugins use the correct logger
  • how do we make sure user code uses the correct logger

The naive solution is to use structlog as is in vdk-core code and pass it inside a wrapper for plugins and use code. For plugins, the wrapper can be inside CommonContext. For user code, we can expose an object instance, similar to job_input and have user logs inside data jobs be done through the custom object instance.

PRO: Simple and straightforward, doesn't add too much complexity to our existing code base.

CON: There is absolutely no guarantee that users and plugin devs will use the objects we provide and not opt for logging.getLogger().

CON: This is too much of a breaking change for existing setups. We would have to create the configuration mechanism as well and ship the two features together.

Use structlog for vdk-core, use native logging for plugins and user code

https://www.structlog.org/en/stable/api.html#structlog.stdlib.render_to_log_kwargs https://www.structlog.org/en/stable/standard-library.html#rendering-using-logging-based-formatters

We can use structlog in vdk-core and benefit from the build-in configuration system, event dicts, etc. At the same time, we can do the formatting with regular logging by writing and passing our own formatters, or already existing ones, e.g. JSON. This allows plugins and user code to still use logging.getLogger()

PRO: Ensures that users will always benefit from structlog, e.g. they can use structlog IF they want to, for example if plugins want to add to the event dict, but still get some of the benefits if they don't.

CON: Structlog is still a dependency of core

CON: We potentially have to write our own formatters

Create a vdk-structlog plugin and use native logging for vdk-core, plugins and user code

This approach is similar to the above one, but exports all the structlog configuration to a vdk-plugin. We can create the plugin that adds structlog and then use native logging in the whole vdk-codebase.

PRO: Structlog is not a dependency of vdk-core

PRO: Ensures users will always benefit from structlog (not necesserily true for vdk-core devs 😢)

CON: Hard to pass values in event dicts outside of configuration, e.g. we can add stuff in the processor chain, but for passing key-value pairs on the fly, we would have to somehow fetch the extra params, get the event dict and add to it.

This also requires adding a custom processor to the processor chain, that will read extra and add entries to the event dict. The event dict will then pass to extra again because of the redner_log_to_kwargs

CON: Requires POC, not sure if config will work as a plugin or if passing from regular logging to the dict is a thing.

Configuration

Logging configuration should be relatively straightforward with vdk's existing mechanism and structlog configuration. We should support some new config.init values, for example

vdk_log_metadata=[timestamp, level, other stuff]
vdk_log_format=json|console|something else

We should also set the log level using existing log level configuration.

Custom Filters

We should write our own filters for our custom metadata values. For example, if we want to filter out the step name, there is no structlog filter that will do it for us out of the box.

Custom Formatters

We should support a json formatter for the regular logger, or structlog. How to do this is explicitly stated in the structlog documentation and is relatively straightforward. We should still look into it in case we go for the custom event dict values in Option 3.

Additionally, whatever option we pick for integrating, we should also write our own console formatter. Custom entries to the event dict are logged after the logging message, which is not very convenient.

We want something like

2023-09-26 18:19:58 [info     ] 30_ingest_to_table.py Doing something awesome over here, boss

instead of

2023-09-26 18:19:58 [info     ] Doing something awesome over here, boss step=30_ingest_to_table.py

Metadata

We should decide what kind of metatadata to support for logs, e.g. timestamps, level, file name, class, step number...