Interfaces - vmware/versatile-data-kit GitHub Wiki

Versatile Data Kit provides the following interfaces to enable building data pipelines:

Data Jobs development Kit (VDK)

It enables engineers to develop, test and run Data Jobs on a local machine. It comes with common functionality for data ingestion and processing like:

Data Jobs

Data Jobs are the central interface around which all others go around. It represents single unit of work - it can be a project (job can have many files) or a single SQL query. This is where the main business logic is being written. They provide ability to write SQL only transformation or complex python steps and run them both locally and in cloud.

CLI for running and developing data jobs

CLI can be used to run data job locally or can be attached to debugger to debug jobs. Users can use plugins to mock almost any part of the data job enabling easier debugging.

See more in hello world example

For more also check CLI help after installing the SDK using vdk --helporvdk --help`

Sending data for ingestion from different sources to different destinations

It's easy to built plugins to send data to different destinations like Snowlfake, Reshift or ETL tools like Fivetran/Stitch.

See more in SDK IIngester and SDK IIngesterPlugin

See list of currently developed plugins in plugins directory

Transforming raw data into a dimensional model

Some common templates VDK provide or Slowly Changing Dimenstion type 1 and 2.

Users can also build multiple templates enabling them to easily establish common ways for transforming data.

See more in SDK ITemplate and for new template can be registered as plugins using ITemplateRegistry

See usage examples in Data Processing Templates Examples

Plugin interfaces

Extensibility system has been designed to be able to plugin at almost all points of execution of the CLI. It's meant to make it easy to customize CLI for any organization use-case. It's meant to also make it easy to integrate with any open source tool - like DBT for SQL only transformations, or dagster for functional dataflow or great_expectations for data quality checks.

See more in the Plugin README

Keeping state, configuration and secrets

Properties are variables passed to your data job and can be updated by your data job.

You can use them to:

  • Control the behavior of jobs and pipelines.
  • Store state you want to re-use. For example, you can store last ingestion timestamp for incremental ingestion job.
  • Keep credentials like passwords, API keys.

See more in Properties REST API See more in SDK IProperties See more in SDK Plugin interface for properties

Reusable Job Templates

Templates are pieces of reusable code, which is common to use cases of different customers of Versatile Data Kit. Templates are executed in the context of a Data Job They provide an easy solution to common tasks like loading data to a data warehouse.

See more in SDK ITemplateRegistry See more in SDK ITemplate

See usage examples in Data Processing Templates Examples

Parameterized SQL

SQL steps can be parameterized

select * from {table_name}

And table_name will be automatically replaced by looking at Data Job Arguments or Data Job Properties

Control Service

Control Service provides REST API which enables creating, deploying, managing and executing Data Jobs in Kubernetes runtime environment.

It also provides CLI for interacting with it in more user friendly

Job Lifecycle API

The API reflects the usual Data Application (or in our case Data Job) lifecycle that we've observed:

  1. Create a new data job (webhook to further configure job in common way -e.g authorize its creation, setup permissions, etc).
  2. Develop and run the data job locally.
  3. Deploy data job in cloud runtime enviornment to run on scheduled basis.
  4. Monitor and operate Job

See more in API docs

Job Deployment API

Data Jobs can be automatically versioned and deployed and managed. The API can be used through CLI currently.

The API is designed (but not implemented) to support strict separation of config from code. When from single source code there could be multiple deploymensts with different configuration

API is meant to allow integration with UI (e.g UI IDE) to enable non-engineers to contribute.

See more in Deployment API docs

Job Execution API

Execution API enables managing execution of data jobs. It keep relations between which version of the code was used and what configuration it was started with. It's also designed to be the integration point with Workflow tools like Airflow.

See more in Execution API docs

GraphQL Jobs Query API

List Data Jobs with GraphQL like query enabling to inspect job definitions, deployments and executions for your team or across team. The API is designed to make it easy to develop UI on top of it.

See more in Jobs Query API docs

Teams support

Group data jobs that belong to a single team together. Can also apply authorization rules using access control webhooks Enable multiple teams to both collaborate and yet work independently - think Distributed Data Mesh

SSO support

It support Oauth2 based authorization of all operations enabling easy to integrate with company SSO.

See more in security section of Control Service Helm chart

Access Control

Access control can be configured using Oauth2-based on claim values. See more in the helm chart configuration

For more complex use cases Access Control Webhooks enable the creation of custom rules for who is allowed to do what operations in the Control Service. See more in Authorization webhook configuration

Auditing

One can subscribe for webhooks for all operations executed by Control Service

See more in Webhooks configuration

Monitoring metrics

Control Service is designed to easily integrate with different monitoring systems like Prometheus, Wavefront, etc. It's easiest to integrate with Prometheus due to its tight collaboration with Kubernetes.

See list of metrics supported in here See more in monitoring configuration