Install VDK Control Service with custom SDK - vmware/versatile-data-kit GitHub Wiki
Overview
In this tutorial, we will install the Versatile Data Kit Control Service using custom created SDK.
This SDK will be used automatically by all Data Jobs being deployed to it. And any change to the SDK will be automatically applied for all deployed data jobs instantaneously (starting from the next run).
Prerequisites
Here are listed the minimum prerequisites needed to be able to install VDK Control Service using custom SDK.
- 1. Git and Docker repository.
- 2. Python (PyPi) repository
- 3. Kubernetes and Helm
- Optional integrations
Before follows more details and one example of how they can be set up.
1. Git and Docker repository.
This tutorial assumes Github will be used. Github provides both docker (container) and git repo. Any other docker and git repository would work.
Go to https://github.com/new and create a repository. For this example, we have created "github.com/tozka/demo-vdk.git"
1.2. Generate Github Token.
You will need this Github Token later. Make sure to save it in a known place.
Make sure you gave permissions for both repo and packages (as we'd use it for both git and docker repository)
See example:
2. Python (PyPi) repository
This is where we will release (upload) our custom SDK. For POC purposes we will use https://test.pypi.org
- Create an account using https://test.pypi.org/account/register/
- Go to https://test.pypi.org/manage/account/
- Click Add API Token and generate new API Token (you will need it later, save it for now)
3. Kubernetes and Helm
We need Kubernetes to install the Control Service. And also helm to install it.
In production, you may want to use some cloud provider like GKE, TKG, EKS or other 3 letter abbreviation ...
In this example though, we will use kind and set up things locally.
- First, install kind
- Create a demo cluster using:
kind create cluster --name demo
Optional integrations
VDK comes with some optional integrations with 3th party systems to provide more value that can be enabled with configuration only.
Those we will not be covered in this tutorial. Start a new discussion or contact us on slack on how to integrate since the options are not as clearly documented as we'd like.
1. External Logging
All job logs can be forwarded to a centralized logging system.
Prerequisites: SysLog or Fluentd
2. Notifications
SMPT Server for mail notifications. It's configured in in both SDK and Control Service
Prerequisites: SMTP Server
3. Integration with a monitoring system (e.g Prometheus).
See list of metrics supported in here See more in monitoring configuration
Prerequisites: Prometheus or Wavefront or similar
4. Advanced Alerting rules
You can define some more advanced monitoring rules. The Helm chart comes with prepared PrometheusRules (e.g Job Delay alerting) that can be used with AlertManager and Prometheus
Prerequisites: The out of the box rules require AlertManager
5. SSO Support
It supports Oauth2-based authorization of all operations enabling easy to integrate with company SSO. Authorization using claims is also supported.
See more in security section of Control Service Helm chart
Prerequisites: OAuth2
6. Access Control Webhooks
Access Control Webhooks enables to create more complex rules for who is allowed to do what operations in the Control Service (for cases where Oauth2 is not enough).
Prerequisites: Webhook endpoint
Install Versatile Data Kit with custom SDK
Here we will install the Versatile Data Kit.
First, we will create our custom SDK. This is a very simple process. If you are familiar with python packaging using setuptools, you will find these steps trivial.
1. Create custom VDK
NOTE: You can skip this if you do not want to create custom SDK. Quickstart VDK is a such custom SDK which can be used to start quickly.
1. Create a directory for our SDK
mkdir my-org-vdk
cd my-org-vdk
Note that you should change the my-org-vdk
name to something appropriate to your organisation.
2. Create and edit setup.py
Open setup.py
in your favorite IDE.
We want to create an SDK that will support
- Database queries to both Postgres and Snowflake
- Ingesting Data into Postgres, Snowflake and using HTTP and using file.
- Control Service Operations - deploying data jobs.
In install_requires
we specify the plugins we need to achieve that:
import setuptools
setuptools.setup(
name="my-org-vdk",
version="1.0",
install_requires=[
"vdk-core",
"vdk-plugin-control-cli",
"vdk-postgres",
"vdk-snowflake",
"vdk-ingest-http",
"vdk-ingest-file",
]
)
Note that you should change the package name to something appropriate to your organisation, and amend subsequent commands to refer to that name instead of my-org-vdk
.
3. Upload our SDK distribution to a PiPy repository
In order for our python SDK to be installable and usable, we need to release it.
- First, we build and package it:
python setup.py sdist --formats=gztar
- Then we upload it to pypi.org. Fill out PIP_REPO_UPLOAD_USER_PASSWORD and PIP_REPO_UPLOAD_USER_NAME from step 2 of the Prerequisites section.
twine upload --repository-url https://test.pypi.org/legacy/ -u "$PIP_REPO_UPLOAD_USER_NAME" -p "$PIP_REPO_UPLOAD_USER_PASSWORD" dist/my-org-vdk-1.0.tar.gz
2. Create SDK Docker image
We need to create a simple docker image with our SDK installed which will be used by all jobs managed by VDK Control Service.
1. Create Dockerfile with our SDK installed
Open empty Dockerfile-vdk-base
with a text editor or IDE.
The content of the Dockerfile is simply this:
FROM python:3.7-slim
WORKDIR /vdk
ENV VDK_VERSION $vdk_version
#Install VDK
RUN pip install --extra-index-url https://test.pypi.org/simple my-org-vdk
As you can see it's pretty basic. We just want to install VDK.
2. Build and publish the Docker image
First, we need to log in to the Github Container Registry. Export the following environment variable:
export CR_PAT=*Github Personal Access Token*
and replace *Github Personal Access Token*
with the token you created earlier.
Then, run the following command:
echo $CR_PAT | docker login ghcr.io -u USERNAME --password-stdin
Make sure to tag it both with the version of the SDK and with the tag "release".
For example (replace with your own GitHub repo created in prerequisite):
docker build -t ghcr.io/tozka/my-org-vdk:1.0 -t ghcr.io/tozka/my-org-vdk:release -f Dockerfile-vdk-base .
docker push ghcr.io/tozka/my-org-vdk:release
docker push ghcr.io/tozka/my-org-vdk:1.0
3. Install Versatile Data Kit Control Service with Helm.
Here it is time to put everything together.
1. Create and edit new file values.yaml
Here we will use the GitHub token, account name, and repo created in step 2 of the Prerequisites.
We need to export the following variables:
export GITHUB_ACCOUNT_NAME=*your account name*
export GITHUB_URL=*URL of the repo you created earlier*
The content of the values.yaml is:
resources:
limits:
memory: 0
requests:
memory: 0
cockroachdb:
statefulset:
resources:
limits:
memory: 0
requests:
memory: 0
init:
resources:
limits:
cpu: 0
memory: 0
requests:
cpu: 0
memory: 0
deploymentGitUrl: "${GITHUB_URL}"
deploymentGitUsername: "${GITHUB_ACCOUNT_NAME}"
deploymentGitPassword: "${GITHUB_TOKEN}"
uploadGitReadWriteUsername: "${GITHUB_ACCOUNT_NAME}"
uploadGitReadWritePassword: "${GITHUB_TOKEN}"
deploymentDockerRegistryType: generic
deploymentDockerRegistryUsernameReadOnly: "${GITHUB_ACCOUNT_NAME}"
deploymentDockerRegistryPasswordReadOnly: "${GITHUB_TOKEN}"
deploymentDockerRegistryUsername: "${GITHUB_ACCOUNT_NAME}"
deploymentDockerRegistryPassword: "${GITHUB_TOKEN}"
deploymentDockerRepository: "ghcr.io/${GITHUB_ACCOUNT_NAME}/data-jobs/demo-vdk"
proxyRepositoryURL: "ghcr.io/${GITHUB_ACCOUNT_NAME}/data-jobs/demo-vdk"
deploymentVdkDistributionImage:
registryUsernameReadOnly: "${GITHUB_ACCOUNT_NAME}"
registryPasswordReadOnly: "${GITHUB_TOKEN}"
registry: ghcr.io/${GITHUB_ACCOUNT_NAME}
repository: "my-org-vdk"
tag: "release"
security:
enabled: False
2. Install VDK Helm chart
helm repo add vdk-gitlab https://gitlab.com/api/v4/projects/28814611/packages/helm/stable
helm repo update
helm install my-vdk-runtime vdk-gitlab/pipelines-control-service -f values.yaml
3. Expose Control Service API
In order to access the application from our browser we need to expose it using kubectl port-forward
command:
kubectl port-forward service/my-vdk-runtime-svc 8092:8092
Note that this command does not return, and you will need to open a new terminal window to proceed.
Use
Then let's see how data or analytics engineers would use it in our organization to create, develop and deploy jobs:
Install custom VDK
pip install --extra-index-url https://test.pypi.org/simple/ my-org-vdk
Configure VDK to know about Control Service
export VDK_CONTROL_SERVICE_REST_API_URL=http://localhost:8092
Create a sample data job
This will create a data job and register it in the Control Service. Locally it will create a directory with sample files of a data job:
vdk create --name example --team my-team --path .
Develop the data job
Browse the files in the example directory
Deploy the data job
It's a single "click" (or CLI command). Behind the scenes, VDK will package and install all dependencies, create docker images and container, release and version it, and finally schedule it (if configured) for execution.
vdk deploy --job-path example --reason "reason"
We can see some details about our job
vdk show --name example --team my-team
Note how there is both a VDK version and a Job Version. Those are deployed independently. VDK version is taken from the Control Service configuration and managed centrally. While the Job version is separate and the data engineer developing the job is in control .
Both the VDK version and job version can be changed if needed with vdk deploy --update
command.