Scheduling a Data Job for automatic execution - vmware/versatile-data-kit GitHub Wiki

Data Job Deployment Cycle consists of 3 parts:

Overview

This Guide includes

  • information about each stage for Scheduling and Executing Data Jobs automatically
  • and an example.

We will use local installation of the Versatile Data Kit Control Service to create and schedule a continuously running Data Job.

The job itself will merely print a message in the logs.

Prerequisites

To follow this guide, you need to have Control Service installed. To install the Control Service of the Versatile Data Kit locally, follow Installation guide

Part 1: Data Job

After the Control Service is installed, you can create a new Data Job by running the vdk create command:

Run vdk create --help to see what are all the options and examples.

If you run

vdk create 

It will prompt you for all the necessary info. The rest of this example assumes that the selected job name is hello-world.

To verify that the job was indeed created in the Control Service, list all jobs:

vdk list --all

This should produce the following output:

job_name     job_team    status
-----------  ----------  ------------
hello-world  my-team     NOT_DEPLOYED

You can also observe the code of the newly created Data Job by inspecting the content of the hello-world folder in the current directory. The code will be organized in the following structure:

hello-world/
├── 10_python_step.py
├── 20_sql_step.sql
├── config.ini
├── README.md
├── requirements.txt

You can modify This Data Job sample to customize the Data Job to your needs. For more information on the structure of the Data Jobs, please check the Data-Job page.

For the purpose of this example, let's delete the python and SQL step and just leave one Python step file - 10_python_step.py with the following content:

def run(job_input):
    print(f'\n============ HELLO WORLD! ============\n')

Finally, modify the schedule_cron property inside the config.ini file as follows:

schedule_cron = */2 * * * *

This property specifies the execution schedule for the Data Job when it is deployed. */2 * * * * indicates that the Data Job will be executed every 2 minutes.

After the changes we have the following file structure:

hello-world/
├── 10_python_step.py
├── config.ini
├── README.md
├── requirements.txt

➡️ Next Section: Deployment

⚠️ **GitHub.com Fallback** ⚠️