Test VDK Data Jobs with pytest - vmware/versatile-data-kit GitHub Wiki

This tutorial guides you through the process of testing Versatile Data Kit (VDK) data jobs to ensure they run correctly and efficiently. Testing is crucial for maintaining data integrity, ensuring data processing logic is correct, and catching errors early in the development cycle. Upon completion, you'll understand how to implement tests for VDK data jobs, improving your data pipelines' reliability and robustness.

Audience: This tutorial is intended for data engineers and developers working with VDK to build and manage data processing jobs.

Estimated Time: Completing this tutorial should take less than 30 minutes.

Prerequisites

Familiarity with Python and pytest framework.
Basic understanding of VDK and its components.
VDK and vdk-test-utils installed in your development environment.

Install Necessary Tools

Before writing tests, ensure you have vdk-test-utils and pytest installed. These tools provide functionalities to write and run tests for VDK data jobs.

pip install vdk-test-utils pytest

Step 1: Understanding the Basics

Part 1: Setting Up Your Testing Environment

VDK provides CliEntryBasedTestRunner, a tool for creating unit or functional tests for your plugins or data jobs in an isolated environment. It's useful for testing data jobs without affecting your production environment by stubbing external dependencies

Part 2: Writing Your First Test

Let's say we want to test a simple data transformation job. First, initialize the test runner with any plugins you're using

from vdk.internal.test_utils import CliEntryBasedTestRunner

list_of_plugins_i_am_using = []
runner = CliEntryBasedTestRunner(list_of_plugins_i_am_using)

Then, invoke the data job you wish to test:

result = runner.invoke(["run", "path/to/your-data-job"])
cli_assert_equal(0, result)
assert 'expected_output' in result.output

Tip: Use cli_assert_equal(0, result) to ensure the job ran successfully without errors and on error to easier to debug logs

Step 2: Advanced Testing Scenarios

Testing HTTP Ingestion

To test a job that ingests data over HTTP:

Mock an HTTP server response.
Use CliEntryBasedTestRunner with the HTTP ingestion plugin.
Run your data job and assert the expected outcome

from vdk.internal.test_utils import mock, CliEntryBasedTestRunner

def test_http_ingestion(httpserver):
    httpserver.expect_request("/ingest").respond_with_response(Response(status=200))
    # Set up environment variables for the test
    with mock.patch.dict(os.environ, {"VDK_INGEST_METHOD_DEFAULT": "http", ...}):
        runner = CliEntryBasedTestRunner(ingest_http_plugin)
        result = runner.invoke(["run", "path/to/ingest-job"])
        cli_assert_equal(0, result)
        assert len(httpserver.log) == expected_batches

Testing SQL Transformations

For SQL transformation jobs:

Run the job using CliEntryBasedTestRunner.
Execute a query against the transformed data.
Assert the expected data is present.

def test_sql_transformation():
    runner = CliEntryBasedTestRunner([])
    result = runner.invoke(["run", "path/to/sql-job"])
    cli_assert_equal(0, result)
    
    query_result = runner.invoke(["sql-query", "--query", "SELECT * FROM table"])
    assert "expected_value" in query_result.output

Wrap-up

Congratulations on completing this tutorial! You've learned how to set up a testing environment for VDK data jobs, write basic and advanced tests, and ensure your data processing logic works as expected.

Conclusion

Testing is a vital part of the data job development process, helping to catch errors early and ensure data integrity. By following this tutorial, you've equipped yourself with the knowledge to write effective tests for your VDK data jobs.

What's Next?

Explore further by testing different types of data jobs, incorporating continuous integration processes for automated testing, and reviewing the VDK documentation for more advanced features and best practices .

Take a look at more examples here.