Test VDK Data Jobs with pytest - vmware/versatile-data-kit GitHub Wiki
This tutorial guides you through the process of testing Versatile Data Kit (VDK) data jobs to ensure they run correctly and efficiently. Testing is crucial for maintaining data integrity, ensuring data processing logic is correct, and catching errors early in the development cycle. Upon completion, you'll understand how to implement tests for VDK data jobs, improving your data pipelines' reliability and robustness.
Audience: This tutorial is intended for data engineers and developers working with VDK to build and manage data processing jobs.
Estimated Time: Completing this tutorial should take less than 30 minutes.
Prerequisites
- Familiarity with Python and pytest framework.
- Basic understanding of VDK and its components.
- VDK and vdk-test-utils installed in your development environment.
Install Necessary Tools
Before writing tests, ensure you have vdk-test-utils
and pytest
installed. These tools provide functionalities to write and run tests for VDK data jobs.
pip install vdk-test-utils pytest
Step 1: Understanding the Basics
Part 1: Setting Up Your Testing Environment
VDK provides CliEntryBasedTestRunner, a tool for creating unit or functional tests for your plugins or data jobs in an isolated environment. It's useful for testing data jobs without affecting your production environment by stubbing external dependencies
Part 2: Writing Your First Test
Let's say we want to test a simple data transformation job. First, initialize the test runner with any plugins you're using
from vdk.internal.test_utils import CliEntryBasedTestRunner
list_of_plugins_i_am_using = []
runner = CliEntryBasedTestRunner(list_of_plugins_i_am_using)
Then, invoke the data job you wish to test:
result = runner.invoke(["run", "path/to/your-data-job"])
cli_assert_equal(0, result)
assert 'expected_output' in result.output
Tip: Use
cli_assert_equal(0, result)
to ensure the job ran successfully without errors and on error to easier to debug logs
Step 2: Advanced Testing Scenarios
Testing HTTP Ingestion
To test a job that ingests data over HTTP:
- Mock an HTTP server response.
- Use CliEntryBasedTestRunner with the HTTP ingestion plugin.
- Run your data job and assert the expected outcome
from vdk.internal.test_utils import mock, CliEntryBasedTestRunner
def test_http_ingestion(httpserver):
httpserver.expect_request("/ingest").respond_with_response(Response(status=200))
# Set up environment variables for the test
with mock.patch.dict(os.environ, {"VDK_INGEST_METHOD_DEFAULT": "http", ...}):
runner = CliEntryBasedTestRunner(ingest_http_plugin)
result = runner.invoke(["run", "path/to/ingest-job"])
cli_assert_equal(0, result)
assert len(httpserver.log) == expected_batches
Testing SQL Transformations
For SQL transformation jobs:
- Run the job using CliEntryBasedTestRunner.
- Execute a query against the transformed data.
- Assert the expected data is present.
def test_sql_transformation():
runner = CliEntryBasedTestRunner([])
result = runner.invoke(["run", "path/to/sql-job"])
cli_assert_equal(0, result)
query_result = runner.invoke(["sql-query", "--query", "SELECT * FROM table"])
assert "expected_value" in query_result.output
Wrap-up
Congratulations on completing this tutorial! You've learned how to set up a testing environment for VDK data jobs, write basic and advanced tests, and ensure your data processing logic works as expected.
Conclusion
Testing is a vital part of the data job development process, helping to catch errors early and ensure data integrity. By following this tutorial, you've equipped yourself with the knowledge to write effective tests for your VDK data jobs.
What's Next?
Explore further by testing different types of data jobs, incorporating continuous integration processes for automated testing, and reviewing the VDK documentation for more advanced features and best practices .