Ray in VDK - vmware/versatile-data-kit GitHub Wiki

ml-ray-diagram-vdk

Motivation

Data engineers and analytics engineers are essential to any organization that heavily relies on data. They're responsible for designing, creating, and maintaining the data architecture of a company. This typically includes creating and managing databases (tables and datasets), data pipelines, and ETL (extract, transform, load) processes. They also work with data scientists to ensure that they have the necessary data to perform their analyses.

For instance, an analytics engineer may need to run a Python job to ingest data from DB, rest API or other service, SQL Job to perform some data manipulation, then a Spark job to process large amounts of data in parallel, and finally a Ray job for distributed computing. Each of these jobs would typically require a different platform or tool for management, creating a fragmented and complex workflow. These complexities and fragmentation lead to inefficiencies and increased overhead, as engineers need to switch between different platforms and keep track of their jobs and code versions across multiple systems

By integrating the capability to run Ray jobs in VDK, it can function as a "single pane of glass" for managing and observing various types of jobs. A unified job management and version control system can significantly improve the efficiency of data engineers and scientists. With this proposal, VDK will provide a platform where users can manage diverse job types, keep track of their deployed code version, and have single unified UI for all type of jobs

Proposal

This proposal suggests extending the VDK to enable it to handle Ray jobs. Here's how it works: ray-vdk

See original diagram at here

  1. Deployment & Versioning

Users deploy the jobs using vdk deploy. This command initiates the job deployment process, including versioning, thereby releasing the users' code.

 vdk deploy --name your-ray-job

or to revert

 vdk deploy --name your-ray-job --job-version <prev-version>
  1. Source Code Tracking

All source codes related to different types of jobs will be tracked and maintained in the VDK Source Repository. This repository serves as a single, centralized location for all types of job source codes. It acts as a read-only catalog enabling root cause analysis, reproducibility and reuse.

  1. Automatic Ray client initilization and configuration

Depending on the cluster where the data job run VDK can automatically initialize and shutdown remote ray connection with optimal settings and configuration:

# Sample implementation of vdk-ray plugin that automatically initilizes and shutdowns at the end of a job
import ray
from vdk.api.plugin.hook_markers import hookimpl
 
class RunRayJob:
    @hookimpl
    def run_job(self, context: ExecutionContext):
        ray.init() # Initialize Ray
         
        yield # this yields back the the user job execution
 
        ray.shutdown() # Shutdown Ray

  1. VDK Operations UI Integration

The VDK Operations UI is updated to provide monitoring for all types of jobs, including Ray jobs. It would be able to distinguish between types of Jobs. While the operations UI provides a general overview and monitoring capabilities, it can redirect users to more specialized UIs like Ray Dashboard UI for more detailed insights.