VDK Jupyter Integration Convert Job Operation - vmware/versatile-data-kit GitHub Wiki

Overview

In the text below "transform" could be used instead of "convert".

With the integration of VDK (Versatile Data Kit) and Jupyter, we're unveiling a novel form of data job that incorporates notebook files. Within this innovative setup, a "notebook job" is commonly built around a single notebook file that encompasses all the SQL and Python code required for the data job. While it's certainly possible to have a combination of file types, such as .sql, .py, and .ipynb, this is less commonplace due to the standard development flow in the Jupyter notebook environment.

The advent of this new data job style catalyzes an interesting use case - transitioning from a traditional data job comprised solely of .py and .sql files (representing distinct job steps) to this fresh design which integrates notebooks.

In this comprehensive document, we will walk you through three practical strategies to perform this transformative operation. We will provide an in-depth analysis of the pros and cons of each approach to determine the most suitable choice. Through this exploration, we aim to provide valuable insights and assist in optimally leveraging this new data job format for specific needs.

Make sure you are familiar with VDK dictionary and VDK Jupyter Integration before continuing reading.

Data Jobs

As previously outlined, we differentiate between two data job types: the initial variant excludes notebook files while the second variant includes them. To facilitate user understanding of these differences, we'll dissect a specific data job example from those provided in VDK.

To familiarize yourself with the data job, please refer to this rest_job. This source offers a comprehensive explanation of the data job, demonstrating its structure using only .py and .sql files.

Next, we will navigate through the notebook rendition of that job. Here's how the notebook version appears:

The contrasts between these two versions are quite evident:

  • In the notebook job, we consolidate the code from three distinct files into a single, comprehensive notebook file.
  • The original variant necessitates the use of a 'run' function to utilize the 'job_input'. Contrarily, in the notebook variant, 'job_input' can be accessed directly.
  • In the initial version, each file signifies a new data job step. However, in the notebook version, a single file encompasses multiple steps. (Note: Cells tagged with 'VDK' are considered individual steps).

Transform job

Let's delve into the transformation use case - specifically, how we can convert a job comprised solely of .py and .sql files into a notebook job.

We have two possible routes for this transformation. The first approach involves compiling all contents from the initial variant's files into a single comprehensive notebook file. In contrast, the second approach entails the creation of multiple notebooks - with several .sql files consolidated into one notebook, and each .py file encapsulated within its own notebook.

Regardless of the chosen method, it's important to emphasize that the sequence of code execution remains consistent in both scenarios. The operational flow of the job will be preserved in its original order, ensuring continuity and consistency of results.

The single notebook approach have two variants - one which refactors the python code and puts it into classes and the second does not involve code modifications.

  • Single notebook approach with classes

As previously discussed, one potential approach is to encapsulate each job step's code, which is originally contained in separate files, into a single comprehensive Jupyter notebook. The code from each individual file will be placed into a separate cell, maintaining the execution order from the original setup. Therefore, the cell corresponding to the first file to be executed will be positioned at the top of the notebook, while the cell associated with the last file will be located at the bottom. To facilitate tracking, all cells will be tagged with VDK.

For .sql files that contain only a single SQL statement, the cell corresponding to this step will execute the command job_input.execute_query(statement_from_file). This straightforward translation ensures that no issues arise from .sql files.

However, the process becomes more complex when consolidating multiple Python files into a single Jupyter notebook due to potential scoping issues. If different files share function or variable names, complications can arise. To overcome this obstacle and maintain proper scoping, we propose a refactoring approach. By encapsulating the code within classes, we create separate scopes naturally. This technique ensures that overlapping names across different Python files do not cause conflict or confusion in the consolidated notebook.

The output we can expect from the transformation operation would closely resemble the following representation:

The code is from life-expectancy example.

Adopting this strategy may give rise to several challenges, a few of which are listed below:

  • Loss of Context: When transforming Python files into classes, there is a risk of losing the context provided by global variables or state. Class methods won't have access to these variables unless they are passed as arguments or converted into class variables.
  • Conflicts with Other Classes or Modules: Converting files into classes might unintentionally create classes with the same name as existing classes or modules in your project or Python's standard library. This can lead to naming conflicts and cause issues.
  • Increased Complexity: Introducing classes can make the code more complex, especially for beginners or less experienced programmers. Some Python files may be simple scripts that don't benefit from the object-oriented features provided by classes.
  • Incompatibility with Functional Code: If your original Python files follow a functional programming style, translating them into classes may not be suitable and can introduce unnecessary complications.
  • Static Methods Limitations: In the example provided, if you're using static methods, it's important to note that they don't have access to instance-specific data or methods. This limitation can restrict their functionality, especially if your original Python files heavily rely on interdependent functions.
  • Dependency Inversion: If one file uses functions from another file, those functions would now become methods within the class of the first file. This change can disrupt dependencies since methods within a class cannot be imported in the same way as standalone functions. You would need to instantiate the class or adjust the code structure to maintain access to those functions.

Despite the challenges associated with this approach, it also presents numerous advantages, as outlined below:

  • Improved Accessibility: Having all the code in one place can make it easier to understand the entire project, as all the elements are immediately accessible.
  • Easier Debugging: With all the code in a single notebook, it can be simpler to trace and debug issues.
  • Better Understanding of Code: One of the main advantages is that having all the code in one place provides a unified view of your data job. This makes it easier to grasp the overall flow and dependencies of the job.
  • Easier Project Navigation: Having everything in a single location simplifies navigating through the project. You don't need to constantly switch between different files to understand how the data and logic are connected.
  • Consistent and Clear Code: Integrating all files into one ensures consistency and coherence in the code. This improves readability and makes it easier to maintain the codebase over time.
  • Preservation of Context: Keeping all related information together in one place helps preserve the context that might otherwise be lost when switching between different files. This allows for a better understanding of the code and its purpose.
  • Faster Prototyping: Having everything in one place facilitates quick prototyping and iteration. It allows for faster development of new features and testing of ideas since all the necessary code is readily accessible.

No tools were found that support converting files into a class instance, which means we need to support the operation by ourselves.

  • Single notebook approach without classes

In the non-class approach, encapsulating code within classes to create distinct scopes will be omitted and everything else will remain the same.

The output we can expect from the transformation operation would closely resemble the following representation: Screenshot 2023-06-22 at 16 07 49

Challenges that may arise from merging Python files into a single notebook without using classes:

  • Limited Modularity: Without using classes or separate files, the modularity of your code may decrease.
  • Loss of Namespace Isolation: When different Python files are merged into a single notebook, it can lead to conflicts between variables or functions with the same name. This can cause unexpected behaviour in your code.

Most of the pros of single notebook with classes (that are not specific for class implementation) are also pros for this approach but we will not be repeating them as it is relatively clear.

  • Multiple notebook approach

Now let's examine the second method, which involves utilizing multiple notebooks. The principal concept of this strategy is to keep the code in its original form, meaning each Python file would need to correspond to a separate notebook. As for the .sql files, they are flexible—they can share a notebook with a Python file or inhabit their own individual notebook. The entirety of a file's content would be situated within a single cell, which will carry a VDK tag.

The output we can expect from the transformation operation would closely resemble the following representation:

The code is from life-expectancy example.

Adopting this strategy may give rise to several challenges, a few of which are listed below:

  • Organizing Your Code: Using multiple notebooks can make your codebase less structured and organized. It may become challenging to navigate and keep track of different notebooks, especially as your project grows larger.
  • Consistency Maintenance: Ensuring consistency across multiple notebooks can be difficult. It requires careful coordination to track changes made in one notebook and apply those changes to related notebooks if needed.
  • Managing Dependencies: Handling dependencies or shared functions between different Python files becomes complex with multiple notebooks. Direct imports between notebooks are not supported, adding complexity to managing these dependencies.
  • Reusability and Modularity Concerns: Splitting code into multiple notebooks can hinder code reusability and modularity. Extracting and reusing specific functions or modules from different notebooks becomes less straightforward (they need to be extracted into separate Python files).
  • Keeping Data Consistent: When working with multiple notebooks simultaneously, ensuring data consistency across the notebooks can be challenging. Each notebook operates in its own isolated context, making it difficult to synchronize and update shared data between notebooks.
  • Limited Interactivity: Interactive exploration and manipulation of data across notebooks can be hindered by the lack of shared state. Changes made in one notebook may not be immediately visible or accessible in another notebook.
  • Complex Data Sharing: Sharing data between notebooks often requires explicit data transfer methods like file storage, databases, or inter-notebook communication libraries. This adds complexity and overhead to the development process.
  • Dealing with Complex Dependencies: If notebooks have interdependencies or rely on shared variables or functions, managing these dependencies without shared state becomes more intricate. Coordinating changes and ensuring consistency across notebooks can be challenging.
  • Overhead from Context Switching: Constantly switching between notebooks to access or update specific data disrupts the workflow.
  • Challenges in Debugging: Debugging issues that involve interactions between multiple notebooks can be more complex. Identifying and resolving errors stemming from differences in state or inconsistent data can be time-consuming.

Despite the challenges associated with this approach, it also presents numerous advantages, as outlined below:

  • Easier Transformation: By not creating classes for each file, converting standalone Python files into notebooks becomes simpler and more straightforward. You can keep the original code structure as it is without needing extensive changes.
  • Maintaining Familiarity: Each notebook can closely resemble the structure of the original Python file, preserving the logical organization and flow of the code. This makes it easier for developers to understand and adapt to the notebooks without having to learn a new class-based structure.
  • Minimal Changes: Without the need to convert code into classes, you can minimize modifications to the original code. This reduces the chances of introducing errors or unintended changes during the transformation process.
  • Reduced Conflicts: By avoiding code encapsulation within classes, you decrease the risk of conflicts related to variable scope.
  • Simplified Maintenance: The absence of classes for each file makes code maintenance less complex. Developers can directly modify and update the code within the notebooks without the additional complexity introduced by class methods.
  • Comparison table

Below is a table for comparison outlining the disadvantages of the different approaches. The table compares the likelihood of each problem being present in the respective solutions on a scale of 1 to 3.

Problem Single notebook solution Multiple notebooks solution Single notebook solution without classes
Loss of Context - provided by global variables or state 3 2 1
Increased Complexity(bause of classes) 2 1 1
Dependency Inversion(functions become class methods) 3 1 1
Code Organization, Maintaining Consistency 1 2 1
Dependency Management(between files) 1 3 1
Interactivity Limitations(lack of shared state between notebooks) 1 3 1
Challenges in Debugging 1 3 2
Conflicts with Other Classes or Modules 3 1 1
Loss of Namespace Isolation 1 1 3
Overall score 16 17 12
  • Decision

Decided approach: single notebook without classes