Guide to Implement a Python Native Operator (converting from a Python UDF) - Texera/texera GitHub Wiki

In the page for PythonUDF, we introduced the basic concepts of PythonUDF and described each API. To let other users use the Python operators, it is necessary to implement it as a native operator.

In this section, we will discuss how to implement a Python native operator and let future users drag and drop it on the UI. We will start by implementing a sample UDF then talk about how to convert it to a native operator.

Starting with a Sample Python UDF

Suppose we have a sample Python UDF named Treemap Visualizer, as presented below:

The UDF takes a CSV file as its input. For this example, we use a dataset of geo-location information of tweets. A sample of the dataset is shown below:

The Treemap Visualizer UDF takes the CSV file as a table (using the Table API) and outputs an HTML page that contains a treemap figure. The HTML page will be consumed by the HTML visualizer operator, and the View Result operator eventually displays the figure in the browser. The visualization is presented below:

Now, let's take a closer look at the Treemap Visualizer UDF. As shown in the following code block, the UDF contains 3 steps:

from pytexera import *

import plotly.express as px
import plotly.io
import plotly
import numpy as np


class ProcessTableOperator(UDFTableOperator):

    @overrides
    def process_table(self, table: Table, port: int) -> Iterator[Optional[TableLike]]:
        table = table.groupby(['geo_tag.countyName','geo_tag.stateName']).size().reset_index(name='counts')
        #print(table)
        fig = px.treemap(table, path=['geo_tag.stateName','geo_tag.countyName'], values='counts',
                         color='counts', hover_data=['geo_tag.countyName','geo_tag.stateName'],
                         color_continuous_scale='RdBu',
                         color_continuous_midpoint=np.average(table['counts'], weights=table['counts']))
        fig.update_layout(margin=dict(t=50, l=25, r=25, b=25))
        html = plotly.io.to_html(fig, include_plotlyjs='cdn', auto_play=False)
        yield {'html': html}
  1. It first performs an aggregation with a groupby to calculate the number of geo_tags of each US state.
  2. Then it invokes the Plotly library to create a treemap figure based on the aggregated dataset.
  3. Lastly, it converts the treemap figure object into an HTML string, by invoking the to_html function in the Plotly library, and yields it as the output.

Convert the UDF into a Python Native Operator

Next we convert the Treemap Visualizer UDF into a native operator. As described in thepage for Java native operator, a native operator requires the definitions of a descriptor (Desc), an executor (Exec), and a configuration (OpConfig). A Python native operator also requires these definitions, with some unique tweaks. We use the Treemap Visualization operator as an example to elaborate the differences:

Operator Descriptor (Desc)

  • Operator infomation The operator information is the same as a Java native operator, which contains the name, description, group, input port, and output port information.

  • Extending interface Instead of implementing the OperatorDescriptor interface, a Python native operator implements the PythonOperatorDescriptor interface with overriding the generatePythonCode method. Our example is a VisualizationOperator, and we need to extend it as well.

  • Python content The generatePythonCode method returns the actual Python code as a string, as shown below:

    wiki drawio (3)

    Now, let's compare the code in the PythonUDF with what we write in the descriptor. As we can see, both are responsible for generating the treemap figure and converting it into an HTML page. Additionally, we've included null-value handling and error alerts to make the operator more comprehensive.

  • Output schema The Python UDF needs to define the output Schema in the property editor, while for native operators the output Schema is defined by implementing getOutputSchema. To do so, we use a Schema builder and add the output schema with the attribute name “html-content”.

    override def getOutputSchema(schemas: Array[Schema]): Schema = {
            Schema.newBuilder.add(new Attribute("html-content", AttributeType.STRING)).build
          }
    
  • Chart type Since this operator is a visualization operator, we need to register its chart type as a HTML_VIZ.

    override def chartType(): String = VisualizationConstants.HTML_VIZ
    

Executor (Exec)

In all Python native operators, the executor is simply the PythonUDFExecutor.

Operator Configuration

In a Python native operator, it shares the same configuration as a Java native operator.

Registration

It has the same process as a Java native operator.

Test

After following all the steps above, you should be able to drag and drop the operator into the canvas. During the execution, the operator will output the expected result.