Getting Started and Flow Basics - CoxAutomotiveDataSolutions/waimak GitHub Wiki

This page describes the initial steps for getting a basic Waimak application up and running. It is assumed you know the concepts behind Waimak, therefore please read the Introduction and High Level Concepts page before beginning.

Contents

Getting Waimak

Waimak consists of a number of modules, with the core functionality being provided by the waimak-core_2.11 module name. A complete list of all the Waimak modules and the functionality they provide is given in the Project README.md.

Importing Waimak

Waimak is published in Maven Central and will be available to use by all POM-compatible build tools and applications.

Maven

You can import Waimak into your Maven project by adding the following dependency to your POM:

        <dependency>
            <groupId>com.coxautodata</groupId>
            <artifactId>waimak-core_2.11</artifactId>
            <version>${waimak.version}</version>
        </dependency>

Spark Shell

Waimak can also be imported into the interactive Spark Shell by providing the --packages option:

spark-shell --packages "com.coxautodata:waimak-core_2.11:2.0"

Handling the Spark dependency

Waimak uses a concept of bring-your-own Spark version to handle the Spark dependency, meaning that Waimak is not dependent on any particular version or release of Apache Spark. This allows a user to pick a version of Spark they wish to use for their application and Waimak will typically be happy to use this version. Waimak achieves this by marking it's Spark dependency as optional. Waimak should be compatible with any version and distribution of Spark >= 2.2 however the list of currently tested versions can be found under the profile section in the Waimak POM.

Defining and Running a Flow

Creating an application using Waimak typically consists of three stages:

  1. Creating an initial empty flow
    All Waimak flows begin with an empty DataFlow object on which actions can be added

  2. Adding actions to the flow to build up an application
    Many Waimak actions can be added to a flow to create a complete flow object that describes your application

  3. Executing the flow
    The fully described flow application can then be executed by Waimak, with each action being executed in turn (and parallel if possible) depending on the dependency between Waimak actions

Creating a Flow

You can create an empty Waimak flow by including the following import and calling the Waimak flow builder:

import com.coxautodata.waimak.dataflow.Waimak

val emptyFlow = Waimak.sparkFlow(spark)

The Waimak.sparkFlow function requires that you pass in a SparkSession object. In most interactive shells this will be the spark variable by default.

Some actions such as the Cache as Parquet Action and Stage and Commit Actions require a temporary folder for their underlying staging mechanisms. If you need to use these actions you will also have to provide an additional temporary folder path to the builder function:

val emptyFlow = Waimak.sparkFlow(spark, "/tmp/project_name")

Note: When providing a temporary folder it is advisable to provide a unique path for each project to prevent conflicting writes between projects.

Adding Actions

Once you have a flow object, actions can then be added to the flow to form a complete description of your application. It is important to note that the actions are not performed when they are added to the flow but instead are performed later when the flow objected is executed.

Actions are added directly onto the flow object by calling methods on the flow:

val flowWithAction = emptyFlow.addAction(..)

The flow object is immutable, meaning that it cannot change once created, so when actions are added onto a flow a new flow with the additional actions is created. You must assign the new flow object created by the add action to a value to save it's state. Multiple actions can also be called in a row to produce a new flow containing all the actions that were called:

val flowWithThreeActions = flowWithAction 
                            .addAction(..)
                            .addAction(..)

All of the standard actions available are described in the Actions page. To make them available to use on a flow object you must include the following import statement:

import com.coxautodata.waimak.dataflow.spark.SparkActions._

val flowWithRead = emptyFlow.openFileCSV("/files/customer.csv", "customer")

Labels as References to Underlying Data

In Waimak, actions use labels (i.e. String values) to reference underlying Datasets. These labels are used in all Waimak actions and the executor forms a dependency graph at execution time to allow as much computation as possible to be done in parallel.

For example, in the flowing code snipped a CSV file is opened and assigned a label value of "customer". A subsequent action then references the "customer" label, transforms the underlying data and assigned to result to a new label "unique_customer_ids". An write action then references the transformed data by using label "unique_customer_ids" and writes the result to a path. The write action does not require an output label as there is no result to assign:

import com.coxautodata.waimak.dataflow.spark.SparkActions._

val writeUniqueCustomerIdsFlow = emptyFlow
                                  .openFileCSV("/files/customer.csv", "customer")
                                  .transform("customer")("unique_customer_ids")(df => df.select('id).distinct)
                                  .writeParquet("/output")("unique_customer_ids")                              

The data referenced by the labels is also immutable hence every transformation will produce a new label, leaving the previous label unchanged. This concept is explained on the Actions page under the Immutable Labels heading.

Executing a Flow

Once you have finished constructing your Waimak flow you can execute the flow using an Executor. The Executor will analyse the flow passed to it and execute the underlying Spark transformations and actions. The Executor manages all the dependencies between labels, ensures all actions are executed in the correct order and will try to execute independent actions in parallel.

To execute a flow with a specific executor you must construct an executor object and pass in the flow to the execute method:

import com.coxautodata.waimak.dataflow.Waimak

val executor = Waimak.sparkExecutor()
executor.execute(writeUniqueCustomerIdsFlow)

However, by default flows will already include an executor and can therefore be executed in the following simplified way:

writeUniqueCustomerIdsFlow.execute()

The executors can be configured and tuned for an application, however the default executor configuration is typically fine for most use cases. Executor configuration and tuning is covered in the Executor page.

Putting it all Together

To put everything covered above together, we can write a simple application that reads in a CSV containing "customer" data, finds the unique "ids" in the file and writes it to an output location in Parquet:

// Import Waimak builders and actions
import com.coxautodata.waimak.dataflow.Waimak
import com.coxautodata.waimak.dataflow.spark.SparkActions._

// Create an empty flow
val emptyFlow = Waimak.sparkFlow(spark)

// Build up our application by adding actions
val writeUniqueCustomerIdsFlow = emptyFlow
                                  .openFileCSV("/files/customer.csv", "customer")
                                  .transform("customer")("unique_customer_ids")(df => df.select('id).distinct)
                                  .writeParquet("/output")("unique_customer_ids")                 

// Execute the flow
writeUniqueCustomerIdsFlow.execute()

Where To Go Next

You should now be able to construct Spark applications using a structured approach with Waimak. There is much more to Waimak, some of which is covered in the following pages:

  • Actions and Advanced Actions
    Waimak includes many built-in Actions, most of which are listed with examples in the above pages

  • Flow Best Practices
    Best practices to use when defining and structuring your flow

  • Creating Custom Actions
    This page describes how to create custom actions for situations when the built-in actions do not cover required functionality

  • Executors
    Configuring and tuning the Executor object

⚠️ **GitHub.com Fallback** ⚠️