Getting Started - Texera/texera GitHub Wiki

Texera Documentation

Texera is an open-source system that supports collaborative data analytics at scale using Web-based workflows. This page includes instructions on how to install the system and do a simple workflow.

1. Prerequisites:

Java JDK
- Install Java JDK 11 (Java Development Kit) (recommend: adoptopenjdk) for running the backend engine of Texera and set JAVA_HOME in your path.
- Check version:
```
java -version
```
Git
- Verify installation:
```
git --version
```
- On Windows, install the software from https://gitforwindows.org/. Git Bash is available after installing Git.
- On Mac and Linux, see https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
sbt (Scala Build Tool)
- Install sbt for building the project, check https://www.scala-sbt.org/1.x/docs/Setup.html. We recommend using sdkman to install sbt if you are using Java 8. Sbt installed using brew has problem with Java 8, as documented here.
- If the sbt --version command fails on Windows after installation, it is recommended to restart your computer.
Node.js & npm
- Install an LTS version (not the latest). Currently, we require LTS version > 18.x
- Check version:
```
node -v
```
- On Windows, install from https://nodejs.org/en/.
- On Mac and Linux, use NVM to install NodeJS because it avoids permission issues when using node.
yarn
- Install yarn package manager: https://classic.yarnpkg.com/en/docs/install/. Use [email protected].
```
npm install -g yarn
corepack enable
corepack prepare [email protected] --activate
```

2. Texera Initialization:

Open a command line (Git Bash on Windows) and navigate to a directory where you want to install the Texera project.
Clone the project from GitHub by executing

git clone https://github.com/Texera/texera.git

Navigate to the project directory:

cd texera

Configure Yarn version for the core/gui workspace:

yarn --cwd core/gui set version 4.5.1

3. Start Texera (Skip this step if you're using Windows):

Open a command line and navigate to the cloned repository. If you are on Windows, you need to use Git Bash as a Linux bash shell in order to run shell scripts.
Navigate to the core directory
```
cd core
```

Then build the project. console ./scripts/build.sh Depending on your environment, it may take a few minutes (around 2 minutes to 6 minutes).

If the shell script outputs an error message related to pylsp: command not found, install pylsp first and try again.
```
pip install python-lsp-server python-lsp-server[websockets]
```

Start the Texera Web server. In the core directory:
```
./scripts/server.sh
```

Wait until you see the message org.eclipse.jetty.server.Server: Started

Start the Texera computing unit process. Open a new terminal window. In the core directory:
```
./scripts/workflow-computing-unit.sh
```

Wait until you see the message ---------Now we have 1 nodes in the cluster---------

Note: (if ./scripts/workflow-computing-unit.sh gives a "permission denied error", just do chmod 755 scripts/workflow-computing-unit.sh to grant an execute permission to the file).

Start the Texera compiling unit process. Open a new terminal window, In the core directory:
```
./scripts/workflow-compiling-service.sh
```
Start the shared-editing server locally. To do so, open a new terminal window. In the core directory:
```
./scripts/shared-editing-server.sh
```
Start the frontend. Open a new terminal window. Open a new terminal window. In the core/gui directory:
```
ng serve
```
Open a browser and access http://localhost:4200.

4. Use Texera:

Web UI Overview

Operator Library/Menu:

It is separated into multiple dropdown menus based on the operator type, e.g., Source Operator, Search Operator, etc. You can drag and drop an operator from these dropdown menus onto the Workflow Canvas.
Workflow Canvas:

It is the main playground, where you can drag and drop Operators from the Operator Library onto it. Each operator is shown as a square box and connected with other operators with arrowed links which indicates the data flow.
Properties Editor Panel:

The panel will show up when you highlight a specific operator (by clicking on it) in the Workflow Canvas. You can customize the properties of the selected operator, for example, set the keyword for a filter. When the selected operator is configured correctly, a green ring will surround it; while a red ring usually indicates an error in configuration or connection to other operators.
Result Panel:

By default or when there is no result, it is hidden. You can click on the little UP arrow to expand this panel. When a workflow is finished running, the result panel will pop up with the data. You may slide up and down or left and right to view the data inside the panel.

Create the first workflow

The following are detailed instructions to create a workflow to analyze data from a csv file using the Texera system. More specifically, the workflow will calculate the average sales per item type for Europe from the CountrySalesData.csv (Make sure the downloaded file is in .csv file extension). The sales data has been downloaded from eforexcel.com and has 100 rows of data.

We will be creating a workflow on Texera Web UI to

read the data from the file;
filter the relevant data based on keywords;
perform an aggregation.

1. Read Data

Drag and drop the CSV File Scan operator from the Source operator type onto the Workflow Canvas.
Select the CSV File Scan operator on the Workflow Canvas. On the right-hand side, the Properties Editor Panel for the CSV File Scan operator should appear.
Fill in the absolute file path of the downloaded CountrySalesData.csv file on your OS.
The delimiter has been set to , by default.
Check the header option to indicate that the file has a header row at the top.

2. Filtering Using Regex

Drag and drop a Regular Expression operator from the Search operator type and place it to the right of the CSV File Scan operator on the Workflow Canvas.
The two operators should get connected automatically (an arrow from CSV File Operator to Regular Expression). If not, please connect them manually by clicking and dragging on the grey dot of the first operator to the second operator. The connection signifies the flow of data from the CSV File Scan operator to the Regular Expression operator.
Select the Regular Expression operator on the Workflow Canvas. On the right-hand side, the Properties Editor Panel for the Regular Expression operator should appear.
In the dropdown menu set the data column on which you want to perform the search. The Regex property is the expression that you want to search. The column is Region and regex is Europe.

3. Aggregation

Drag and drop an Aggregate operator from the Utilities type onto the Workflow Canvas and connect it to the Regular Expression operator.
On the properties panel for the Aggregate operator, set the Aggregation Function to average.
In the dropdown menu below, set the data column to be averaged as Units Sold.
In the input box of Result Attribute, give the name of the aggregated value. For example, let's use units-sold-per-type.
We are doing the averaging per item-type. Hence, click the + under Group By Keys and type Item Type.

4. Run the workflow

The workflow is now complete. Click the Run button to run the workflow. The results will appear in the Result Panel. Some green status messages will be labeled on the operators to indicate the process of the execution.

If the user accidentally closes the browser, the system will automatically save the workflow and load it when the user visits the same site.

In case the workflow doesn't work as expected, you can try to "Refresh" the page, and click the "Run" button again.