NoETL - noetl/noetl GitHub Wiki

NoETL (Not Only ETL) is a system to manage the sequence of the process execution by controlling forks and child processes. It enables job scheduling and exit status handling.

Differing Approaches to Data Pipelining

Basic Approach: API Calls

Data usually processed through API calls through the command line. Different calls to Hive, Hadoop, RDBMS, NoSQL databases are executed to generate the results.

Automated Approach: Data Pipelining

The eventual goal is to automate data processing through data pipelining, in which the data will be taken in and processed to generate the tables and structured data sets, all within the cloud. NoETL takes the basic approach, and automates it by loading a sequence of steps and executing them, accounting for success and failure cases. NoETL calls processes on the data (Spark, Hive, Hadoop, HBase, BigQuery, RedShift, MongoDB, MySQL etc) by making calls through the operating system. NoETL loads a configuration file, specified by the user as a field, and searches through the file to find the starting step. Each step is executed until the end of the sequence or a failure case is reached.

Components

NoETL & the Command Line

NoETL directly interacts with the command line. It relies on reading in steps from a JSON configuration file, each of which has a list of commands that are run in parallel and joined together before calling the next step, in the case of a success. In the JSON file, each step has code specifying the next action to take. There are currently only true/false exit codes, where a success points to the next step, and a failure exits out of the program.

Hive, Hadoop Map Reduce, MySQL/RedShift/Postgres

Each step in the JSON config file has a command list, which is executed by calling the runShell function in NoETL as the action. The command list under the steps involves calls to Hive, Hadoop MR, MySQL, and other applications. Hive is a data warehousing infrastructure that facilitates data summarization and analysis. Hadoop is used to distribute data processing across clusters of computers with the HDFS (Hadoop Distributed File System). MySQL/RedShift/Postgres is a Relational Database Management System (RDBMS) that supports SQL and is used to store the processed data.

User Interface

Developing a user interface will help users be able to interact more directly with their data, and see immediate results. The goal of the user interface is to allow data uploads, and support machine learning processing of the data to lead to faster visualization of the results. Users will be able to upload new data, which NoETL will automatically process. A faster, more efficient implementation of this process will be paired with the ability to manipulate the data on the UI, and help users learn to work with their own data to gain more insights from the results. The different components to be developed include a data modeling UI to manage the NoETL configuration files & execution scenarios, integrated UI for SQL queries.

Approaches to File Maintenance

  1. Edit the agent and the configuration manually
  2. Use a UI on the client machine to configure files, but upload files manually
  3. Control the execution process from the UI
    • With this approach, a connection between the UI and an API needs to be made in order to get the status of the current process and report back.