PSTL - jackghm/Vertica GitHub Wiki

Disclaimer: The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy or position of my current or past employers.

Update - Open Source project for PSTL => https://twitter.com/_jg/status/977958771057201153

My new industry acronym: PSTL

the Parallelized Streaming Transformation Loader (Pron. PiSToL) is an architecture for highly scalable and reliable, data ingestion pipelines

While there is guidance on using; Apache Kafka™ for Streaming (or non-Streaming), and Apache Spark™ for Transformations, and Loading data (e.g., COPY) into an HP-Vertica™ columnar Data Warehouse, there is very little prescriptive guidance on how to truly parallelize a unified data pipeline - until now.

update (20170516): Spark Summit West 2017 HPE Professional Services now has a PSTL impl. that is an end-user self-service, no-code required “ETL” framework. It is Extensible and operationally robust, Spark Structured Streaming app for Kafka, Hadoop/Hive (ORC, Parquet), OpenTSDB/HBase, and Vertica data pipelines.

update: HP Big Data 2015 Kafka, Spark, Vertica PSTL Slide deck

Background

In a traditional ETL system, we Extract from data sources, then Transform and/or restructure data, and then Load into a data store (e.g., data warehouse). We may execute each an individual phase in parallel; however, each phase of the process is typically processed as independently executing “parallel” processes. That is, the Extract process, the Transformation processing, and Loading can run independent of each other, giving some performance benefits via the separation of each concerns.

Note that there is nothing inherent in the following PiSToL method and process requiring Apache Kafka™, Apache Spark™, and/or HP- Vertica™, however, prescriptive guidance on parallelizing each, will be the focus of this, and future articles/code.

PiSToL

The PiSToL model, borrows concepts from the traditional ETL model and Lambda Architecture, but was designed for modern day Big Data pipelines, where reading, writing, and transforming of large amounts of data should have the following semantics

a) Data ingestion should be processed in near real-time

a. Streaming data should be ingested into a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable

b. We prescribe: Apache Kafka™

b) Streaming data should distributed and partitioned on ingestion to enable parallelized (Read) processing into a in-memory Resilient Distributed Datasets

a. We prescribe: Spark DataSets

b. We prescribe: Spark Structured Streaming ~~Spark Streaming using exactly once Spark Streaming from Apache Kafka - Koeninger github, Cloudera Blog~~

c) Transformations should be done using Spark SQL and UDFs/Expressions and maybe some of our custom PSTL expressions we provide 👍 ~~parallelism on partitioned RDDs~~

a. We prescribe: Spark Transformations and functional programming in Scala on RDD/Data Frames

UPDATE: The remaining guidance was based on a custom Vertica UDL. In the event you do not want to invest this much effort to achieve crazy ingest number like 36TB/hr we recommend using the Vertica COPY from Kafka with the Avro Parser.

d) Loading from RDDs into a Data Warehouse should be done via parallel processing using a method and process that affinitizes records to the specific Data Storage (write) node

a. We prescribe: Pre-hashing partitioned record columns using the target Data Stores hashing function (e.g., MurmurHash). This method avoids the Shuffling/redistributing of data into the Data Warehouse (HP-Vertica Partitioning and Segmentation)

b. We prescribe: Parallelized writes to the Data Store via tcp socket streams to the DW acting as a tcp server that is listening/receiving data directly to each node.

c. Example HP-Vertica COPY command;

copy schema.tableName with source SPARK(port='12345', nodes='node0001:4,node0002:4,node0003:4,node0004:4,node0005:4,node0006:4,node0007:4,node0008:4') direct;

e) Pre-parsed & pre-transformed Data in the RDDs should be persisted to a Data Store

a. We prescribe: Saving the original RDD to Hadoop HDFS

f) Parsed & transformed Data in the RDDs should be persisted to a Data Store

a. In addition to writes to the DW, we prescribe saving the structured data from the RDD into Hadoop HDFS Apache Parquet™ compressed files. This allows you to manage the retention policy in your DW without losing historical data.

Note: We currently support ORC & Parquet hdfs formats and we recommend using Vertica external tables using libhdfs++ to read hdfs/hive tables. ~~investigating ORC File format since it is richly supported in hp-vertica orcfile format~~

Note: Saving both pre-parsed and parsed data also allows you to do Data Quality checks in your PSTL pipeline.

We prescribe: All Write operations should be done using Idempotent methods (more on that later) and exactly-once from source through transforms to sinks.