Shredding - winlinvip/snowplow GitHub Wiki

Snowplow has a Shredding process which consists of two phases:

Extracting unstructured event JSONs and context JSONs from enriched event files into their own files
Loading those files into corresponding tables in Redshift

Currently Hive and Postgres are not supported for shredding.

Technical architecture

The shredding flow and main components are highlighted in blue on the right-hand side of this technical architecture:

Main components

0. Iglu

Iglu is a schema repository technology which holds the JSON Schemas against which unstructured events and context JSONs are validated.

For more information on Iglu, see the [Iglu wiki] iglu-wiki.

1. Scala Hadoop Shred

Scala Hadoop Shred is a dedicated Scalding job to perform the JSON validation and extraction. This is a five step process:

Reads Snowplow enriched events from S3
Extracts any unstructured event JSONs and context JSONs found
Validates that these JSONs conform to schema
Adds metadata to these JSONs to track their origins
Writes these JSONs out to nested folders dependent on their schema

Configuring this is covered in Configuring shredding.

2. StorageLoader

The StorageLoader has functionality to load shredded types into corresponding tables in Redshift, using Redshift's native COPY FROM JSON functionality. This is a multi-step process:

Find folders of shredded types in S3
For each folder of shredded types:
- Find the JSON Paths file that corresponds to the shredded type
- Determine the Redshift tablename from the shredded type
- Load the shredded type files into the Redshift table using the JSON Paths file

Configuring this is covered in Loading shredded types.