Shredding - winlinvip/snowplow GitHub Wiki
Snowplow has a Shredding process which consists of two phases:
- Extracting unstructured event JSONs and context JSONs from enriched event files into their own files
- Loading those files into corresponding tables in Redshift
Currently Hive and Postgres are not supported for shredding.
The shredding flow and main components are highlighted in blue on the right-hand side of this technical architecture:
Iglu is a schema repository technology which holds the JSON Schemas against which unstructured events and context JSONs are validated.
For more information on Iglu, see the [Iglu wiki] iglu-wiki.
Scala Hadoop Shred is a dedicated Scalding job to perform the JSON validation and extraction. This is a five step process:
- Reads Snowplow enriched events from S3
- Extracts any unstructured event JSONs and context JSONs found
- Validates that these JSONs conform to schema
- Adds metadata to these JSONs to track their origins
- Writes these JSONs out to nested folders dependent on their schema
Configuring this is covered in Configuring shredding.
The StorageLoader has functionality to load shredded types into corresponding tables in Redshift, using Redshift's native COPY FROM JSON
functionality. This is a multi-step process:
- Find folders of shredded types in S3
- For each folder of shredded types:
- Find the JSON Paths file that corresponds to the shredded type
- Determine the Redshift tablename from the shredded type
- Load the shredded type files into the Redshift table using the JSON Paths file
Configuring this is covered in Loading shredded types.