6 Configuring shredding - winlinvip/snowplow GitHub Wiki
HOME > SNOWPLOW SETUP GUIDE > Step 3: Setting up Enrich > Step 3.1: setting up EmrEtlRunner > 5: Configuring shredding
## 1. OverviewSnowplow has a Shredding process for Redshift which consists of two phases:
- Extracting unstructured event JSONs and context JSONs from enriched event files into their own files
- Loading those files into corresponding tables in Redshift
The first phase is instrumented by EmrEtlRunner; in this page we will explain how to configure the shredding process to operate smoothly with EmrEtlRunner.
Note: this guide is ONLY required if you want to shred your own unstructured event JSONs and context JSONs. If you are only shredding Snowplow-authored JSONs like link clicks and ad impressions, then you can skip this page and go straight to Loading shredded types.
## 2. Pre-requisitesFirst off, we assume that all JSONs you are sending as unstructured events and contexts are self-describing JSONs. To find out more about self-describing JSONs:
- Iglu documentation on self-describing JSONs
- JavaScript Tracker 2.0.0 release on self-describing JSONs
- SchemaVer for semantic schema versioning
Secondly, we assume that you have defined self-describing JSON Schemas for each of your JSONs. Resources:
Thirdly, we assume that you have setup your own Iglu schema repository to host your schemas. Resources:
You are now ready to configure EmrEtlRunner for shredding.
## 3. Configuring EmrEtlRunnerThe first relevant section of the EmrEtlRunner's config.yml
is:
:shredded:
:good: ADD HERE # e.g. s3://my-out-bucket/shredded/good
:bad: ADD HERE # e.g. s3://my-out-bucket/shredded/bad
:errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below
Please make sure that these shredded buckets are set correctly.
Next, we let EmrEtlRunner know about your Iglu schema repository, so that schemas can be retrieved from there as well as from Iglu Central. The relevant section of config.yml
is:
:iglu:
:schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0
:data:
:cache_size: 500
:repositories:
- :name: "Iglu Central"
:priority: 0
:vendor_prefixes:
- com.snowplowanalytics
:connection:
:http:
:uri: http://iglucentral.com
You must add an extra entry in the :repositories:
array pointing to your own Iglu schema repository.
For more information on how to do this, please review the Iglu client configuration wiki page. The EmrEtlRunner converts the YAML format given above into an Iglu client configuration JSON automatically.
Your updated config.yml
will end up looking something like:
:iglu:
:schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0
:data:
:cache_size: 500
:repositories:
- :name: "Iglu Central"
:priority: 0
:vendor_prefixes:
- com.snowplowanalytics
:connection:
:http:
:uri: http://iglucentral.com
- :name: "Acme's Iglu repository"
:priority: 0
:vendor_prefixes:
- com.acme
:connection:
:http:
:uri: http://internal.acme.com/iglu
That's it for configuring EmrEtlRunner for shredding. Next, please refer to the Loading shredded types wiki page to understand how to configure StorageLoader.