6 Configuring shredding - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki
HOME » SNOWPLOW SETUP GUIDE » Step 3: Setting up Enrich » Step 3.1: setting up EmrEtlRunner » 5: Configuring shredding
Snowplow has a Shredding process for Redshift which contributes to the following three phases:
- Extracting unstructured event JSONs and context JSONs from enriched event files into their own files
- Removing endogenous duplicate records, which are sometimes introduced within the Snowplow pipeline (feature added to r76)
- Loading those files into corresponding tables in Redshift
The first two phases are instrumented by EmrEtlRunner; in this page we will explain how to configure the shredding process to operate smoothly with EmrEtlRunner.
Note: Even though the first phase is required only if you want to shred your own unstructured event JSONs and context JSONs, the second phase will be beneficial to data modeling and analysis. If none of it is required and you are only shredding Snowplow-authored JSONs like link clicks and ad impressions, then you can skip this page and go straight to Loading shredded types.
First off, we assume that all JSONs you are sending as unstructured events and contexts are self-describing JSONs. To find out more about self-describing JSONs:
- Iglu documentation on self-describing JSONs
- JavaScript Tracker 2.0.0 release on self-describing JSONs
- SchemaVer for semantic schema versioning
Secondly, we assume that you have defined self-describing JSON Schemas for each of your JSONs. Resources:
Thirdly, we assume that you have setup your own Iglu schema registry to host your schemas. Resources:
» Read more about the topics related to events and contexts:
You are now ready to configure EmrEtlRunner for shredding.
The relevant section of the EmrEtlRunner's config.yml
is:
shredded:
good: s3://my-out-bucket/shredded/good # e.g. s3://my-out-bucket/shredded/good
bad: s3://my-out-bucket/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
errors: s3://my-out-bucket/shredded/errors # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://my-out-bucket/shredded/archive # Not required for Postgres currently
The configuration file is referenced with --config
option to EmrEtlRunner.
Please make sure that these shredded buckets are set correctly.
Next, we let EmrEtlRunner know about your Iglu schema registry, so that schemas can be retrieved from there as well as from Iglu Central. Add your own registry to the repositories array in iglu_resolver.json
file:
{
"schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0",
"data": {
"cacheSize": 500,
"repositories": [
{
"name": "Iglu Central",
"priority": 0,
"vendorPrefixes": [ "com.snowplowanalytics" ],
"connection": {
"http": {
"uri": "http://iglucentral.com"
}
}
}
#custom section starts here -->
,
{
...
}
#custom section ends here <--
]
}
}
You must add an extra entr(-y/ies) in the repositories:
array pointing to your own Iglu schema registry. If you are not submitting custom events and contexts and are not interested in shredding then there's no need in adding the custom section but the iglu_resolver.json
file is still required and is referenced with --resolver
option to EmrEtlRunner.
For more information on how to customize the iglu_resolver.json
file, please review the Iglu client configuration wiki page.
That's it for configuring EmrEtlRunner for shredding. Next, please refer to the Loading shredded types wiki page to understand how to configure StorageLoader.