1 Installing EmrEtlRunner - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki
HOME > SNOWPLOW SETUP GUIDE > Step 3: Setting up Enrich > Step 3.1: setting up EmrEtlRunner > Step 3.1.1: Installing EmrEtlRunner
This guide assumes that you have administrator access to a Unix-based server (e.g. Ubuntu, OS X, Fedora) on which you can install EmrEtlRunner and schedule a regular cronjob.
You might wish to try out the steps showing you how an EC2 instance could be set up via AWS CLI.
In theory EmrEtlRunner can be deployed onto a Windows-based server, using the Windows Task Scheduler instead of cron, but this has not been tested or documented.
You will need to setup EmrEtlRunner on your own server. A number of people choose to do so on an EC2 instance (thereby keeping all of Snowplow in the Amazon Cloud). If you do so, please note that you must not use a t1.micro
instance. You should at the very least use an m1.small
instance.
The EmrEtlRunner jar is available for download. For more information, see the Hosted assets page.
* If you prefer, an alternative Ruby manager such as chruby or rbenv should work fine too.
You will also need an EC2 key pair setup in your Amazon EMR account.
For details on how to do this, please see Create a Key Pair. Make sure that you setup the EC2 key pair inside the region in which you will be running your ETL jobs.
EmrEtlRunner processes data through three distinct states:
- :raw - raw Snowplow event logs are the input to the EmrEtlRunner process
- :enriched - EmrEtlRunner validates and enriches the raw event logs into enriched events
- :shredded - EmrEtlRunner shreds JSONs found in enriched events ready for loading into dedicated Redshift tables
For :raw:in
, specify the Amazon S3 path you configured for your Snowplow collector.
For all other S3 locations, you can specify paths within a single S3 bucket that you setup now. This bucket must be in the same AWS region as your :raw:in
bucket.
Done? Right, now we can install EmrEtlRunner.
We host EmrEtlRunner on the distribution platform JFrog Bintray. You can get a copy of it as shown below.
Note: follow this link to choose your version of the EmrEtlRunner. The distribution name follows the pattern snowplow_emr_{{RELEASE_VERSION}}.zip
.
$ wget http://dl.bintray.com/snowplow/snowplow-generic/snowplow_emr_{{RELEASE_VERSION}}.zip
The archive contains both EmrEtlRunner and StorageLoader. Unzip the archive:
$ unzip snowplow_emr_{{RELEASE_VERSION}}.zip
You will see two files snowplow-emr-etl-runner
and snowplow-storage-loader
where the first one is the actual EmrEtlRunner.
EmrEtlRunner requires a YAML format configuration file to run. There is a configuration file template available in the Snowplow GitHub repository at /3-enrich/emr-etl-runner/config/config.yml.sample
. See Common configuration more information on how to write this file.
Storages for data can be configured using storage targets JSONs. Configuration file templates available in the Snowplow GitHub repository at /4-storage/config/targets
directory
You will also need an Iglu resolver configuration file. This is where we list the schema repositories to use to retrieve JSON Schemas for validation. For more information on this, see the wiki page for Configuring shredding.
If you wish to use Snowplow enrichments, see the wiki page for configuring enrichments.
All done installing EmrEtlRunner? Then learn how to use it