1 Installing EmrEtlRunner - ClaraVista-IT/snowplow GitHub Wiki

HOME > SNOWPLOW SETUP GUIDE > Step 3: Setting up Enrich > Step 3.1: setting up EmrEtlRunner > 1: Installing EmrEtlRunner

Assumptions
Dependencies
Installation
Configuration
Configuring enrichments
Next steps

## 1. Assumptions

This guide assumes that you have administrator access to a Unix-based server (e.g. Ubuntu, OS X, Fedora) on which you can install EmrEtlRunner and schedule a regular cronjob.

In theory EmrEtlRunner can be deployed onto a Windows-based server, using the Windows Task Scheduler instead of cron, but this has not been tested or documented.

## 2. Dependencies

2.1 Hardware

You will need to setup EmrEtlRunner on your own server. A number of people choose to do so on an EC2 instance (thereby keeping all of Snowplow in the Amazon Cloud). If you do so, please note that you must not use a t1.micro instance. You should at the very least use an m1.small instance.

2.2 Software

The EmrEtlRunner jar is available for download. For more information, see the Hosted assets page.

Alternatively, to build EmrEtlRunner yourself, first make sure that your server has all of the following installed:

Git - see the [Git Installation Guide] git-install
Ruby and RVM* - see our Ruby and RVM setup guide. Both EmrEtlRunner and StorageLoader require Ruby 1.9.3

* If you prefer, an alternative Ruby manager such as chruby or rbenv should work fine too.

2.3 EC2 key

You will also need an EC2 key pair setup in your Amazon EMR account.

For details on how to do this, please see the section "Configuring the client" in the Setting up EMR command line tools wiki page. Make sure that you setup the EC2 key pair inside the region in which you will be running your ETL jobs.

### 2.4 S3 locations

EmrEtlRunner processes data through three distinct states:

:raw - raw Snowplow event logs are the input to the EmrEtlRunner process
:enriched - EmrEtlRunner validates and enriches the raw event logs into enriched events
:shredded - EmrEtlRunner shreds JSONs found in enriched events ready for loading into dedicated Redshift tables

For :raw:in, specify the Amazon S3 path you configured for your Snowplow collector.

For all other S3 locations, you can specify paths within a single S3 bucket that you setup now. This bucket must be in the same AWS region as your :raw:in bucket.

Done? Right, now we can install EmrEtlRunner.

## 3. Installation

To build EmrEtlRunner yourself, checkout the Snowplow repository and navigate to the EmrEtlRunner root:

$ git clone git://github.com/snowplow/snowplow.git
$ cd snowplow/3-enrich/emr-etl-runner

Next you are ready to build the application on your system:

$ ./build.sh

Check it worked okay:

$ ./deploy/snowplow-emr-etl-runner --version
snowplow-emr-etl-runner 0.17.0

If you have any problems installing, please double-check that you have successfully completed our Ruby and RVM setup guide.

## 4. Configuration

EmrEtlRunner requires a YAML format configuration file to run. There is a configuration file template available in the Snowplow GitHub repository at [/3-enrich/emr-etl-runner/config/config.yml.sample] config-yml. See Common configuration more information on how to write this file.

Iglu

You will also need an Iglu resolver configuration file. This is where we list the schema repositories to use to retrieve JSON Schemas for validation. For more information on this, see the wiki page for Configuring shredding.

## 5. Configuring enrichments

If you wish to use Snowplow enrichments, see the wiki page for configuring enrichments.

## 6. Next steps

All done installing EmrEtlRunner? Then [learn how to use it] using-emretlrunner