1 Installing EmrEtlRunner - ClaraVista-IT/snowplow GitHub Wiki
HOME > SNOWPLOW SETUP GUIDE > Step 3: Setting up Enrich > Step 3.1: setting up EmrEtlRunner > 1: Installing EmrEtlRunner
## 1. AssumptionsThis guide assumes that you have administrator access to a Unix-based server (e.g. Ubuntu, OS X, Fedora) on which you can install EmrEtlRunner and schedule a regular cronjob.
In theory EmrEtlRunner can be deployed onto a Windows-based server, using the Windows Task Scheduler instead of cron, but this has not been tested or documented.
## 2. DependenciesYou will need to setup EmrEtlRunner on your own server. A number of people choose to do so on an EC2 instance (thereby keeping all of Snowplow in the Amazon Cloud). If you do so, please note that you must not use a t1.micro
instance. You should at the very least use an m1.small
instance.
The EmrEtlRunner jar is available for download. For more information, see the Hosted assets page.
Alternatively, to build EmrEtlRunner yourself, first make sure that your server has all of the following installed:
- Git - see the [Git Installation Guide] git-install
- Ruby and RVM* - see our Ruby and RVM setup guide. Both EmrEtlRunner and StorageLoader require Ruby 1.9.3
* If you prefer, an alternative Ruby manager such as chruby or rbenv should work fine too.
You will also need an EC2 key pair setup in your Amazon EMR account.
For details on how to do this, please see the section "Configuring the client" in the Setting up EMR command line tools wiki page. Make sure that you setup the EC2 key pair inside the region in which you will be running your ETL jobs.
### 2.4 S3 locationsEmrEtlRunner processes data through three distinct states:
- :raw - raw Snowplow event logs are the input to the EmrEtlRunner process
- :enriched - EmrEtlRunner validates and enriches the raw event logs into enriched events
- :shredded - EmrEtlRunner shreds JSONs found in enriched events ready for loading into dedicated Redshift tables
For :raw:in
, specify the Amazon S3 path you configured for your Snowplow collector.
For all other S3 locations, you can specify paths within a single S3 bucket that you setup now. This bucket must be in the same AWS region as your :raw:in
bucket.
Done? Right, now we can install EmrEtlRunner.
## 3. InstallationTo build EmrEtlRunner yourself, checkout the Snowplow repository and navigate to the EmrEtlRunner root:
$ git clone git://github.com/snowplow/snowplow.git
$ cd snowplow/3-enrich/emr-etl-runner
Next you are ready to build the application on your system:
$ ./build.sh
Check it worked okay:
$ ./deploy/snowplow-emr-etl-runner --version
snowplow-emr-etl-runner 0.17.0
If you have any problems installing, please double-check that you have successfully completed our Ruby and RVM setup guide.
## 4. ConfigurationEmrEtlRunner requires a YAML format configuration file to run. There is a configuration file template available in the Snowplow GitHub repository at [/3-enrich/emr-etl-runner/config/config.yml.sample
] config-yml. See Common configuration more information on how to write this file.
You will also need an Iglu resolver configuration file. This is where we list the schema repositories to use to retrieve JSON Schemas for validation. For more information on this, see the wiki page for Configuring shredding.
## 5. Configuring enrichmentsIf you wish to use Snowplow enrichments, see the wiki page for configuring enrichments.
## 6. Next stepsAll done installing EmrEtlRunner? Then [learn how to use it] using-emretlrunner