4 Self hosting Hadoop Enrich - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki

HOME » SNOWPLOW SETUP GUIDE » Step 3: Setting up Enrich » Step 3.1: setting up EmrEtlRunner » 4: Self-hosting Hadoop Enrich

Overview
Bucket and directory setup
Uploading files
Configuring EmrEtlRunner
Next steps

1. Overview

EmrEtlRunner runs Hadoop Enrich resources which are publicly hosted on Amazon S3 by Snowplow - please see the Hosted assets page for details. For most users, this will be fine. However, there are some cases where you will need to self-host the Hadoop Enrich process in your own Amazon S3 bucket. Two examples are:

You are using a custom fork of the Hadoop Enrich process
You are using a commercial version of the MaxMind GeoCity database

For self-hosting instructions read on.

2. Bucket and directory setup

First create a new S3 bucket, for example:

s3://[mycompanyname]-snowplow-hosted-assets

You do not need to give any public permissions on this bucket.

Now create the following two directory structures:

s3://[mycompanyname]-snowplow-hosted-assets/3-enrich/hadoop-etl
s3://[mycompanyname]-snowplow-hosted-assets/third-party/maxmind

That's it - you are now ready to upload your files.

3. Uploading files

3.1 MaxMind database

If you are using the free GeoCityLite version of the MaxMind database, then download it from Hosted assets and upload to:

s3://[mycompanyname]-snowplow-hosted-assets/third-party/maxmind/GeoLiteCity.dat

If you are using a commercial version of the MaxMind GeoCity database, then download it from your MaxMind account and upload it into this directory:

s3://[mycompanyname]-snowplow-hosted-assets/third-party/maxmind/

Please note: MaxMind releases an updated version of the GeoCity database each month, so be sure to keep your version up-to-date.

3.2 Hadoop Enrich process

If you are using the standard version of Hadoop Enrich, then download it from Hosted assets and upload to:

s3://[mycompanyname]-snowplow-hosted-assets/3-enrich/scala-hadoop-enrich/snowplow-hadoop-enrich-[version].jar

If you are using a custom fork of the Hadoop Enrich process, then upload your assembled fatjar to:

s3://[mycompanyname]-snowplow-hosted-assets/3-enrich/scala-hadoop-enrich/snowplow-hadoop-enrich-[version]-[fork].jar

4. Configuring EmrEtlRunner

In your config.yml file, set s3.buckets.assets to your own bucket name:

buckets:
  assets: s3://[mycompanyname]-snowplow-hosted-assets

If you are using a custom fork of the Hadoop Enrich process, make sure to update enrich:versions:hadoop_enrich to your own Hadoop Enrich version:

enrich:
  ...
  versions:
    ...
    hadoop_enrich: [version]-[fork]

5. Next steps

And that's it - you should now be able to run EmrEtlRunner against the custom/commercial assets hosted in your own dedicated S3 bucket.