4 Self hosting Hadoop Enrich - chuwy/snowplow-ci GitHub Wiki

HOME » SNOWPLOW SETUP GUIDE » Step 3: Setting up Enrich » Step 3.1: setting up EmrEtlRunner » 4: Self-hosting Hadoop Enrich

  1. Overview
  2. Bucket and directory setup
  3. Uploading files
  4. Configuring EmrEtlRunner
  5. Next steps
## 1. Overview

EmrEtlRunner runs Hadoop Enrich resources which are publicly hosted on Amazon S3 by Snowplow - please see the Hosted assets page for details. For most users, this will be fine. However, there are some cases where you will need to self-host the Hadoop Enrich process in your own Amazon S3 bucket. Two examples are:

  1. You are using a custom fork of the Hadoop Enrich process
  2. You are using a commercial version of the MaxMind GeoCity database

For self-hosting instructions read on.

## 2. Bucket and directory setup

First create a new S3 bucket, for example:

s3://[mycompanyname]-snowplow-hosted-assets

You do not need to give any public permissions on this bucket.

Now create the following two directory structures:

s3://[mycompanyname]-snowplow-hosted-assets/3-enrich/hadoop-etl
s3://[mycompanyname]-snowplow-hosted-assets/third-party/maxmind

That's it - you are now ready to upload your files.

## 3. Uploading files

3.1 MaxMind database

If you are using the free GeoCityLite version of the MaxMind database, then download it from Hosted assets and upload to:

s3://[mycompanyname]-snowplow-hosted-assets/third-party/maxmind/GeoLiteCity.dat

If you are using a commercial version of the MaxMind GeoCity database, then download it from your MaxMind account and upload it into this directory:

s3://[mycompanyname]-snowplow-hosted-assets/third-party/maxmind/

Please note: MaxMind releases an updated version of the GeoCity database each month, so be sure to keep your version up-to-date.

3.2 Hadoop Enrich process

If you are using the standard version of Hadoop Enrich, then download it from Hosted assets and upload to:

s3://[mycompanyname]-snowplow-hosted-assets/3-enrich/scala-hadoop-enrich/snowplow-hadoop-enrich-[version].jar

If you are using a custom fork of the Hadoop Enrich process, then upload your assembled fatjar to:

s3://[mycompanyname]-snowplow-hosted-assets/3-enrich/scala-hadoop-enrich/snowplow-hadoop-enrich-[version]-[fork].jar
## 4. Configuring EmrEtlRunner

In your config.yml file, set s3.buckets.assets to your own bucket name:

buckets:
  assets: s3://[mycompanyname]-snowplow-hosted-assets

If you are using a custom fork of the Hadoop Enrich process, make sure to update enrich:versions:hadoop_enrich to your own Hadoop Enrich version:

enrich:
  ...
  versions:
    ...
    hadoop_enrich: [version]-[fork]
## 5. Next steps

And that's it - you should now be able to run EmrEtlRunner against the custom/commercial assets hosted in your own dedicated S3 bucket.

⚠️ **GitHub.com Fallback** ⚠️