4 Self hosting Hadoop Enrich - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki
HOME » SNOWPLOW SETUP GUIDE » Step 3: Setting up Enrich » Step 3.1: setting up EmrEtlRunner » 4: Self-hosting Hadoop Enrich
EmrEtlRunner runs Hadoop Enrich resources which are publicly hosted on Amazon S3 by Snowplow - please see the Hosted assets page for details. For most users, this will be fine. However, there are some cases where you will need to self-host the Hadoop Enrich process in your own Amazon S3 bucket. Two examples are:
- You are using a custom fork of the Hadoop Enrich process
- You are using a commercial version of the MaxMind GeoCity database
For self-hosting instructions read on.
First create a new S3 bucket, for example:
s3://[mycompanyname]-snowplow-hosted-assets
You do not need to give any public permissions on this bucket.
Now create the following two directory structures:
s3://[mycompanyname]-snowplow-hosted-assets/3-enrich/hadoop-etl
s3://[mycompanyname]-snowplow-hosted-assets/third-party/maxmind
That's it - you are now ready to upload your files.
If you are using the free GeoCityLite version of the MaxMind database, then download it from Hosted assets and upload to:
s3://[mycompanyname]-snowplow-hosted-assets/third-party/maxmind/GeoLiteCity.dat
If you are using a commercial version of the MaxMind GeoCity database, then download it from your MaxMind account and upload it into this directory:
s3://[mycompanyname]-snowplow-hosted-assets/third-party/maxmind/
Please note: MaxMind releases an updated version of the GeoCity database each month, so be sure to keep your version up-to-date.
If you are using the standard version of Hadoop Enrich, then download it from Hosted assets and upload to:
s3://[mycompanyname]-snowplow-hosted-assets/3-enrich/scala-hadoop-enrich/snowplow-hadoop-enrich-[version].jar
If you are using a custom fork of the Hadoop Enrich process, then upload your assembled fatjar to:
s3://[mycompanyname]-snowplow-hosted-assets/3-enrich/scala-hadoop-enrich/snowplow-hadoop-enrich-[version]-[fork].jar
In your config.yml
file, set s3.buckets.assets
to your own bucket name:
buckets:
assets: s3://[mycompanyname]-snowplow-hosted-assets
If you are using a custom fork of the Hadoop Enrich process, make sure to update enrich:versions:hadoop_enrich
to your own Hadoop Enrich version:
enrich:
...
versions:
...
hadoop_enrich: [version]-[fork]
And that's it - you should now be able to run EmrEtlRunner against the custom/commercial assets hosted in your own dedicated S3 bucket.