EmrEtlRunner Input Formats - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki
HOME > SNOWPLOW SETUP GUIDE > Step 3: Setting up Enrich > Step 3.1: setting up EmrEtlRunner > 1: Installing EmrEtlRunner > EmrEtlRunner input formats
Supported input formats for the EmrEtlRunner are as follows:
Use this when you are running the CloudFront Collector.
Documentation:
Use this when you are running the Clojure Collector on Elastic Beanstalk.
Documentation:
Use this when you are using the Scala Stream Collector plus Kinesis LZO S3 Sink.
Documentation:
Use this when you are analyzing Amazon CloudFront access logs (web distribution format only).
If you use CloudFront as your CDN for web content, you can use Snowplow to process your CloudFront access logs. Snowplow will enrich these logs with the user-agent, page URI fragments and geo-location as standard.
To process CloudFront access logs, first create a new EmrEtlRunner config.yml
:
- Set your
:raw:in:
bucket to where your logs are written - Set your
:etl:collector_format:
totsv/com.amazon.aws.cloudfront/wd_access_log
- Provide new bucket paths and a new job name, to prevent this job from clashing with your existing Snowplow job(s)
If you are running the Snowplow batch (Hadoop) flow with Amazon Redshift, then you should deploy the relevent event table into your Amazon Redshift database. You can find the table definition here:
You can either load these events using your existing atomic.events
table, or if you prefer load into an all-new database or schema. If you load into your existing atomic.events
table, make sure to schedule these loads so that they don't clash with your existing loads.
Use this when you are working with Urban Airship Connect. For more details see Urban Airship Connect webhook setup.