2 Using EmrEtlRunner - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki
HOME » SNOWPLOW SETUP GUIDE » Step 3: Setting up Enrich » Step 3.1: setting up EmrEtlRunner Step 3.1.1: Installing EmrEtlRunner » Step 3.1.2: Using EmrEtlRunner
There are two usage modes for EmrEtlRunner:
- Rolling mode where EmrEtlRunner processes whatever raw Snowplow event logs it finds in the In Bucket
- Timespan mode where EmrEtlRunner only processes those raw Snowplow event logs whose timestamp is within a timespan specified on the command-line
Timespan mode can be useful if you have a large backlog of raw Snowplow event logs and you want to start by processing just a small subset of those logs.
The EmrEtlRunner is an executable jar:
$ ./snowplow-emr-etl-runner
The command-line options for EmrEtlRunner look like this:
Usage: snowplow-emr-etl-runner [options]
Specific options:
-c, --config CONFIG configuration file
-n, --enrichments ENRICHMENTS enrichments directory
-r, --resolver RESOLVER Iglu resolver file
-t, --targets TARGETS targets directory
-d, --debug enable EMR Job Flow debugging
-s, --start YYYY-MM-DD optional start date *
-e, --end YYYY-MM-DD optional end date *
-x staging,s3distcp,emr{enrich,shred,elasticsearch},archive_raw,
--skip skip work step(s)
-E, --process-enrich LOCATION run enrichment only on specified location. Implies --skip staging,shred,archive_raw
-S, --process-shred LOCATION run shredding only on specified location. Implies --skip staging,enrich,archive_raw
* filters the raw event logs processed by EmrEtlRunner by their timestamp. Only
supported with 'cloudfront' collector format currently.
Common options:
-h, --help Show this message
-v, --version Show version
A note on the --skip
option: this takes a list of individual steps to skip.
So for example you could run only the EMR job with the command-line option:
$ ./snowplow-emr-etl-runner --skip staging,archive_raw --config config/config.yml --targets config/targets/ --resolver resolver.json --enrichments config/enrichments
Instead of using the --config option, you can pass the configuration to the EmrEtlRunner via stdin. You need to set --config -
to signal that the config is to be read from stdin rather than from a file:
$ cat config/config.yml | ./snowplow-emr-etl-runner --config - --targets config/targets/ --resolver resolver.json --enrichments config/enrichments
Invoking EmrEtlRunner with just the --config
option puts it into rolling
mode, processing all the raw Snowplow event logs it can find in your In
Bucket:
$ ./snowplow-emr-etl-runner --config config/config.yml --resolver config/resolver.json --enrichments config/enrichments --targets config/targets/
To run EmrEtlRunner in timespan mode, you need to specify the --start
and/or --end
dates as well as the --config
option, like so:
$ ./snowplow-emr-etl-runner \
--config config.yml \
--resolver config/resolver.json \
--start 2012-06-20 \
--end 2012-06-24 \
--targets config/targets/
This will run EmrEtlRunner on log files which have timestamps in the period 20 June 2012 to 24 June 2012 inclusive.
Note that you do not have to specify both the start and end dates:
- Specify
--start
only and the timespan will run from your start date up to today, inclusive - Specify
--end
only and the timespan will run from the beginning of time up to your end date, inclusive
If your raw Snowplow logs are generated by the Amazon CloudFront collector, please note that CloudFront timestamps in UTC.
Once you have run the EmrEtlRunner you should be able to manually inspect in S3 the folder specified in the :out:
parameter in your config.yml
file and see new files generated, which will contain the cleaned data either for uploading into a storage target (e.g. Redshift or Infobright) or for analysing directly using Hive (or Pig or Mahout or some other Hadoop querying tool) on EMR.
Note: most Snowplow users run the 'hadoop' version of the ETL process, in which case the data generated is saved into subfolders with names of the form part-000...
. If, however, you are running the legacy 'hive' ETL (because e.g. you want to use Hive or Infobright as your storage target, rather than Redshift, which is the only storage target the 'hadoop' etl currently supports), the subfolders names will be of the format dt=...
.
Comfortable using EmrEtlRunner? Then schedule it so that it regularly takes new data generated by the collector, processes it, cleans it, enriches it, and writes it back to S3.