3 Scheduling EmrEtlRunner - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki

HOME » SNOWPLOW SETUP GUIDE » Step 3: Setting up Enrich » Step 3.1: setting up EmrEtlRunner Step 3.1.1: Installing EmrEtlRunner » Step 3.1.2: Using EmrEtlRunner » Step 3.1.3: Scheduling EmrEtlRunner

  1. Overview
  2. cron
  3. Jenkins
  4. Windows Task Scheduler
  5. Next steps

Scheduling

1. Overview

Once you have the ETL process working smoothly, you can schedule a daily (or more frequent) task to automate the daily ETL process.

We run our daily ETL jobs at 3 AM UTC so that we are sure that we have processed all of the events from the day before (CloudFront logs can take some time to arrive).

To consider your different scheduling options in turn:

2. cron

Running EmrEtlRunner as Ruby (rather than JRuby apps) is no longer actively supported. The latest version of the EmrEtlRunner is available from our Bintray here.

The recommended way of scheduling the ETL process is as a daily cronjob.

Note: The below reference to snowplow-emr-etl-runner.sh script is provided in case you are still using the older version of EmrEtlRunner. It's a better solution to use snowplow-runner-and-loader.sh script which synchronizes EmrEtlRunner and StorageLoader. You might skip these instructions altogether and return to this topic on Scheduling the StorageLoader page.

If you are still using the shell script available in the Snowplow GitHub repository at /3-enrich/emr-etl-runner/bin/snowplow-emr-etl-runner.sh you need to edit this script and update the three variables:

rvm_path=/path/to/.rvm # Typically in the $HOME of the user who installed RVM
RUNNER_PATH=/path/to/snowplow/3-enrich/snowplow-emr-etl-runner
RUNNER_CONFIG=/path/to/your-config.yml
RUNNER_ENRICHMENTS=/path/to/your-enrichment-jsons

So for example if you installed RVM as the admin user, then you would set:

rvm_path=/home/admin/.rvm

Now, assuming you're using the excellent cronic as a wrapper for your cronjobs, and that both cronic and Bundler are on your path, you can configure your cronjob like so:

0 4   * * *   root    cronic /path/to/snowplow/3-enrich/bin/snowplow-emr-etl-runner.sh

This will run the ETL job daily at 4 AM, emailing any failures to you via cronic.

3. Jenkins

Some developers use the Jenkins continuous integration server (or Hudson, which is very similar) to schedule their Hadoop and Hive jobs.

Describing how to do this is out of scope for this guide, but the blog post Lowtech Monitoring with Jenkins is a great tutorial on using Jenkins for non-CI-related tasks, and could be easily adapted to schedule EmrEtlRunner.

4. Windows Task Scheduler

For Windows servers, in theory it should be possible to use a Windows PowerShell script plus Windows Task Scheduler instead of bash and cron. However, this has not been tested or documented.

If you get this working, please let us know!

5. Next steps

Now you have installed and scheduled EmrEtlRunner, you have all your data ready for analysis in S3. Learn how to setup the StorageLoader to regularly load your data into a database e.g. Infobright or Redshift for e.g. OLAP analysis, or to analyse it on S3 via Emr.

⚠️ **GitHub.com Fallback** ⚠️