3 Scheduling the StorageLoader - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki

HOME > SNOWPLOW SETUP GUIDE > Step 4: setting up alternative data stores > 1: Installing the StorageLoader > 2: Using the StorageLoader > 3: Scheduling the StorageLoader

  1. Overview
  2. Scheduling StorageLoader only
  3. Scheduling EmrEtlRunner and StorageLoader
  4. Alternatives to cron
  5. Next steps

1. Overview

Once you have the load process working smoothly, you can schedule a daily (or more frequent) task to automate the storage process.

The standard way of scheduling the load process is as a daily cronjob. We provide a snowplow-runner-and-loader.sh shell scripts for you to use in your scheduling if you want to run the StorageLoader immediately after EmrEtlRunner has completed its work (recommended).

  1. snowplow-storage-loader.sh - this (obsolete) script just runs the StorageLoader
  2. snowplow-runner-and-loader.sh - this script runs the EmrEtlRunner immediately followed by the StorageLoader

The second script is recommended assuming

To consider each scheduling option in turn:

2. Scheduling StorageLoader only

The below steps are relevant to the obsolete script snowplow-storage-loader.sh. Running EmrEtlRunner as Ruby (rather than JRuby apps) is no longer actively supported. The latest version of the EmrEtlRunner is available from our Bintray here.

The shell script /4-storage/storage-loader/bin/snowplow-storage-loader.sh runs the StorageLoader app only.

You need to edit this script and update the three variables at the top:

rvm_path=/path/to/.rvm # Typically in the $HOME of the user who installed RVM
LOADER_PATH=/path/to/snowplow/4-storage/snowplow-storage-loader
LOADER_CONFIG=/path/to/your-loader-config.yml

So for example if you installed RVM as the admin user, then you would set:

rvm_path=/home/admin/.rvm

Now, assuming you're using the excellent cronic as a wrapper for your cronjobs, and that both cronic and Bundler are on your path, you can configure your cronjob like so:

0 6   * * *   root    /path/to/snowplow/4-storage/storage-loader/bin/snowplow-storage-loader.sh

This will run the ETL job daily at 6am, emailing any failures to you via cronic. Please make sure that your Snowplow events have been safely generated and stored in your In Bucket prior to 6am.

3. Scheduling EmrEtlRunner and StorageLoader

The shell script /4-storage/storage-loader/bin/snowplow-runner-and-loader.sh runs EmrEtlRunner, immediately followed by StorageLoader - i.e. it chains them together. At Snowplow, this is the scheduling option we use.

If you use this script, you can delete any separate cronjob for the EmrEtlRunner alone.

You need to update this script and update the six variables at the top:

RUNNER_PATH=/path/to/snowplow-emr-etl-runner
LOADER_PATH=/path/to/snowplow-storage-loader
RUNNER_CONFIG=/path/to/your-runner-config.yml
RESOLVER=/path/to/your-resolver.json
RUNNER_ENRICHMENTS=/path/to/your/enrichment-jsons
LOADER_CONFIG=/path/to/your-loader-config.yml

So for example if you installed the StorageLoader as the admin user, then you would set:

LOADER_PATH=/home/admin/snowplow-storage-loader

Using cronic as a wrapper, and with cronic and Bundler on your path, configure your cronjob like so:

0 4   * * *   root    /path/to/snowplow-runner-and-loader.sh

This will run the ETL job and then the database load daily at 4am, emailing any failures to you via cronic.

4. Alternatives to cron

In place of cron, you could schedule StorageLoader using a continuous integration server such as Jenkins, or potentially use the Windows Task Scheduler.

These options are explored in a little more detail in the Scheduling EmrEtlRunner guide.

5. Next steps

Setup the StorageLoader! Now you are ready to do some analysis!.

⚠️ **GitHub.com Fallback** ⚠️