2 Using the StorageLoader - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki
HOME > SNOWPLOW SETUP GUIDE > Step 4: setting up alternative data stores > 1: Installing the StorageLoader > 2: Using the StorageLoader
Running the StorageLoader is very straightforward - please review the command-line options in the next section.
The StorageLoader is an executable jar:
$ ./snowplow-storage-loader
The command-line options for StorageLoader look like this:
Specific options:
-c, --config CONFIG configuration fgle
-b CONFIG, base64-encoded configuration string
--base64-config-string
-t, --targets TARGETS_DIR targets directory
-i, --include compupdate,vacuum include optional work step(s)
-s download|delete,load,shred,analyze,archive_enriched,
--skip skip work step(s)
-r, --resolver RESOLVER Iglu resolver config file
Common options:
-h, --help Show this message
-v, --version Show version
A note on the --include
option: this includes optional work steps
which are otherwise not used. --include vacuum
runs a VACUUM
operation on the table following the load. --include compupdate
runs
the load then determines the best compression encoding format to use for
each each of the fields in your Redshift event table, using the :comprows:
setting for the sample size. For more information on Amazon’s comprows
functionality, see the Redshift documentation.
A note on the --skip
option: this skips the work steps listed. So, for example --skip download,load
would only run the final archive step. This is useful if you have an error in your load and need to re-run only part of it.
Instead of using the --config option, you can pass the configuration to the EmrEtlRunner via stdin. You need to set --config -
to signal that the config is to be read from stdin rather than from a file:
$ cat config/config.yml | ./snowplow-storage-loader --config -
As per the above, running StorageLoader is a matter of populating
your storage targets configurations (config/targets/
) and configuration file, (config/config.yml
) for this
example, and then invoking StorageLoader like so:
$ ./snowplow-storage-loader --config config/config.yml --resolver config/resolver.json --targets config/targets/ --skip analyze
The --skip analyze
is required because in Redshift only the table owner or superuser can ANALYZE a table, not our storageloader
user.
StorageLoader depends on Snowplow's [Infobright Ruby Loader][irl], which in turn uses the locate
shell command. If your shell complains that this is missing, in which case you can install it separately.
To install and configure locate
on Debian/Ubuntu:
$ sudo apt-get install mlocate
$ sudo updatedb
All done? Then schedule the StorageLoader to regularly migrate new data into your data store (e.g. Infobright or Redshift).