Configuring storage targets - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki
HOME > SNOWPLOW SETUP GUIDE > Step 3: Setting up Enrich > Configuring storage targets
Snowplow offers the option to configure certain storage targets. This is done using configuration JSONs.
When running EmrEtlRunner or StorageLoader, the --targets
argument should be populated with the filepath of a directory containing your configuration JSONs.
Each storage target JSON file can have arbitrary name, but must conform it's JSON Schema.
Some targets are handled by EmrEtlRunner (duplicate tracking, failure tracking) and some by StorageLoader (enriched data).
Here's a list of currently supported targets, grouped by purpose:
- Enriched data
- Failures
- Duplicate tracking
Schema: iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/1-0-0
-
name
, a descriptive name for this Snowplow storage target -
host
, the host (endpoint in Redshift parlance) of the databse to load. -
database
, the name of the database to load -
port
, the port of the database to load. 5439 is the default Redshift port -
schema
, the name of the database schema which will store your Snowplow tables -
username
, the database user to load your Snowplow events with. You can leave this blank to default to the user running the script -
password
, the password for the database user. Leave blank if there is no password -
maxError
, a Redshift-specific setting governing how many load errors should be permitted before failing the overall load. See the RedshiftCOPY
documentation for more details -
compRows
, a Redshift-specific setting defining number of rows to be used as the sample size for compression analysis. Should be between 1000 and 1000000000 -
purpose
: common for all targets. Redshift supports onlyENRICHED_DATA
-
sslMode
, determines how to handle encryption for client connections and server certificate verification. The the followingsslMode
values are supported:
-
DISABLE
: SSL is disabled and the connection is not encrypted. -
REQUIRE
: SSL is required. -
VERIFY_CA
: SSL must be used and the server certificate must be verified. -
VERIFY_FULL
: SSL must be used. The server certificate must be verified and the server hostname must match the hostname attribute on the certificate.
Note: The difference between VERIFY_CA
and VERIFY_FULL
depends on the policy of the root CA. If a public CA is used, VERIFY_CA
allows connections to a server that somebody else may have registered with the CA to succeed. In this case, verify-full
should always be used. If a local CA is used, or even a self-signed certificate, using VERIFY_CA
often provides enough protection.
Schema: iglu:com.snowplowanalytics.snowplow.storage/postgresql_config/jsonschema/1-0-0
-
name
, enter a descriptive name for this Snowplow storage target -
host
, the host (endpoint in Redshift parlance) of the databse to load. -
database
, the name of the database to load -
port
, the port of the database to load. 5439 is the default Redshift port; 5432 is the default Postgres port -
schema
, the name of the database schema which will store your Snowplow tables -
username
, the database user to load your Snowplow events with. You can leave this blank to default to the user running the script -
password
, the password for the database user. Leave blank if there is no password -
sslSode
, determines how to handle encryption for client connections and server certificate verification. The the followingsslMode
values are supported:
-
DISABLE
: SSL is disabled and the connection is not encrypted. -
REQUIRE
: SSL is required. -
VERIFY_CA
: SSL must be used and the server certificate must be verified. -
VERIFY_FULL
: SSL must be used. The server certificate must be verified and the server hostname must match the hostname attribute on the certificate.
-
purpose
: common for all targets. PostgreSQL supports onlyENRICHED_DATA
Schema: iglu:com.snowplowanalytics.snowplow.storage/elastic_config/jsonschema/1-0-0
-
name
: a descriptive name for this Snowplow storage target -
port
: The port to load. Normally 9200, should be 80 for Amazon Elasticsearch Service. -
index
: The Elasticsearch index to load -
nodesWanOnly
: if this is set to true, the EMR job will disable node discovery. This option is necessary when using Amazon Elasticsearch Service. -
type
: name of type -
purpose
: common for all targets. Elasticsearch supports onlyFAILED_EVENTS
For information on setting up Elasticsearch itself, see Setting up Amazon Elasticsearch Service.
Schema: iglu:com.snowplowanalytics.snowplow.storage/amazon_dynamodb_config/jsonschema/1-0-0
-
name
: a descriptive name for this Snowplow storage target -
accessKeyId
: AWS Access Key Id -
secretAccessKey
: AWS Secret Access Key -
awsRegion
: AWS region -
dynamodbTable
: DynamoDB table to store information about processed events -
purpose
: common for all targets. Elasticsearch supports onlyDUPLICATE_TRACKING