Upgrade Guide - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki
HOME » UPGRADE GUIDE
On this page, we are posting the steps to upgrade sequentially after a Snowplow release with the latest version at the top. Here sequentially means from the previous to the following.
You can also use Snowplow Version Matrix as a guidance to the internal component dependencies for a particular release.
For easier navigation, please, follow the links below.
- Snowplow 88 Angkor Wat released (r88) 2017-04-27
- Snowplow 87 Chichen Itza (r87) 2017-02-21
- Snowplow 86 Petra (r86) 2016-12-20
- Snowplow 85 Metamorphosis (r85) 2016-11-15
- Snowplow 84 Steller's Sea Eagle (r84) 2016-10-07
- Snowplow 83 Bald Eagle (r83) 2016-09-06
- Snowplow 82 Tawny Eagle (r82) 2016-08-08
- Snowplow 81 Kangaroo Island Emu (r81) 2016-06-16
- Snowplow 80 Southern Cassowary (r80) 2016-05-30
- Snowplow 79 Black Swan (r79) 2016-05-12
- Snowplow 78 Great Hornbill (r78) 2016-03-15
- Snowplow 77 Great Auk (r77) 2016-02-29
- Snowplow 76 Changeable Hawk-Eagle (r76) 2016-01-26
- Snowplow 75 Long-Legged Buzzard (r75) 2016-01-02
- Snowplow 74 European Honey Buzzard (r74) 2015-12-22
- Snowplow 73 Cuban Macaw (r73) 2015-12-04
- Snowplow 72 Great Spotted Kiwi (r72) 2015-10-15
- Snowplow 71 Stork-Billed Kingfisher (r71) 2015-10-02
- Snowplow 70 Bornean Green Magpie (r70) 2015-08-19
- Snowplow 69 Blue-Bellied Roller (r69) 2015-07-24
- Snowplow 68 Turquoise Jay (r68) 2015-07-23
- Snowplow 67 Bohemian Waxwing (r67) 2015-07-13
- Snowplow 66 Oriental Skylark (r66) 2015-06-16
- Snowplow 65 Scarlet Rosefinch (r65) 2015-05-08
- Snowplow 64 Palila (r64) 2015-04-16
- Snowplow 63 Red-Cheeked Cordon-Bleu (r63) 2015-04-02
- Snowplow 62 Tropical Parula (r62) 2015-03-17
- Snowplow 61 Pygmy Parrot (r61) 2015-03-02
- Snowplow 60 Bee Hummingbird (r60) 2015-02-03
- Snowplow 0.9.14 (v0.9.14) 2014-12-31
- Snowplow 0.9.13 (v0.9.13) 2014-12-01
- Snowplow 0.9.12 (v0.9.12) 2014-11-26
- Snowplow 0.9.11 (v0.9.11) 2014-11-10
- Snowplow 0.9.10 (v0.9.10) 2014-11-06
- Snowplow 0.9.9 (v0.9.9) 2014-10-27
- Snowplow 0.9.8 (v0.9.8) 2014-09-18
- Snowplow 0.9.7 (v0.9.7) 2014-09-02
- Snowplow 0.9.6 (v0.9.6) 2014-07-26
- Snowplow 0.9.5 (v0.9.5) 2014-07-09
- Snowplow 0.9.4 (v0.9.4) 2014-05-30
- Snowplow 0.9.3 (v0.9.3) 2014-05-21
- Snowplow 0.9.2 (v0.9.2) 2014-04-30
- Snowplow 0.9.1 (v0.9.1) 2014-04-11
- Snowplow 0.9.0 (v0.9.0) 2014-02-04
This release introduces event de-duplication across different pipeline runs, powered by DynamoDB, along with an important refactoring of the batch pipeline configuration.
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
Storage targets configuration JSONs can be generated from your existing config.yml
, using the 3-enrich/emr-etl-runner/config/convert_targets.rb
script. These files should be stored in a folder, for example called targets
, alongside your existing enrichments
folder.
When complete, your folder layout will look something like this:
snowplow_config
├── config.yml
├── enrichments
│ ├── campaign_attribution.json
│ ├── ...
│ ├── user_agent_utils_config.json
├── iglu_resolver.json
├── targets
│ ├── duplicate_dynamodb.json
│ ├── enriched_redshift.json
For complete examples, see our storage target configuration JSONs. The explanation of the properties are on the wiki page.
- Remove whole
storage.targets
section (leavingstorage.download.folder
) from yourconfig.yml
file - Update the
hadoop_shred
job version in your configuration YAML like so:
versions:
hadoop_enrich: 1.8.0 # UNCHANGED
hadoop_shred: 0.11.0 # WAS 0.10.0
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
- Append the option
--targets $TARGETS_DIR
to bothsnowplow-emr-etl-runner
andsnowplow-storage-loader
applications - Append the option
--resolver $IGLU_RESOLVER
tosnowplow-storage-loader
application. This is required to validate the storage target configurations
Please be aware that enabling this will have a potentially high cost and performance impact on your Snowplow batch pipeline.
If you want to start to deduplicate events across batches you need to add a new DynamoDB config target to your newly created targets
directory.
Optionally, before first run of Shred job with cross-batch deduplication, you may want to run Event Manifest Populator to back-fill the DynamoDB table.
When Hadoop Shred runs, if the table doesn’t exist then it will be automatically created with provisioned throughput by default set to 100 write capacity units and 100 read capacity units and the required schema to store and deduplicate events.
For relatively low (1M events per run) cases, the default settings will likely just work. However, we do strongly recommend monitoring the EMR job, and its AWS billing impact, closely and tweaking DynamoDB provisioned throughput and your EMR cluster specification accordingly.
This release contains a wide array of new features, stability enhancements and performance improvements for EmrEtlRunner and StorageLoader. As of this release EmrEtlRunner lets you specify EBS volumes for your Hadoop worker nodes; meanwhile StorageLoader now writes to a dedicated manifest table to record each load.
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
To make use of the new ability to specify EBS volumes for your EMR cluster’s core nodes, update your configuration YAML like so:
jobflow:
master_instance_type: m1.medium
core_instance_count: 1
core_instance_type: c4.2xlarge
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
volume_size: 200 # Gigabytes
volume_type: "io1"
volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
ebs_optimized: false # Optional. Will default to true
The above configuration will attach an EBS volume of 200 GiB to each core instance in your EMR cluster; the volumes will be Provisioned IOPS (SSD), with the performance of 400 IOPS/GiB. The volumes will not be EBS optimized. Note that this configuration has finally allowed us to use the EBS-only c4
instance types for our core nodes.
For a complete example, see our sample config.yml
template.
You will also need to deploy the following manifest table for Redshift:
This table should be deployed into the same schema as your events
and other tables.
This release introduces additional event de-duplication functionality for our Redshift load process, plus a brand new data model that makes it easier to get started with web data. It also adds support for AWS’s newest regions: Ohio, Montreal and London.
Upgrading is simple - update the hadoop_shred
job version in your configuration YAML like so:
versions:
hadoop_enrich: 1.8.0 # UNCHANGED
hadoop_shred: 0.10.0 # WAS 0.9.0
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
You will also need to deploy the following table for Redshift:
This release brings initial beta support for using Apache Kafka with the Snowplow real-time pipeline, as an alternative to Amazon Kinesis.
Please note that this Kafka support is extremely beta - we want you to use it and test it; do not use it in production.
The real-time apps for R85 Metamorphosis are available in the following zipfiles:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_scala_stream_collector_0.9.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.10.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_1x.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_elasticsearch_sink_0.8.0_2x.zip
Or you can download all of the apps together in this zipfile:
https://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r85_metamorphosis.zip
To upgrade the Stream Collector application:
- Install the new Collector on each server in your auto-scaling group
- Upgrade your config by:
- Moving the
collector.sink.kinesis.buffer
section down tocollector.sink.buffer
; as this section will be used to configure limits for both Kinesis and Kafka. - Adding a new section within the
collector.sink
block:
- Moving the
collector {
...
sink {
...
buffer {
byte-limit:
record-limit: # Not supported by Kafka; will be ignored
time-limit:
}
...
kafka {
brokers: ""
# Data will be stored in the following topics
topic {
good: ""
bad: ""
}
}
...
}
To upgrade the Stream Enrich application:
- Install the new Stream Enrich on each server in your auto-scaling group
- Upgrade your config by:
- Adding a new section within the enrich block:
enrich {
...
# Kafka configuration
kafka {
brokers: "localhost:9092"
}
...
}
Note: The app-name defined in your config will be used as your Kafka consumer group ID.
The Kinesis apps for R84 Stellers Sea Eagle are available in the following zipfiles:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_collector_0.8.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.9.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_elasticsearch_sink_0.8.0.zip
Or you can download all of the apps together in this zipfile:
https://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r84_stellers_sea_eagle.zip
Only the Elasticsearch Sink app config has changed. The change does not include breaking config changes. To upgrade the Elasticsearch Sink:
- Install the new Elasticsearch Sink app on each server in your Elasticsearch Sink auto-scaling group
- Update your Elasticsearch Sink config with the new
elasticsearch.client.http
section: elasticsearch.client.http.conn-timeout
elasticsearch.client.http.read-timeout
NOTE: These timeouts are optional and will default to 300000 if they cannot be found in your Config.
See our sample config.hocon
template.
This release introduces our powerful new SQL Query Enrichment, long-awaited support for the EU Frankfurt AWS region (eu-central-1), plus POST
support for our Iglu webhook adapter.
Update the hadoop_enrich
job version in your configuration YAML like so:
versions:
hadoop_enrich: 1.8.0 # WAS 1.7.0
hadoop_shred: 0.9.0 # UNCHANGED
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
This is a real-time pipeline release. This release updates the Kinesis Elasticsearch Sink with support for sending events via HTTP, allowing us to support Amazon Elasticsearch Service.
The Kinesis apps for 82 Tawny Eagle are all available in a single zip file here:
https://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r82_tawny_eagle.zip
The individual Kinesis apps for R82 Tawny Eagle are also available in the following zipfiles:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_collector_0.7.0.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_stream_enrich_0.8.1.zip
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_elasticsearch_sink_0.7.0.zip
Only the Elasticsearch Sink app has actually changed. The change does, however, include breaking config changes, so you will need to make some minor changes to your configuration file. To upgrade the Elasticsearch Sink:
- Install the new Elasticsearch Sink app on each server in your Elasticsearch Sink auto-scaling group
- Update your Elasticsearch Sink config with the new
elasticsearch
section:
- The only new fields are
elasticsearch.client.type
andelasticsearch.client.port
- The following fields have been renamed:
elasticsearch.cluster-name
is nowelasticsearch.cluster.name
elasticsearch.endpoint
is nowelasticsearch.client.endpoint
elasticsearch.max-timeout
is nowelasticsearch.client.max-timeout
elasticsearch.index
is nowelasticsearch.cluster.index
elasticsearch.type
is nowelasticsearch.cluster.type
- Update your supervisor process to point to the new Kinesis Elasticsearch Sink app
- Restart the supervisor process on each server running the sink
This is a real-time pipeline release. At the heart of it is the Hadoop Event Recovery project, which allows you to fix up Snowplow bad rows and make them ready for reprocessing.
The Kinesis apps for R81 Kangaroo Island Emu are all available in a single zip file here:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r81_kangaroo_island_emu.zip
Only the Stream Enrich app has actually changed. The change is not breaking, so you don’t have to make any changes to your configuration file. To upgrade Stream Enrich:
- Install the new Stream Enrich app on each server in your Stream Enrich auto-scaling group
- Update your supervisor process to point to the new Stream Enrich app
- Restart the supervisor process on each server running Stream Enrich
This is a real-time pipeline release which improves stability and brings the real-time pipeline up-to-date with our Hadoop pipeline.
As a result, you can now use R79 Black Swan’s API Request Enrichment and the HTTP Header Extractor Enrichment in your real-time pipeline. Also, you can now configure the number of records that the Kinesis Client Library should retrieve with each call to GetRecords
.
The Kinesis apps for R80 Southern Cassowary are all available in a single zip file here:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r80_southern_cassowary.zip
There are no breaking changes in this release - you can upgrade the individual Kinesis apps without worrying about having to update the configuration files or indeed the Kinesis streams.
If you want to configure how many records Stream Enrich should read from Kinesis at a time, update its configuration file to add a maxRecords
property like so:
enrich {
...
streams {
in: {
...
maxRecords: 5000 # Default is 10000
...
If you want to configure how many records Kinesis Elasticsearch Sink should read from Kinesis at a time, again update its configuration file to add a maxRecords
property:
sink {
...
kinesis {
in: {
...
maxRecords: 5000 # Default is 10000
...
This release introduces our powerful new API Request Enrichment, plus a new HTTP Header Extractor Enrichment and several other improvements on the enrichments side.
It also updates the Iglu client used by our Hadoop Enrich and Hadoop Shred components. The version 1.4.0 lets you fetch your schemas from Iglu registries with authentication support, allowing you to keep your proprietary schemas private.
The recommended AMI version to run Snowplow is now 4.5.0 - update your configuration YAML as follows:
emr:
ami_version: 4.5.0 # WAS 4.3.0
Next, update your hadoop_enrich
and hadoop_shred
job versions like so:
versions:
hadoop_enrich: 1.7.0 # WAS 1.6.0
hadoop_shred: 0.9.0 # WAS 0.8.0
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
If you want to use an Iglu registry with authentication, add a private apikey
to the registry’s configuration entry and set the schema version to 1-0-1 as in the example below.
{
"schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
"data": {
"cacheSize": 500,
"repositories": [
{
"name": "Iglu Central",
"priority": 0,
"vendorPrefixes": [ "com.snowplowanalytics" ],
"connection": {
"http": {
"uri": "http://iglucentral.com"
}
}
},
{
"name": "Private Acme repository for com.acme",
"priority": 1,
"vendorPrefixes": [ "com.acme" ],
"connection": {
"http": {
"uri": "http://iglu.acme.com/api",
"apikey": "APIKEY-FOR-ACME"
}
}
}
]
}
}
This release brings our Kinesis pipeline functionally up-to-date with our Hadoop pipeline, and makes various further improvements to the Kinesis pipeline.
The Kinesis apps for R78 Great Hornbill are now all available in a single zip file here:
http://dl.bintray.com/snowplow/snowplow-generic/snowplow_kinesis_r78_great_hornbill.zip
Scala Kinesis Enrich has been renamed to Stream Enrich. The name of the artifact has changed to "snowplow-stream-enrich".
Upgrading will require the following configuration changes to the applications' individual HOCON configuration files.
Add a collector.cookie.name
field to the HOCON and set its value to "sp"
.
Also, note that the configuration file no longer supports loading AWS credentials from the classpath using ClasspathPropertiesFileCredentialsProvider. If your configuration looks like this:
{
"aws": {
"access-key": "cpf",
"secret-key": "cpf"
}
}
then you should change "cpf" to "default" to use the DefaultAWSCredentialsProviderChain. You will need to ensure that your credentials are available in one of the places the AWS Java SDK looks. For more information about this, see the Javadoc.
Replace the sink.kinesis.out
string with an object with two fields:
{
"sink": {
"good": "elasticsearch", # or "stdout"
"bad": "kinesis" # or "stderr" or "none"
}
Additionally, move the stream-type
setting from the sink.kinesis.in
section to the sink
section.
If you are loading Snowplow bad rows into for example Elasticsearch, please make sure to update all applications.
For a complete example, see our sample config.hocon
template.
This release focuses on the command-line applications used to orchestrate Snowplow, bringing Snowplow up-to-date with the new 4.x series of Elastic MapReduce releases.
Running EmrEtlRunner and StorageLoader as Ruby (rather than JRuby apps) is no longer actively supported.
The latest version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
Note that the snowplow-runner-and-loader.sh
script has been also updated to use the JRuby binaries rather than the raw Ruby project.
The recommended AMI version to run Snowplow is now 4.3.0 - update your configuration YAML as follows:
emr:
ami_version: 4.3.0 # WAS 3.7.0
You will need to update the jar versions in the same section:
versions:
hadoop_enrich: 1.6.0 # WAS 1.5.1
hadoop_shred: 0.8.0 # WAS 0.7.0
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
This release introduces an event de-duplication process which runs on Hadoop, and also includes an important bug fix for our SendGrid webhook support.
Upgrading to this release is simple - the only changed components are the jar versions for Hadoop Enrich and Hadoop Shred.
In the config.yml
file for your EmrEtlRunner, update your hadoop_enrich
and hadoop_shred
job versions like so:
versions:
hadoop_enrich: 1.5.1 # WAS 1.5.0
hadoop_shred: 0.7.0 # WAS 0.6.0
hadoop_elasticsearch: 0.1.0 # Unchanged
For a complete example, see our sample config.yml
template.
This release lets you warehouse the event streams generated by Urban Airship and SendGrid, and also updates our web-recalculate data model.
The corresponding version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
In your EmrEtlRunner’s config.yml
file, update your hadoop_enrich
job’s version to 1.5.0, like so:
versions:
hadoop_enrich: 1.5.0 # WAS 1.4.0
For a complete example, see our sample config.yml
template.
You'll need to deploy the Redshift tables for any webhooks you plan on ingesting into Snowplow. You can find the Redshift table deployment instructions on the corresponding webhook setup wiki pages:
This release adds a Weather Enrichment to the Hadoop pipeline - making Snowplow the first event analytics platform with built-in weather analytics!
Data provider: OpenWeatherMap
To take advantage of this new enrichment, update the hadoop_enrich
jar version in the emr
section of your configuration YAML:
versions:
hadoop_enrich: 1.4.0 # WAS 1.3.0
hadoop_shred: 0.6.0 # UNCHANGED
hadoop_elasticsearch: 0.1.0 # UNCHANGED
For a complete example, see our sample config.yml
template.
Make sure to add a weather_enrichment_config.json
configured as explained here into your enrichments
folder too. The file should conform to this JSON Schema.
The corresponding JSONPaths file could be found here.
If you are using Snowplow with Amazon Redshift, you will need to deploy the org_openweathermap_weather_1 table into your database.
This release adds the ability to automatically load bad rows from the Snowplow Elastic MapReduce jobflow into Elasticsearch for analysis and formally separates the Snowplow enriched event format from the TSV format used to load Redshift.
The corresponding version of the EmrEtlRunner and StorageLoader are available from our Bintray here.
You will need to update the jar versions in the emr
section of your configuration YAML:
versions:
hadoop_enrich: 1.3.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.6.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
In order to start loading bad rows from the EMR jobflow into Elasticsearch, you will need to add an Elasticsearch target to the targets
section of your configuration YAML.
targets:
- name: "Our Elasticsearch cluster" # Name for the target - used to label the corresponding jobflow step
type: elasticsearch # Marks the database type as Elasticsearch
host: "ec2-43-1-854-22.compute-1.amazonaws.com" # Elasticsearch host
database: snowplow # The Elasticsearch index
port: 9200 # Port used to connect to Elasticsearch
table: bad_rows # The Elasticsearch type
es_nodes_wan_only: false # Set to true if using Amazon Elasticsearch Service
username: # Not required for Elasticsearch
password: # Not required for Elasticsearch
sources: # Leave blank or specify: ["s3://out/enriched/bad/run=xxx", "s3://out/shred/bad/run=yyy"]
maxerror: # Not required for Elasticsearch
comprows: # Not required for Elasticsearch
Note that the database
and table
fields actually contain the index and type respectively where bad rows will be stored.
The sources
field is an array of buckets from which to load bad rows. If you leave this field blank, then the bad rows buckets created by the current run of the EmrEtlRunner will be loaded. Alternatively, you can explicitly specify an array of bad row buckets to load.
For a complete example, see our sample config.yml
template.
Note these updates to EmrEtlRunner's command-line arguments:
- You can skip loading data into Elasticsearch by running EmrEtlRunner with the
--skip elasticsearch
option - To run just the Elasticsearch load without any other EmrEtlRunner steps, explicitly skip all other steps using
--skip staging,s3distcp,enrich,shred,archive_raw
- Note that running EmrEtlRunner with
--skip enrich,shred
will no longer skip the EMR job, since there is still the Elasticsearch step to run - If you are using Postgres rather than Redshift, you should no longer pass the
--skip shred
option to EmrEtlRunner. This is because the shred step now removes JSON fields from the enriched event TSV.
Use the appropriate migration script to update your version of the atomic.events
table to the relevant schema:
If you are upgrading to this release from an older version of Snowplow, we also provide Redshift migration scripts to atomic.events
version 0.8.0 from 0.4.0, 0.5.0 and 0.6.0 versions.
Warning: these migration scripts will alter your atomic.events
table in-place, deleting the unstruct_event
, contexts
, and derived_contexts
columns. We recommend that you make a full backup before running these scripts.
This release adds the ability to track clicks through the Snowplow Clojure Collector, adds a cookie extractor enrichment and introduces new de-duplication queries leveraging R71's event fingerprint
This release bumps the Clojure Collector to version 1.1.0.
To upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting “Save As…”
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector’s application
- Click the “Upload New Version” and upload your warfile
You need to update the version of the Enrich jar in your configuration file:
hadoop_enrich: 1.2.0 # Version of the Hadoop Enrichment process
If you wish to use the new cookie extractor enrichment, write a configuration JSON and add it to your enrichments
folder. The example JSON can be found here.
This default configuration is capturing the Scala Stream Collector's own sp
cookie - in practice, you would probably extract other more valuable cookies available on your domain. Each extracted cookie will end up a single derived context following the JSON Schema org.ietf/http_cookie/jsonschema/1-0-0
.
Note: This enrichment only works with events recorded by the Scala Stream Collector - the CloudFront and Clojure Collectors do not capture HTTP headers.
If you are using Snowplow with Amazon Redshift and wish to use the new cookie extractor enrichment, you will need to deploy the org_ietf_http_cookie_1
table into your database.
For the new URI redirect functionality, install the com_snowplowanalytics_snowplow_uri_redirect_1
table.
This release significantly overhauls Snowplow's handling of time and introduces event fingerprinting to support de-duplication efforts. It also brings our validation of unstructured events and custom context JSONs "upstream" from our Hadoop Shred process into our Hadoop Enrich process.
The latest version of the EmrEtlRunner and StorageLoadeder are available from our Bintray here.
Unzip this file to a sensible location (e.g. /opt/snowplow-r71
).
You should update the versions of the Enrich and Shred jars in your [configuration file][https://github.com/snowplow/snowplow/blob/r71-stork-billed-kingfisher/3-enrich/emr-etl-runner/config/config.yml.sample]:
hadoop_enrich: 1.1.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.5.0 # Version of the Hadoop Shredding process
You should also update the AMI version field:
ami_version: 3.7.0
For each of your database targets, you must add the new ssl_mode
field:
targets:
- name: "My Redshift database"
...
ssl_mode: disable # One of disable (default), require, verify-ca or verify-full
If you wish to use the new event fingerprint enrichment, write a configuration JSON and add it to your enrichments
folder. The example JSON can be found here.
Use the appropriate migration script to update your version of the atomic.events
table to the corresponding schema:
If you are ingesting Cloudfront access logs with Snowplow, use the Cloudfront access log migration script to update your com_amazon_aws_cloudfront_wd_access_log_1
table.
This release focuses on improving our StorageLoader and EmrEtlRunner components and is the first step towards combining the two into a single CLI application.
Download the EmrEtlRunner and StorageLoader from Bintray.
Unzip this file to a sensible location (e.g. /opt/snowplow-r70
).
Check that you have a compatible JRE (1.7+) installed:
$ ./snowplow-emr-etl-runner --version
snowplow-emr-etl-runner 0.17.0
Your two old configuration files will no longer work. Use the aforementioned combine_configurations.rb
script to turn them into a unified configuration file and a resolver JSON.
For reference:
-
config/iglu_resolver.json
- example resolver JSON -
emr-etl-runner/config/config.yml.sample
- example unified configuration YAML
Note that field names in the unified configuration file no longer start with a colon - so region: us-east-1
not :region: us-east-1
.
The EmrEtlRunner now requires a --resolver
argument which should be the path to your new resolver JSON.
Also note that when specifying steps to skip using the --skip
option, the "archive" step has been renamed to "archive_raw" in the EmrEtlRunner and "archive_enriched" in the StorageLoader. This is in preparation for merging the two applications into one.
This release contains new and updated SQL data models.
The SQL data models are a standalone and optional part of the Snowplow pipeline. Users who don't use the SQL data models are therefore not affected by this release.
To implement the SQL data models, first execute the relevant setup queries in Redshift. Then use SQL Runner to execute the queries on a regular basis. SQL Runner is an open source app that makes it easy to execute SQL statements programmatically as part of the Snowplow data pipeline.
The web and mobile data models come in two variants: recalculate
and incremental
.
The recalculate
models drop and recalculate the derived tables using all events, and can therefore be replaced without having to upgrade the tables.
The incremental
models update the derived tables using only the events from the most recent batch. The updated incremental
model comes with a migration script.
This is a small release which adapts the EmrEtlRunner to use the new Elastic MapReduce API.
You need to update EmrEtlRunner to the version 0.16.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r68-turquoise-jay
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
This release brings a host of upgrades to our real-time Amazon Kinesis pipeline as well as the embedding of Snowplow tracking into this pipeline.
The Kinesis apps for r67 Bohemian Waxwing are now all available in a single zip file here. Upgrading will require various configuration changes to each of the three applications’ HOCON configuration files.
- Change
collector.sink.kinesis.stream.name
tocollector.sink.kinesis.stream.good
in the HOCON - Add
collector.sink.kinesis.stream.bad
to the HOCON
If you want to include Snowplow tracking for this application please append the following:
enrich {
...
monitoring {
snowplow {
collector-uri: ""
collector-port: 80
app-id: ""
method: "GET"
}
}
}
Note that this is a wholly optional section; if you do not want to send application events to a second Snowplow instance, simply do not add this to your configuration file.
For a complete example, see our config.hocon.sample
file.
- Add
max-timeout
into theelasticsearch
fields - Merge location fields into the
elasticsearch
section - If you want to include Snowplow Tracking for this application please append the following:
sink {
...
monitoring {
snowplow {
collector-uri: ""
collector-port: 80
app-id: ""
method: "GET"
}
}
}
Again, note that Snowplow tracking is a wholly optional section.
For a complete example, see our config.hocon.sample
file.
This release upgrades our Hadoop Enrichment process to run on Hadoop 2.4, re-enables our Kinesis-Hadoop lambda architecture and also introduces a new scriptable enrichment powered by JavaScript.
You need to update EmrEtlRunner to the version 0.15.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r66-oriental-skylark
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
You need to update your EmrEtlRunner's config.yml
file to reflect the new Hadoop 2.4.0 and AMI 3.6.0 support:
:emr:
:ami_version: 3.6.0 # WAS 2.4.2
And:
:versions:
:hadoop_enrich: 1.0.0 # WAS 0.14.1
You can enable this enrichment by creating a self-describing JSON and adding into your enrichments
folder. The configuration JSON should validate against the javascript_script_config
schema.
This release greatly improves the speed, efficiency, and reliability of Snowplow’s real-time Kinesis pipeline.
The Kinesis apps for r65 Scarlet Rosefinch are all available in a single zip file here.
Upgrading will require various configuration changes to each of the four applications.
Add backoffPolic
y and buffer fields to the configuration HOCON.
- Add
backoffPolicy
andbuffer
fields to the configuration HOCON - Extract the resolver from the configuration HOCON into its own JSON file, which can be stored locally or in DynamoDB
- Update the command line arguments as detailed here
- Rename the outermost key in the configuration HOCON from "connector" to "sink"
- Replace the "s3/endpoint" field with an "s3/region" field (such as
us-east-1
)
Rename the outermost key in the configuration HOCON from "connector" to "sink"
This is a major release which adds a new data modeling stage to the Snowplow pipeline, as well as fixes a small number of important bugs across the rest of Snowplow.
You need to update EmrEtlRunner to the code 0.14.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r64-palila
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
From this release onwards, you must specify IAM roles for Elastic MapReduce to use. If you have not already done so, you can create these default EMR roles using the AWS Command Line Interface, like so:
$ aws emr create-default-roles
Now update your EmrEtlRunner's config.yml
file to add the default roles you just created:
:emr:
:ami_version: 2.4.2 # Choose as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html
:region: eu-west-1 # Always set this
:jobflow_role: EMR_EC2_DefaultRole # NEW LINE
:service_role: EMR_DefaultRole # NEW LINE
This release also bumps the Hadoop Enrichment process to version 0.14.1. Update config.yml
like so:
:versions:
:hadoop_enrich: 0.14.1 # WAS 0.14.0
For a complete example, see our sample config.yml
template.
This release widens the mkt_clickid
field in atomic.events
. You need to use the appropriate migration script to update to the new table definition:
This is a major release which adds two new enrichments, upgrades existing enrichments and significantly extends and improves our Canonical Event Model for loading into Redshift, Elasticsearch and Postgres.
The new and upgraded enrichments are as follows:
- New enrichment: parsing useragent strings using the
ua_parser
library - New enrichment: converting the money amounts in e-commerce transactions into a base currency using Open Exchange Rates
- Upgraded: extracting click IDs in our campaign attribution enrichment, so that Snowplow event data can be more precisely joined with campaign data
- Upgraded: our existing MaxMind-powered IP lookups
- Upgraded: useragent parsing using the
user_agent_utils
library can now be disabled
To continue parsing useragent strings using the user_agent_utils
library, you must add a new JSON configuration file into your folder of enrichment JSONs:
{
"schema": "iglu:com.snowplowanalytics.snowplow/user_agent_utils_config/jsonschema/1-0-0",
"data": {
"vendor": "com.snowplowanalytics.snowplow",
"name": "user_agent_utils_config",
"enabled": true,
"parameters": {}
}
}
The name of the file is not important but must end in .json
.
Configuring other enrichments is at your discretion. Useful resources here are:
There are two steps to upgrading the EMR pipeline:
- Upgrade your EmrEtlRunner to use the latest Hadoop job versions
- Upgrade your Redshift and/or Postgres
atomic.events
table to the relevant definitions
This release bumps:
- The Hadoop Enrichment process to version 0.14.0
- The Hadoop Shredding process to version 0.4.0
In your EmrEtlRunner's config.yml
file, update your Hadoop jobs versions like so:
:versions:
:hadoop_enrich: 0.14.0 # WAS 0.13.0
:hadoop_shred: 0.4.0 # WAS 0.3.0
For a complete example, see our sample config.yml
template.
You need to use the appropriate migration script to update to the new table definition:
If you want to make use of the new ua_parser based useragent parsing enrichment in Redshift, you must also deploy the new table into your atomic
schema:
This release updates:
- Scala Kinesis Enrich, to version 0.4.0
- Kinesis Elasticsearch Sink, to version 0.2.0
The new version of the Kinesis pipeline is available on Bintray. The download contains the latest versions of all of the Kinesis apps (Scala Stream Collector, Scala Kinesis Enrich, Kinesis Elasticsearch Sink, and Kinesis S3 Sink).
Our recommended approach for upgrading is as follows:
- Kill your Scala Kinesis Enrich cluster
- Leave your Kinesis Elasticsearch Sink cluster running until all remaining enriched events are loaded, then kill this cluster too
- Upgrade your Scala Kinesis Enrich cluster to the new version
- Upgrade your Kinesis Elasticsearch Sink cluster to the new version
- Restart your Scala Kinesis Enrich cluster
- Restart your Kinesis Elasticsearch Sink cluster
This release is designed to fix an incompatibility issue between r61's EmrEtlRunner and some older Elastic Beanstalk configurations. It also includes some other EmrEtlRunner improvements.
You need to update EmrEtlRunner to the code 0.13.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r62-tropical-parula
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
You must also update your EmrEtlRunner's configuration file, or else you will get a Contract failure on start. See the next section for details.
Whether or not you use the new bootstrap option, you must update your EmrEtlRunner's config.yml
file to include an entry for it:
In the :emr:
section of your EmrEtlRunner's config.yml
file, add in a :bootstrap:
property like so:
:emr:
...
:ec2_key_name: ADD HERE
:bootstrap: [] # No custom boostrap actions
:software:
...
For a complete example, see our sample config.yml
template.
This release has a variety of new features, operational enhancements and bug fixes. The major additions are:
- You can now parse Amazon CloudFront access logs using Snowplow
- The latest Clojure Collector version supports Tomcat 8 and CORS, ready for cross-domain
POST
from JavaScript and ActionScript - EmrEtlRunner's failure handling and Clojure Collector log handling have been improved
You need to update EmrEtlRunner to the code 0.12.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r61-pygmy-parrot
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
If you currently use snowplow-runner-and-loader.sh
, upgrade to the relevant version too.
This release bumps the Hadoop Enrichment process to version 0.13.0.
In your EmrEtlRunner's config.yml
file, update your hadoop_enrich
and hadoop_shred
jobs' versions like so:
:versions:
:hadoop_enrich: 0.13.0 # WAS 0.12.0
For a complete example, see our sample config.yml
template.
This release bumps the Clojure Collector to version 1.0.0.
You will not be able to upgrade an existing Tomcat 7 cluster to use this version. Instead, to upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting "Save As…"
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector's application
- Click the "Launch New Environment" action
- Click the "Upload New Version" and upload your warfile
When you are confident that the new collector is performing as expected, you can choose the "Swap Environment URLs" action to put the new collector live.
This release focuses on the Snowplow Kinesis flow, and includes:
- A new Kinesis “sink app” that reads the Scala Stream Collector’s Kinesis stream of raw events and stores these raw events in Amazon S3 in an optimized format
- An updated version of our Hadoop Enrichment process that supports as an input format the events stored in S3 by the new Kinesis sink app
Together, these two features let you robustly archive your Kinesis event stream in S3, and process and re-process it at will using our tried-and-tested Hadoop Enrichment process.
Up until now, all Snowplow releases have used semantic versioning. We will continue to use semantic versioning for Snowplow's many constituent applications and libraries, but our releases of the Snowplow platform as a whole will be known by their release number plus a codename. The codenames for 2015 will be birds in ascending order of size, starting with the Bee Hummingbird.
We recommend upgrading EmrEtlRunner to the version 0.11.0, given the bugs fixed in this release. You also must upgrade if you want to use Hadoop to process the events stored by the Kinesis LZO S3 Sink.
Upgrade is as follows:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout r60-bee-hummingbird
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
This release bumps the Hadoop Enrichment process to version 0.12.0.
In your EmrEtlRunner's config.yml file, update your hadoop_enrich
job's version like so:
:versions:
:hadoop_enrich: 0.12.0 # WAS 0.11.0
If you want to run the Hadoop Enrichment process against the output of the Kinesis LZO S3 Sink, you will have to change the collector_format field in the configuration file to thrift
:
:collector_format: thrift
For a complete example, see our sample config.yml
template.
We are steadily moving over to Bintray for hosting binaries and artifacts which don't have to be hosted on S3. To make deployment easier, the Kinesis apps (Scala Stream Collector, Scala Kinesis Enrich, Kinesis Elasticsearch Sink, and Kinesis S3 Sink) are now all available in a single zip file.
This release contains a variety of important bug fixes, plus support for three new event streams which can be loaded into your Snowplow event warehouse and unified log:
- Mandrill - for tracking email and email-related events delivered by Mandrill
- PagerDuty - for tracking incidents generated by PagerDuty
- Pingdom - for tracking site outages detected by Pingdom
You need to update EmrEtlRunner to the code 0.10.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.14
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
This release bumps the Hadoop Enrichment process to version 0.11.0 and the Hadoop Shredding process to version 0.3.0.
In your EmrEtlRunner's config.yml
file, update your hadoop_enrich and hadoop_shred jobs' versions like so:
:versions:
:hadoop_enrich: 0.11.0 # WAS 0.10.1
:hadoop_shred: 0.3.0 # WAS 0.2.1
For a complete example, see our sample config.yml
template.
This release bumps the Clojure Collector to version 0.9.1.
To upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting "Save As…"
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector’s application
- Click the "Upload New Version" and upload your warfile
You can find the new pixel in our GitHub repository as 2-collectors/cloudfront-collector/static/i
- upload this to S3, overwriting your existing pixel.
Remember to invalidate the pixel in your CloudFront distribution.
Make sure to deploy Redshift tables for any of the new webhooks that you plan on ingesting into Snowplow. You can find the Redshift table deployment instructions on the corresponding webhook setup wiki pages:
This release is fixing two bugs found in the previous release:
- Safer URI parsing
- Dependency conflict with the version of Specs2 in Kinesis Enrich
This release bumps Common Enrich to 0.9.1, Hadoop Enrich to version 0.10.1, and Kinesis Enrich to 0.2.1 with the latter two publically available on S3.
In your EmrEtlRunner's config.yml
file, update your Hadoop enrich job's version to 0.10.1:
:versions:
:hadoop_enrich: 0.10.1
For a complete example, see our sample config.yml
template.
This release significantly improves and extends our Kinesis support. The major new feature is our all new Kinesis Elasticsearch Sink, which streams event data from Kinesis into Elasticsearch in real-time. The data is then available to power real-time dashboards and analysis (e.g. using Kibana).
In addition to enabling real-time loading of data into Elasticsearch, we have made a number of other improvements to the real-time flow:
- Bad rows of data are now loaded into a dedicated bad rows stream in Kinesis
- The real-time flow now runs the latest version of Scala Common Enrich, making it possible to employ the same configurable enrichments in the real-time flow that are currently available in the batch flow.
This release also makes some improvements to Snowplow Common Enrich and Hadoop Enrich which should be invaluable for users of our batch-based event pipeline.
There are several changes you need to make to move to the new versions of the Scala Stream Collector and Scala Kinesis Enrich:
- You must provide a "region" field (with a value like “us-east-1”) in the configuration files
- You must provide a "resolver" field in the Scala Kinesis Enrich containing the data used to configure the Iglu resolver
- If you run Scala Kinesis Enrich without the -enrichments option, the IP anonymization enrichment and the IP address lookup enrichment will not run automatically
New templates for the two configuration files can be found on GitHub (you will need to edit the AWS credentials and the stream names):
And a sample enrichment directory containing sensible configuration JSONs can be found here.
This release bumps the Hadoop Enrichment process to version 0.10.0.
In your EmrEtlRunner's config.yml
file, update your Hadoop enrich job's version to 0.10.0, like so:
:versions:
:hadoop_enrich: 0.10.0 # WAS 0.9.0
For a complete example, see our sample config.yml
template.
For the first time, you can now use Snowplow to collect, store and analyze event streams generated by supported third-party software.
Many Software-as-a-Service vendors publish their own internal event streams for customers to consume - these event stream APIs are often referred to as "webhooks", sometimes as "streaming APIs", "postbacks" or "HTTP response APIs". Snowplow 0.9.11 adds first-class support for an initial set of these third-party webhooks.
For our initial 0.9.11 release we are adding support for three different webhook sources:
- MailChimp - for tracking email and email-related events delivered by MailChimp
- CallRail - for tracking completed telephone calls recorded by CallRail
- Iglu - for tracking Iglu-compatible self-describing events, enabling you to use schema-less webhook APIs such as AD-X Tracking
This release bumps the Hadoop Enrichment process to version 0.9.0.
In your EmrEtlRunner's config.yml
file, update your Hadoop enrich job's version to 0.9.0, like so:
:versions:
:hadoop_enrich: 0.9.0 # WAS 0.8.0
For a complete example, see our sample config.yml
template.
This release bumps the Clojure Collector to version 0.9.0.
To upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting "Save As…"
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector's application
- Click the “Upload New Version” and upload your warfile
If you have installed the com_snowplowanalytics_snowplow_change_form_1
table following the 0.9.10 release, then please upgrade it by using the upgrade script, migrate_change_form_1_r1_to_r2.sql
.
Also, make sure to deploy Redshift tables for any webhooks you plan on ingesting into Snowplow. You can find the Redshift table deployment instructions on the corresponding webhook setup wiki pages:
This is a minimalistic release designed to support the new events and context of the Snowplow JavaScript Tracker v2.1.1.
This release is primarily targeted at Snowplow users of Amazon Redshift who are upgrading to the Snowplow JavaScript Tracker (v2.1.0+).
You will need to deploy the tables for any new events/context you want to support into your Amazon Redshift database. Make sure to deploy these into the same schema as your events
table resides in.
You can find all Redshift table definitions in our GitHub repository under 4-storage/redshift-storage/sql
.
The StorageLoader will automatically pick up the new JSON Paths files - you do not have need to deploy these.
This is primarily a comprehensive bug fix release, although it also adds the new campaign_attribution
enrichment to our enrichment registry.
You need to update EmrEtlRunner and StorageLoader to the code 0.9.2 and 0.3.3 respectively on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.9
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
This release bumps the Hadoop Enrichment process to version 0.8.0.
In your EmrEtlRunner's config.yml
file, update your Hadoop enrich job’s version to 0.8.0, like so:
:versions:
:hadoop_enrich: 0.8.0 # WAS 0.7.0
For a complete example, see our sample config.yml
template.
If you upgrade Hadoop Enrich to version 0.8.0 as above, you must also follow these steps, or else campaign attribution will be disabled.
To use the new enrichment, add a "campaign_attribution.json" file containing a campaign_attribution
enrichment JSON to your enrichments directory. Note that the previously automatic behaviour of populating the mkt_
fields based on the utm_
querystring fields no longer occurs by default. To reproduce it you must use the Google-like manual tagging configuration.
This release bumps the Clojure Collector to version 0.8.0.
To upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting "Save As…"
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector's application
- Click the "Upload New Version" and upload your warfile
With this release, we are adding event analytics support for iOS and Android applications. Mobile event analytics is a major step in Snowplow’s journey from a web analytics tool to a general-purpose event analytics platform.
Adding mobile support for Snowplow is really a few different releases:
- Snowplow 0.9.8, which adds POST support to our Clojure Collector and upgrades our Enrichment process to support POST payloads containing multiple events
- A new event tracker for iOS, see today’s accompanying iOS Tracker blog post
- A new event tracker for Android, see today’s accompanying Android Tracker blog post
- New mobile-specific JSON Schemas available in Iglu Central, mobile_context and geolocation_context
This release bumps the Hadoop Enrichment process to version 0.7.0.
In your EmrEtlRunner's config.yml
file, update your Hadoop enrich job's version to 0.7.0, like so:
:versions:
:hadoop_enrich: 0.7.0 # WAS 0.6.0
For a complete example, see our sample config.yml
template.
Please make sure that you upgrade the Hadoop Enrichment process to 0.7.0 before upgrading your collector.
This release bumps the Clojure Collector to version 0.7.0.
To upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting "Save As…"
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector's application
- Click the "Upload New Version" and upload your warfile
Both of the new trackers send mobile-related context conforming to the mobile_context JSON Schema, as a custom context automatically attached to each event.
If you are running Redshift, you can deploy the mobile_context
table into your database using this this script.
The Android Tracker also optionally sends a geolocation-related context relating to the geolocation_context JSON Schema; support for this in the iOS Tracker is planned soon.
This release is a "tidy-up" release which fixes some important bugs, particularly:
- A bug in v0.9.5 onwards which was preventing events containing multiple JSONs from being shredded successfully
- Our Hive table definition falling behind Snowplow 0.9.6’s enriched event format updates
- A bug in EmrEtlRunner causing issues running Snowplow inside some VPC environments
As well as these important fixes, 0.9.7 comes with a set of smaller bug fixes plus two new features:
- The ability to perform shredding without prior enrichment (i.e. shred an existing folder of enriched events)
- The ability to load Redshift from an S3 bucket in a region different to Redshift's own region
You need to update EmrEtlRunner and StorageLoader to the 0.9.7 code release on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.7
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
In your EmrEtlRunner's config.yml
file, update your Hadoop shred job's version to 0.2.1, like so:
:versions:
...
:hadoop_shred: 0.2.1 # WAS 0.2.0
For a complete example, see our sample config.yml
template.
Hive users can find the updated Hive file in our repository as 4-storage/hive-storage/hiveql/table-def.q.
Note that enriched events generated by pre-0.9.6 Snowplow are not compatible with this updated Hive definition, and will need to be re-generated.
This release does four things:
- It fixes some important bugs discovered in Snowplow 0.9.5, related to our new shredding functionality
- It introduces new JSON-based configurations for Snowplow's existing enrichments
- It extends our geo-IP lookup enrichment to support all five of MaxMind's commercial databases
- It extends our referer-parsing enrichment to support a user-configurable list of internal domains
You need to update EmrEtlRunner and StorageLoader to the 0.9.6 code release on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.6
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
$ cd ../../4-storage/storage-loader
$ bundle install --deployment
Update your EmrEtlRunner's config.yml
file. First update both of your Hadoop job versions to, respectively:
:versions:
:hadoop_enrich: 0.6.0 # WAS 0.5.0
:hadoop_shred: 0.2.0 # WAS 0.1.0
Next, completely delete the :enrichments:
section at the bottom:
:enrichments:
:anon_ip:
:enabled: true
:anon_octets: 2
For a complete example, see our sample config.yml
template.
Finally, if you wish to use any of the configurable enrichments, you need to create a directory of configuration JSONs and pass that directory to the EmrEtlRunner using the new --enrichments
option.
For help on this, please read our release blog; also check out the example enrichments directory, and review the configuration guide for the new JSON-based enrichments.
Important: don’t forget to update any Bash script that you use to run your EmrEtlRunner job, to include the --enrichments
argument. If you forget to do this, then all of your enrichments will be switched off. You can see updated versions of these Bash files here:
You need to use the appropriate migration script to update to the new table definition:
This release makes Snowplow the first event analytics system to validate incoming event and context JSONs (using JSON Schema), and then automatically shred those JSONs into dedicated tables in Amazon Redshift.
You need to update EmrEtlRunner to the code release 0.9.5 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.5
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
You also need to update the config.yml
file for EmrEtlRunner. For more information on how to populate the new configuration file correctly, see the Configuration section of the EmrEtlRunner setup guide.
You need to upgrade your StorageLoader installation to the code 0.9.5 on Github:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.5
$ cd snowplow/4-storage/storage-loader
$ bundle install --deployment
You also need to update the config.yml
file for StorageLoader.
If you want to add support for the new Snowplow-authored events e.g. link clicks to your Snowplow installation, this is a two-step process:
- Deploy the Redshift table definition available in the Snowplow repo into your Redshift database (same schema as
atomic.events
) - (If using Looker) deploy the LookML model available in the Snowplow repo into your Looker instance
Snowplow 0.9.5 lets you define your own custom unstructured events and contexts, and configure Snowplow to processing these from collection through into Redshift and even Looker.
Setting this up is outside of the scope of this release blog post. We have documented the process on our wiki, split into two pages:
This release includes a new base LookML data model and dashboard to get Snowplow users started with Looker.
The new base model has some significant improvements over the old one:
- Querying the data is much faster. When new Snowplow event data is loaded into Redshift, Looker automatically detects it and generates the relevant session-level and visitor-level derived tables, so that they are ready to be queried directly. We’ve tuned the derived tables with the relevant dist keys and sort keys to make sure any underlying table joins in Redshift are performant
- New visualizations are now supported including geographic plots
- Looker's new functionality around global filters: this makes it possible to drill into subsets of visitors by a range of dimensions, and see a wide range of different visualizations for that subset of users on the same screen, opening up new creative ways of exploring your Snowplow data
- Metrics and dimensions have been renamed to make it easier for a new user unfamiliar with Snowplow to explore the data through Looker
To make use of the new models, you'll need to have a Looker license or be on a Looker trial.
First, you will need to load a new country codes dataset into Redshift / Postgres: this maps two character ISO country codes (outputted by our Maxmind enrichment) to three-character ISO country codes (used by Looker for geographic visualizations) and country names.
Clone the Snowplow repo:
$ git clone https://github.com/snowplow/snowplow.git
You need to run the contents of snowplow/5-data-modeling/reference-data/redshift/iso-country-codes.sql
in our Redshift database. This can be done using PSQL e.g.
psql -U $username -p $port -h $host -d $database -f snowplow/5-data-modeling/reference-data/redshift/iso-country-codes.sql
Alternatively, you can copy and paste the content of the file into your favorite SQL editor.
You then need to make sure that our Looker user has access to the new data. In PSQL, execute:
GRANT USAGE ON SCHEMA reference_data TO looker;
GRANT SELECT ON TABLE reference_data.country_codes TO looker;
Assuming that the user credentials you share with Looker have username "looker".
Next, you need to transfer our LookML files from the Snowplow repo into the repo you use for Looker, either directly (via Git) or by creating the files in the Looker UI (in the models section), and then copying and pasting the contents. Note that may need to update the snowplow.model.lookml
so that it references your connection in Redshift to your Snowplow dataset: the example file assumes that your connection is called "snowplow", which may not be the case.
Once copied over, you should be able to start exploring the "events", "sessions" and "visitors" views, and playing around directly with the "Traffic Pulse" dashboard.
These release deals with incremental improvements to EmrEtlRunner, plus two important bug fixes for Clojure Collector users.
The first Clojure Collector issue was a problem in the file move functionality in EmrEtlRunner, which was preventing Clojure Collector users from scaling beyond a single instance without data loss.
The second Clojure Collector issue involved the Elastic Beanstalk's Apache proxy's IP address(es) showing up in the atomic.events
table in place of the expected end-user's IPs. We were unable to reproduce this issue when running multiple instances, so we do not believe this problem is as widespread.
Upgrading is a two-step process:
- Update EmrEtlRunner
- Update Clojure Collector [optional]
You need to update EmrEtlRunner to the code 0.7.0 on GitHub:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.3
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
You also need to update your EmrEtlRunner's config.yml
file in a few places. First add a logging section at the top:
:logging:
:level: DEBUG # You can optionally switch to INFO for production
Next you need to replace this:
:emr:
:hadoop_version: 1.0.3
with this:
:emr:
:ami_version: 2.4.2
If you need to use a different Hadoop version, check out this handy table to determine the correct AMI version.
Finally, add the region in:
:emr:
:ami_version: 2.4.2
:region: us-east-1 # Or your region
Your :region:
will be your existing :placement:
without the character on the end. Note that if you are running your EMR job in an EC2 subnet, you no longer need to set the :placement:
field.
Once you have made these changes, do check your final version against the updated config.yml
template.
This release bumps the Clojure Collector to version 0.6.0. Upgrading to this release is only necessary if you have been encountering the issue with proxy IPs appearing in atomic.events
, as discussed in this email thread (issue #719).
To upgrade to this release:
- Download the new warfile by right-clicking on this link and selecting “Save As…”
- Log in to your Amazon Elastic Beanstalk console
- Browse to your Clojure Collector's application
- Click the “Upload New Version” and upload your warfile
This release adds Snowplow support for the updated CloudFront access log file format introduced by Amazon on the morning of 29th April 2014.
If you currently use the Snowplow CloudFront-based event collector, you are recommended to upgrade to this release as soon as possible.
As well as support for the new log file format, this release also features a new standalone Scalding job to make re-processing “bad” rows easier, and also some Hive script updates to bring our Hive support in step with our Postgres and Redshift schemas.
Before upgrading, please ensure that you are on Snowplow 0.9.1 version, which introduced changes to the Snowplow enriched event format.
If you attempt to jump straight to 0.9.2 (from versions before 0.9.1), your enriched events will not load into your legacy Redshift or Postgres schema.
Upgrading is super simple: simply update the config.yml
file for EmrEtlRunner to use the version 0.5.0 of the Hadoop ETL:
:snowplow:
:hadoop_etl_version: 0.5.0
Important: since releasing this version of Snowplow, we have learnt that the suggested upgrade process listed below has the unfortunate side effect of URL-encoding all string columns in the recovered data. For that reason, we recommend updating to Snowplow 0.9.3, where this bug is addressed.
Any Snowplow batch runs after the CloudFront change but before your upgrade to 0.9.2 will have resulted in valid events ending up in your bad
rows bucket. Happily, we can use the Snowplow Hadoop Bad Rows job to recover them.
For every run to recover data from, you can run the Hadoop Bad Rows job using the Amazon Ruby EMR client:
$ elastic-mapreduce --create --name "Extract raw events from Snowplow bad row JSONs" \
--instance-type m1.xlarge --instance-count 3 \
--jar s3://snowplow-hosted-assets/3-enrich/scala-bad-rows/snowplow-bad-rows-0.1.0.jar \
--arg com.snowplowanalytics.hadoop.scalding.SnowplowBadRowsJob \
--arg --hdfs \
--arg --input --arg s3n://[[PATH_TO_YOUR_FIXABLE_BAD_ROWS]] \
--arg --output --arg s3n://[[PATH_WILL_BE_STAGING_FOR_EMRETLRUNNER]]
Replace the [[...]]
placeholders above with the appropriate bucket paths. Please note: if you have multiple runs to fix, then we suggest running the above multiple times, one per run to fix, rather than running it against your whole bad rows bucket - it should be much faster.
Now you are ready to process the recovered raw events with Snowplow. Unfortunately, the filenames generated by the Bad Rows job are not compatible with the EmrEtlRunner currently (we will fix this in a future release). In the meantime, here is a workaround:
- Edit
config.yml
and change:collector_format: cloudfront
to:collector_format: clj-tomcat
- Edit
config.yml
and point the:processing:
bucket setting to wherever your extracted bad rows are located - Run EmrEtlRunner with
--skip staging
If you are a Qubole and/or Hive user, you can find an alternative approach to recovering the bad rows in our blog post, Reprocessing bad rows of Snowplow data using Hive, the JSON Serde and Qubole.
This release introduces initial support for JSON-based custom unstructured events and custom contexts in the Snowplow Enrichment and Storage processes; this is the most-requested feature from our community and a key building block for mobile and app event tracking in Snowplow.
Snowplow’s event trackers have supported custom unstructured events and custom contexts for some time, but prior to 0.9.1 there had been no way of working with these JSON-based objects “downstream” in the rest of the Snowplow data pipeline. This release adds preliminary support like this:
- Parse incoming custom unstructured events and contexts to ensure that they are valid JSON
- Where possible, clean up the JSON (e.g. remove whitespace)
- Store the JSON as
json
-type fields in Postgres, and in largevarchar
fields in Redshift
As well as this new JSON-based functionality, 0.9.1 also includes a host of additional features and updates.
You need to update EmrEtlRunner to the code 0.9.1 on Github:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.1
$ cd snowplow/3-enrich/emr-etl-runner
$ bundle install --deployment
You also need to update the config.yml
file for EmrEtlRunner to use the Hadoop ETL version 0.4.0:
:snowplow:
:hadoop_etl_version: 0.4.0
Don't forget to add in the new subnet (VPC) argument too:
:emr:
...
:ec2_subnet_id: ADD HERE # Leave blank if not running in VPC
See a complete example of the EmrEtlRunner config.yml
file on Github repo.
You need to upgrade your StorageLoader installation to the code 0.9.1 on Github:
$ git clone git://github.com/snowplow/snowplow.git
$ git checkout 0.9.1
$ cd snowplow/4-storage/storage-loader
$ bundle install --deployment
We have updated the Redshift and Postgres table definitions for atomic.events
. You can find the latest versions in the GitHub repository, along with migration scripts to handle the upgrade from recent prior versions. Please review any migration script carefully before running and check that you are happy with how it handles the upgrade.
Database | Table definition | Migration script |
---|---|---|
Redshift | 0.3.0 | Migrate from 0.2.2 to 0.3.0 |
Postgres | 0.2.0 | Migrate from 0.1.x to 0.2.0 |
This release introduces our initial beta support for Amazon Kinesis in the Snowplow Collector and Enrichment components.
At Snowplow we are hugely excited about Kinesis's potential, not just to enable near-real-time event analytics, but more fundamentally to serve as a business’s unified log, aka its “digital nervous system”. This is a concept we introduced recently in our blog post The three eras of business data processing, and further explored at the Inaugural Kinesis London meetup.
No upgrade steps as the release introduces the whole "new" concept. If you want to take it onboard you would need to set up a new environment.