Enrichment - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki

HOME > SNOWPLOW TECHNICAL DOCUMENTATION > Enrichment

The Snowplow Enrichment step takes the raw log files generated by the Snowplow collectors, tidies the data up and enriches it so that it is:

  1. Ready to be analysed using EMR
  2. Ready to be uploaded into Amazon Redshift, PostgreSQL or some other alternative storage mechanism for analysis

The current enrichment process provides 2 options for developers to use:

  1. Using Scalding, a Scala implementation of Cascading, an ETL library that's written on top of Hadoop. This is the Hadoop Enrichment. Snowplow uses Amazon's EMR to run the Enrichment process. The regular running of the process (which is necessary to ensure that up-to-date Snowplow data is available for analysis) is managed by EmrEtlRunner, a Ruby application.

  2. Using Scala and Amazon Kinesis for real-time processing of data.

In this guide, we cover:

  1. The Enrichment Process itself
  2. How the EmrEtlRunner instruments the regular running of the Enrichment Process
  3. Stream
⚠️ **GitHub.com Fallback** ⚠️