Enrichment - thebeansgroup/snowplow GitHub Wiki

HOME > [SNOWPLOW TECHNICAL DOCUMENTATION](Snowplow technical documentation) > Enrichment > Enrichment

The Snowplow Enrichment step takes the raw log files generated by the Snowplow collectors, tidies the data up and enriches it so that it is:

  1. Ready in S3 to be analysed using EMR
  2. Ready to be uploaded into Amazon Redshift, PostgreSQL or some other alternative storage mechanism for analysis

The current Enrichment process is written using Scalding, a Scala implementation of Cascading, an ETL library that's written on top of Hadoop. This is the Hadoop Enrichment, although we are now also working on a Kinesis-based enrichment process.

Snowplow uses Amazon's EMR to run the Enrichment process. The regular running of the process (which is necessary to ensure that up-to-date Snowplow data is available for analysis) is managed by EmrEtlRunner, a Ruby application.

In this guide, we cover:

  1. How the EmrEtlRunner instruments the regular running of the Enrichment Process
  2. [The Enrichment Process itself][The-enrichment-process]
⚠️ **GitHub.com Fallback** ⚠️