Enrichment - thebeansgroup/snowplow GitHub Wiki
HOME > [SNOWPLOW TECHNICAL DOCUMENTATION](Snowplow technical documentation) > Enrichment > Enrichment
The Snowplow Enrichment step takes the raw log files generated by the Snowplow collectors, tidies the data up and enriches it so that it is:
- Ready in S3 to be analysed using EMR
- Ready to be uploaded into Amazon Redshift, PostgreSQL or some other alternative storage mechanism for analysis
The current Enrichment process is written using Scalding, a Scala implementation of Cascading, an ETL library that's written on top of Hadoop. This is the Hadoop Enrichment, although we are now also working on a Kinesis-based enrichment process.
Snowplow uses Amazon's EMR to run the Enrichment process. The regular running of the process (which is necessary to ensure that up-to-date Snowplow data is available for analysis) is managed by EmrEtlRunner, a Ruby application.
In this guide, we cover:
- How the EmrEtlRunner instruments the regular running of the Enrichment Process
- [The Enrichment Process itself][The-enrichment-process]