The StorageLoader - chuwy/snowplow-ci GitHub Wiki
HOME » [SNOWPLOW TECHNICAL DOCUMENTATION](Snowplow technical documentation) » [Storage](storage documentation) » The StorageLoader
- Data from enriched Snowplow event files generated by the Scalding process on EMR is read and written to Amazon Redshift
- The enriched event files are then moved from the in-bucket (which was the archive bucket for the EmrEtlRunner) to the archive bucket (for the StorageLoader)
The StorageLoader is configured via the configuration file shared with EmrEtlRunner. For more information, see the guide to setting up the StorageLoader.
##The StorageLoader role in ETL process
The enriched files contain the tab-separated values contributing to atomic.events
and custom tables. The shredding process
- reads Snowplow enriched events from enriched good files (produced and temporary stored in HDFS as a result of enrichment process);
- extracts any unstructured (self-describing) event JSONs and contexts JSONs found;
- validates that these JSONs conform to the corresponding schemas located in Iglu registry;
- adds metadata to these JSONs to track their origins;
- writes these JSONs out to nested folders on S3 dependent on their schema.
As a result the enriched good file is "shredded" into a few shredded good files (provided the event file contained data from at least one of the following: custom self-describing events, contexts, configurable enrichments):
- a TSV formatted file containing the data for
atomic.events
table; - possibly one or more JSON files related to custom user specific (self-describing) events extracted from
unstruct_event
field of the enriched good file; - possibly one or more JSON files related to custom contexts extracted from
contexts
filed of the enriched good file; - possibly one or more JSON files related to configurable enrichments (if any was enabled) extracted from
derived_contexts
field of the enriched good file.
Those files end up in S3 and are used to load the data into Redshift tables dedicated to each of the above files under the StorageLoader orchestration.
The whole process could be depicted with the following dataflow diagram.