Glossary - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki
HOME > SNOWPLOW GLOSSARY OF TERMS
A - B - C - D - E - F - G - H - I - J - K - L - M - N - O - P - Q - R - S - T - U - V - W - X - Y - Z
Once we have our data modeled in tidy users, sessions, content items tables, we are ready to perform analysis on them.
Most companies that use Snowplow will perform analytics using a number of different types of tools:
- It is common to implement a Business Intelligence tool on top of Snowplow data to enable users (particularly non-technical users) to slice and dice (pivot) on the data. For many companies, the BI tool will be the primary way that most users interface with Snowplow data.
- Often a data scientist or data science team will often crunch the underlying event-level data to perform more sophisticated analysis including building predictive models, perform marketing attribution etc. The data scientist(s) will use one or more specialist tools e.g. Python for Data Science or R.
A collector receives data in the form of GET
or POST
requests from the trackers, and write the data to either logs or streams (for further processing).
A context is the group of entities associated with or describing the setting in which an event has taken place. What makes contexts interesting is that they are common across multiple different event types. Thus, contexts provide a convenient way in Snowplow to schema common entities once, and then use those schemas across all the different events where those entities are relevant.
Across all our trackers, the approach is the same. Each context is a self-describing JSON. We create an array of all the different contexts that we wish to pass into Snowplow, and then we pass those contexts in generally as the final argument on any track method that we call to capture the event.
See Analytics
At data collection time, we aim to capture all the data required to accurately represent a particular event that has just occurred.
At this stage, the data that is collected should describe the events as they have happened, including as much rich information about:
- The event itself
- The individual/entity that performed the action - that individual or entity is a "context"
- Any "objects" of the action - those objects are also "context"
- The wider context that the event has occurred in
For each of the above we want to collect as much data describing the event and associated contexts as possible.
See Enrichment
The data collection and enrichment process generates a data set that is an "event stream": a long list of packets of data, where each packet represents a single event.
Whilst it is possible to do analysis directly on this event stream, it is very common to:
- Join the event-stream data set with other data sets (e.g. customer data, product data, media data, marketing data, financial data).
- Aggregate the event-level data into smaller data sets that are easier and faster to run analyses against.
- Apply "business logic" i.e. definitions to the data as part of that aggregation step.
What tables are produced, and the different fields available in each, varies widely between companies in different sectors, and surprisingly even varies within the same vertical. That is because part of putting together these aggregate tables involves implementing business-specific logic.
We call this process of aggregating a "data modeling". At the end of the data modeling process, a clean set of tables is available to make it easier to perform analysis on the data.
In event modeling terms, an entity is a thing or object which is somehow relevant to the event that we are observing. We use the word "entity" because the word "object" is too loaded - it has too many connotations.
There is a lot of confusion around the role of entities within events - even to the extent of one analytics company arguing that entity data is completely distinct from event data. In fact nothing could be further from the truth - as we see it, our events consist of almost nothing but entities.
An event is something that occurred in a particular point in time. Examples of events include:
- Load a web page
- Add an item to basket
- Enter a destination
- Check a balance
- Search for an item
- Share a video
Snowplow is an event analytics platform. Once you have setup one or more Snowplow trackers, every time an event occurs, Snowplow should spot the event, generate a packet of data to describe the event, and send that event into the Snowplow data pipeline.
When we set up Snowplow, we need to make sure that we track all the events that are meaningful to our business, so that the data associated with those events is available in Snowplow for analysis.
When we come to analyse Snowplow data, we need to be able to look at the event data and understand, in an unambiguous way, what that data actually means i.e. what it represents.
An event dictionary is a crucial tool in both cases. It is a document that defines the universe of events that a company is interested in tracking.
Data enrichment is sometimes referred to as "dimension widening". We are using 3rd party sources of data to enrich the data we originally collected about the event so that we have more context available for understanding that event, enabling us to perform richer analysis.
Snowplow supports the following enrichments out-of-the-box. We are working on making our enrichment framework pluggable, so that users and partners can extend the list of enrichments performed as part of the data processing pipeline:
- IP -> Geographic location
- Referrer query string -> source of traffic
- User agent string -> classifying devices, operating systems and browsers
Huskimo is an open-source product from the Snowplow team. It connects to third-party SaaS platforms (e.g Singular, Twilio), exports their data via API, and then uploads that data into your Redshift instance. Huskimo has a simple goal: to make essential datasets currently locked away inside various SaaS platforms available for analysis inside Redshift.
Huskimo is a complement to Snowplow's built-in webhook support. It came about because not all SaaS services offer webhooks which expose their internal data as a stream of events. Note that you do not need to use Snowplow to use Huskimo.
Iglu is a machine-readable, open-source schema repository for JSON Schema from the team at Snowplow Analytics. A schema repository (sometimes called a registry) is like npm or Maven or git but holds data schemas instead of software or code.
Snowplow uses Iglu to store all the schemas associated with the different events and contexts that are captured via Snowplow. When an event or context is sent into Snowplow, it is sent with a reference to the schema for the event or context, which points to the location of the schema for the event or context in Iglu.
The Snowplow pipeline is built to enable a very clean separation of the following steps in the data processing flow:
Sauna is an open-source decisioning and response framework from Snowplow Analytics team. Analysts and data scientists (and some data engineers) are the end users of Sauna: you want to use Sauna to automate responses to your event streams in third-party systems.
Schema DDL is a set of generators for producing various DDL formats from JSON Schemas. It's tightly coupled with other tools from Snowplow Platform like Iglu and Self-describing JSON and used mostly in Schema Guru.
Schema Guru is a tool (CLI, Spark job and web) allowing you to derive JSON Schemas from a set of JSON instances process and transform it into different data definition formats.
Current primary features include:
- derivation of JSON Schema from set of JSON instances
- generation of Redshift table DDL and JSONPaths file
Unlike other tools for deriving JSON Schemas, Schema Guru allows you to derive schema from an unlimited set of instances (making schemas much more precise), and supports many more JSON Schema validation properties.
Schema Guru is used heavily in association with Snowplow's own Snowplow, Iglu and Schema DDL projects.
SchemaVer is Snowplow Team own schema versioning notion. It is defined as follows: MODEL-REVISION-ADDITION
-
MODEL
when you make a breaking schema change which will prevent interaction with any historical data -
REVISION
when you introduce a schema change which may prevent interaction with some historical data -
ADDITION
when you make a schema change that is compatible with all historical data
Self-describing JSON is an individual JSON with its JSON Schema. It generally looks like the one below:
{
"schema": "iglu:com.snowplowanalytics/ad_click/jsonschema/1-0-0",
"data": {
"bannerId": "4732ce23d345"
}
}
It differs from standard JSON due to the following important changes :
- We have added a new top-level field, schema, which contains (in a space-efficient format) all the information required to uniquely identify the associated JSON Schema
- We have moved the JSON’s original property inside a data field. This sandboxing will prevent any accidental collisions should the JSON already have a schema field
Snowplow has a Shredding process (as part of Enrichment and Storage processes) which consists of two phases:
- Extracting unstructured event JSONs and context JSONs from enriched event files into their own files
- Loading those files into corresponding tables in Redshift
There are three great use cases to use the shredding functionality for:
- Adding support into your Snowplow installation for new Snowplow event types with no software upgrade required - simply add new tables to your Redshift database.
- Defining your own custom unstructured events types, and processing these through the Snowplow pipeline into dedicated tables in Redshift. Retailers can define their own "product view" or "add to basket" events, for example. Media companies can define their own "video play" events.
- Defining your own custom context types, and processing these through the Snowplow pipeline into dedicated tables in Redshift. You can define your own "user" type, for example, including whatever fields you capture and want to store related to a user.
Snowplow is an enterprise-strength marketing and product analytics platform. It does three things:
- Identifies your users, and tracks the way they engage with your website or application
- Stores your users' behavioural data in a scalable "event data warehouse" you control: in Amazon S3 and (optionally) Amazon Redshift or Postgres
- Lets you leverage the biggest range of tools to analyze that data, including big data tools (e.g. Hive, Pig, Mahout) via EMR or more traditional tools e.g. Tableau, R, Looker, Chartio to analyze that behavioural data
The enrichment process takes raw Snowplow collector logs, tidies them up, enriches them (e.g. by adding Geo-IP data, and performing referrer parsing) and then writes the output of that process back to S3 as a cleaned up set of Snowplow event files. The data in these files can be analysed directly by any big data tool that runs on EMR.
In addition, Snowplow data from those event files could be copied into Amazon Redshift, where it can be analysed using any tool that talks to PostgreSQL.
There are therefore a number of different potential storage modules that Snowplow users can store their data in.
We follow Google five-variable tracking event structure. When you track a structured event, you get five parameters:
- Category: The name for the group of objects you want to track.
- Action: A string that is used to define the user in action for the category of object.
- Label: An optional string which identifies the specific object being actioned.
- Property: An optional string describing the object or the action performed on it.
- Value: An optional numeric data to quantify or further describe the user action.
For example, when tracking a custom structured event the specification for the trackStructEvent
method (Javascript tracker) would follow the pattern:
snowplow_name_here('trackStructEvent', 'category','action','label','property','value');
A tracker is client- or server-side libraries which track customer behaviour by sending Snowplow events to a Collector.
You may wish to track events on your website or application which are not directly supported by Snowplow and which structured event tracking does not adequately capture. Your event may have more than the five fields offered by trackStructEvent
, or its fields may not fit into the category-action-label-property-value model. The solution is Snowplow's custom unstructured events. Unstructured events use self-describing JSON which can have arbitrarily many fields.
For example, to track an unstructured event with Javascript tracker, you make use of the trackUnstructEvent
method with the pattern shown below:
snowplow_name_here('trackUnstructEvent', <<SELF-DESCRIBING EVENT JSON>>);
Snowplow allows you to collect events via the adapters (webhooks) of supported third-party software.
Webhooks allow this third-party software to send their own internal event streams to Snowplow collectors for further processing. Webhooks are sometimes referred to as "streaming APIs" or "HTTP response APIs".