Glossary - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki

HOME > SNOWPLOW GLOSSARY OF TERMS

A - B - C - D - E - F - G - H - I - J - K - L - M - N - O - P - Q - R - S - T - U - V - W - X - Y - Z

Analytics

Once we have our data modeled in tidy users, sessions, content items tables, we are ready to perform analysis on them.

Most companies that use Snowplow will perform analytics using a number of different types of tools:

It is common to implement a Business Intelligence tool on top of Snowplow data to enable users (particularly non-technical users) to slice and dice (pivot) on the data. For many companies, the BI tool will be the primary way that most users interface with Snowplow data.
Often a data scientist or data science team will often crunch the underlying event-level data to perform more sophisticated analysis including building predictive models, perform marketing attribution etc. The data scientist(s) will use one or more specialist tools e.g. Python for Data Science or R.

Read more or go to the top

Collector

A collector receives data in the form of GET or POST requests from the trackers, and write the data to either logs or streams (for further processing).

Read more or go to the top

Context

A context is the group of entities associated with or describing the setting in which an event has taken place. What makes contexts interesting is that they are common across multiple different event types. Thus, contexts provide a convenient way in Snowplow to schema common entities once, and then use those schemas across all the different events where those entities are relevant.

Across all our trackers, the approach is the same. Each context is a self-describing JSON. We create an array of all the different contexts that we wish to pass into Snowplow, and then we pass those contexts in generally as the final argument on any track method that we call to capture the event.

Read more or go to the top

Data analysis

See Analytics

Data collection

At data collection time, we aim to capture all the data required to accurately represent a particular event that has just occurred.

At this stage, the data that is collected should describe the events as they have happened, including as much rich information about:

The event itself
The individual/entity that performed the action - that individual or entity is a "context"
Any "objects" of the action - those objects are also "context"
The wider context that the event has occurred in

For each of the above we want to collect as much data describing the event and associated contexts as possible.

Read more or go to the top

Data enrichment

See Enrichment

Data modeling

The data collection and enrichment process generates a data set that is an "event stream": a long list of packets of data, where each packet represents a single event.

Whilst it is possible to do analysis directly on this event stream, it is very common to:

Join the event-stream data set with other data sets (e.g. customer data, product data, media data, marketing data, financial data).
Aggregate the event-level data into smaller data sets that are easier and faster to run analyses against.
Apply "business logic" i.e. definitions to the data as part of that aggregation step.

What tables are produced, and the different fields available in each, varies widely between companies in different sectors, and surprisingly even varies within the same vertical. That is because part of putting together these aggregate tables involves implementing business-specific logic.

We call this process of aggregating a "data modeling". At the end of the data modeling process, a clean set of tables is available to make it easier to perform analysis on the data.

Read more or go to the top

Entity

In event modeling terms, an entity is a thing or object which is somehow relevant to the event that we are observing. We use the word "entity" because the word "object" is too loaded - it has too many connotations.

There is a lot of confusion around the role of entities within events - even to the extent of one analytics company arguing that entity data is completely distinct from event data. In fact nothing could be further from the truth - as we see it, our events consist of almost nothing but entities.

Read more or go to the top

Event

An event is something that occurred in a particular point in time. Examples of events include:

Load a web page
Add an item to basket
Enter a destination
Check a balance
Search for an item
Share a video

Snowplow is an event analytics platform. Once you have setup one or more Snowplow trackers, every time an event occurs, Snowplow should spot the event, generate a packet of data to describe the event, and send that event into the Snowplow data pipeline.

Read more or go to the top

Event Dictionary

When we set up Snowplow, we need to make sure that we track all the events that are meaningful to our business, so that the data associated with those events is available in Snowplow for analysis.

When we come to analyse Snowplow data, we need to be able to look at the event data and understand, in an unambiguous way, what that data actually means i.e. what it represents.

An event dictionary is a crucial tool in both cases. It is a document that defines the universe of events that a company is interested in tracking.

Read more or go to the top

Enrichment

Data enrichment is sometimes referred to as "dimension widening". We are using 3rd party sources of data to enrich the data we originally collected about the event so that we have more context available for understanding that event, enabling us to perform richer analysis.

Snowplow supports the following enrichments out-of-the-box. We are working on making our enrichment framework pluggable, so that users and partners can extend the list of enrichments performed as part of the data processing pipeline:

IP -> Geographic location
Referrer query string -> source of traffic
User agent string -> classifying devices, operating systems and browsers

Read more or go to the top

Huskimo

Huskimo is an open-source product from the Snowplow team. It connects to third-party SaaS platforms (e.g Singular, Twilio), exports their data via API, and then uploads that data into your Redshift instance. Huskimo has a simple goal: to make essential datasets currently locked away inside various SaaS platforms available for analysis inside Redshift.

Huskimo is a complement to Snowplow's built-in webhook support. It came about because not all SaaS services offer webhooks which expose their internal data as a stream of events. Note that you do not need to use Snowplow to use Huskimo.

Read more or go to the top

Iglu

Iglu is a machine-readable, open-source schema repository for JSON Schema from the team at Snowplow Analytics. A schema repository (sometimes called a registry) is like npm or Maven or git but holds data schemas instead of software or code.

Snowplow uses Iglu to store all the schemas associated with the different events and contexts that are captured via Snowplow. When an event or context is sent into Snowplow, it is sent with a reference to the schema for the event or context, which points to the location of the schema for the event or context in Iglu.

Read more or go to the top

Pipeline

The Snowplow pipeline is built to enable a very clean separation of the following steps in the data processing flow:

Data collection
Data enrichment
Data modelling
Data analysis

Read more or go to the top

Sauna

Sauna is an open-source decisioning and response framework from Snowplow Analytics team. Analysts and data scientists (and some data engineers) are the end users of Sauna: you want to use Sauna to automate responses to your event streams in third-party systems.

Read more or go to the top

Schema DDL

Schema DDL is a set of generators for producing various DDL formats from JSON Schemas. It's tightly coupled with other tools from Snowplow Platform like Iglu and Self-describing JSON and used mostly in Schema Guru.

Read more or go to the top

Schema Guru

Schema Guru is a tool (CLI, Spark job and web) allowing you to derive JSON Schemas from a set of JSON instances process and transform it into different data definition formats.

Current primary features include:

derivation of JSON Schema from set of JSON instances
generation of Redshift table DDL and JSONPaths file

Unlike other tools for deriving JSON Schemas, Schema Guru allows you to derive schema from an unlimited set of instances (making schemas much more precise), and supports many more JSON Schema validation properties.

Schema Guru is used heavily in association with Snowplow's own Snowplow, Iglu and Schema DDL projects.

Read more or go to the top

SchemaVer

SchemaVer is Snowplow Team own schema versioning notion. It is defined as follows: MODEL-REVISION-ADDITION

MODEL when you make a breaking schema change which will prevent interaction with any historical data
REVISION when you introduce a schema change which may prevent interaction with some historical data
ADDITION when you make a schema change that is compatible with all historical data

Read more or go to the top

Self-describing JSON

Self-describing JSON is an individual JSON with its JSON Schema. It generally looks like the one below:

{
	"schema": "iglu:com.snowplowanalytics/ad_click/jsonschema/1-0-0",
	"data": {
		"bannerId": "4732ce23d345"
	}
}

It differs from standard JSON due to the following important changes :

We have added a new top-level field, schema, which contains (in a space-efficient format) all the information required to uniquely identify the associated JSON Schema
We have moved the JSON’s original property inside a data field. This sandboxing will prevent any accidental collisions should the JSON already have a schema field

Read more or go to the top

Shredding

Snowplow has a Shredding process (as part of Enrichment and Storage processes) which consists of two phases:

Extracting unstructured event JSONs and context JSONs from enriched event files into their own files
Loading those files into corresponding tables in Redshift

There are three great use cases to use the shredding functionality for:

Adding support into your Snowplow installation for new Snowplow event types with no software upgrade required - simply add new tables to your Redshift database.
Defining your own custom unstructured events types, and processing these through the Snowplow pipeline into dedicated tables in Redshift. Retailers can define their own "product view" or "add to basket" events, for example. Media companies can define their own "video play" events.
Defining your own custom context types, and processing these through the Snowplow pipeline into dedicated tables in Redshift. You can define your own "user" type, for example, including whatever fields you capture and want to store related to a user.

Read more or go to the top

Snowplow

Snowplow is an enterprise-strength marketing and product analytics platform. It does three things:

Identifies your users, and tracks the way they engage with your website or application
Stores your users' behavioural data in a scalable "event data warehouse" you control: in Amazon S3 and (optionally) Amazon Redshift or Postgres
Lets you leverage the biggest range of tools to analyze that data, including big data tools (e.g. Hive, Pig, Mahout) via EMR or more traditional tools e.g. Tableau, R, Looker, Chartio to analyze that behavioural data

Read more or go to the top

Storage

The enrichment process takes raw Snowplow collector logs, tidies them up, enriches them (e.g. by adding Geo-IP data, and performing referrer parsing) and then writes the output of that process back to S3 as a cleaned up set of Snowplow event files. The data in these files can be analysed directly by any big data tool that runs on EMR.

In addition, Snowplow data from those event files could be copied into Amazon Redshift, where it can be analysed using any tool that talks to PostgreSQL.

There are therefore a number of different potential storage modules that Snowplow users can store their data in.

Read more or go to the top

Structured event

We follow Google five-variable tracking event structure. When you track a structured event, you get five parameters:

Category: The name for the group of objects you want to track.
Action: A string that is used to define the user in action for the category of object.
Label: An optional string which identifies the specific object being actioned.
Property: An optional string describing the object or the action performed on it.
Value: An optional numeric data to quantify or further describe the user action.

For example, when tracking a custom structured event the specification for the trackStructEvent method (Javascript tracker) would follow the pattern:

snowplow_name_here('trackStructEvent', 'category','action','label','property','value');

Read more or go to the top

Tracker

A tracker is client- or server-side libraries which track customer behaviour by sending Snowplow events to a Collector.

Read more or go to the top

Unstructured event

You may wish to track events on your website or application which are not directly supported by Snowplow and which structured event tracking does not adequately capture. Your event may have more than the five fields offered by trackStructEvent, or its fields may not fit into the category-action-label-property-value model. The solution is Snowplow's custom unstructured events. Unstructured events use self-describing JSON which can have arbitrarily many fields.

For example, to track an unstructured event with Javascript tracker, you make use of the trackUnstructEvent method with the pattern shown below:

snowplow_name_here('trackUnstructEvent', <<SELF-DESCRIBING EVENT JSON>>);

Read more or go to the top

Webhook

Snowplow allows you to collect events via the adapters (webhooks) of supported third-party software.

Webhooks allow this third-party software to send their own internal event streams to Snowplow collectors for further processing. Webhooks are sometimes referred to as "streaming APIs" or "HTTP response APIs".

Read more or go to the top

Glossary - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki

Analytics

Collector

Context

Data analysis

Data collection

Data enrichment

Data modeling

Entity

Event

Event Dictionary

Enrichment

Huskimo

Iglu

Pipeline

Sauna

Schema DDL

Schema Guru

SchemaVer

Self-describing JSON

Shredding

Snowplow

Storage

Structured event

Tracker

Unstructured event

Webhook

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️