Scala stream collector - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki
HOME > SNOWPLOW TECHNICAL DOCUMENTATION > Collectors
The Scala Stream Collector is a Snowplow event collector for Snowplow, written in Scala. The Scala Stream Collector allows near-real time processing (Enrichment, Storage, Analytics) of a Snowplow raw event stream.
The Scala Stream Collector receives raw Snowplow events over HTTP, serializes them to a Thrift record format, and then writes them to a sink. Currently supported sinks are:
- Amazon Kinesis
-
stdout
for a custom stream collection process
Support for Apache Kafka may be added in the future - please see ticket #xxx for details.
Like the Clojure Collector, the Scala Stream Collector supports cross-domain Snowplow deployments, setting a user_id
(used to identify unique visitors) server side to reliably identify the same user across domains.
The Scala Stream Collector allows the use of a third-party cookie, making user tracking across domains possible. The CloudFront Collector does not support cross domain tracking of users because user ids are set client-side, whereas the Scala Stream Collector sets them server-side.
In a nutshell: the Scala Stream Collector receives events from the Snowplow JavaScript tracker, sets/updates a third-party user tracking cookie, and returns the pixel to the client. The ID in this third-party user tracking cookie is stored in the network_userid
field in Snowplow events.
In pseudocode terms:
if (request contains an "sp" cookie) {
Record that cookie as the user identifier
Set that cookie with a now+1 year cookie expiry
Add the headers and payload to the output array
} else {
Set the "sp" cookie with a now+1 year cookie expiry
Add the headers and payload to the output array
}
The Scala Stream Collector is built on top of Spray and Akka actors.