Collector logging formats - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki

HOME » SNOWPLOW TECHNICAL DOCUMENTATION » Collectors » Collector Logging Formats

Overview

Different Snowplow collectors write Snowplow data to logs of different formats.

Here we document the different formats, and show which collectors generate what. This document should be used by anyone:

  1. Building a new collector, who would like to ensure it logs to a Snowplow supported format.
  2. Building an ETL module, to ensure that the ETL module can successfully read the raw Snowplow logs generated by each collector, and write them to the data structures used by the storage modules.

Logging formats

  1. Be it a record in a logfile or a raw event, it serves as an envelope containing the event data encapsulated within either the GET query string or the POST response object.
  2. If it's inside the POST response object, then the events will be contained within this JSON.
  3. With either GET or POST, the lowest level will be a set of name/value pairs that respect the Snowplow Tracker Protocol.
Logging formats Description Status Collector
Cloudfront logs Amazon's Cloudfront log formats with Amazon Cloudfront filename naming convention Supported (both pre and post Sept 2012 formats) Cloudfront Collector
Tomcat access logs Tomcat access logs with Amazon Elastic Beanstalk filename naming convention Supported Clojure Collector
Snowplow Thrift raw event Binary serialized Thrift events Supported Scala Stream Collector

The Cloudfront logging format (with Cloudfront naming convention)

For the Cloudfront logfile naming convention, please, refer to the official Amazon documentation. The main point to note here is the logfiles are stored in gzip format, thus bearing .gz extension. They will be prefixed with the distribution ID:

distribution-ID.YYYY-MM-DD-HH.unique-ID.gz

The logging format is well described in this section of the Amazon article. Each entry in a log file gives details about a single user request and the files bear the following characteristics.

  • Use the W3C extended log file format.
  • Contain tab-separated values.
  • Contain records that are not necessarily in chronological order.
  • Contain two header lines: one with the file-format version, and another that lists the W3C fields included in each record.
  • Substitute URL-encoded equivalents for spaces and non-standard characters in field values.
  • These non-standard characters consist of all ASCII codes below 32 and above 127. The URL encoding standard is RFC 1738.

Note that the actual field containing the key/value pairs from the GET requests initiated by the trackers is cs-uri-query (the 12th field).

Below is an example of a single record in the logfile.

2016-01-20 20:22:55 IND6 480 174.2.224.27 GET d2gtrjee5bqfpl.cloudfront.net /i 200 https://www.properweb.ca/hosting/ Mozilla/5.0%2520(Windows%2520NT%25206.1)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 e=ue&ue_px=eyJzY2hlbWEiOiJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5zbm93cGxvdy91bnN0cnVjdF9ldmVudC9qc29uc2NoZW1hLzEtMC0wIiwiZGF0YSI6eyJzY2hlbWEiOiJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5zbm93cGxvdy9saW5rX2NsaWNrL2pzb25zY2hlbWEvMS0wLTEiLCJkYXRhIjp7InRhcmdldFVybCI6Imh0dHBzOi8vd3d3LnByb3BlcndlYi5jYS9ob3N0aW5nL2NvbXBhcmUtcGVyc29uYWwtcGxhbnMvIiwiZWxlbWVudElkIjoiIiwiZWxlbWVudFRhcmdldCI6IiJ9fX0&tv=js-2.5.3&tna=cf&aid=cfpweb&p=web&tz=America%252FGuatemala&lang=en-US&cs=UTF-8&f_pdf=1&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=1&f_java=0&f_gears=0&f_ag=0&res=1152x864&cd=24&cookie=1&eid=a8451163-d056-4a6c-a8ef-c612aab3c252&dtm=1453321369503&vp=1152x329&ds=1135x2601&vid=3&sid=a2e39d3f-af4d-48f7-b153-8ca79942a552&duid=830e4863d85df04a&fp=1354193749&refr=https%253A%252F%252Fwww.properweb.ca%252Fdomain-name-registration%252F&url=https%253A%252F%252Fwww.properweb.ca%252Fhosting%252F - Hit yavbRZy0qwso0j-8VBYB-VHIaJjo8K4eaARnXiseXDvKSH8vZ-_Mlg== d2gtrjee5bqfpl.cloudfront.net https 1268 0.001 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Hit

The main points regarding Cloudfront logging are

  1. Supports single events sent via GET only
  2. No support for network_userid

Please, refer to the Snowplow Tracker Protocol for the comprehensive list of the individual parameters that could be submitted with the GET request (and thus contained in the field cs-uri-query of the Cloudfront logfile).

The Tomcat access log format (with Amazon Elastic Beanstalk filename naming convention)

To ensure persistence of the logs, you have to configure your environment to publish logs to Amazon S3 automatically after they have been rotated. Elastic Beanstalk creates a bucket with the naming pattern elasticbeanstalk-region-account-id for each region in which you create environments. Within this bucket, logs are stored under the path resources/environments/logs/logtype/environment-id/instance-id.

For example, logs from instance i-0a1fd158, in Elastic Beanstalk environment e-mpcwnwheky in region us-west-2 in account 0123456789012, are stored in the following location:

s3://elasticbeanstalk-us-west-2-0123456789012/resources/environments/logs/publish/e-mpcwnwheky/i-0a1fd158

Please, refer to the following Amazon article for more details.

Bear in mind that if you reconfigure the environment the instance-id could be modified too.

Access logging in Tomcat environment is performed by valves that implement org.apache.catalina.AccessLog interface. A formatting layout identifying the various information fields from the request and response to be logged is determined by the attribute pattern.

Our Clojure Collector uses a customized Tomcat access log valve. Compared to the standard AccessLogValve, this valve:

  1. Introduces a new pattern, 'I', to escape an incoming header.
  2. Introduces a new pattern, 'C', to fetch a cookie stored on the response.
  3. Re-implements the pattern 'i' to ensure that "" (empty string) is replaced with "-".
  4. Re-implements the pattern 'q' to remove the "?" and ensure "" (empty string) is replaced with "-".
  5. Overwrites the 'v' pattern, to write the version of this AccessLogValve, rather than the local server name.
  6. Introduces a new pattern, 'w' to capture the request's body.
  7. Introduces a new pattern, '~' to capture the request's content type.
  8. Re-implements the pattern 'a' to get remote IP more reliably, even through proxies.

Thus the pattern we use ensures that the access log format matches that produced by the Cloudfront Collector (so that the same ETL process can be employed for both collectors).

<Valve ... pattern="%{yyyy-MM-dd}t&#9;%{HH:mm:ss}t&#9;-&#9;%b&#9;%a&#9;%m&#9;%h&#9;%U&#9;%s&#9;%{Referer}i&#9;%{User-Agent}I&#9;%q&amp;cv=clj-1.1.0-%v&amp;nuid=%{sp}C&#9;-&#9;-&#9;-&#9;%~&#9;%w" />

As a result, a logfile record will look like the one below:

2016-11-27      07:16:07        -       43      185.124.153.90  GET     185.124.153.90  /i      200     http://chuwy.me/scala-blocks.html       Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_11_6%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F54.0.2840.98+Safari%2F537.36 stm=1480230967340&e=pv&url=http%3A%2F%2Fchuwy.me%2Fscala-blocks.html&page=Scala%20Code%20Blocks&refr=http%3A%2F%2Fchuwy.me%2F&tv=js-2.7.0-rc2&tna=blogTracker&aid=blog&p=web&tz=Asia%2FOmsk&cs=UTF-8&f_pdf=1&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=1&f_java=0&f_gears=0&f_ag=0&res=1280x800&cd=24&cookie=1&eid=1799a90f-f570-4414-b91a-b0db8f39cc2e&dtm=1480230967333&vp=1280x726&ds=1280x4315&vid=18&sid=395e4506-37a3-4074-8de2-d8c75fb17d4a&duid=1f9b3980-6619-4d75-a6c9-8253c76c3bfb&fp=531497290&cv=clj-1.1.0-tom-0.2.0&nuid=5beb1f92-d4fb-4020-905c-f659929c8ab5      -       -       -       -       -

For your convenience, the pattern codes utilized are summarised in the table below.

Code Implementation Type Description
%a Customized Reimplemented to get remote IP more reliably, even through proxies.
%b Standard Bytes sent, excluding HTTP headers, or '-' if zero.
%{xxx}C New Introduced to fetch a cookie stored on the response.
%h Standard Remote host name (or IP address if enableLookups for the connector is false).
%{xxx}I New Introduced to escape an incoming header.
%{xxx}i Customized Reimplemented to ensure that "" (empty string) is replaced with "-".
%m Standard Request method (GET, POST).
%q Customized Re-implements the pattern 'q' to remove the "?" and ensure "" (empty string) is replaced with "-".
%s Standard HTTP status code of the response.
%{xxx}t Standard Timestamp formatted using the enhanced SimpleDateFormat pattern .
%U Standard Requested URL path.
%v Customized Overwritten to write the version of this AccessLogValve, rather than the local server name.
%w New Introduced to capture the request's body.
%~ New Introduced to capture the request's content type.

The main characteristics of Tomcat access logs:

  1. A custom textual format based on Apache logfile format.
  2. Support GET but also multiple events sent via POST.
  3. Support network_userid.

The Snowplow Thrift raw event format

As its name suggests, the Stream collector differs from the batch pipeline collectors in that it produces streams of Snowplow events (records). As such, the data (payload) is serialized by utilizing Apache Thrift framefork.

Streaming Data is data that is generated continuously. Streaming data includes a wide variety of data such as log files and events generated by customers using your mobile or web applications, ecommerce purchases, in-game player activity, information from social networks, etc.

Binary serialization allows for:

  • simpler data structure
  • smaller size
  • faster transfer
  • easier (programmatical) parsing

The Snowplow Thrift raw event format conforms to this Thrift schema. For easier perception, the structure of the collector payload is depicted below.

struct CollectorPayload {
	31337: string schema

	// Required fields which are intrinsic properties of HTTP
	100: string ipAddress

	// Required fields which are Snowplow-specific
	200: i64 timestamp
	210: string encoding
	220: string collector

	// Optional fields which are intrinsic properties of HTTP
	300: optional string userAgent
	310: optional string refererUri
	320: optional string path
	330: optional string querystring
	340: optional string body
	350: optional list<string> headers
	360: optional string contentType

	// Optional fields which are Snowplow-specific
	400: optional string hostname
	410: optional string networkUserId
}

It's important to note that we built stream data processing on the idea of Lambda architecture which implies both a speed (real-time) layer and a batch layer. As a result we provide two consumers: Stream Enrich and Kinesis-S3 sink.

Due to their nature (purpose):

  • Stream Enrich reads raw Snowplow events off a Kinesis stream and writes the enriched Snowplow event to another Kinesis stream
  • Kinesis-S3 reads records from an Amazon Kinesis stream, encodes and wraps them into Protocol Buffers (PB) by means of ElephantBird library, compresses the PB arrays using splittable LZO, and writes them to S3

Thus, the output of Kinesis-S3 is a projection of raw event data (serialized Thrift records, not enriched) in the form of a compressed LZO file. Each .lzo file has a corresponding .lzo.index file containing the byte offsets for the LZO blocks, so that the blocks can be processed in parallel using Hadoop.

Note: We also provide an option to GZIP the Thrift records rather than produce LZO files. However, those files are not meant to be processed in Hadoop. They are archives of Thrift records without the protocol buffers layer being applied.

Generally, the LZO file generated by Kinesis-S3 could be depicted as an "onion-like" layered object as shown below.

The main characteristics of stream-based raw events:

  1. A serialized Thrift record format.
  2. Support both GET and POST requests.
  3. Support network_userid.
⚠️ **GitHub.com Fallback** ⚠️