Collector logging formats - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki
HOME » SNOWPLOW TECHNICAL DOCUMENTATION » Collectors » Collector Logging Formats
Different Snowplow collectors write Snowplow data to logs of different formats.
Here we document the different formats, and show which collectors generate what. This document should be used by anyone:
- Building a new collector, who would like to ensure it logs to a Snowplow supported format.
- Building an ETL module, to ensure that the ETL module can successfully read the raw Snowplow logs generated by each collector, and write them to the data structures used by the storage modules.
- Be it a record in a logfile or a raw event, it serves as an envelope containing the event data encapsulated within either the
GET
query string or thePOST
response object. - If it's inside the
POST
response object, then the events will be contained within this JSON. - With either
GET
orPOST
, the lowest level will be a set of name/value pairs that respect the Snowplow Tracker Protocol.
Logging formats | Description | Status | Collector |
---|---|---|---|
Cloudfront logs | Amazon's Cloudfront log formats with Amazon Cloudfront filename naming convention | Supported (both pre and post Sept 2012 formats) | Cloudfront Collector |
Tomcat access logs | Tomcat access logs with Amazon Elastic Beanstalk filename naming convention | Supported | Clojure Collector |
Snowplow Thrift raw event | Binary serialized Thrift events | Supported | Scala Stream Collector |
For the Cloudfront logfile naming convention, please, refer to the official Amazon documentation. The main point to note here is the logfiles are stored in gzip
format, thus bearing .gz
extension. They will be prefixed with the distribution ID:
distribution-ID.YYYY-MM-DD-HH.unique-ID.gz
The logging format is well described in this section of the Amazon article. Each entry in a log file gives details about a single user request and the files bear the following characteristics.
- Use the W3C extended log file format.
- Contain tab-separated values.
- Contain records that are not necessarily in chronological order.
- Contain two header lines: one with the file-format version, and another that lists the W3C fields included in each record.
- Substitute URL-encoded equivalents for spaces and non-standard characters in field values.
- These non-standard characters consist of all ASCII codes below 32 and above 127. The URL encoding standard is RFC 1738.
Note that the actual field containing the key/value pairs from the GET
requests initiated by the trackers is cs-uri-query
(the 12th field).
Below is an example of a single record in the logfile.
2016-01-20 20:22:55 IND6 480 174.2.224.27 GET d2gtrjee5bqfpl.cloudfront.net /i 200 https://www.properweb.ca/hosting/ Mozilla/5.0%2520(Windows%2520NT%25206.1)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 e=ue&ue_px=eyJzY2hlbWEiOiJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5zbm93cGxvdy91bnN0cnVjdF9ldmVudC9qc29uc2NoZW1hLzEtMC0wIiwiZGF0YSI6eyJzY2hlbWEiOiJpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5zbm93cGxvdy9saW5rX2NsaWNrL2pzb25zY2hlbWEvMS0wLTEiLCJkYXRhIjp7InRhcmdldFVybCI6Imh0dHBzOi8vd3d3LnByb3BlcndlYi5jYS9ob3N0aW5nL2NvbXBhcmUtcGVyc29uYWwtcGxhbnMvIiwiZWxlbWVudElkIjoiIiwiZWxlbWVudFRhcmdldCI6IiJ9fX0&tv=js-2.5.3&tna=cf&aid=cfpweb&p=web&tz=America%252FGuatemala&lang=en-US&cs=UTF-8&f_pdf=1&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=1&f_java=0&f_gears=0&f_ag=0&res=1152x864&cd=24&cookie=1&eid=a8451163-d056-4a6c-a8ef-c612aab3c252&dtm=1453321369503&vp=1152x329&ds=1135x2601&vid=3&sid=a2e39d3f-af4d-48f7-b153-8ca79942a552&duid=830e4863d85df04a&fp=1354193749&refr=https%253A%252F%252Fwww.properweb.ca%252Fdomain-name-registration%252F&url=https%253A%252F%252Fwww.properweb.ca%252Fhosting%252F - Hit yavbRZy0qwso0j-8VBYB-VHIaJjo8K4eaARnXiseXDvKSH8vZ-_Mlg== d2gtrjee5bqfpl.cloudfront.net https 1268 0.001 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Hit
The main points regarding Cloudfront logging are
- Supports single events sent via
GET
only - No support for
network_userid
Please, refer to the Snowplow Tracker Protocol for the comprehensive list of the individual parameters that could be submitted with the GET
request (and thus contained in the field cs-uri-query
of the Cloudfront logfile).
To ensure persistence of the logs, you have to configure your environment to publish logs to Amazon S3 automatically after they have been rotated. Elastic Beanstalk creates a bucket with the naming pattern elasticbeanstalk-region-account-id
for each region in which you create environments. Within this bucket, logs are stored under the path resources/environments/logs/logtype/environment-id/instance-id
.
For example, logs from instance i-0a1fd158
, in Elastic Beanstalk environment e-mpcwnwheky
in region us-west-2
in account 0123456789012
, are stored in the following location:
s3://elasticbeanstalk-us-west-2-0123456789012/resources/environments/logs/publish/e-mpcwnwheky/i-0a1fd158
Please, refer to the following Amazon article for more details.
Bear in mind that if you reconfigure the environment the instance-id
could be modified too.
Access logging in Tomcat environment is performed by valves that implement org.apache.catalina.AccessLog interface. A formatting layout identifying the various information fields from the request and response to be logged is determined by the attribute pattern
.
Our Clojure Collector uses a customized Tomcat access log valve. Compared to the standard AccessLogValve, this valve:
- Introduces a new pattern, 'I', to escape an incoming header.
- Introduces a new pattern, 'C', to fetch a cookie stored on the response.
- Re-implements the pattern 'i' to ensure that "" (empty string) is replaced with "-".
- Re-implements the pattern 'q' to remove the "?" and ensure "" (empty string) is replaced with "-".
- Overwrites the 'v' pattern, to write the version of this AccessLogValve, rather than the local server name.
- Introduces a new pattern, 'w' to capture the request's body.
- Introduces a new pattern, '~' to capture the request's content type.
- Re-implements the pattern 'a' to get remote IP more reliably, even through proxies.
Thus the pattern we use ensures that the access log format matches that produced by the Cloudfront Collector (so that the same ETL process can be employed for both collectors).
<Valve ... pattern="%{yyyy-MM-dd}t	%{HH:mm:ss}t	-	%b	%a	%m	%h	%U	%s	%{Referer}i	%{User-Agent}I	%q&cv=clj-1.1.0-%v&nuid=%{sp}C	-	-	-	%~	%w" />
As a result, a logfile record will look like the one below:
2016-11-27 07:16:07 - 43 185.124.153.90 GET 185.124.153.90 /i 200 http://chuwy.me/scala-blocks.html Mozilla%2F5.0+%28Macintosh%3B+Intel+Mac+OS+X+10_11_6%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F54.0.2840.98+Safari%2F537.36 stm=1480230967340&e=pv&url=http%3A%2F%2Fchuwy.me%2Fscala-blocks.html&page=Scala%20Code%20Blocks&refr=http%3A%2F%2Fchuwy.me%2F&tv=js-2.7.0-rc2&tna=blogTracker&aid=blog&p=web&tz=Asia%2FOmsk&cs=UTF-8&f_pdf=1&f_qt=0&f_realp=0&f_wma=0&f_dir=0&f_fla=1&f_java=0&f_gears=0&f_ag=0&res=1280x800&cd=24&cookie=1&eid=1799a90f-f570-4414-b91a-b0db8f39cc2e&dtm=1480230967333&vp=1280x726&ds=1280x4315&vid=18&sid=395e4506-37a3-4074-8de2-d8c75fb17d4a&duid=1f9b3980-6619-4d75-a6c9-8253c76c3bfb&fp=531497290&cv=clj-1.1.0-tom-0.2.0&nuid=5beb1f92-d4fb-4020-905c-f659929c8ab5 - - - - -
For your convenience, the pattern
codes utilized are summarised in the table below.
Code | Implementation Type | Description |
---|---|---|
%a | Customized | Reimplemented to get remote IP more reliably, even through proxies. |
%b | Standard | Bytes sent, excluding HTTP headers, or '-' if zero. |
%{xxx}C | New | Introduced to fetch a cookie stored on the response. |
%h | Standard | Remote host name (or IP address if enableLookups for the connector is false ). |
%{xxx}I | New | Introduced to escape an incoming header. |
%{xxx}i | Customized | Reimplemented to ensure that "" (empty string) is replaced with "-". |
%m | Standard | Request method (GET , POST ). |
%q | Customized | Re-implements the pattern 'q' to remove the "?" and ensure "" (empty string) is replaced with "-". |
%s | Standard | HTTP status code of the response. |
%{xxx}t | Standard | Timestamp formatted using the enhanced SimpleDateFormat pattern . |
%U | Standard | Requested URL path. |
%v | Customized | Overwritten to write the version of this AccessLogValve , rather than the local server name. |
%w | New | Introduced to capture the request's body. |
%~ | New | Introduced to capture the request's content type. |
The main characteristics of Tomcat access logs:
- A custom textual format based on Apache logfile format.
- Support
GET
but also multiple events sent viaPOST
. - Support
network_userid
.
As its name suggests, the Stream collector differs from the batch pipeline collectors in that it produces streams of Snowplow events (records). As such, the data (payload) is serialized by utilizing Apache Thrift framefork.
Streaming Data is data that is generated continuously. Streaming data includes a wide variety of data such as log files and events generated by customers using your mobile or web applications, ecommerce purchases, in-game player activity, information from social networks, etc.
Binary serialization allows for:
- simpler data structure
- smaller size
- faster transfer
- easier (programmatical) parsing
The Snowplow Thrift raw event format conforms to this Thrift schema. For easier perception, the structure of the collector payload is depicted below.
struct CollectorPayload {
31337: string schema
// Required fields which are intrinsic properties of HTTP
100: string ipAddress
// Required fields which are Snowplow-specific
200: i64 timestamp
210: string encoding
220: string collector
// Optional fields which are intrinsic properties of HTTP
300: optional string userAgent
310: optional string refererUri
320: optional string path
330: optional string querystring
340: optional string body
350: optional list<string> headers
360: optional string contentType
// Optional fields which are Snowplow-specific
400: optional string hostname
410: optional string networkUserId
}
It's important to note that we built stream data processing on the idea of Lambda architecture which implies both a speed (real-time) layer and a batch layer. As a result we provide two consumers: Stream Enrich and Kinesis-S3 sink.
Due to their nature (purpose):
- Stream Enrich reads raw Snowplow events off a Kinesis stream and writes the enriched Snowplow event to another Kinesis stream
- Kinesis-S3 reads records from an Amazon Kinesis stream, encodes and wraps them into Protocol Buffers (PB) by means of ElephantBird library, compresses the PB arrays using splittable LZO, and writes them to S3
Thus, the output of Kinesis-S3 is a projection of raw event data (serialized Thrift records, not enriched) in the form of a compressed LZO file. Each .lzo
file has a corresponding .lzo.index
file containing the byte offsets for the LZO blocks, so that the blocks can be processed in parallel using Hadoop.
Note: We also provide an option to GZIP the Thrift records rather than produce LZO files. However, those files are not meant to be processed in Hadoop. They are archives of Thrift records without the protocol buffers layer being applied.
Generally, the LZO file generated by Kinesis-S3 could be depicted as an "onion-like" layered object as shown below.
The main characteristics of stream-based raw events:
- A serialized Thrift record format.
- Support both
GET
andPOST
requests. - Support
network_userid
.