Troubleshooting - OXYGEN-MARKET/oxygen-market.github.io GitHub Wiki

This is a page of hints, tips and explanations to help you work with Snowplow. If something looks like a bug in Snowplow but isn't, it will end up on this page too.

EmrEtlRunner failed. What do I do now?
Why are browser features missing in IE?
Hive problem: I upgraded and now queries are not working
I need to recreate my table of Snowplow events, how?
I want to recompute my Snowplow events, how?
My database load process died during an S3 file copy, help!
Shredding is failing with File does not exist: hdfs:/local/snowplow/shredded-events
How do I terminate a Clojure Collector instance without losing event logs?
My EMR master instance starts but my core instances timeout during the bootstrap process

EmrEtlRunner failed. What do I do now?

EmrEtlRunner has three different ways of failing:

The ETL job on Elastic MapReduce fails to start
The ETL job starts on Elastic MapReduce but errors part way through
One or more S3 file copy operations fail

For help diagnosing and fixing these problems, please see our dedicated Troubleshooting jobs on Elastic MapReduce wiki page.

Why are browser features all recorded as null for Internet Explorer?

With the exception of cookies and Java, our JavaScript tracker cannot detect what browser features (PDF, Flash etc) a given instance of Internet Explorer has. This is because IE, unlike the other major browsers, does not populate the window.navigator.mimeTypes[] and navigator.plugins[] properties.

There are other ways of detecting some browser features (via ActiveX), but these are not advised as they can trigger UAC warnings on Windows.

Hive problem: I upgraded and now queries are not working or returning nonsense results

The most likely reason for this is that you have configured your ETL process to output your Snowplow event files in the non-Hive format (used to feed Infobright etc). This is typically configured with the following configuration option to EmrEtlRunner:

:etl:
    :storage_format: non-hive

Unlike the Hive format output, the non-Hive format output for Snowplow event files is not backwards compatible for Hive queries. In other words, with the non-Hive format, running a HiveQL query across Snowplow event files generated by two different versions of the ETL process will probably not work.

The solution is to re-run the ETL process across all of your raw Snowplow logs when you upgrade your ETL process.

I need to recreate my table of Snowplow events, how?

If you have somehow lost or corrupted your Snowplow event store (in Infobright or Redshift), don't panic!

Fortunately, Snowplow does not delete any data at any stage of its processing, so it's all available for you to restore from your archive buckets.

Here is a simple workflow to use with StorageLoader to re-populate Infobright or Redshift with all of your events:

Create a new events table in your database, let's call it events2
Create a new S3 bucket, let's call it events-archive2
Edit your StorageLoader's config.yml file:
- Change :table: to point to your events2 table
- Change :in: to point to your existing archive bucket
- Change :archive: to point to your new events-archive2 bucket
Rerun StorageLoader

This should load all of your events into your new events2 table, archiving all events after loading into events-archive2.

I want to recompute my Snowplow events, how?

You may well want to recompute all of your Snowplow events, for example if we release a new enrichment (such as geo-IP lookup) and you want it to be run against all of your historical data.

Fortunately, Snowplow does not delete any data at any stage of its processing, so the raw data is still available in your archive bucket for you to regenerate your Snowplow events from.

Here is a simple workflow to use with EmrEtlRunner to regenerate your Snowplow events from your raw collector logs:

Create a new S3 bucket, let's call it events2
Create a new S3 bucket, let's call it logs-archive2
Edit your EmrEtlRunner's config.yml file:
- Change :in: to point to your existing archive bucket
- Change :out: to point to your new events2 bucket
- Change :archive: to point to your new logs-archive2 bucket
Rerun EmrEtlRunner

This should load recompute all of your events into your new events2 bucket, archiving all events after loading into events-archive2. From there you can reload your recomputed events into Infobright or Redshift using StorageLoader.

My database load process died during an S3 file copy, help!

Occasionally Amazon S3 fails repeatedly to perform a file operation, eventually causing StorageLoader to die. When this happens, you may see "500 InternalServerErrors", reported by Sluice, which is the library we use to handle S3 file operations.

If this happens, you will need to rerun your StorageLoader process, using the following guidance:

If the job died during the download-to-local step, then:

Delete any files in your download folder
Rerun StorageLoader

If the job died during the archiving step, rerun StorageLoader with the command-line option of --skip download,delete,load

Shredding is failing with File does not exist: hdfs:/local/snowplow/shredded-event

You are probably seeing an error like this in your EMR job's syslog:

2014-07-17 02:31:42,198 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Running with args: [Ljava.lang.String;@471719b6
2014-07-17 02:31:45,975 FATAL com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system
java.io.FileNotFoundException: File does not exist: hdfs:/local/snowplow/shredded-events
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)

The Hadoop job step that is failing is the copy (using Amazon's S3DistCp utility) of shredded JSONs from your EMR cluster's HDFS file system back to Amazon S3, ready for loading into Redshift. Due to an unfortunate attribute of S3DistCp, it will fail if no files were output for shredding. Possible reasons for this:

You are not generating any custom contexts, nor unstructured events and have not enabled link click tracking. Solution: run EmrEtlRunner with --skip shred. Remove this --skip as/when you know that you do have JSONs to shred.
You are trying to send contexts/unstructured events from your tracker, but something is going wrong. You can validate that this is the case by doing a text search on your collector logs and confirming you can't see the query string parameters 'ue_pr' or 'ue_px', and 'co' or 'cx'. Solution: review your tracker implementation to fix
You are sending contexts/unstructured events from your tracker, but the JSONs are failing schema validation for some reason. In this case you should be able to find the data in your shredded bad rows bucket, along with the reason(s) for the validation failure. Solution: update your JSON Schemas in Iglu, or your JSON instances, so that they pass validation

How do I terminate a Clojure Collector instance without losing event logs?

The Clojure Collector is configured to upload logs of raw events to Amazon S3 every hour (typically at 10 minutes past the hour). If you want to terminate an instance running the Clojure Collector, you need to follow a strict process to ensure the most recent event logs are not lost when the instance is terminated.

For the process to follow, please see our dedicated Troubleshooting Clojure Collector instances to prevent data loss wiki page.

My EMR master instance starts but my core instances timeout during the bootstrap process

You are most likely running EMR in a VPC. If EMR cannot launch the slave EC2 instances, then you may have a misconfigured VPC. You must set Enable Hostnames to true for the VPC.