Getting Started with Spork - sigmoidanalytics/spork GitHub Wiki

Setting up spork

git clone https://github.com/apache/pig -b spark    # Uses Spark-1.3.0
ant -Dhadoopversion=23 jar

Environment Variables required:

export JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_USER_CLASSPATH_FIRST="true"
export PIG_HOME=/root/pig
export SPARK_MASTER=local    # OR spark://localhost:7077 OR yarn-client 

Running Pig Script in spark mode

bin/pig -x spark 

Run sample script

Put data into hdfs

hadoop fs -mkdir /pig-test/input/
hadoop fs -put ./tutorial/data/excite-small.log /pig-test/input/

Start pig and paste the script

bin/pig -x spark
raw = LOAD '/pig-test/input/excite-small.log' USING PigStorage('\t') AS (user: chararray, time:chararray, query:chararray);
queries = FOREACH raw GENERATE query;
distinct_queries = DISTINCT queries;
STORE distinct_queries INTO '/pig-test/output/';

Common issues

1. java.lang.IllegalStateException: unread block data

Misconfigured env's. Check HADOOP_HOME, PIG_HOME & HADOOP_CONF_DIR are set properly.

2. java.lang.RuntimeException: Cannot instantiate: org.apache.pig.......

Try adding additional jar to PIG_CLASSPATH & SPARK_JARS If still facing the issue then try specifying full class with package name(e.g.: org.apache.pig.udf.JSONStorage) and use REGISTER as first line of script (e.g.: REGISTER /usr/lib/Tutorial.jar;)

3. ClassNotFoundException

This is usually seen in yarn. Try adding additional udf jars to PIG_OPTS like export PIG_OPTS="-Dspark.yarn.dist.files=$PIG_HOME/legacy/pig-0.14.0-SNAPSHOT-withouthadoop-h2.jar,$PIG_HOME/lib/pig-udfs-new.jar"

4. Error creating job configuration

export PIG_JAR=$PIG_HOME/build/pig-0.14.0-SNAPSHOT.jar