Spork Setup on CDH - sigmoidanalytics/spork GitHub Wiki

Setting up spork

git clone https://github.com/apache/pig -b spark       # Uses Spark-1.3.0
git checkout ee9cce3079ce3f7fb1fceb5a334c98d5d44d4426  # Uses Spark-1.2.0
ant -Dhadoopversion=23 jar

Environment Variables required:

export JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_USER_CLASSPATH_FIRST="true"
export PIG_HOME=/root/pig
export SPARK_MASTER=local    # OR spark://localhost:7077 OR yarn-client 

Running Pig Script in spark mode

bin/pig -x spark 

Run sample script

Put data into hdfs

hadoop fs -mkdir /pig-test/input/
hadoop fs -put ./tutorial/data/excite-small.log /pig-test/input/

Start pig and paste the script

bin/pig -x spark
raw = LOAD '/pig-test/input/excite-small.log' USING PigStorage('\t') AS (user: chararray, time:chararray, query:chararray);
queries = FOREACH raw GENERATE query;
distinct_queries = DISTINCT queries;
STORE distinct_queries INTO '/pig-test/output/';

Running Pig Script in Embedded mode

import java.io.IOException;
import org.apache.pig.PigServer;
public class Test {
    public static void main(String[] args)
    {
     try {
         PigServer ps=new PigServer("spark");
         ps.registerScript("script.pig");
     } catch (IOException e) {
         e.printStackTrace();
     }
    }
}

Steps to run code:

javac -cp "$PIG_HOME/build/pig-0.14.0-SNAPSHOT.jar" Test.java
jar -cvf script.jar Test.class
export HADOOP_CLASSPATH=$PIG_HOME/build/pig-0.14.0-SNAPSHOT.jar:/usr/lib/spark/assembly/lib/spark-assembly-1.2.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar:$PIG_HOME/build/ivy/lib/Pig/*
hadoop jar script.jar Test

Setting up spork on CDH yarn

Apart from above mentioned env's add these

export SPARK_MASTER=yarn-client
export PIG_CLASSPATH=$SPARK_HOME/assembly/lib/spark-assembly-1.2.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar

Common issues and their solutions

**Issue:**java.lang.IllegalStateException: unread block data

Misconfigured env's Check these HADOOP_HOME, PIG_HOME & HADOOP_CONF_DIR

**Issue:**java.lang.RuntimeException: Cannot instantiate: org.apache.pig.......

Try adding additional jar to PIG_CLASSPATH & SPARK_JARS If still facing the issue then try specifying full class with package name(e.g.: org.apache.pig.udf.JSONStorage) and use REGISTER as first line of script (e.g.: REGISTER /usr/lib/Tutorial.jar;)

**Issue:**ClassNotFoundException

This is usually seen in yarn. Try adding additional udf jars to PIG_OPTS like export PIG_OPTS="-Dspark.yarn.dist.files=$PIG_HOME/legacy/pig-0.14.0-SNAPSHOT-withouthadoop-h2.jar,$PIG_HOME/lib/pig-udfs-new.jar"

**Issue:**Error creating job configuration

export PIG_JAR=$PIG_HOME/build/pig-0.14.0-SNAPSHOT.jar

###Performance tuning steps

--TODO--