Spork Setup on CDH - sigmoidanalytics/spork GitHub Wiki
Setting up spork
git clone https://github.com/apache/pig -b spark # Uses Spark-1.3.0
git checkout ee9cce3079ce3f7fb1fceb5a334c98d5d44d4426 # Uses Spark-1.2.0
ant -Dhadoopversion=23 jar
Environment Variables required:
export JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_USER_CLASSPATH_FIRST="true"
export PIG_HOME=/root/pig
export SPARK_MASTER=local # OR spark://localhost:7077 OR yarn-client
Running Pig Script in spark mode
bin/pig -x spark
Run sample script
Put data into hdfs
hadoop fs -mkdir /pig-test/input/
hadoop fs -put ./tutorial/data/excite-small.log /pig-test/input/
Start pig and paste the script
bin/pig -x spark
raw = LOAD '/pig-test/input/excite-small.log' USING PigStorage('\t') AS (user: chararray, time:chararray, query:chararray);
queries = FOREACH raw GENERATE query;
distinct_queries = DISTINCT queries;
STORE distinct_queries INTO '/pig-test/output/';
Running Pig Script in Embedded mode
import java.io.IOException;
import org.apache.pig.PigServer;
public class Test {
public static void main(String[] args)
{
try {
PigServer ps=new PigServer("spark");
ps.registerScript("script.pig");
} catch (IOException e) {
e.printStackTrace();
}
}
}
Steps to run code:
javac -cp "$PIG_HOME/build/pig-0.14.0-SNAPSHOT.jar" Test.java
jar -cvf script.jar Test.class
export HADOOP_CLASSPATH=$PIG_HOME/build/pig-0.14.0-SNAPSHOT.jar:/usr/lib/spark/assembly/lib/spark-assembly-1.2.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar:$PIG_HOME/build/ivy/lib/Pig/*
hadoop jar script.jar Test
Setting up spork on CDH yarn
Apart from above mentioned env's add these
export SPARK_MASTER=yarn-client
export PIG_CLASSPATH=$SPARK_HOME/assembly/lib/spark-assembly-1.2.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar
Common issues and their solutions
**Issue:**java.lang.IllegalStateException: unread block data
Misconfigured env's Check these HADOOP_HOME, PIG_HOME & HADOOP_CONF_DIR
**Issue:**java.lang.RuntimeException: Cannot instantiate: org.apache.pig.......
Try adding additional jar to PIG_CLASSPATH & SPARK_JARS If still facing the issue then try specifying full class with package name(e.g.: org.apache.pig.udf.JSONStorage) and use REGISTER as first line of script (e.g.: REGISTER /usr/lib/Tutorial.jar;)
**Issue:**ClassNotFoundException
This is usually seen in yarn. Try adding additional udf jars to PIG_OPTS like export PIG_OPTS="-Dspark.yarn.dist.files=$PIG_HOME/legacy/pig-0.14.0-SNAPSHOT-withouthadoop-h2.jar,$PIG_HOME/lib/pig-udfs-new.jar"
**Issue:**Error creating job configuration
export PIG_JAR=$PIG_HOME/build/pig-0.14.0-SNAPSHOT.jar
###Performance tuning steps
--TODO--