Getting Started with Spork - sigmoidanalytics/spork GitHub Wiki
Setting up spork
git clone https://github.com/apache/pig -b spark # Uses Spark-1.3.0
ant -Dhadoopversion=23 jar
Environment Variables required:
export JAVA_HOME=/usr/lib/jvm/java-7-oracle-cloudera
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_USER_CLASSPATH_FIRST="true"
export PIG_HOME=/root/pig
export SPARK_MASTER=local # OR spark://localhost:7077 OR yarn-client
Running Pig Script in spark mode
bin/pig -x spark
Run sample script
Put data into hdfs
hadoop fs -mkdir /pig-test/input/
hadoop fs -put ./tutorial/data/excite-small.log /pig-test/input/
Start pig and paste the script
bin/pig -x spark
raw = LOAD '/pig-test/input/excite-small.log' USING PigStorage('\t') AS (user: chararray, time:chararray, query:chararray);
queries = FOREACH raw GENERATE query;
distinct_queries = DISTINCT queries;
STORE distinct_queries INTO '/pig-test/output/';
Common issues
1. java.lang.IllegalStateException: unread block data
Misconfigured env's. Check HADOOP_HOME, PIG_HOME & HADOOP_CONF_DIR are set properly.
2. java.lang.RuntimeException: Cannot instantiate: org.apache.pig.......
Try adding additional jar to PIG_CLASSPATH & SPARK_JARS If still facing the issue then try specifying full class with package name(e.g.: org.apache.pig.udf.JSONStorage) and use REGISTER as first line of script (e.g.: REGISTER /usr/lib/Tutorial.jar;)
3. ClassNotFoundException
This is usually seen in yarn. Try adding additional udf jars to PIG_OPTS like export PIG_OPTS="-Dspark.yarn.dist.files=$PIG_HOME/legacy/pig-0.14.0-SNAPSHOT-withouthadoop-h2.jar,$PIG_HOME/lib/pig-udfs-new.jar"
4. Error creating job configuration
export PIG_JAR=$PIG_HOME/build/pig-0.14.0-SNAPSHOT.jar