faq How to run Hadoop free Spark in PadoGrid - padogrid/padogrid GitHub Wiki

How to run Hadoop-free Spark in PadoGrid? I want to include my own Hadoop. Where do I set SPARK_DIST_CLASSPATH?

First, make sure you have Hadoop installed. You can install Hadoop in PadoGrid using install_padogrid as follows.

# Download and install Hadoop
install_padogrid -product hadoop

# Update the workspace with the desired Hadoop version
update_product -product hadoop

There are two (2) ways to include the Hadoop class path to Spark in PadoGrid.

1. bin_sh/setenv.sh

Edit bin_sh/setenv.sh file.

switch_cluster myspark
vi bin_sh/setenv.sh

In bin_sh/setenv.sh, uncomment the following line. HADOOP_HOME is set by PadoGrid when you updated the workspace with update_product. See above.

CLASSPATH="$CLASSPATH:$($HADOOP_HOME/bin/hadoop classpath)"

2. spark-env.sh

Another way is to add SPARK_DIST_CLASSPATH in spark-env.sh as described in the Spark documentation [1]. This overrides bin_sh/setenv.sh.

Edit etc/spark-env.sh file.

switch_cluster myspark
vi etc/spark-env.sh

In etc/spark-env.sh, add the following line. HADOOP_HOME is set by PadoGrid when you updated the workspace with update_product. See above.

export SPARK_DIST_CLASSPATH=$CLASSPATH:$($HADOOP_HOME/bin/hadoop classpath)

:pencil2: CLASSPATH set by PadoGrid includes your cluster and workspace libraries in their plugins and lib directories. You can drop in your jar files in any of these directories to be part of the Spark class path.

References

  1. Using Spark's "Hadoop Free" Build, https://spark.apache.org/docs/latest/hadoop-provided.html