Building Apache Spark - linux-on-ibm-z/docs GitHub Wiki
Building Apache Spark
The instructions provided below specify the steps to build Apache Spark version 3.5.0 in Standalone Mode on Linux on IBM Z for the following distributions:
- RHEL (7.8, 7.9, 8.6, 8.8, 8.9, 9.0, 9.2, 9.3)
- SLES (12 SP5, 15 SP5)
- Ubuntu (20.04, 22.04, 23.10)
Limitation: Hive and ORC do not currently fully support big-endian systems. The use of these components should be avoided if possible.
The binary for Apache Spark version 3.5.0 can be downloaded from here. It works after installing Java and building LevelDB JNI from source as mentioned in Step 2.4 and Step 2.6. Please note that steps in Documentation are the only verification performed on the binary.
General Notes:
-
When following the steps below please use a standard permission user unless otherwise specified.
-
A directory
/<source_root>/
will be referred to in these instructions, this is a temporary writeable directory anywhere you'd like to place it.
Step 1. Build using script
If you want to build Spark using manual steps, go to STEP 2.
Use the following commands to build Spark using the build script. Please make sure you have wget installed.
wget -q https://raw.githubusercontent.com/linux-on-ibm-z/scripts/master/ApacheSpark/3.5.0/build_spark.sh
# Build Spark
bash build_spark.sh [Provide -h option to print help menu]
If the build completes successfully, go to STEP 5. In case of error, check logs for more details or go to STEP 2 to follow manual build steps.
Step 2. Build Prerequisites for Apache Spark
2.1) Install the dependencies
export SOURCE_ROOT=/<source_root>/
-
RHEL (7.8, 7.9, 8.6, 8.8, 8.9)
sudo yum groupinstall -y 'Development Tools' sudo yum install -y wget tar git libtool autoconf make curl python3
-
RHEL (9.0, 9.2, 9.3)
sudo yum install -y rpmdevtools wget tar git libtool autoconf make curl python3 flex gcc redhat-rpm-config rpm-build pkgconfig gettext automake gdb bison gcc-c++ binutils
-
SLES (12 SP5)
sudo zypper install -y wget tar git libtool autoconf curl libnghttp2-devel gcc make gcc-c++ zip unzip gzip gawk python36 sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 40
-
SLES (15 SP5)
sudo zypper install -y wget tar git libtool autoconf curl gcc make gcc-c++ zip unzip gzip gawk python3
-
Ubuntu (20.04, 22.04, 23.10)
sudo apt-get update sudo apt-get install -y wget tar git libtool autoconf build-essential curl apt-transport-https
2.2) Build and Install GCC 9.2.0 ( For RHEL 7.x only)
cd "${SOURCE_ROOT}"
sudo yum install -y hostname tar zip gcc-c++ unzip python3 cmake curl wget gcc vim patch binutils-devel tcl gettext
GCC_VERSION=9.2.0
wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.gz
tar -xf gcc-${GCC_VERSION}.tar.gz
cd gcc-${GCC_VERSION}
./contrib/download_prerequisites
mkdir objdir
cd objdir
../configure --prefix=/opt/gcc --enable-languages=c,c++ --with-arch=zEC12 --with-long-double-128 \
--build=s390x-linux-gnu --host=s390x-linux-gnu --target=s390x-linux-gnu \
--enable-threads=posix --with-system-zlib --disable-multilib
make -j $(nproc)
sudo make install
sudo ln -sf /opt/gcc/bin/gcc /usr/bin/gcc
sudo ln -sf /opt/gcc/bin/g++ /usr/bin/g++
sudo ln -sf /opt/gcc/bin/g++ /usr/bin/c++
export PATH=/opt/gcc/bin:"$PATH"
export LD_LIBRARY_PATH=/opt/gcc/lib64:"$LD_LIBRARY_PATH"
export C_INCLUDE_PATH=/opt/gcc/lib/gcc/s390x-linux-gnu/${GCC_VERSION}/include
export CPLUS_INCLUDE_PATH=/opt/gcc/lib/gcc/s390x-linux-gnu/${GCC_VERSION}/include
2.3) Install ZSTD-JNI library(For Ubuntu 20.04,RHEL 7.x/8.x and SLES distributions)
2.3.1) Install Temurin11 (required for ZSTD-JNI)
cd "${SOURCE_ROOT}"
sudo mkdir -p /opt/openjdk/11/
curl -SL -o jdk11.tar.gz "https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.19%2B7/OpenJDK11U-jdk_s390x_linux_hotspot_11.0.19_7.tar.gz"
sudo tar -zxf jdk11.tar.gz -C /opt/openjdk/11/ --strip-components 1
2.3.2) Install ZSTD-JNI library
sudo mkdir /usr/lib64/ #Only for Ubuntu 20.04
cd "${SOURCE_ROOT}"
git clone -b c1.5.2-5 https://github.com/luben/zstd-jni.git
cd zstd-jni/
JAVA_HOME="/opt/openjdk/11" PATH="/opt/openjdk/11/bin/:${PATH}" ./sbt compile package
sudo cp target/classes/linux/s390x/libzstd-jni-1.5.2-5.so /usr/lib64/
export LD_LIBRARY_PATH=/usr/lib64/:"$LD_LIBRARY_PATH" #Only for Ubuntu 20.04
2.4) Install Java
Install Java 8 (required for building LevelDB JNI):
cd "${SOURCE_ROOT}"
curl -SL -o jdk8.tar.gz "https://github.com/ibmruntimes/semeru8-binaries/releases/download/jdk8u372-b07_openj9-0.38.0/ibm-semeru-open-jdk_s390x_linux_8u372b07_openj9-0.38.0.tar.gz"
sudo mkdir -p /opt/openjdk/8/
sudo tar -zxf jdk8.tar.gz -C /opt/openjdk/8/ --strip-components 1
Install Java:
- With OpenJDK
- RHEL (7.8, 7.9, 8.6, 8.8,, 8.9, 9.0, 9.2, 9.3)
sudo yum install -y java-11-openjdk java-11-openjdk-devel sudo yum install -y java-17-openjdk java-17-openjdk-devel (For RHEL8.x,9.x)
- SLES (12 SP5, 15 SP5)
sudo zypper install -y java-11-openjdk java-11-openjdk-devel sudo zypper install -y java-17-openjdk java-17-openjdk-devel (for SLES15-sp4/sp5 only)
- Ubuntu (20.04, 22.04, 23.10)
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-11-jdk sudo DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-17-jdk
- With Eclipse Adoptium Temurin Runtime (previously known as AdoptOpenJDK hotspot)
- RHEL (7.8, 7.9, 8.6, 8.8, 8.9, 9.0, 9.2, 9.3) , SLES (12 SP5, 15 SP5) and Ubuntu (20.04, 22.04, 23.10)
- Download and install Eclipse Adoptium Temurin Runtime (Java 11, 17) from here.
- RHEL (7.8, 7.9, 8.6, 8.8, 8.9, 9.0, 9.2, 9.3) , SLES (12 SP5, 15 SP5) and Ubuntu (20.04, 22.04, 23.10)
export JAVA_HOME=/<Path to OpenJDK>/
export PATH="${JAVA_HOME}/bin:${PATH}"
Note: At the time of creation of these build instructions, Apache Spark version 3.5.0 was verified with OpenJDK 11,17 (latest distro provided version) and Eclipse Adoptium Temurin Runtime (build 11.0.19+7, 17.0.10+7
) on all above distributions.
2.5) Install Maven
wget -O apache-maven-3.8.8.tar.gz "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=maven/maven-3/3.8.8/binaries/apache-maven-3.8.8-bin.tar.gz"
tar -xzf apache-maven-3.8.8.tar.gz
export PATH="${PATH}:${SOURCE_ROOT}/apache-maven-3.8.8/bin"
2.6) Build LevelDB JNI
-
Download and configure Snappy
cd "${SOURCE_ROOT}" wget https://github.com/google/snappy/releases/download/1.1.3/snappy-1.1.3.tar.gz tar -zxvf snappy-1.1.3.tar.gz export SNAPPY_HOME="${SOURCE_ROOT}/snappy-1.1.3" cd "${SNAPPY_HOME}" ./configure --disable-shared --with-pic make sudo make install
-
Download the source code for LevelDB and LevelDB JNI
cd "${SOURCE_ROOT}" git clone -b s390x https://github.com/linux-on-ibm-z/leveldb.git git clone -b leveldbjni-1.8-s390x https://github.com/linux-on-ibm-z/leveldbjni.git
-
Set the environment variables
export LEVELDB_HOME="${SOURCE_ROOT}/leveldb" export LEVELDBJNI_HOME="${SOURCE_ROOT}/leveldbjni" export LIBRARY_PATH="${SNAPPY_HOME}" export C_INCLUDE_PATH="${LIBRARY_PATH}" export CPLUS_INCLUDE_PATH="${LIBRARY_PATH}"
-
Apply the LevelDB patch
cd "${LEVELDB_HOME}" git apply "${LEVELDBJNI_HOME}/leveldb.patch" make libleveldb.a
-
Build the jar file
cd "${LEVELDBJNI_HOME}" JAVA_HOME="/opt/openjdk/8/" PATH="/opt/openjdk/8/bin/:${PATH}" mvn clean install -P download -Plinux64-s390x -DskipTests JAVA_HOME="/opt/openjdk/8/" PATH="/opt/openjdk/8/bin/:${PATH}" jar -xvf "${LEVELDBJNI_HOME}/leveldbjni-linux64-s390x/target/leveldbjni-linux64-s390x-1.8.jar" export LD_LIBRARY_PATH="${LEVELDBJNI_HOME}/META-INF/native/linux64/s390x${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
2.7) Set Environment Variables
export MAVEN_OPTS="-Xss128m -Xmx3g -XX:ReservedCodeCacheSize=1g"
export JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF-8"
Step 3. Build Apache Spark
3.1) Clone Spark Repository
cd "${SOURCE_ROOT}"
git clone -b v3.5.0 https://github.com/apache/spark.git
3.2) Apply Patches to Fix Known Issues
cd spark
curl -sSL "https://raw.githubusercontent.com/linux-on-ibm-z/scripts/master/ApacheSpark/3.5.0/patch/spark.diff" | git apply
3.3) Build Spark
cd "${SOURCE_ROOT}/spark"
./build/mvn -DskipTests clean package
Step 4. Run the test cases (Optional)
-
Run the Whole Java Test Suites
cd "${SOURCE_ROOT}/spark" ./build/mvn test -fn -DwildcardSuites=none
-
Run an Individual Java Test (For example
JavaAPISuite
)cd "${SOURCE_ROOT}/spark" ./build/mvn -DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite test
-
Run the Whole Scala Test Suites
cd "${SOURCE_ROOT}/spark" ./build/mvn test -fn -Dtest=none -pl '!sql/hive'
-
Run an Individual Scala Test (For example
DataFrameCallbackSuite
)cd "${SOURCE_ROOT}/spark" ./build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.util.DataFrameCallbackSuite test
Note:
Hive does not currently support big-endian systems and so its test suites are skipped.
Note: Following scala test cases have been observed to fail intermittently. They should pass on rerun.
CoarseGrainedExecutorBackendSuite
SPARK-40320 Executor should exit when initialization failed for fatal error
SQLAppStatusListenerWithInMemoryStoreSuite
driver side SQL metrics
Any Test Suite having below error
Test case failing due to java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
DAGSchedulerSuite
SPARK-40082: recomputation of shuffle map stage with no pending partitions should finalize the stage. pushBasedShuffleEnabled = false
ClientE2ETestSuite
spark deep recursion
ReplE2ESuite
Simple query
UDF containing 'def'
UDF containing in-place lambda
Updating UDF properties
SPARK-43198: Filter does not throw ammonite-related class initialization exception
Client-side JAR
Java UDF
Java UDF Registration
UDF Registration
UDF closure registration
call_udf
Single Cell Compilation
Local relation containing REPL generated class
Collect REPL generated class
REPL class in encoder
REPL class in UDF
streaming works with REPL generated code
Note: Following RHEL 8.x, 9.x and SLES 15 SP5 scala test failures are observed on both s390x and amd64 when building Spark with OpenJDK11/17.
Spark Project Core
UISuite
http -> https redirect applies to all URIs
Note: Following scala test failures are being investigated
Spark Project SQL
SQLQueryTestSuite
ansi/cast.sql_analyzer_test
ansi/interval.sql
ansi/interval.sql_analyzer_test
cast.sql_analyzer_test
interval.sql
interval.sql_analyzer_test
try_cast.sql_analyzer_test
FlatMapGroupsWithStateDistributionSuite
SPARK-38204: flatMapGroupsWithState should require ClusteredDistribution from children if the query starts from checkpoint in 3.2.x - with initial state
SparkScriptTransformationSuite
SPARK-32400: TRANSFORM should support more data types (interval, array, map, struct and udt) as input (no serde)
NullableColumnBuilderSuite
CALENDAR_INTERVAL column builder: null values
Step 5. Start Apache Spark Shell
cd "${SOURCE_ROOT}/spark"
./bin/spark-shell