Building Apache Spark - linux-on-ibm-z/docs GitHub Wiki
Building Apache Spark
The instructions provided below specify the steps to build Apache Spark version 3.5.4 in Standalone Mode on Linux on IBM Z for the following distributions:
- RHEL (8.8, 8.10, 9.2, 9.4, 9.5)
- SLES 15 SP6
- Ubuntu (20.04, 22.04, 24.04, 24.10)
Limitation: Hive does not currently fully support big-endian systems. The use of this component should be avoided.
General Notes:
-
When following the steps below please use a standard permission user unless otherwise specified.
-
A directory
/<source_root>/
will be referred to in these instructions, this is a temporary writeable directory anywhere you'd like to place it.
1. Build using script
If you want to build Spark using manual steps, go to STEP 2.
Use the following commands to build Spark using the build script. Please make sure you have wget installed.
wget -q https://raw.githubusercontent.com/linux-on-ibm-z/scripts/master/ApacheSpark/3.5.4/build_spark.sh
# Build Spark
bash build_spark.sh [Provide -h option to print help menu]
If the build completes successfully, go to STEP 5. In case of error, check logs for more details or go to STEP 2 to follow manual build steps.
2. Build Prerequisites for Apache Spark
2.1. Install the dependencies
export SOURCE_ROOT=/<source_root>/
-
RHEL (8.8, 8.10)
sudo yum groupinstall -y 'Development Tools' sudo yum install -y wget tar git libtool autoconf make curl python3 procps-ng
-
RHEL (9.2, 9.4, 9.5)
sudo yum groupinstall -y 'Development Tools' sudo yum install -y --allowerasing rpmdevtools wget tar git libtool autoconf make curl python3 flex gcc redhat-rpm-config rpm-build pkgconfig gettext automake gdb bison gcc-c++ binutils procps-ng
-
SLES 15 SP6
sudo zypper install -y wget tar git libtool autoconf curl gcc make gcc-c++ zip unzip gzip gawk python3 procps
-
Ubuntu (20.04, 22.04, 24.04, 24.10)
sudo apt-get update sudo apt-get install -y wget tar git libtool autoconf build-essential curl apt-transport-https cmake python3 procps
2.2. Install Java
- Install Java 8 (required for building LevelDB JNI):
cd "${SOURCE_ROOT}"
curl -SL -o jdk8.tar.gz "https://github.com/ibmruntimes/semeru8-binaries/releases/download/jdk8u432-b06_openj9-0.48.0/ibm-semeru-open-jdk_s390x_linux_8u432b06_openj9-0.48.0.tar.gz"
sudo mkdir -p /opt/openjdk/8/
sudo tar -zxf jdk8.tar.gz -C /opt/openjdk/8/ --strip-components 1
-
With Eclipse Adoptium Temurin Runtime (previously known as AdoptOpenJDK hotspot)
- Download and install Eclipse Adoptium Temurin Runtime (Java 11 or 17) from here.
-
With OpenJDK 11
- RHEL (8.8, 8.10, 9.2, 9.4, 9.5)
sudo yum install -y java-11-openjdk java-11-openjdk-devel
- SLES 15 SP6
sudo zypper install -y java-11-openjdk java-11-openjdk-devel
- Ubuntu (20.04, 22.04, 24.04, 24.10)
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-11-jdk
-
With OpenJDK 17
- RHEL (8.8, 8.10, 9.2, 9.4, 9.5)
sudo yum install -y java-17-openjdk java-17-openjdk-devel
- SLES 15 SP6
sudo zypper install -y java-17-openjdk java-17-openjdk-devel
- Ubuntu (20.04, 22.04, 24.04, 24.10)
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-17-jdk
Note: At the time of creation of these build instructions, Apache Spark version 3.5.4 was verified with OpenJDK 11,17 (latest distro provided version) and Eclipse Adoptium Temurin Runtime (build 11.0.25+9, 17.0.13+11
) on all above distributions.
2.3. Set JAVA_HOME and PATH
export JAVA_HOME=/<Path to Java>/
export PATH="${JAVA_HOME}/bin:${PATH}"
2.4. Install Maven
cd "${SOURCE_ROOT}"
wget -O apache-maven-3.8.8.tar.gz "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=maven/maven-3/3.8.8/binaries/apache-maven-3.8.8-bin.tar.gz"
tar -xzf apache-maven-3.8.8.tar.gz
2.5. Build LevelDB JNI
-
Download and configure Snappy
cd "${SOURCE_ROOT}" wget https://github.com/google/snappy/releases/download/1.1.4/snappy-1.1.4.tar.gz tar -zxvf snappy-1.1.4.tar.gz export SNAPPY_HOME="${SOURCE_ROOT}/snappy-1.1.4" cd "${SNAPPY_HOME}" ./configure --disable-shared --with-pic make sudo make install
-
Download the source code for LevelDB and LevelDB JNI
cd "${SOURCE_ROOT}" git clone -b s390x https://github.com/linux-on-ibm-z/leveldb.git git clone -b leveldbjni-1.8-s390x https://github.com/linux-on-ibm-z/leveldbjni.git
-
Set the environment variables
export LEVELDB_HOME="${SOURCE_ROOT}/leveldb" export LEVELDBJNI_HOME="${SOURCE_ROOT}/leveldbjni" export LIBRARY_PATH="${SNAPPY_HOME}" export C_INCLUDE_PATH="${LIBRARY_PATH}" export CPLUS_INCLUDE_PATH="${LIBRARY_PATH}"
-
Apply the LevelDB patch
cd "${LEVELDB_HOME}" git apply "${LEVELDBJNI_HOME}/leveldb.patch" make libleveldb.a
-
Build the jar file
cd "${LEVELDBJNI_HOME}" JAVA_HOME="/opt/openjdk/8/" PATH="/opt/openjdk/8/bin/:${SOURCE_ROOT}/apache-maven-3.8.8/bin:${PATH}" mvn clean install -P download -Plinux64-s390x -DskipTests JAVA_HOME="/opt/openjdk/8/" PATH="/opt/openjdk/8/bin/:${SOURCE_ROOT}/apache-maven-3.8.8/bin:${PATH}" jar -xvf "${LEVELDBJNI_HOME}/leveldbjni-linux64-s390x/target/leveldbjni-linux64-s390x-1.8.jar" export LD_LIBRARY_PATH="${LEVELDBJNI_HOME}/META-INF/native/linux64/s390x${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
2.6. Build AirCompressor
cd "${SOURCE_ROOT}"
git clone -b "0.27" --single-branch https://github.com/airlift/aircompressor.git
cd aircompressor
curl -sSL "https://raw.githubusercontent.com/linux-on-ibm-z/scripts/master/ApacheSpark/3.5.4/patch/aircompressor.diff" | git apply -
PATH="${SOURCE_ROOT}/apache-maven-3.8.8/bin:${PATH}" mvn install -B -V -DskipTests -Dair.check.skip-all
2.7. Set Environment Variables
export LANG="C.UTF-8"
3. Build Apache Spark
3.1. Clone Spark Repository
cd "${SOURCE_ROOT}"
git clone -b v3.5.4 --depth 1 https://github.com/apache/spark.git
3.2. Apply Patches to Fix Known Issues
cd "${SOURCE_ROOT}/spark"
curl -sSL "https://raw.githubusercontent.com/linux-on-ibm-z/scripts/master/ApacheSpark/3.5.4/patch/spark.diff" | git apply -
curl -sSL "https://raw.githubusercontent.com/linux-on-ibm-z/scripts/master/ApacheSpark/3.5.4/patch/disabledTests.diff" | git apply -
curl -sSL https://patch-diff.githubusercontent.com/raw/apache/spark/pull/49606.patch | git apply -
3.3. Build Spark
cd "${SOURCE_ROOT}/spark"
./build/mvn -DskipTests clean install
4. Run the test cases (Optional)
# Fix for TTY related issues when launching the Ammonite REPL in tests.
ORIG_TERM="$TERM"
export TERM=vt100
cd "${SOURCE_ROOT}/spark"
./build/mvn -B test -fn -pl '!sql/hive'
export TERM="$ORIG_TERM"
# There are TTY related issues when launching the Ammonite REPL in tests so try to reset to sane values.
stty sane || true
Note: Hive does not currently support big-endian systems so its test suites are skipped.
Note: Some test cases have been observed to fail intermittently. They should pass on rerun.
To run an individual scala test:
cd "${SOURCE_ROOT}/spark"
build/mvn -pl :spark-connect_2.12 -Dtest=none -Dsuites="org.apache.spark.sql.connect.execution.ReattachableExecuteSuite @reattach after connection expired" test
To run an individual java test:
cd "${SOURCE_ROOT}/spark"
build/mvn -pl :spark-streaming_2.12 -Dtest="org.apache.spark.streaming.JavaMapWithStateSuite#testBasicFunction" -DwildcardSuites=none test
Note: Some tests have been disabled because they load native-endian files that were generated on a little-endian system.
Note: Some tests may fail with distro provided OpenJDK JVMs due to system wide cyrpto-policy
settings on RHEL and SLES.
If a test fails with an exception like:
java.security.cert.CertPathValidatorException: Algorithm constraints check failed on keysize limits: RSA 1024 bit key used with certificate: CN=spark, OU=spark, O=spark, L=spark, ST=spark, C=spark
set the crypto-policy
to LEGACY
with the command update-crypto-policy --set LEGACY
and rerun the test.
5. Start Apache Spark Shell
cd "${SOURCE_ROOT}/spark"
./bin/spark-shell