Building Hadoop from source - shawfdong/hyades GitHub Wiki
In this article we describe how to build Apache Hadoop 2.5.2 from source on Hyades.
$ cd /scratch $ wget http://apache.spinellicreations.com/hadoop/common/hadoop-2.5.2/hadoop-2.5.2-src.tar.gz $ tar xvfz hadoop-2.5.2-src.tar.gz $ cd hadoop-2.5.2-src
According to BUILDING.txt, the requirements for building Hadoop 2.5.2 are:
- Unix System
- JDK 1.6+
- Maven 3.0 or later
- Findbugs 1.3.9 (if running findbugs)
- ProtocolBuffer 2.5.0
- CMake 2.6 or newer (if compiling native code)
- Zlib devel (if compiling native code)
- openssl devel ( if compiling native hadoop-pipes )
- Internet connection for first build (to fetch all Maven and Hadoop dependencies)
Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a Java development project's build, reporting and documentation from a central piece of information.
Download Maven 3.2.3 from one of the mirrors:
$ cd /scratch/ $ wget http://mirror.metrocast.net/apache/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz
Unpack the tar ball:
$ tar xvfz apache-maven-3.2.3-bin.tar.gz -C /pfs/sw/java
Create a module file (/pfs/sw/modulefiles/maven/3.2.3) that sets the following environment variables:
M2_HOME=/pfs/sw/java/apache-maven-3.2.3 PATH=$M2_HOME/bin:$PATH
Load the module:
$ module load maven
As of December 1st, 2014, the latest release of Protocol Buffers is 2.6.0. However, Hadoop 2.5.2 requires exactly Protocol Buffers 2.5.0!
Download Protocol Buffers 2.5.0:
$ cd /scratch/ $ wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.bz2
Build and install Protocol Buffers 2.5.0:
$ module load python $ tar xvfj protobuf-2.5.0.tar.bz $ cd protobuf-2.5.0 $ ./configure --prefix=/pfs/sw/serial/gcc/protobuf-2.5.0 $ make $ make check $ make install
Create a module file (/pfs/sw/modulefiles/protobuf/2.5.0) that sets the following environment variables:
PATH=/pfs/sw/serial/gcc/protobuf-2.5.0/bin:$PATH PKG_CONFIG_PATH=/pfs/sw/serial/gcc/protobuf-2.5.0/lib/pkgconfig:$PKG_CONFIG_PATH
Load the module:
$ module load protobuf
CMake 2.8 is available in the CentOS 6 repositories:
$ yum install cmake
$ cd /scratch/hadoop-2.5.2-src
Create binary distribution with native libraries:
$ mvn package -Pdist,native -DskipTests=true -DtarNOTE there is a typo in the Hadoop documentation. The option to skip tests should be -DskipTests, not -Dskiptests!
The resulting distribution is stored in /scratch/hadoop-2.5.2-src/hadoop-dist/target/.
In order to improve performance, Hadoop tries to load native implementations of certain components[3]. These components are available in dynamically-linked native libraries, located in the lib/native directory. Although the native libraries provided by the official Hadoop 2.5.2 release are 64-bit, they are linked with GLIBC_2.14. The glibc on RHEL/CentOS 6, however, is version 2.12. Thus the stock native libraries can't be loaded and we'll get the following warning:
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Let's fix it by overwriting the native libraries in the official Hadoop 2.5.2 release with our just built ones:
# cp /scratch/hadoop-2.5.2-src/hadoop-dist/target/hadoop-2.5.2/lib/native/* /pfs/sw/bigdata/hadoop-2.5.2/lib/native/