1. Preparing to Install ODP Manually

This chapter describes how to prepare to install Open source Data Platform (ODP) manually. You must complete the following tasks before you deploy Hadoop cluster using ODP:

Important

See the ODP Release Notes for the ODP 3.2.2.0-1 repo information.

1.1. Meeting Minimum System Requirements

To run Open Source Data Platform, your system must meet minimum requirements.

1.1.1. Hardware Recommendations

Although there is no single hardware requirement for installing ODP, there are some basic guidelines. A complete installation of ODP 3.2.2 consumes about 8 GB of disk space.

1.1.2. Operating System Requirements

Refer to the Acceldata Support Matrix for information regarding supported operating systems.

1.1.3. Software Requirements

You must install the following software on each of your hosts:

apt-get (for Ubuntu 18/20)
chkconfig (Ubuntu 18/20)
curl
reposync
rpm (for RHEL, CentOS 7)
scp
tar
unzip
wget
yum (for RHEL or CentOS 7)

In addition, if you are creating local mirror repositories as part of the installation process and you are using RHEL, CentOS 7, you need the following utilities on the mirror repo server:

createrepo
reposync
yum-utils

See Deploying ODP in Production Data Centers with Firewalls.

1.1.4. JDK Requirements

Your system must have the correct Java Development Kit (JDK) installed on all cluster nodes.

Refer to the Support Matrix for information regarding supported JDKs.

Important

Before enabling Kerberos in the cluster, you must deploy the Java Cryptography Extension (JCE) security policy files on all hosts in the cluster. See Installing the JCE for more information.

The following sections describe how to install and configure the JDK.

1.1.4.1. Manually Installing Oracle JDK 1.8

Use the following instructions to manually install JDK 1.8:

If you do not have a /usr/java directory, create one:

mkdir /usr/java
Download the Oracle 64-bit JDK (jdk-8u202-linux-x64.tar.gz) from the Oracle download site.
Open a web browser and navigate to http://www.oracle.com/ technetwork/java/javase/downloads/jdk8-downloads-2133151.html.
Copy the downloaded jdk.tar.gz file to the /usr/java directory.
Navigate to the /usr/java directory and extract the jdk.tar.gz file:

cd /usr/java tar zxvf jdk-8u202-linux-x64.tar.gz

The JDK files are extracted into a /usr/ java/jdk1.8.0_202 directory.
Create a symbolic link (symlink) to the JDK:

ln -s /usr/java/jdk1.8.0_202 /usr/java/default

Set the JAVA_HOME and PATH environment variables:

export JAVA_HOME=/usr/java/default 
export PATH=$JAVA_HOME/bin:$PATH

Verify that Java is installed in your environment:

java -version

You should see output similar to the following:

java version "1.8.0_202" 
Java(TM) SE Runtime Environment (build 1.8.0_202-b01) 
Java HotSpot(TM) 64-Bit Server VM (build 24.67-b01, mixed mode)

1.1.4.2. Manually Installing the JCE

Unless you are using OpenJDK with unlimited-strength JCE, you must manually install the Java Cryptography Extension (JCE) security policy files on all hosts in the cluster:

Obtain the JCE policy file appropriate for the JDK version in your cluster:

Oracle JDK 1.8

https://www.oracle.com/java/technologies/javase-jce8-downloads.html

Save the policy file archive in a temporary location.
On each host in the cluster, add the unlimited security policy JCE jars to $JAVA_HOME/jre/lib/security/.

For example, run the following command to extract the policy jars into the JDK installed on your host:

unzip -o -j -q jce_policy-8.zip -d /usr/jdk64/jdk1.8.0_202/jre/lib/security/

1.1.5. Metastore Database Requirements

If you are installing Apache projects Hive and HCatalog, Oozie, Hue, or Ranger, you must install a database to store metadata information in the metastore. You can either use an existing database instance or install a new instance manually.

Refer to the Support Matrix for information regarding supported metastore databases.

The following sections describe how to install and configure the metastore database.

1.1.5.1. Metastore Database Prerequisites

The database administrator must create the following users and specify the following values:

For Apache Hive: hive_dbname, hive_dbuser, and hive_dbpasswd.
For Apache Oozie: oozie_dbname, oozie_dbuser, and oozie_dbpasswd.

Note

By default, Hive uses the Derby database for the metastore. However, Derby is not supported for production systems.

For Hue: Hue user name and Hue user password
For Apache Ranger: RANGER_ADMIN_DB_NAME

1.1.5.2. Installing and Configuring PostgreSQL

The following instructions explain how to install PostgreSQL as the metastore database. See your third-party documentation for instructions on how to install other supported databases.

Important

Prior to using PostgreSQL as your Hive metastore, consult with the offiical PostgreSQL documentation and ensure you are using a JDBC 4+ driver that corresponds to your implementation of PostgreSQL.

1.1.5.2.1. Installing PostgreSQL on RHEL, and CentOS

Use the following instructions to install a new instance of PostgreSQL:

Using a terminal window, connect to the host machine where you plan to deploy a PostgreSQL instance:

yum install postgresql-server

Start the instance:

/etc/init.d/postgresql start

For some newer versions of PostgreSQL, you might need to execute the command /etc/init.d/postgresql initdb.

Reconfigure PostgreSQL server:

Edit the /var/lib/pgsql/data/postgresql.conf file.

Change the value of #listen_addresses = 'localhost' to listen_addresses = '*'.

Edit the /var/lib/pgsql/data/postgresql.conf file.

Remove comments from the "port = " line and specify the port number (default 5432).

Edit the /var/lib/pgsql/data/pg_hba.conf file by adding the following:

host all all 0.0.0.0/0 trust

If you are using PostgreSQL v9.1 or later, add the following to the /var/lib/pgsql/data/postgresql.conf file:

standard_conforming_strings = off

Create users for PostgreSQL server by logging in as the root user and entering the following syntax:

echo "CREATE DATABASE $dbname;" | sudo -u $postgres psql -U postgres echo "CREATE USER $user WITH PASSWORD '$passwd';" | sudo -u $postgres psql -U postgres 
echo "GRANT ALL PRIVILEGES ON DATABASE $dbname TO $user;" | sudo -u  $postgres psql -U postgres

The previous syntax should have the following values:
• $postgres is the postgres user.
• $user is the user you want to create.
• $dbname is the name of your PostgreSQL database.

Note

For access to the Hive metastore, you must create hive_dbuser after Hive has been installed, and for access to the Oozie metastore, you must create oozie_dbuser after Oozie has been installed.

On the Hive metastore host, install the connector:

yum install postgresql-jdbc*

Confirm that the .jar file is in the Java share directory:

ls -l /usr/share/java/postgresql-jdbc.jar

1.1.5.2.2. Installing PostgreSQL on Ubuntu 18/20

To install a new instance of PostgreSQL:

Connect to the host machine where you plan to deploy PostgreSQL instance. At a terminal window, enter:

apt-get install postgresql-server

Start the instance.

Note

For some newer versions of PostgreSQL, you might need to execute the command:

/etc/init.d/postgresql initdb

Reconfigure PostgreSQL server:

Edit the /var/lib/pgsql/data/postgresql.conf file.

Change the value of #listen_addresses = 'localhost' to listen_addresses = '*'

Edit the /var/lib/pgsql/data/postgresql.conf file.

Change the port setting from #port = 5432 to port = 5432

Edit the /var/lib/pgsql/data/pg_hba.conf

Add the following:

host all all 0.0.0.0/0 trust

Optional: If you are using PostgreSQL v9.1 or later, add the following to the /var/lib/pgsql/data/postgresql.conf file:

standard_conforming_strings = off

Create users for PostgreSQL server.

echo "CREATE DATABASE $dbname;" | sudo -u $postgres psql -U postgres echo "CREATE USER $user WITH PASSWORD '$passwd';" | sudo -u $postgres psql - U postgres 
echo "GRANT ALL PRIVILEGES ON DATABASE $dbname TO $user;" | sudo -u  $postgres psql -U postgres

Where: $postgres is the postgres user, $user is the user you want to create, and $dbname is the name of your PostgreSQL database.

Note

For access to the Hive metastore, create hive_dbuser after Hive has been installed, and for access to the Oozie metastore, create oozie_dbuser after Oozie has been installed.

On the Hive Metastore host, install the connector.

apt-get install -y libpostgresql-jdbc-java

Copy the connector .jar file to the Java share directory.

cp /usr/share/java/postgresql-*jdbc3.jar /usr/share/java/ postgresql-jdbc.jar

Confirm that the .jar is in the Java share directory.

ls /usr/share/java/postgresql-jdbc.jar

Change the access mode of the .jar file to 644.

chmod 644 /usr/share/java/postgresql-jdbc.jar

1.1.5.3. Installing and Configuing MariaDB

This section describes how to install MariaDB as the metastore database. For instructions on how to install other supported databases, see your third-party documentation.

For additional information regarding MariaDB, see MariaDB.

1.1.5.3.1. Installing MariaDB on RHEL and CentOS

Important

If you are installing on CentOS or RHEL, it is highly recommended that you install from a repository using yum.

Follow these steps to install a new instance of MariaDB on RHEL and CentOS:

There are YUM repositories for several YUM-based Linux distributions. Use the Maria DB Downloads page to generate the YUM repository.
Move the MariaDB repo file to the directory /etc/yum.repos.d/.

It is suggested that you name your file MariaDB.repo.

The following is an example MariaDB.repo file for CentOS 7:

[mariadb] 
name=MariaDB 
baseurl=http://yum.mariadb.org/10.1/centos7-amd64 
gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB 
gpgcheck=1

In this example the gpgkey line automatically fetches the GPG key that is used to sign the repositories. gpgkey enables yum and rpm to verify the integrity of the packages that it downloads.The id of MariaDB's signing key is 0xcbcb082a1bb943db. The short form of the id is 0x1BB943DB and the full key fingerprint is:1993 69E5 404B D5FC 7D2F E43B CBCB 082A 1BB9 43DB.

If you want to fix the version to an older version, follow the instructions on Adding the MariaDB YUM Repository.

If you do not have the MariaDB GPG signing key installed, YUM prompts you to install it after downloading the packages. If you are prompted to do so, install the MariaDB GPG signing key.
Use the following command to install MariaDB:

sudo yum install MariaDB-server MariaDB-client

If you already have the MariaDB-Galera-server package installed, you might need to remove it prior to installing MariaDB-server. If you need to remove MariaDB Galera-server, use the following command:

sudo yum remove MariaDB-Galera-server

No databases are removed when the MariaDB-Galera-server rpm package is removed, though with any upgrade, it is best to have backups.

Install MariaDB with YUM by following the directions at Enabling MariaDB.
Use one of the following commands to start MariaDB:

If your system is using systemctl:

sudo systemctl start mariadb

If your system is not using systemctl:

sudo /etc/init.d/mysql start

1.1.5.4. Installing and Configuring MySQL

This section describes how to install MySQL as the metastore database. For instructions on how to install other supported databases, see your third-party documentation.

Important

When you use MySQL as your Hive metastore, you must use mysql connector-java-5.1.35.zip or later JDBC driver.

1.1.5.4.1. Installing MySQL on RHEL and CentOS

To install a new instance of MySQL:

Connect to the host machine you plan to use for Hive and HCatalog.
Install MySQL server.

From a terminal window, enter:

yum install mysql-community-release

For CentOS7, install MySQL server from the ODP-Utils repository.

Start the instance.

/etc/init.d/mysqld start

Set the root user password using the following command format:

mysqladmin -u root password $mysqlpassword

For example, use the following command to set the password to "root":

mysqladmin -u root password root

Remove unnecessary information from log and STDOUT:

mysqladmin -u root 2>&1> /dev/null

mysql -u root -p root

In this syntax, "root" is the root user password.

[root@c6402 /]# mysql -u root -proot 
Welcome to the MySQL monitor. Commands end with ; or \g. 
Your MySQL connection id is 11 
Server version: 5.1.73 Source distribution 
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved. 
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. 
Type 'help;' or '\h' for help. Type '\c' to clear the current input  statement. 
mysql> CREATE USER 'dbuser'@'localhost' IDENTIFIED BY 'dbuser';
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost';
Query OK, 0 rows affected (0.00 sec) 
mysql> CREATE USER 'dbuser'@'%' IDENTIFIED BY 'dbuser'; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%'; 
Query OK, 0 rows affected (0.00 sec) 
mysql> FLUSH PRIVILEGES; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost' WITH GRANT  OPTION; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%' WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec) 
mysql>

Use the exit command to exit MySQL.
You should now be able to reconnect to the database as "dbuser" by using the following command:

mysql -u dbuser -p dbuser

After testing the dbuser login, use the exit command to exit MySQL.

10.Install the MySQL connector .jar file:

yum install mysql-connector-java*

1.1.5.4.2. Ubuntu 18/20

To install a new instance of MySQL:

Connect to the host machine you plan to use for Hive and HCatalog.
Install MySQL server.

From a terminal window, enter:

apt-get install mysql-server

Start the instance.

/etc/init.d/mysql start

Set the root user password using the following command format:

mysqladmin -u root password $mysqlpassword

For example, to set the password to "root":

mysqladmin -u root password root

Remove unnecessary information from log and STDOUT.

mysqladmin -u root 2>&1> /dev/null

mysql -u root -p root

Log in as the root user, create the dbuser, and grant it adequate privileges. This user provides access to the Hive metastore. Use the following series of commands (shown here with the returned responses) to create dbuser with password dbuser.

[root@c6402 /]# mysql -u root -proot 
Welcome to the MySQL monitor. Commands end with ; or \g. 
Your MySQL connection id is 11 
Server version: 5.1.73 Source distribution 
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved. 
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. 
Type 'help;' or '\h' for help. Type '\c' to clear the current input  statement. 
mysql> CREATE USER 'dbuser'@'localhost' IDENTIFIED BY 'dbuser'; Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost'; Query OK, 0 rows affected (0.00 sec) 
mysql> CREATE USER 'dbuser'@'%' IDENTIFIED BY 'dbuser'; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%'; 
Query OK, 0 rows affected (0.00 sec) 
mysql> FLUSH PRIVILEGES; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost' WITH GRANT  OPTION; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%' WITH GRANT OPTION; 
Query OK, 0 rows affected (0.00 sec) 
mysql>

Use the exit command to exit MySQL.
You should now be able to reconnect to the database as dbuser, using the following command:

mysql -u dbuser -p dbuser

After testing the dbuser login, use the exit command to exit MySQL.

10.Install the MySQL connector JAR file.

apt-get install mysql-connector-java*

1.1.5.5. Configuring Oracle as the Metastore Database

You can select Oracle as the metastore database. For instructions on how to install the databases, see your third-party documentation. To configure Oracle as the Hive Metastore, install ODP and Hive, and then follow the instructions in "Set up Oracle DB for use with Hive Metastore" in this guide.

1.2. Virtualisation and Cloud Platforms

Open source Data Platform (ODP) is certified and supported when running on virtual or cloud platforms (for example, VMware vSphere or Amazon Web Services EC2) if the respective guest operating system is supported by ODP and any issues detected on these platforms are reproducible on the same supported operating system installed elsewhere.

See the Support Matrix for the list of supported operating systems for ODP.

1.3. Configuring Remote Repositories

The standard ODP install fetches the software from a remote yum repository over the Internet. To use this option, you must set up access to the remote repository and have an available Internet connection for each of your hosts. To download the ODP maven artifacts and build your own repository, see Download the ODP Maven Artifacts.

Important

See the ODP Release Notes and ODP 2.6 repo information.

Note

If your cluster does not have access to the Internet, or if you are creating a large cluster and you want to conserve bandwidth, you can instead provide a local copy of the ODP repository that your hosts can access.

6.x line of RHEL/CentOS 7

wget -nv http://public-repo-1.acceldata.com/ODP/centos7/3.2.2.0-1/odp.repo -O /etc/yum.repos.d/odp.repo

7.x line of RHEL/CentOS

wget -nv http://public-repo-1.acceldata.com/ODP/centos7/3.2.2.0-1/odp.repo

Ubuntu 18/20

apt-get update 
wget http://public-repo-1.acceldata.com/ODP/ubuntu<version>/3.2.2.0-1/odp.list -O /etc/apt/sources.list.d/odp.list

1.4. Deciding on a Deployment Type

While it is possible to deploy all of ODP on a single host, you should use at least four hosts: one master host and three slaves.

1.5. Collect Information

To deploy your ODP, you need the following information:

The fully qualified domain name (FQDN) for each host in your system, and the components you want to set up on each host. You can use hostname -f to check for the FQDN.
If you install Apache Hive, HCatalog, or Apache Oozie, you need the host name, database name, user name, and password for the metastore instance.

Note

If you are using an existing instance, the dbuser you create for ODP must be granted ALL PRIVILEGES permissions on that instance.

1.6. Prepare the Environment

To deploy your ODP instance, you must prepare your deployment environment:

Enable NTP on Your Cluster
Disable SELinux
Disable IPTables

1.6.1. Enable NTP on Your Cluster

The clocks of all the nodes in your cluster must be synchronized. If your system does not have access to the Internet, you should set up a master node as an NTP xserver to achieve this synchronization.

Use the following instructions to enable NTP for your cluster:

Configure NTP clients by executing the following command on each node in your cluster:

For RHEL/CentOS/:

a. Configure the NTP clients:

yum install ntp

b. Enable the service:

systemctl enable ntpd

c. Start NTPD:

systemctl start ntpd

Enable the service by executing the following command on each node in your cluster:

For RHEL/CentOS

chkconfig ntpd on

For Ubuntu 18/20:

chkconfig ntp on

Start the NTP. Execute the following command on all the nodes in your cluster. • For RHEL/CentOS 7:

/etc/init.d/ntpd start

For Ubuntu 18/20

/etc/init.d/ntp start

If you want to use an existing NTP server as the X server in your environment, complete the following steps:

a. Configure the firewall on the local NTP server to enable UDP input traffic on Port 123 and replace 192.168.1.0/24 with the IP addresses in the cluster, as shown in the following example using RHEL hosts:

# iptables -A RH-Firewall-1-INPUT -s 192.168.1.0/24 -m state --state NEW -p udp --dport 123 -j ACCEPT

b. Save and restart iptables. Execute the following command on all the nodes in your cluster:
```
 # service iptables save 
 # service iptables restart 
```
c. Finally, configure clients to use the local NTP server. Edit the /etc/ntp.conf file and add the following line:
```
server $LOCAL_SERVER_IP OR HOSTNAME
```

1.6.2. Disable SELinux

The Security-Enhanced (SE) Linux feature should be disabled during the installation process. 1. Check the state of SELinux. On all the host machines, execute the following command:

getenforce

If the command returns "disabled" or "permissive" as the response, no further actions are required. If the result is enabled, proceed to Step 2.

Disable SELinux either temporarily for each session or permanently.

Disable SELinux temporarily by executing the following command:

setenforce 0

Disable SELinux permanently in the /etc/sysconfig/selinux file by changing the value of the SELINUX field to permissive or disabled. Restart your system.

1.6.3. Disable IPTables

Because certain ports must be open and available during installation, you should temporarily disable iptables. If the security protocols at your installation do not allow you to disable iptables, you can proceed with them on if all of the relevant ports are open and available; otherwise, cluster installation fails.

On all RHEL/CentOS 6 host machines, execute the following commands to disable iptables:

chkconfig iptables off 
service iptables stop

Restart iptables after your setup is complete.

On RHEL/CENTOS 7 host machines, execute the following commands to disable firewalld:

systemctl stop firewalld
systemctl mask firewalld

Restart firewalld after your setup is complete.

On Ubuntu 18/20 Host machines, execute the following command to disable iptables:

service ufw stop

Restart iptables after your setup is complete.

Important

If you leave iptables enabled and do not set up the necessary ports, the cluster installation fails.

1.7. Download Companion Files

You can download and extract a set of companion files, including script files and configuration files, that you can then modify to match your own cluster environment:

To download and extract the files:

wget http://public-repo-1.acceldata.com/ODP/tools/3.2.2.0-1/ 
odp_manual_install_rpm_helper_files-3.2.2.0.1.tar.gz 
tar zxvf odp_manual_install_rpm_helper_files-3.2.2.0.1.tar.gz

Important

See the ODP Release Notes for the ODP 3.2.2.0 repo information.

1.8. Define Environment Parameters

You must set up specific users and directories for your ODP installation by using the following instructions:

Define directories.

The following table describes the directories you need for installation, configuration, data storage, process IDs, and log information based on the Apache Hadoop Services you plan to install. Use this table to define what you are going to use to set up your environment.

Note

The scripts.zip file that you downloaded in the supplied companion files includes a script, directories.sh, for setting directory environment parameters.

You should edit and source (or copy the contents to your ~/.bash_profile) to set up these environment variables in your environment.

Table 1.1. Directories Needed to Install Core Hadoop

Hadoop Service	Parameter	Definition
HDFS	DFS_NAME_DIR	Space separated list of directories to which NameNode should store the file system image: for example, `/ grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/nn`.
HDFS	DFS_DATA_DIR	Space separated list of directories where DataNodes should store the blocks. For example, `/grid/hadoop/hdfs/dn, /grid1/ hadoop/hdfs/dn /grid2/hadoop/hdfs/dn`
HDFS	FS_CHECKPOINT_DIR	Space separated list of directories where SecondaryNameNode should store the checkpoint image. For example, `/grid/hadoop/hdfs/snn /grid1/hadoop/hdfs/snn /grid2/hadoop/hdfs/snn`
HDFS	HDFS_LOG_DIR	Directory for storing the HDFS logs. This directory name is a combination of a directory and the $HDFS_USER. For example, `/var/log/hadoop/hdfs, where hdfs is the $HDFS_USER`
HDFS	HDFS_PID_DIR	Directory for storing the HDFS process ID. This directory name is a combination of a directory and the $HDFS_USER. For example, `/var/run/hadoop/hdfs, where hdfs is the $HDFS_USER`
HDFS	HADOOP_CONF_DIR	Directory for storing the Hadoop configuration files. For example, `/etc/hadoop/conf`
YARN	YARN_LOCAL_DIR	Space-separated list of directories where YARN should store temporary data. For example, `/grid/hadoop/yarn /grid1/hadoop/yarn /grid2/hadoop/yarn`
YARN	YARN_LOG_DIR	Directory for storing the YARN logs. For example, `/var/log/hadoop/yarn`. This directory name is a combination of a directory and the $YARN_USER. In the example yarn is the $YARN_USER.
YARN	YARN_LOCAL_LOG_DIR	Space-separated list of directories where YARN stores container log data. For example, `/grid/hadoop/yarn/logs /grid1/hadoop/yarn/log`.
YARN	YARN_PID_DI	Directory for storing the YARN process ID. For example, `/var/run/hadoop/yarn`. This directory name is a combination of a directory and the $YARN_USER. In the example, yarn is the $YARN_USER
MapReduce	MAPRED_LOG_DIR	Directory for storing the JobHistory Server logs. For example, `/var/log/hadoop/mapred`. This directory name is a combination of a directory and the $MAPRED_USER. In the example, mapred is the $MAPRED_USER

Table 1.2. Directories Needed to Install Ecosystem Components

Hadoop Service	Parameter	Definition
Oozie	OOZIE_CONF_DIR	Directory to store the Oozie configuration files. For example, /etc/oozie/conf.
Oozie	OOZIE_DATA	Directory to store the Oozie data. For example, /var/db/oozie.
Oozie	OOZIE_LOG_DIR	Directory to store the Oozie logs. For example, /var/log/oozie.
Oozie	OOZIE_PID_DIR	Directory to store the Oozie process ID. For example, /var/run/oozie.
Oozie	OOZIE_TMP_DIR	Directory to store the Oozie temporary files. For example, /var/tmp/oozie.
Hive	HIVE_CONF_DIR	Directory to store the Hive configuration files. For example, /etc/hive/conf.
Hive	HIVE_LOG_DIR	Directory to store the Hive logs. For example, /var/log/hive.
Hive	HIVE_PID_DIR	Directory to store the Hive process ID. For example, /var/run/hive.
WebHCat	WEBHCAT_CONF_DIR	Directory to store the WebHCat configuration files. For example, /etc/hcatalog/conf/webhcat.
WebHCat	WEBHCAT_LOG_DIR	Directory to store the WebHCat logs. For example, /var/log/webhcat.
WebHCat	WEBHCAT_PID_DIR	Directory to store the WebHCat process ID. For example, /var/run/webhcat.
HBase	HBASE_CONF_DIR	Directory to store the Apache HBase configuration files. For example, /etc/hbase/conf.
HBase	HBASE_LOG_DIR	Directory to store the HBase logs. For example, /var/log/hbase.
HBase	HBASE_PID_DIR	Directory to store the HBase process ID. For example, /var/run/hbase.
ZooKeeper	ZOOKEEPER_DATA_DIR	Directory where Apache ZooKeeper stores data. For example, /grid/hadoop/zookeeper/data
ZooKeeper	ZOOKEEPER_CONF_DIR	Directory to store the ZooKeeper configuration files. For example, /etc/zookeeper/conf.
ZooKeeper	ZOOKEEPER_LOG_DIR	Directory to store the ZooKeeper logs. For example, /var/log/zookeeper.
ZooKeeper	ZOOKEEPER_PID_DIR	Directory to store the ZooKeeper process ID. For example, /var/run/zookeeper.
Sqoop	SQOOP_CONF_DIR	Directory to store the Apache Sqoop configuration files. For example, /etc/sqoop/conf.

If you use the companion files, the following screen provides a snapshot of how your directories.sh file should look after you edit the TODO variables:

#!/bin/sh 

# 
# Directories Script 
# 
# 1. To use this script, you must edit the TODO variables below for your  environment. 
# 
# 2. Warning: Leave the other parameters as the default values. Changing  these default values requires you to 
# change values in other configuration files. 
# 

# 
# Hadoop Service - HDFS 
# 

# Space separated list of directories where NameNode stores file system  image. For example, /grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/nn DFS_NAME_DIR="TODO-LIST-OF-NAMENODE-DIRS"; 

# Space separated list of directories where DataNodes stores the blocks. For  example, /grid/hadoop/hdfs/dn /grid1/hadoop/hdfs/dn /grid2/hadoop/hdfs/dn DFS_DATA_DIR="TODO-LIST-OF-DATA-DIRS"; 

# Space separated list of directories where SecondaryNameNode stores  checkpoint image. For example, /grid/hadoop/hdfs/snn /grid1/hadoop/hdfs/ snn /grid2/hadoop/hdfs/snn 
FS_CHECKPOINT_DIR="TODO-LIST-OF-SECONDARY-NAMENODE-DIRS"; 

# Directory to store the HDFS logs. 
HDFS_LOG_DIR="/var/log/hadoop/hdfs";

# Directory to store the HDFS process ID. 
HDFS_PID_DIR="/var/run/hadoop/hdfs"; 

# Directory to store the Hadoop configuration files. 
HADOOP_CONF_DIR="/etc/hadoop/conf"; 

# 
# Hadoop Service - YARN 
# 

# Space separated list of directories where YARN stores temporary data. For  example, /grid/hadoop/yarn/local /grid1/hadoop/yarn/local /grid2/hadoop/yarn/local 
YARN_LOCAL_DIR="TODO-LIST-OF-YARN-LOCAL-DIRS"; 

# Directory to store the YARN logs. 
YARN_LOG_DIR="/var/log/hadoop/yarn";

# Space separated list of directories where YARN stores container log data.  For example, /grid/hadoop/yarn/logs /grid1/hadoop/yarn/logs /grid2/hadoop/ yarn/logs 
YARN_LOCAL_LOG_DIR="TODO-LIST-OF-YARN-LOCAL-LOG-DIRS"; 

# Directory to store the YARN process ID. 
YARN_PID_DIR="/var/run/hadoop/yarn"; 

# 
# Hadoop Service - MAPREDUCE 
# 

# Directory to store the MapReduce daemon logs. 
MAPRED_LOG_DIR="/var/log/hadoop/mapred"; 
# Directory to store the mapreduce jobhistory process ID. MAPRED_PID_DIR="/var/run/hadoop/mapred"; 

# 
# Hadoop Service - Hive 
# 

# Directory to store the Hive configuration files. 
HIVE_CONF_DIR="/etc/hive/conf"; 

# Directory to store the Hive logs. 
HIVE_LOG_DIR="/var/log/hive"; 

# Directory to store the Hive process ID. 
HIVE_PID_DIR="/var/run/hive"; 

# 
# Hadoop Service - WebHCat (Templeton) 
# 

# Directory to store the WebHCat (Templeton) configuration files. WEBHCAT_CONF_DIR="/etc/hcatalog/conf/webhcat"; 

# Directory to store the WebHCat (Templeton) logs. 
WEBHCAT_LOG_DIR="var/log/webhcat"; 

# Directory to store the WebHCat (Templeton) process ID.
WEBHCAT_PID_DIR="/var/run/webhcat"; 

# 
# Hadoop Service - HBase 
# 

# Directory to store the HBase configuration files. 
HBASE_CONF_DIR="/etc/hbase/conf"; 

# Directory to store the HBase logs. 
HBASE_LOG_DIR="/var/log/hbase"; 

# Directory to store the HBase logs. 
HBASE_PID_DIR="/var/run/hbase";

# 
# Hadoop Service - ZooKeeper 
# 

# Directory where ZooKeeper stores data. For example, /grid1/hadoop/ zookeeper/data 
ZOOKEEPER_DATA_DIR="TODO-ZOOKEEPER-DATA-DIR"; 

# Directory to store the ZooKeeper configuration files. ZOOKEEPER_CONF_DIR="/etc/zookeeper/conf"; 

# Directory to store the ZooKeeper logs. 
ZOOKEEPER_LOG_DIR="/var/log/zookeeper"; 

# Directory to store the ZooKeeper process ID. 
ZOOKEEPER_PID_DIR="/var/run/zookeeper"; 

# 
# Hadoop Service - Oozie 
# 

# Directory to store the Oozie configuration files. 
OOZIE_CONF_DIR="/etc/oozie/conf" 

# Directory to store the Oozie data. 
OOZIE_DATA="/var/db/oozie" 

# Directory to store the Oozie logs. 
OOZIE_LOG_DIR="/var/log/oozie" 

# Directory to store the Oozie process ID.
OOZIE_PID_DIR="/var/run/oozie" 

# Directory to store the Oozie temporary files. 
OOZIE_TMP_DIR="/var/tmp/oozie" 

# 
# Hadoop Service - Sqoop 
# 
SQOOP_CONF_DIR="/etc/sqoop/conf"

The following table describes system user account and groups. Use this table to define what you are going to use in setting up your environment. These users and groups should reflect the accounts you create in Create System Users and Groups. The scripts.zip file you downloaded includes a script, usersAndGroups.sh, for setting user and group environment parameters.

Table 1.3. Define Users and Groups for Systems

Parameter	Definition
HDFS_USER	User that owns the Hadoop Distributed File Sysem (HDFS) services. For example, hdfs.
YARN_USER	User that owns the YARN services. For example, yarn.
ZOOKEEPER_USER	User that owns the ZooKeeper services. For example, zookeeper.
HIVE_USER	User that owns the Hive services. For example, hive.
WEBHCAT_USER	User that owns the WebHCat services. For example, hcat.
HBASE_USER	User that owns the HBase services. For example, hbase.
SQOOP_USER	User owning the Sqoop services. For example, sqoop.
KAFKA_USER	User owning the Apache Kafka services. For example, kafka.
OOZIE_USER	User owning the Oozie services. For example oozie.
HADOOP_GROUP	A common group shared by services. For example, hadoop.
KNOX_USER	User that owns the Knox Gateway services. For example, knox.

1.9. Creating System Users and Groups

In general, Apache Hadoop services should be owned by specific users and not by root or application users. The following table shows the typical users for Hadoop services. If you choose to install the ODP components using the RPMs, these users are automatically set up.

If you do not install with the RPMs, or want different users, then you must identify the users that you want for your Hadoop services and the common Hadoop group and create these accounts on your system.

To create these accounts manually, you must follow this procedure:

Add the user to the group.

useradd -G <groupname> <username>

Table 1.4. Typical System Users and Groups

Hadoop Service	User	Group
HDFS	hdfs	hadoop
YARN	yarn	hadoop
MapReduce	mapred	hadoop, mapred
Hive	hive	hadoop
HCatalog/WebHCatalog	hcat	hadoop
HBase	hbase	hadoop
Sqoop	sqoop	hadoop
ZooKeeper	zookeeper	hadoop
Oozie	oozie	hadoop
Knox Gateway	knox	hadoop

Hadoop Service User Group

HDFS hdfs hadoop

YARN yarn hadoop

MapReduce mapred hadoop, mapred

Hive hive hadoop

HCatalog/WebHCatalog hcat hadoop

HBase hbase hadoop

Sqoop sqoop hadoop

ZooKeeper zookeeper hadoop

Oozie oozie hadoop

Knox Gateway knox hadoop

1.10. Determining ODP Memory Configuration Settings

You can use either of two methods determine YARN and MapReduce memory configuration settings:

Running the YARN Utility Script
Calculating YARN and MapReduce Memory Requirements

The ODP utility script is the recommended method for calculating ODP memory configuration settings, but information about manually calculating YARN and MapReduce memory configuration settings is also provided for reference.

1.10.1. Running the YARN Utility Script

This section describes how to use the yarn-utils.py script to calculate YARN, MapReduce, Hive, and Tez memory allocation settings based on the node hardware specifications. The yarn-utils.py script is included in the ODP companion files. See Download Companion Files.

To run the yarn-utils.py script, execute the following command from the folder containing the script yarn-utils.py options, where options are as follows:

Table 1.5. yarn-utils.py Options

Option	Description
-c CORES	The number of cores on each host
-m MEMORY	The amount of memory on each host, in gigabytes
-d DISKS	The number of disks on each host
-k HBASE	"True" if HBase is installed; "False" if not

Option Description

-c CORES The number of cores on each host

-m MEMORY The amount of memory on each host, in gigabytes

-d DISKS The number of disks on each host

-k HBASE "True" if HBase is installed; "False" if not

Notes

Requires python26 to run.

You can also use the -h or --help option to display a Help message that describes the options.

Example Running the following command from the odp_manual_install_rpm_helper_files-3.2.2.0. $BUILD directory:

python yarn-utils.py -c 16 -m 64 -d 4 -k True

Returns

Using cores=16 memory=64GB disks=4 hbase=True 
Profile: cores=16 memory=49152MB reserved=16GB usableMem=48GB disks=4 Num Container=8 
Container Ram=6144MB 
Used Ram=48GB 
Unused Ram=16GB 
yarn.scheduler.minimum-allocation-mb=6144 
yarn.scheduler.maximum-allocation-mb=49152 
yarn.nodemanager.resource.memory-mb=49152 
mapreduce.map.memory.mb=6144 
mapreduce.map.java.opts=-Xmx4096m 
mapreduce.reduce.memory.mb=6144 
mapreduce.reduce.java.opts=-Xmx4096m 
yarn.app.mapreduce.am.resource.mb=6144 
yarn.app.mapreduce.am.command-opts=-Xmx4096m 
mapreduce.task.io.sort.mb=1792 
tez.am.resource.memory.mb=6144 
tez.am.launch.cmd-opts =-Xmx4096m 
hive.tez.container.size=6144 
hive.tez.java.opts=-Xmx4096m

1.10.2. Calculating YARN and MapReduce Memory Requirements

This section describes how to manually configure YARN and MapReduce memory allocation settings based on the node hardware specifications.

YARN takes into account all of the available compute resources on each machine in the cluster. Based on the available resources, YARN negotiates resource requests from applications running in the cluster, such as MapReduce. YARN then provides processing capacity to each application by allocating containers. A container is the basic unit of processing capacity in YARN, and is an encapsulation of resource elements such as memory and CPU.

In an Apache Hadoop cluster, it is vital to balance the use of memory (RAM), processors (CPU cores), and disks so that processing is not constrained by any one of these cluster resources. As a general recommendation, allowing for two containers per disk and per core gives the best balance for cluster utilization.

When determining the appropriate YARN and MapReduce memory configurations for a cluster node, you should start with the available hardware resources. Specifically, note the following values on each node:

RAM (amount of memory)
CORES (number of CPU cores)
DISKS (number of disks)

The total available RAM for YARN and MapReduce should take into account the Reserved Memory. Reserved memory is the RAM needed by system processes and other Hadoop processes (such as HBase):

reserved memory = stack memory reserve + HBase memory reserve (if HBase is on the same node)

You can use the values in the following table to determine what you need for reserved memory per node:

Table 1.6. Reserved Memory Recommendations

Total Memory per Node	Recommended Reserved System Memory	Recommended Reserved HBase Memory
4 GB	1 GB	1 GB
8 GB	2 GB	1 GB
16 GB 24 GB	2 GB 4 GB	2 GB 4 GB
48 GB	6 GB	8 GB
64 GB	8 GB	8 GB
72 GB	8 GB	8 GB
96 GB 128 GB	12 GB 24 GB	16 GB 24 GB
256 GB	32 GB	32 GB
512 GB	64 GB	64 GB

Total Memory per Node Recommended Reserved System Memory Recommended Reserved HBase Memory

4 GB 1 GB 1 GB

8 GB 2 GB 1 GB

16 GB 24 GB 2 GB 4 GB 2 GB 4 GB

48 GB 6 GB 8 GB

64 GB 8 GB 8 GB

72 GB 8 GB 8 GB

96 GB 128 GB 12 GB 24 GB 16 GB 24 GB

256 GB 32 GB 32 GB

512 GB 64 GB 64 GB

After you determine the amount of memory you need per node, you must determine the maximum number of containers allowed per node:

Number of containers = min (2CORES, 1.8DISKS, (total available RAM) / MIN_CONTAINER_SIZE) DISKS is the value for dfs.data.dirs (number of data disks) per machine.

MIN_CONTAINER_SIZE is the minimum container size (in RAM). This value depends on the amount of RAM available; in smaller memory nodes, the minimum container size should also be smaller.

The following table provides the recommended values:

Table 1.7. Recommended Container Size Values

Total RAM per Node	Recommended Minimum Container Size
Less than 4 GB	256 MB
Between 4 GB and 8 GB	512 MB
Between 8 GB and 24 GB	1024 MB
Above 24 GB	2048 MB

Total RAM per Node Recommended Minimum Container Size

Less than 4 GB 256 MB

Between 4 GB and 8 GB 512 MB

Between 8 GB and 24 GB 1024 MB

Above 24 GB 2048 MB

Finally, you must determine the amount of RAM per container:

RAM per container = max(MIN_CONTAINER_SIZE, (total available RAM, per containers)

Using the results of all the previous calculations, you can configure YARN and MapReduce.

Table 1.8. YARN and MapReduce Configuration Values

Configuration File	Configuration Setting	Value Calculation
yarn-site.xml	yarn.nodemanager.resource.memory mb	= containers * RAM-per-container
yarn-site.xml	yarn.scheduler.minimum-allocation mb	= RAM-per-container
yarn-site.xml	yarn.scheduler.maximum-allocation mb	= containers * RAM-per-container
mapred-site.xml	mapreduce.map.memory.mb	= RAM-per-container
mapred-site.xml	mapreduce.reduce.memory.mb	= 2 * RAM-per-container
mapred-site.xml	mapreduce.map.java.opts	= 0.8 * RAM-per-container
mapred-site.xml	mapreduce.reduce.java.opts	= 0.8 * 2 * RAM-per-container
mapred-site.xml	yarn.app.mapreduce.am.resource.mb	= 2 * RAM-per-container
mapred-site.xml	yarn.app.mapreduce.am.command opts	= 0.8 * 2 * RAM-per-container

Configuration File Configuration Setting Value Calculation

yarn-site.xml yarn.nodemanager.resource.memory mb = containers * RAM-per-container

yarn-site.xml yarn.scheduler.minimum-allocation mb = RAM-per-container

yarn-site.xml yarn.scheduler.maximum-allocation mb = containers * RAM-per-container

mapred-site.xml mapreduce.map.memory.mb = RAM-per-container

mapred-site.xml mapreduce.reduce.memory.mb = 2 * RAM-per-container

mapred-site.xml mapreduce.map.java.opts = 0.8 * RAM-per-container

mapred-site.xml mapreduce.reduce.java.opts = 0.8 * 2 * RAM-per-container

mapred-site.xml yarn.app.mapreduce.am.resource.mb = 2 * RAM-per-container

mapred-site.xml yarn.app.mapreduce.am.command opts = 0.8 * 2 * RAM-per-container

Note

After installation, both yarn-site.xml and mapred-site.xml are located in the /etc/ hadoop/conf folder.

Examples Assume that your cluster nodes have 12 CPU cores, 48 GB RAM, and 12 disks:

Reserved memory = 6 GB system memory reserve + 8 GB for HBase min container size = 2 GB

If there is no HBase, then you can use the following calculation:

Number of containers = min (212, 1.8 12, (48-6)/2) = min (24, 21.6, 21) = 21 RAM-per-container = max (2, (48-6)/21) = max (2, 2) = 2

Table 1.9. Example Value Calculations Without HBase

Configuration	Value Calculation
yarn.nodemanager.resource.memory-mb	= 21 * 2 = 42*1024 MB
yarn.scheduler.minimum-allocation-mb	= 2*1024 MB
yarn.scheduler.maximum-allocation-mb	= 21 * 2 = 42*1024 MB
mapreduce.map.memory.mb	= 2*1024 MB
mapreduce.reduce.memory.mb	= 2 * 2 = 4*1024 MB
mapreduce.map.java.opts	= 0.8 * 2 = 1.6*1024 MB
mapreduce.reduce.java.opts	= 0.8 * 2 * 2 = 3.2*1024 MB
yarn.app.mapreduce.am.resource.mb	= 2 * 2 = 4*1024 MB
yarn.app.mapreduce.am.command-opts	= 0.8 * 2 * 2 = 3.2*1024 MB

Configuration Value Calculation

yarn.nodemanager.resource.memory-mb = 21 * 2 = 42*1024 MB

yarn.scheduler.minimum-allocation-mb = 2*1024 MB

yarn.scheduler.maximum-allocation-mb = 21 * 2 = 42*1024 MB

mapreduce.map.memory.mb = 2*1024 MB

mapreduce.reduce.memory.mb = 2 * 2 = 4*1024 MB

mapreduce.map.java.opts = 0.8 * 2 = 1.6*1024 MB

mapreduce.reduce.java.opts = 0.8 * 2 * 2 = 3.2*1024 MB

yarn.app.mapreduce.am.resource.mb = 2 * 2 = 4*1024 MB

yarn.app.mapreduce.am.command-opts = 0.8 * 2 * 2 = 3.2*1024 MB

If HBase is included:

Number of containers = min (212, 1.8 12, (48-6-8)/2) = min (24, 21.6, 17) = 17 RAM-per-container = max (2, (48-6-8)/17) = max (2, 2) = 2

Table 1.10. Example Value Calculations with HBase

Configuration	Value Calculation
yarn.nodemanager.resource.memory-mb	= 17 * 2 = 34*1024 MB
yarn.scheduler.minimum-allocation-mb	= 2*1024 MB
yarn.scheduler.maximum-allocation-mb	= 17 * 2 = 34*1024 MB
mapreduce.map.memory.mb	= 2*1024 MB
mapreduce.reduce.memory.mb	= 2 * 2 = 4*1024 MB
mapreduce.map.java.opts	= 0.8 * 2 = 1.6*1024 MB
mapreduce.reduce.java.opts	= 0.8 * 2 * 2 = 3.2*1024 MB
yarn.app.mapreduce.am.resource.mb	= 2 * 2 = 4*1024 MB
yarn.app.mapreduce.am.command-opts	= 0.8 * 2 * 2 = 3.2*1024 MB

Configuration Value Calculation

yarn.nodemanager.resource.memory-mb = 17 * 2 = 34*1024 MB

yarn.scheduler.minimum-allocation-mb = 2*1024 MB

yarn.scheduler.maximum-allocation-mb = 17 * 2 = 34*1024 MB

mapreduce.map.memory.mb = 2*1024 MB

mapreduce.reduce.memory.mb = 2 * 2 = 4*1024 MB

mapreduce.map.java.opts = 0.8 * 2 = 1.6*1024 MB

mapreduce.reduce.java.opts = 0.8 * 2 * 2 = 3.2*1024 MB

yarn.app.mapreduce.am.resource.mb = 2 * 2 = 4*1024 MB

yarn.app.mapreduce.am.command-opts = 0.8 * 2 * 2 = 3.2*1024 MB

Notes:

Updating values for yarn.scheduler.minimum-allocation-mb without also changing yarn.nodemanager.resource.memory-mb, or changing yarn.nodemanager.resource.memory-mb without also changing yarn.scheduler.minimum-allocation-mb changes the number of containers per node.

If your installation has a large amount of RAM but not many disks or cores, you can free RAM for other tasks by lowering both >yarn.scheduler.minimum-allocation-mb and yarn.nodemanager.resource.memory-mb.

With MapReduce on YARN, there are no longer preconfigured static slots for Map and Reduce tasks.

The entire cluster is available for dynamic resource allocation of Map and Reduce tasks as needed by each job. In the previous example cluster, with the previous configurations, YARN is able to allocate up to 10 Mappers (40/4) or 5 Reducers (40/8) on each node (or some other combination of Mappers and Reducers within the 40 GB per node limit).

1.11. Configuring NameNode Heap Size

NameNode heap size depends on many factors, such as the number of files, the number of blocks, and the load on the system. The following table provides recommendations for NameNode heap size configuration. These settings should work for typical Hadoop clusters in which the number of blocks is very close to the number of files (generally, the average ratio of number of blocks per file in a system is 1.1 to 1.2).

Some clusters might require further tweaking of the following settings. Also, it is generally better to set the total Java heap to a higher value.

Table 1.11. Recommended NameNode Heap Size Settings

Number of Files, in Millions	Total Java Heap (Xmx and Xms)	Young Generation Size (-XX:NewSize - XX:MaxNewSize)
< 1 million files	1126m	128m
1-5 million files	3379m	512m
5-10	5913m	768m
10-20	10982m	1280m
20-30	16332m	2048m
30-40	21401m	2560m
40-50	26752m	3072m
50-70	36889m	4352m
70-100	52659m	6144m
100-125	65612m	7680m
125-150	78566m	8960m
150-200	104473m	8960m

Number of Files, in Millions Total Java Heap (Xmx and Xms) Young Generation Size (-XX:NewSize - XX:MaxNewSize)

< 1 million files 1126m 128m

1-5 million files 3379m 512m

5-10 5913m 768m

10-20 10982m 1280m

20-30 16332m 2048m

30-40 21401m 2560m

40-50 26752m 3072m

50-70 36889m 4352m

70-100 52659m 6144m

100-125 65612m 7680m

125-150 78566m 8960m

150-200 104473m 8960m

Note

Acceldata recommends a maximum of 300 million files on the NameNode. You should also set -XX:PermSize to 128m and -XX:MaxPermSize to 256m.

Following are the recommended settings for HADOOP_NAMENODE_OPTS in the hadoop env.sh file (replacing the ##### placeholder for -XX:NewSize, -XX:MaxNewSize, -Xms, and - Xmx with the recommended values from the table):

-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/ log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=##### -XX:MaxNewSize=##### - Xms##### -Xmx##### -XX:PermSize=128m -XX:MaxPermSize=256m -Xloggc:/var/log/ hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX: +PrintGCTimeStamps -XX:+PrintGCDateStamps -Dhadoop.security.logger=INFO,DRFAS  -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HADOOP_NAMENODE_OPTS}

If the cluster uses a secondary NameNode, you should also set HADOOP_SECONDARYNAMENODE_OPTS to HADOOP_NAMENODE_OPTS in the hadoop env.sh file:

HADOOP_SECONDARYNAMENODE_OPTS=$HADOOP_NAMENODE_OPTS

Another useful HADOOP_NAMENODE_OPTS setting is -XX:+HeapDumpOnOutOfMemoryError.

This option specifies that a heap dump should be executed when an out-of-memory error occurs. You should also use -XX:HeapDumpPath to specify the location for the heap dump file:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./etc/heapdump.hprof

1.12. Allocating Adequate Log Space for ODP

Logs are an important part of managing and operating your ODP cluster. The directories and disks that you assign for logging in ODP must have enough space to maintain logs during ODP operations. Allocate at least 10 GB of free space for any disk you want to use for ODP logging.

1.13. Downloading the ODP Maven Artifacts

The Acceldata Release Engineering team hosts all the released ODP maven artifacts at http://repo.acceldata.com/content/repositories/releases/

Other than the release artifacts, some non-Acceldata artifacts are necessary for building the ODP stack. These third-party artifacts are hosted in the Acceldata nexus repository:

http://repo.acceldata.com/content/repositories/jetty-hadoop/

and

http://repo.acceldata.com/content/repositories/re-hosted/

If developers want to develop an application against the ODP stack, and they also have a maven repository manager in-house, then they can proxy these three repositories and continue referring to the internal maven groups repo.

If developers do not have access to their in-house maven repos, they can directly use the Acceldata public groups repo http://repo.acceldata.com/content/groups/public/ and continue to develop applications.

2. Installing Apache ZooKeeper

This section describes installing and testing Apache ZooKeeper, a centralized tool for providing services to highly distributed systems.

Note

HDFS and YARN depend on ZooKeeper, so install ZooKeeper first.

2.1. Install the ZooKeeper Package

Note

In a production environment, Acceldata recommends installing ZooKeeper server on three (or a higher odd number) nodes to ensure that ZooKeeper service is available.

On all nodes of the cluster that you have identified as ZooKeeper servers, type:

For RHEL/CentOS 7

yum install zookeeper-server

For Ubuntu 18/20:

apt-get install zookeeper

Note

Grant the zookeeper user shell access on Ubuntu 18/20.

usermod -s /bin/bash zookeeper

2.2. Securing ZooKeeper with Kerberos (optional)

Note

Before starting the following steps, refer to Setting up Security for Manual Installs.

(Optional) To secure ZooKeeper with Kerberos, perform the following steps on the host that runs KDC (Kerberos Key Distribution Center):

Start the kadmin.local utility:

/usr/sbin/kadmin.local

Create a principal for ZooKeeper:

sudo kadmin.local -q 'addprinc zookeeper/

<ZOOKEEPER_HOSTNAME>@STORM.EXAMPLE.COM'

Create a keytab for ZooKeeper:

sudo kadmin.local -q "ktadd -k /tmp/zk.keytab zookeeper/ <ZOOKEEPER_HOSTNAME>@STORM.EXAMPLE.COM"

Copy the keytab to all ZooKeeper nodes in the cluster.

Note

Verify that only the ZooKeeper and Storm operating system users can access the ZooKeeper keytab.

Administrators must add the following properties to the zoo.cfg configuration file located at /etc/zookeeper/conf:

authProvider.1 = org.apache.zookeeper.server.auth.SASLAuthenticationProvider kerberos.removeHostFromPrincipal = true 
kerberos.removeRealmFromPrincipal = true

Note

Grant the zookeeper user shell access on Ubuntu 18/20.

usermod -s /bin/bash zookeeper

2.3. Securing ZooKeeper Access

The default value of yarn.resourcemanager.zk-acl allows anyone to have full access to the znode. Acceldata recommends that you modify this permission to restrict access by performing the steps in the following sections.

ZooKeeper Configuration
YARN Configuration
HDFS Configuration

2.3.1. ZooKeeper Configuration

Note

The steps in this section only need to be performed once for the ODP cluster. If this task has been done to secure HBase for example, then there is no need to repeat these ZooKeeper steps if the YARN cluster uses the same ZooKeeper server.

Create a keytab for ZooKeeper called zookeeper.service.keytab and save it to / etc/security/keytabs.

sudo kadmin.local -q "ktadd -k /tmp/zk.keytab zookeeper/ 
 <ZOOKEEPER_HOSTNAME>@STORM.EXAMPLE.COM"

Add the following to the zoo.cfg file:

authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider jaasLoginRenew=3600000 
kerberos.removeHostFromPrincipal=true 
kerberos.removeRealmFromPrincipal=true

Create the zookeeper_client_jaas.conf file.

Client { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=false 
useTicketCache=true; 
};

Create the zookeeper_jaas.conf file.

Server { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=true 
storeKey=true 
useTicketCache=false 
keyTab="$PATH_TO_ZOOKEEPER_KEYTAB" 
(such as"/etc/security/keytabs/zookeeper.service.keytab") 
principal="zookeeper/$HOST"; 
(such as "zookeeper/[email protected]";) };

Add the following information to zookeeper-env-sh:

export CLIENT_JVMFLAGS="-Djava.security.auth.login.config=/etc/zookeeper/conf/zookeeper_client_jaas.conf" 
export SERVER_JVMFLAGS="-Xmx1024m -Djava.security.auth.login.config=/etc/zookeeper/conf/zookeeper_jaas.conf"

2.3.2. YARN Configuration

Note

The following steps must be performed on all nodes that launch the ResourceManager.

Create a new configuration file called yarn_jaas.conf in the directory that contains the Hadoop Core configurations (typically, /etc/hadoop/conf).

Client { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=true 
storeKey=true 
useTicketCache=false 
keyTab="$PATH_TO_RM_KEYTAB" 
(such as "/etc/security/keytabs/rm.service.keytab") 
principal="rm/$HOST"; 
(such as "rm/[email protected]";) 
};

Add a new property to the yarn-site.xml file.

<property> 
<name>yarn.resourcemanager.zk-acl</name> 
<value>sasl:rm:rwcda</value> 
</property>

Note

Because yarn-resourcemanager.zk-acl is set to sasl, you do not need to set any value for yarn.resourcemanager.zk-auth.

Setting the value to sasl also means that you cannot run the command addauth<scheme><auth> in the zkclient CLI.

Add a new YARN_OPTS to the yarn-env.sh file and make sure this YARN_OPTS is picked up when you start your ResourceManagers.

YARN_OPTS="$YARN_OPTS -Dzookeeper.sasl.client=true 
-Dzookeeper.sasl.client.username=zookeeper 
-Djava.security.auth.login.config=/etc/hadoop/conf/yarn_jaas.conf 
-Dzookeeper.sasl.clientconfig=Client"

2.3.3. HDFS Configuration

In the hdfs-site.xml file, set the following property, for security of ZooKeeper based fail-over controller. when NameNode HA is enabled:

<property> 
<name>ha.zookeeper.acl</name> 
 <value>sasl:nn:rwcda</value> 
</property>

2.4. Set Directories and Permissions

Create directories and configure ownership and permissions on the appropriate hosts as described below. If any of these directories already exist, we recommend deleting and recreating them.

Acceldata provides a set of configuration files that represent a working ZooKeeper configuration. (See Download Companion Files.) You can use these files as a reference point, however, you need to modify them to match your own cluster environment.

If you choose to use the provided configuration files to set up your ZooKeeper environment, complete the following steps to create the appropriate directories.

Execute the following commands on all ZooKeeper nodes:

mkdir -p $ZOOKEEPER_LOG_DIR; 
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_LOG_DIR; 
chmod -R 755 $ZOOKEEPER_LOG_DIR; 

mkdir -p $ZOOKEEPER_PID_DIR; 
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_PID_DIR; 
chmod -R 755 $ZOOKEEPER_PID_DIR; 

mkdir -p $ZOOKEEPER_DATA_DIR; 
chmod -R 755 $ZOOKEEPER_DATA_DIR; 
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_DATA_DIR

where:
• $ZOOKEEPER_USER is the user owning the ZooKeeper services. For example, zookeeper.
• $ZOOKEEPER_LOG_DIR is the directory to store the ZooKeeper logs. For example, / var/log/zookeeper.
• $ZOOKEEPER_PID_DIR is the directory to store the ZooKeeper process ID. For example, /var/run/zookeeper.
• $ZOOKEEPER_DATA_DIR is the directory where ZooKeeper stores data. For example, /grid/hadoop/zookeeper/data.

Initialize the ZooKeeper data directories with the 'myid' file. Create one file per ZooKeeper server, and put the number of that server in each file:

vi $ZOOKEEPER_DATA_DIR/myid

In the myid file on the first server, enter the corresponding number: 1
In the myid file on the second server, enter the corresponding number: 2
In the myid file on the third server, enter the corresponding number: 3

2.5. Set Up the Configuration Files

You must set up several configuration files for ZooKeeper. Acceldata provides a set of configuration files that represent a working ZooKeeper configuration. (See Download Companion Files. You can use these files as a reference point, however, you need to modify them to match your own cluster environment.

If you choose to use the provided configuration files to set up your ZooKeeper environment, complete the following steps:

Extract the ZooKeeper configuration files to a temporary directory.

The files are located in the configuration_files/zookeeper directories where you decompressed the companion files.

Modify the configuration files.

In the respective temporary directories, locate the zookeeper-env.sh file and modify the properties based on your environment including the JDK version you downloaded.

Edit the zookeeper-env.sh file to match the Java home directory, ZooKeeper log directory, ZooKeeper PID directory in your cluster environment and the directories you set up above.

See below for an example configuration:

export JAVA_HOME=/usr/jdk64/jdk1.8.0_202 
export ZOOKEEPER_HOME=/usr/odp/current/zookeeper-server 
export ZOOKEEPER_LOG_DIR=/var/log/zookeeper 
export ZOOKEEPER_PID_DIR=/var/run/zookeeper/zookeeper_server.pid 
export SERVER_JVMFLAGS=-Xmx1024m 
export JAVA=$JAVA_HOME/bin/java 
CLASSPATH=$CLASSPATH:$ZOOKEEPER_HOME/*

Edit the zoo.cfg file to match your cluster environment. Below is an example of a typical zoo.cfs file:

dataDir=$zk.data.directory.path 
server.1=$zk.server1.full.hostname:2888:3888 
server.2=$zk.server2.full.hostname:2888:3888 
server.3=$zk.server3.full.hostname:2888:3888

Copy the configuration files.

On all hosts create the config directory:

rm -r $ZOOKEEPER_CONF_DIR ; 
mkdir -p $ZOOKEEPER_CONF_DIR ;

Copy all the ZooKeeper configuration files to the $ZOOKEEPER_CONF_DIR directory. • Set appropriate permissions:

chmod a+x $ZOOKEEPER_CONF_DIR/; 
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_CONF_DIR/../; 
chmod -R 755 $ZOOKEEPER_CONF_DIR/../

Note:

$ZOOKEEPER_CONF_DIR is the directory to store the ZooKeeper configuration files. For example, /etc/zookeeper/conf.

$ZOOKEEPER_USER is the user owning the ZooKeeper services. For example, zookeeper.

2.6. Start ZooKeeper

To install and configure HBase and other Hadoop ecosystem components, you must start the ZooKeeper service and the ZKFC:

sudo -E -u zookeeper bash -c "export ZOOCFGDIR=$ZOOKEEPER_CONF_DIR ; export  ZOOCFG=zoo.cfg; 
 source $ZOOKEEPER_CONF_DIR/zookeeper-env.sh ; $ZOOKEEPER_HOME/bin/ zkServer.sh 
 start"

For example:

su - zookeeper -c "export ZOOCFGDIR=/usr/odp/current/zookeeper-server/ conf ; export ZOOCFG=zoo.cfg; source /usr/odp/current/zookeeper-server/conf/ zookeeper-env.sh ; /usr/odp/current/zookeeper-server/bin/zkServer.sh start" 
su -l hdfs -c "/usr/odp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop daemon.sh start zkfc"

3. Installing HDFS, YARN, and MapReduce

This section describes how to install the Hadoop Core components, HDFS, YARN, and MapReduce.

Complete the following instructions to install Hadoop Core components:

3.1. Set Default File and Directory Permissions

Set the default operating system file and directory permissions to 0022 (022).

Use the umask command to confirm that the permissions are set as necessary. For example, to see what the current umask setting are, enter:

umask

If you want to set a default umask for all users of the OS, edit the /etc/profile file, or other appropriate file for system-wide shell configuration.

Ensure that the umask is set for all terminal sessions that you use during installation.

3.2. Install the Hadoop Packages

Execute the following command on all cluster nodes.

For RHEL/CentOS 7

yum install hadoop hadoop-hdfs hadoop-libhdfs hadoop-yarn hadoop mapreduce hadoop-client openssl

For Ubuntu 18/20:

apt-get install hadoop hadoop-hdfs libhdfs0 hadoop-yarn hadoop mapreduce hadoop-client openssl

3.3. Install Compression Libraries

Make the following compression libraries available on all the cluster nodes.

3.3.1. Install Snappy

Install Snappy on all the nodes in your cluster. At each node:

For RHEL/CentOS 7

yum install snappy snappy-devel

For Ubuntu 18/20:

apt-get install libsnappy1 libsnappy-dev

3.3.2. Install LZO

Execute the following command at all the nodes in your cluster:

RHEL/CentOS 7

yum install lzo lzo-devel hadooplzo hadooplzo-native

• For Ubuntu 18/20:

apt-get install liblzo2-2 liblzo2-dev hadooplzo

3.4. Create Directories

Create directories and configure ownership + permissions on the appropriate hosts as described below.

Before you begin:

If any of these directories already exist, we recommend deleting and recreating them.
Acceldata provides a set of configuration files that represent a working ZooKeeper configuration. (See Download Companion Files. You can use these files as a reference point, however, you need to modify them to match your own cluster environment.

Use the following instructions to create appropriate directories:

Create the NameNode Directories
Create the SecondaryNameNode Directories
Create DataNode and YARN NodeManager Local Directories
Create the Log and PID Directories
Symlink Directories with odp-select

3.4.1. Create the NameNode Directories

On the node that hosts the NameNode service, execute the following commands:

mkdir -p $DFS_NAME_DIR; 
chown -R $HDFS_USER:$HADOOP_GROUP $DFS_NAME_DIR; 
chmod -R 755 $DFS_NAME_DIR;

Where:

$DFS_NAME_DIR is the space separated list of directories where NameNode stores the file system image. For example, /grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/ nn.
$HDFS_USER is the user owning the HDFS services. For example, hdfs.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

3.4.2. Create the SecondaryNameNode Directories

On all the nodes that can potentially run the SecondaryNameNode service, execute the following commands:

mkdir -p $FS_CHECKPOINT_DIR; 
chown -R $HDFS_USER:$HADOOP_GROUP $FS_CHECKPOINT_DIR; 
chmod -R 755 $FS_CHECKPOINT_DIR;

where:

$FS_CHECKPOINT_DIR is the space-separated list of directories where SecondaryNameNode should store the checkpoint image. For example, /grid/hadoop/ hdfs/snn /grid1/hadoop/hdfs/snn /grid2/hadoop/hdfs/snn.
$HDFS_USER is the user owning the HDFS services. For example, hdfs.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

3.4.3. Create DataNode and YARN NodeManager Local Directories

At each DataNode, execute the following commands:

mkdir -p $DFS_DATA_DIR; 
chown -R $HDFS_USER:$HADOOP_GROUP $DFS_DATA_DIR; 
chmod -R 750 $DFS_DATA_DIR;

where:

$DFS_DATA_DIR is the space-separated list of directories where DataNodes should store the blocks. For example, /grid/hadoop/hdfs/dn /grid1/hadoop/hdfs/dn / grid2/hadoop/hdfs/dn.
$HDFS_USER is the user owning the HDFS services. For example, hdfs.
$HADOOP_GROUP is a common group shared by services. For example, hadoop. At each ResourceManager and all DataNodes, execute the following commands.

mkdir -p $YARN_LOCAL_DIR; 
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_DIR; 
chmod -R 755 $YARN_LOCAL_DIR;

where:

$YARN_LOCAL_DIR is the space separated list of directories where YARN should store container log data. For example, /grid/hadoop/yarn/local /grid1/hadoop/ yarn/local /grid2/hadoop/yarn/local.
$YARN_USER is the user owning the YARN services. For example, yarn.
$HADOOP_GROUP is a common group shared by services. For example, hadoop. At each ResourceManager and all DataNodes, execute the following commands:

mkdir -p $YARN_LOCAL_LOG_DIR; 
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_LOG_DIR; 
chmod -R 755 $YARN_LOCAL_LOG_DIR;

where:

$YARN_LOCAL_LOG_DIR is the space-separated list of directories where YARN should store temporary data. For example, /grid/hadoop/yarn/logs /grid1/hadoop/yarn/logs /grid2/hadoop/yarn/logs
$YARN_USER is the user owning the YARN services. For example, yarn.
$HADOOP_GROUP is a common group shared by services. For example, hadoop.

3.4.4. Create the Log and PID Directories

Each ZooKeeper service requires a log and PID directory. In this section, you create directories for each service. If you choose to use the companion file scripts, these environment variables are already defined and you can copy and paste the examples into your terminal window.