5 ODP Command Line Installation - acceldata-io/odpdocumentation GitHub Wiki

1. Preparing to Install ODP Manually

This chapter describes how to prepare to install Open source Data Platform (ODP) manually. You must complete the following tasks before you deploy Hadoop cluster using ODP:

  1. Meeting Minimum System Requirements
  2. https://github.com/acceldata-io/odpdocumentation/wiki/ODP-Command-Line-Installation#1-preparing-to-manually-install-odp
  3. Deciding on a Deployment Type
  4. Collect Information
  5. Prepare the Environment
  6. Download Companion Files
  7. Define Environment Parameters
  8. [Optional] Create System Users and Groups
  9. Determining ODP Memory Configuration Settings
  10. Allocating Adequate Log Space for ODP
  11. Download ODP Maven Artifacts

Important

See the ODP Release Notes for the ODP 3.2.2.0-1 repo information.

1.1. Meeting Minimum System Requirements

To run Open Source Data Platform, your system must meet minimum requirements.

1.1.1. Hardware Recommendations

Although there is no single hardware requirement for installing ODP, there are some basic guidelines. A complete installation of ODP 3.2.2 consumes about 8 GB of disk space.

1.1.2. Operating System Requirements

Refer to the Acceldata Support Matrix for information regarding supported operating systems.

1.1.3. Software Requirements

You must install the following software on each of your hosts:

  • apt-get (for Ubuntu 18/20)
  • chkconfig (Ubuntu 18/20)
  • curl
  • reposync
  • rpm (for RHEL, CentOS 7)
  • scp
  • tar
  • unzip
  • wget
  • yum (for RHEL or CentOS 7)

In addition, if you are creating local mirror repositories as part of the installation process and you are using RHEL, CentOS 7, you need the following utilities on the mirror repo server:

  • createrepo
  • reposync
  • yum-utils

See Deploying ODP in Production Data Centers with Firewalls.

1.1.4. JDK Requirements

Your system must have the correct Java Development Kit (JDK) installed on all cluster nodes.

Refer to the Support Matrix for information regarding supported JDKs.

Important

Before enabling Kerberos in the cluster, you must deploy the Java Cryptography Extension (JCE) security policy files on all hosts in the cluster. See Installing the JCE for more information.

The following sections describe how to install and configure the JDK.

1.1.4.1. Manually Installing Oracle JDK 1.8

Use the following instructions to manually install JDK 1.8:

  1. If you do not have a /usr/java directory, create one:

    mkdir /usr/java

  2. Download the Oracle 64-bit JDK (jdk-8u202-linux-x64.tar.gz) from the Oracle download site.

  3. Open a web browser and navigate to http://www.oracle.com/ technetwork/java/javase/downloads/jdk8-downloads-2133151.html.

  4. Copy the downloaded jdk.tar.gz file to the /usr/java directory.

  5. Navigate to the /usr/java directory and extract the jdk.tar.gz file:

    cd /usr/java tar zxvf jdk-8u202-linux-x64.tar.gz

    The JDK files are extracted into a /usr/ java/jdk1.8.0_202 directory.

  6. Create a symbolic link (symlink) to the JDK:

    ln -s /usr/java/jdk1.8.0_202 /usr/java/default

  7. Set the JAVA_HOME and PATH environment variables:

    export JAVA_HOME=/usr/java/default 
    export PATH=$JAVA_HOME/bin:$PATH 
    
  8. Verify that Java is installed in your environment:

    java -version

You should see output similar to the following:

java version "1.8.0_202" 
Java(TM) SE Runtime Environment (build 1.8.0_202-b01) 
Java HotSpot(TM) 64-Bit Server VM (build 24.67-b01, mixed mode)

1.1.4.2. Manually Installing the JCE

Unless you are using OpenJDK with unlimited-strength JCE, you must manually install the Java Cryptography Extension (JCE) security policy files on all hosts in the cluster:

  1. Obtain the JCE policy file appropriate for the JDK version in your cluster:
  • Oracle JDK 1.8

https://www.oracle.com/java/technologies/javase-jce8-downloads.html

  1. Save the policy file archive in a temporary location.

  2. On each host in the cluster, add the unlimited security policy JCE jars to $JAVA_HOME/jre/lib/security/.

For example, run the following command to extract the policy jars into the JDK installed on your host:

unzip -o -j -q jce_policy-8.zip -d /usr/jdk64/jdk1.8.0_202/jre/lib/security/

1.1.5. Metastore Database Requirements

If you are installing Apache projects Hive and HCatalog, Oozie, Hue, or Ranger, you must install a database to store metadata information in the metastore. You can either use an existing database instance or install a new instance manually.

Refer to the Support Matrix for information regarding supported metastore databases.

The following sections describe how to install and configure the metastore database.

1.1.5.1. Metastore Database Prerequisites

The database administrator must create the following users and specify the following values:

  • For Apache Hive: hive_dbname, hive_dbuser, and hive_dbpasswd.
  • For Apache Oozie: oozie_dbname, oozie_dbuser, and oozie_dbpasswd.

Note

By default, Hive uses the Derby database for the metastore. However, Derby is not supported for production systems.

  • For Hue: Hue user name and Hue user password
  • For Apache Ranger: RANGER_ADMIN_DB_NAME

1.1.5.2. Installing and Configuring PostgreSQL

The following instructions explain how to install PostgreSQL as the metastore database. See your third-party documentation for instructions on how to install other supported databases.

Important

Prior to using PostgreSQL as your Hive metastore, consult with the offiical PostgreSQL documentation and ensure you are using a JDBC 4+ driver that corresponds to your implementation of PostgreSQL.

1.1.5.2.1. Installing PostgreSQL on RHEL, and CentOS

Use the following instructions to install a new instance of PostgreSQL:

  1. Using a terminal window, connect to the host machine where you plan to deploy a PostgreSQL instance:

yum install postgresql-server

  1. Start the instance:

/etc/init.d/postgresql start

For some newer versions of PostgreSQL, you might need to execute the command /etc/init.d/postgresql initdb.

  1. Reconfigure PostgreSQL server:
  • Edit the /var/lib/pgsql/data/postgresql.conf file.

Change the value of #listen_addresses = 'localhost' to listen_addresses = '*'.

  • Edit the /var/lib/pgsql/data/postgresql.conf file.

Remove comments from the "port = " line and specify the port number (default 5432).

  • Edit the /var/lib/pgsql/data/pg_hba.conf file by adding the following:

host all all 0.0.0.0/0 trust

  • If you are using PostgreSQL v9.1 or later, add the following to the /var/lib/pgsql/data/postgresql.conf file:

standard_conforming_strings = off

  1. Create users for PostgreSQL server by logging in as the root user and entering the following syntax:
echo "CREATE DATABASE $dbname;" | sudo -u $postgres psql -U postgres echo "CREATE USER $user WITH PASSWORD '$passwd';" | sudo -u $postgres psql -U postgres 
echo "GRANT ALL PRIVILEGES ON DATABASE $dbname TO $user;" | sudo -u  $postgres psql -U postgres 

The previous syntax should have the following values:
• $postgres is the postgres user.
• $user is the user you want to create.
• $dbname is the name of your PostgreSQL database.

Note

For access to the Hive metastore, you must create hive_dbuser after Hive has been installed, and for access to the Oozie metastore, you must create oozie_dbuser after Oozie has been installed.

  1. On the Hive metastore host, install the connector:

yum install postgresql-jdbc*

  1. Confirm that the .jar file is in the Java share directory:

ls -l /usr/share/java/postgresql-jdbc.jar

1.1.5.2.2. Installing PostgreSQL on Ubuntu 18/20

To install a new instance of PostgreSQL:

  1. Connect to the host machine where you plan to deploy PostgreSQL instance. At a terminal window, enter:

apt-get install postgresql-server

  1. Start the instance.

Note

For some newer versions of PostgreSQL, you might need to execute the command:

/etc/init.d/postgresql initdb

  1. Reconfigure PostgreSQL server:
  • Edit the /var/lib/pgsql/data/postgresql.conf file.

Change the value of #listen_addresses = 'localhost' to listen_addresses = '*'

  • Edit the /var/lib/pgsql/data/postgresql.conf file.

Change the port setting from #port = 5432 to port = 5432

  • Edit the /var/lib/pgsql/data/pg_hba.conf

Add the following:

host all all 0.0.0.0/0 trust

  • Optional: If you are using PostgreSQL v9.1 or later, add the following to the /var/lib/pgsql/data/postgresql.conf file:

standard_conforming_strings = off

  1. Create users for PostgreSQL server.

Log in as the root and enter:

echo "CREATE DATABASE $dbname;" | sudo -u $postgres psql -U postgres echo "CREATE USER $user WITH PASSWORD '$passwd';" | sudo -u $postgres psql - U postgres 
echo "GRANT ALL PRIVILEGES ON DATABASE $dbname TO $user;" | sudo -u  $postgres psql -U postgres

Where: $postgres is the postgres user, $user is the user you want to create, and $dbname is the name of your PostgreSQL database.

Note

For access to the Hive metastore, create hive_dbuser after Hive has been installed, and for access to the Oozie metastore, create oozie_dbuser after Oozie has been installed.

  1. On the Hive Metastore host, install the connector.

apt-get install -y libpostgresql-jdbc-java

  1. Copy the connector .jar file to the Java share directory.

cp /usr/share/java/postgresql-*jdbc3.jar /usr/share/java/ postgresql-jdbc.jar

  1. Confirm that the .jar is in the Java share directory.

ls /usr/share/java/postgresql-jdbc.jar

  1. Change the access mode of the .jar file to 644.

chmod 644 /usr/share/java/postgresql-jdbc.jar

1.1.5.3. Installing and Configuing MariaDB

This section describes how to install MariaDB as the metastore database. For instructions on how to install other supported databases, see your third-party documentation.

For additional information regarding MariaDB, see MariaDB.

1.1.5.3.1. Installing MariaDB on RHEL and CentOS

Important

If you are installing on CentOS or RHEL, it is highly recommended that you install from a repository using yum.

Follow these steps to install a new instance of MariaDB on RHEL and CentOS:

  1. There are YUM repositories for several YUM-based Linux distributions. Use the Maria DB Downloads page to generate the YUM repository.

  2. Move the MariaDB repo file to the directory /etc/yum.repos.d/.

It is suggested that you name your file MariaDB.repo.

The following is an example MariaDB.repo file for CentOS 7:

[mariadb] 
name=MariaDB 
baseurl=http://yum.mariadb.org/10.1/centos7-amd64 
gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB 
gpgcheck=1 

In this example the gpgkey line automatically fetches the GPG key that is used to sign the repositories. gpgkey enables yum and rpm to verify the integrity of the packages that it downloads.The id of MariaDB's signing key is 0xcbcb082a1bb943db. The short form of the id is 0x1BB943DB and the full key fingerprint is:1993 69E5 404B D5FC 7D2F E43B CBCB 082A 1BB9 43DB.

If you want to fix the version to an older version, follow the instructions on Adding the MariaDB YUM Repository.

  1. If you do not have the MariaDB GPG signing key installed, YUM prompts you to install it after downloading the packages. If you are prompted to do so, install the MariaDB GPG signing key.

  2. Use the following command to install MariaDB:

sudo yum install MariaDB-server MariaDB-client

  1. If you already have the MariaDB-Galera-server package installed, you might need to remove it prior to installing MariaDB-server. If you need to remove MariaDB Galera-server, use the following command:

sudo yum remove MariaDB-Galera-server

No databases are removed when the MariaDB-Galera-server rpm package is removed, though with any upgrade, it is best to have backups.

  1. Install MariaDB with YUM by following the directions at Enabling MariaDB.

  2. Use one of the following commands to start MariaDB:

  • If your system is using systemctl:

sudo systemctl start mariadb

  • If your system is not using systemctl:

sudo /etc/init.d/mysql start

1.1.5.4. Installing and Configuring MySQL

This section describes how to install MySQL as the metastore database. For instructions on how to install other supported databases, see your third-party documentation.

Important

When you use MySQL as your Hive metastore, you must use mysql connector-java-5.1.35.zip or later JDBC driver.

1.1.5.4.1. Installing MySQL on RHEL and CentOS

To install a new instance of MySQL:

  1. Connect to the host machine you plan to use for Hive and HCatalog.

  2. Install MySQL server.

From a terminal window, enter:

yum install mysql-community-release

For CentOS7, install MySQL server from the ODP-Utils repository.

  1. Start the instance.

/etc/init.d/mysqld start

  1. Set the root user password using the following command format:

mysqladmin -u root password $mysqlpassword

For example, use the following command to set the password to "root":

mysqladmin -u root password root

  1. Remove unnecessary information from log and STDOUT:

mysqladmin -u root 2>&1> /dev/null

  1. Log in to MySQL as the root user:

mysql -u root -p root

In this syntax, "root" is the root user password.

  1. Log in as the root user, create the “dbuser,” and grant dbuser adequate privileges:
[root@c6402 /]# mysql -u root -proot 
Welcome to the MySQL monitor. Commands end with ; or \g. 
Your MySQL connection id is 11 
Server version: 5.1.73 Source distribution 
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved. 
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. 
Type 'help;' or '\h' for help. Type '\c' to clear the current input  statement. 
mysql> CREATE USER 'dbuser'@'localhost' IDENTIFIED BY 'dbuser';
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost';
Query OK, 0 rows affected (0.00 sec) 
mysql> CREATE USER 'dbuser'@'%' IDENTIFIED BY 'dbuser'; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%'; 
Query OK, 0 rows affected (0.00 sec) 
mysql> FLUSH PRIVILEGES; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost' WITH GRANT  OPTION; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%' WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec) 
mysql>
  1. Use the exit command to exit MySQL.

  2. You should now be able to reconnect to the database as "dbuser" by using the following command:

mysql -u dbuser -p dbuser

After testing the dbuser login, use the exit command to exit MySQL.

10.Install the MySQL connector .jar file:

yum install mysql-connector-java*

1.1.5.4.2. Ubuntu 18/20

To install a new instance of MySQL:

  1. Connect to the host machine you plan to use for Hive and HCatalog.

  2. Install MySQL server.

From a terminal window, enter:

apt-get install mysql-server

  1. Start the instance.

/etc/init.d/mysql start

  1. Set the root user password using the following command format:

mysqladmin -u root password $mysqlpassword

For example, to set the password to "root":

mysqladmin -u root password root

  1. Remove unnecessary information from log and STDOUT.

mysqladmin -u root 2>&1> /dev/null

  1. Log in to MySQL as the root user:

mysql -u root -p root

  1. Log in as the root user, create the dbuser, and grant it adequate privileges. This user provides access to the Hive metastore. Use the following series of commands (shown here with the returned responses) to create dbuser with password dbuser.
[root@c6402 /]# mysql -u root -proot 
Welcome to the MySQL monitor. Commands end with ; or \g. 
Your MySQL connection id is 11 
Server version: 5.1.73 Source distribution 
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved. 
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. 
Type 'help;' or '\h' for help. Type '\c' to clear the current input  statement. 
mysql> CREATE USER 'dbuser'@'localhost' IDENTIFIED BY 'dbuser'; Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost'; Query OK, 0 rows affected (0.00 sec) 
mysql> CREATE USER 'dbuser'@'%' IDENTIFIED BY 'dbuser'; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%'; 
Query OK, 0 rows affected (0.00 sec) 
mysql> FLUSH PRIVILEGES; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost' WITH GRANT  OPTION; 
Query OK, 0 rows affected (0.00 sec) 
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%' WITH GRANT OPTION; 
Query OK, 0 rows affected (0.00 sec) 
mysql> 
  1. Use the exit command to exit MySQL.

  2. You should now be able to reconnect to the database as dbuser, using the following command:

mysql -u dbuser -p dbuser

After testing the dbuser login, use the exit command to exit MySQL.

10.Install the MySQL connector JAR file.

apt-get install mysql-connector-java*

1.1.5.5. Configuring Oracle as the Metastore Database

You can select Oracle as the metastore database. For instructions on how to install the databases, see your third-party documentation. To configure Oracle as the Hive Metastore, install ODP and Hive, and then follow the instructions in "Set up Oracle DB for use with Hive Metastore" in this guide.

1.2. Virtualisation and Cloud Platforms

Open source Data Platform (ODP) is certified and supported when running on virtual or cloud platforms (for example, VMware vSphere or Amazon Web Services EC2) if the respective guest operating system is supported by ODP and any issues detected on these platforms are reproducible on the same supported operating system installed elsewhere.

See the Support Matrix for the list of supported operating systems for ODP.

1.3. Configuring Remote Repositories

The standard ODP install fetches the software from a remote yum repository over the Internet. To use this option, you must set up access to the remote repository and have an available Internet connection for each of your hosts. To download the ODP maven artifacts and build your own repository, see Download the ODP Maven Artifacts.

Important

See the ODP Release Notes and ODP 2.6 repo information.

Note

If your cluster does not have access to the Internet, or if you are creating a large cluster and you want to conserve bandwidth, you can instead provide a local copy of the ODP repository that your hosts can access.

  • 6.x line of RHEL/CentOS 7

wget -nv http://public-repo-1.acceldata.com/ODP/centos7/3.2.2.0-1/odp.repo -O /etc/yum.repos.d/odp.repo

  • 7.x line of RHEL/CentOS

wget -nv http://public-repo-1.acceldata.com/ODP/centos7/3.2.2.0-1/odp.repo

  • Ubuntu 18/20
apt-get update 
wget http://public-repo-1.acceldata.com/ODP/ubuntu<version>/3.2.2.0-1/odp.list -O /etc/apt/sources.list.d/odp.list

1.4. Deciding on a Deployment Type

While it is possible to deploy all of ODP on a single host, you should use at least four hosts: one master host and three slaves.

1.5. Collect Information

To deploy your ODP, you need the following information:

  • The fully qualified domain name (FQDN) for each host in your system, and the components you want to set up on each host. You can use hostname -f to check for the FQDN.

  • If you install Apache Hive, HCatalog, or Apache Oozie, you need the host name, database name, user name, and password for the metastore instance.

Note

If you are using an existing instance, the dbuser you create for ODP must be granted ALL PRIVILEGES permissions on that instance.

1.6. Prepare the Environment

To deploy your ODP instance, you must prepare your deployment environment:

  • Enable NTP on Your Cluster
  • Disable SELinux
  • Disable IPTables

1.6.1. Enable NTP on Your Cluster

The clocks of all the nodes in your cluster must be synchronized. If your system does not have access to the Internet, you should set up a master node as an NTP xserver to achieve this synchronization.

Use the following instructions to enable NTP for your cluster:

  1. Configure NTP clients by executing the following command on each node in your cluster:
  • For RHEL/CentOS/:

a. Configure the NTP clients:

yum install ntp

b. Enable the service:

systemctl enable ntpd

c. Start NTPD:

systemctl start ntpd

  1. Enable the service by executing the following command on each node in your cluster:
  • For RHEL/CentOS

chkconfig ntpd on

  • For Ubuntu 18/20:

chkconfig ntp on

  1. Start the NTP. Execute the following command on all the nodes in your cluster. • For RHEL/CentOS 7:

/etc/init.d/ntpd start

  • For Ubuntu 18/20

/etc/init.d/ntp start

  1. If you want to use an existing NTP server as the X server in your environment, complete the following steps:

    a. Configure the firewall on the local NTP server to enable UDP input traffic on Port 123 and replace 192.168.1.0/24 with the IP addresses in the cluster, as shown in the following example using RHEL hosts:

    # iptables -A RH-Firewall-1-INPUT -s 192.168.1.0/24 -m state --state NEW -p udp --dport 123 -j ACCEPT

    b. Save and restart iptables. Execute the following command on all the nodes in your cluster:

     # service iptables save 
     # service iptables restart 
    

    c. Finally, configure clients to use the local NTP server. Edit the /etc/ntp.conf file and add the following line:

    server $LOCAL_SERVER_IP OR HOSTNAME
    

1.6.2. Disable SELinux

The Security-Enhanced (SE) Linux feature should be disabled during the installation process. 1. Check the state of SELinux. On all the host machines, execute the following command:

getenforce

If the command returns "disabled" or "permissive" as the response, no further actions are required. If the result is enabled, proceed to Step 2.

  1. Disable SELinux either temporarily for each session or permanently.
  • Disable SELinux temporarily by executing the following command:

setenforce 0

  • Disable SELinux permanently in the /etc/sysconfig/selinux file by changing the value of the SELINUX field to permissive or disabled. Restart your system.

1.6.3. Disable IPTables

Because certain ports must be open and available during installation, you should temporarily disable iptables. If the security protocols at your installation do not allow you to disable iptables, you can proceed with them on if all of the relevant ports are open and available; otherwise, cluster installation fails.

  • On all RHEL/CentOS 6 host machines, execute the following commands to disable iptables:
chkconfig iptables off 
service iptables stop

Restart iptables after your setup is complete.

  • On RHEL/CENTOS 7 host machines, execute the following commands to disable firewalld:
systemctl stop firewalld
systemctl mask firewalld

Restart firewalld after your setup is complete.

On Ubuntu 18/20 Host machines, execute the following command to disable iptables:

service ufw stop

Restart iptables after your setup is complete.

Important

If you leave iptables enabled and do not set up the necessary ports, the cluster installation fails.

1.7. Download Companion Files

You can download and extract a set of companion files, including script files and configuration files, that you can then modify to match your own cluster environment:

To download and extract the files:

wget http://public-repo-1.acceldata.com/ODP/tools/3.2.2.0-1/ 
odp_manual_install_rpm_helper_files-3.2.2.0.1.tar.gz 
tar zxvf odp_manual_install_rpm_helper_files-3.2.2.0.1.tar.gz

Important

See the ODP Release Notes for the ODP 3.2.2.0 repo information.

1.8. Define Environment Parameters

You must set up specific users and directories for your ODP installation by using the following instructions:

  1. Define directories.

The following table describes the directories you need for installation, configuration, data storage, process IDs, and log information based on the Apache Hadoop Services you plan to install. Use this table to define what you are going to use to set up your environment.

Note

The scripts.zip file that you downloaded in the supplied companion files includes a script, directories.sh, for setting directory environment parameters.

You should edit and source (or copy the contents to your ~/.bash_profile) to set up these environment variables in your environment.

Table 1.1. Directories Needed to Install Core Hadoop

Hadoop Service Parameter Definition
HDFS DFS_NAME_DIR Space separated list of directories to which NameNode should store the file system image: for example, / grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/nn.
HDFS DFS_DATA_DIR Space separated list of directories where DataNodes should store the blocks. For example, /grid/hadoop/hdfs/dn, /grid1/ hadoop/hdfs/dn /grid2/hadoop/hdfs/dn
HDFS FS_CHECKPOINT_DIR Space separated list of directories where SecondaryNameNode should store the checkpoint image. For example, /grid/hadoop/hdfs/snn /grid1/hadoop/hdfs/snn /grid2/hadoop/hdfs/snn
HDFS HDFS_LOG_DIR Directory for storing the HDFS logs. This directory name is a combination of a directory and the $HDFS_USER. For example, /var/log/hadoop/hdfs, where hdfs is the $HDFS_USER
HDFS HDFS_PID_DIR Directory for storing the HDFS process ID. This directory name is a combination of a directory and the $HDFS_USER. For example, /var/run/hadoop/hdfs, where hdfs is the $HDFS_USER
HDFS HADOOP_CONF_DIR Directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf
YARN YARN_LOCAL_DIR Space-separated list of directories where YARN should store temporary data. For example, /grid/hadoop/yarn /grid1/hadoop/yarn /grid2/hadoop/yarn
YARN YARN_LOG_DIR Directory for storing the YARN logs. For example, /var/log/hadoop/yarn. This directory name is a combination of a directory and the $YARN_USER. In the example yarn is the $YARN_USER.
YARN YARN_LOCAL_LOG_DIR Space-separated list of directories where YARN stores container log data. For example, /grid/hadoop/yarn/logs /grid1/hadoop/yarn/log.
YARN YARN_PID_DI Directory for storing the YARN process ID. For example, /var/run/hadoop/yarn. This directory name is a combination of a directory and the $YARN_USER. In the example, yarn is the $YARN_USER
MapReduce MAPRED_LOG_DIR Directory for storing the JobHistory Server logs. For example, /var/log/hadoop/mapred. This directory name is a combination of a directory and the $MAPRED_USER. In the example, mapred is the $MAPRED_USER

Table 1.2. Directories Needed to Install Ecosystem Components

Hadoop Service Parameter Definition
Oozie OOZIE_CONF_DIR Directory to store the Oozie configuration files. For example, /etc/oozie/conf.
Oozie OOZIE_DATA Directory to store the Oozie data. For example, /var/db/oozie.
Oozie OOZIE_LOG_DIR Directory to store the Oozie logs. For example, /var/log/oozie.
Oozie OOZIE_PID_DIR Directory to store the Oozie process ID. For example, /var/run/oozie.
Oozie OOZIE_TMP_DIR Directory to store the Oozie temporary files. For example, /var/tmp/oozie.
Hive HIVE_CONF_DIR Directory to store the Hive configuration files. For example, /etc/hive/conf.
Hive HIVE_LOG_DIR Directory to store the Hive logs. For example, /var/log/hive.
Hive HIVE_PID_DIR Directory to store the Hive process ID. For example, /var/run/hive.
WebHCat WEBHCAT_CONF_DIR Directory to store the WebHCat configuration files. For example, /etc/hcatalog/conf/webhcat.
WebHCat WEBHCAT_LOG_DIR Directory to store the WebHCat logs. For example, /var/log/webhcat.
WebHCat WEBHCAT_PID_DIR Directory to store the WebHCat process ID. For example, /var/run/webhcat.
HBase HBASE_CONF_DIR Directory to store the Apache HBase configuration files. For example, /etc/hbase/conf.
HBase HBASE_LOG_DIR Directory to store the HBase logs. For example, /var/log/hbase.
HBase HBASE_PID_DIR Directory to store the HBase process ID. For example, /var/run/hbase.
ZooKeeper ZOOKEEPER_DATA_DIR Directory where Apache ZooKeeper stores data. For example, /grid/hadoop/zookeeper/data
ZooKeeper ZOOKEEPER_CONF_DIR Directory to store the ZooKeeper configuration files. For example, /etc/zookeeper/conf.
ZooKeeper ZOOKEEPER_LOG_DIR Directory to store the ZooKeeper logs. For example, /var/log/zookeeper.
ZooKeeper ZOOKEEPER_PID_DIR Directory to store the ZooKeeper process ID. For example, /var/run/zookeeper.
Sqoop SQOOP_CONF_DIR Directory to store the Apache Sqoop configuration files. For example, /etc/sqoop/conf.

If you use the companion files, the following screen provides a snapshot of how your directories.sh file should look after you edit the TODO variables:

#!/bin/sh 

# 
# Directories Script 
# 
# 1. To use this script, you must edit the TODO variables below for your  environment. 
# 
# 2. Warning: Leave the other parameters as the default values. Changing  these default values requires you to 
# change values in other configuration files. 
# 

# 
# Hadoop Service - HDFS 
# 

# Space separated list of directories where NameNode stores file system  image. For example, /grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/nn DFS_NAME_DIR="TODO-LIST-OF-NAMENODE-DIRS"; 

# Space separated list of directories where DataNodes stores the blocks. For  example, /grid/hadoop/hdfs/dn /grid1/hadoop/hdfs/dn /grid2/hadoop/hdfs/dn DFS_DATA_DIR="TODO-LIST-OF-DATA-DIRS"; 

# Space separated list of directories where SecondaryNameNode stores  checkpoint image. For example, /grid/hadoop/hdfs/snn /grid1/hadoop/hdfs/ snn /grid2/hadoop/hdfs/snn 
FS_CHECKPOINT_DIR="TODO-LIST-OF-SECONDARY-NAMENODE-DIRS"; 

# Directory to store the HDFS logs. 
HDFS_LOG_DIR="/var/log/hadoop/hdfs";

# Directory to store the HDFS process ID. 
HDFS_PID_DIR="/var/run/hadoop/hdfs"; 

# Directory to store the Hadoop configuration files. 
HADOOP_CONF_DIR="/etc/hadoop/conf"; 

# 
# Hadoop Service - YARN 
# 

# Space separated list of directories where YARN stores temporary data. For  example, /grid/hadoop/yarn/local /grid1/hadoop/yarn/local /grid2/hadoop/yarn/local 
YARN_LOCAL_DIR="TODO-LIST-OF-YARN-LOCAL-DIRS"; 

# Directory to store the YARN logs. 
YARN_LOG_DIR="/var/log/hadoop/yarn";

# Space separated list of directories where YARN stores container log data.  For example, /grid/hadoop/yarn/logs /grid1/hadoop/yarn/logs /grid2/hadoop/ yarn/logs 
YARN_LOCAL_LOG_DIR="TODO-LIST-OF-YARN-LOCAL-LOG-DIRS"; 

# Directory to store the YARN process ID. 
YARN_PID_DIR="/var/run/hadoop/yarn"; 

# 
# Hadoop Service - MAPREDUCE 
# 

# Directory to store the MapReduce daemon logs. 
MAPRED_LOG_DIR="/var/log/hadoop/mapred"; 
# Directory to store the mapreduce jobhistory process ID. MAPRED_PID_DIR="/var/run/hadoop/mapred"; 

# 
# Hadoop Service - Hive 
# 

# Directory to store the Hive configuration files. 
HIVE_CONF_DIR="/etc/hive/conf"; 

# Directory to store the Hive logs. 
HIVE_LOG_DIR="/var/log/hive"; 

# Directory to store the Hive process ID. 
HIVE_PID_DIR="/var/run/hive"; 

# 
# Hadoop Service - WebHCat (Templeton) 
# 

# Directory to store the WebHCat (Templeton) configuration files. WEBHCAT_CONF_DIR="/etc/hcatalog/conf/webhcat"; 

# Directory to store the WebHCat (Templeton) logs. 
WEBHCAT_LOG_DIR="var/log/webhcat"; 

# Directory to store the WebHCat (Templeton) process ID.
WEBHCAT_PID_DIR="/var/run/webhcat"; 

# 
# Hadoop Service - HBase 
# 

# Directory to store the HBase configuration files. 
HBASE_CONF_DIR="/etc/hbase/conf"; 

# Directory to store the HBase logs. 
HBASE_LOG_DIR="/var/log/hbase"; 

# Directory to store the HBase logs. 
HBASE_PID_DIR="/var/run/hbase";

# 
# Hadoop Service - ZooKeeper 
# 

# Directory where ZooKeeper stores data. For example, /grid1/hadoop/ zookeeper/data 
ZOOKEEPER_DATA_DIR="TODO-ZOOKEEPER-DATA-DIR"; 

# Directory to store the ZooKeeper configuration files. ZOOKEEPER_CONF_DIR="/etc/zookeeper/conf"; 

# Directory to store the ZooKeeper logs. 
ZOOKEEPER_LOG_DIR="/var/log/zookeeper"; 

# Directory to store the ZooKeeper process ID. 
ZOOKEEPER_PID_DIR="/var/run/zookeeper"; 

# 
# Hadoop Service - Oozie 
# 

# Directory to store the Oozie configuration files. 
OOZIE_CONF_DIR="/etc/oozie/conf" 

# Directory to store the Oozie data. 
OOZIE_DATA="/var/db/oozie" 

# Directory to store the Oozie logs. 
OOZIE_LOG_DIR="/var/log/oozie" 

# Directory to store the Oozie process ID.
OOZIE_PID_DIR="/var/run/oozie" 

# Directory to store the Oozie temporary files. 
OOZIE_TMP_DIR="/var/tmp/oozie" 

# 
# Hadoop Service - Sqoop 
# 
SQOOP_CONF_DIR="/etc/sqoop/conf" 


  1. The following table describes system user account and groups. Use this table to define what you are going to use in setting up your environment. These users and groups should reflect the accounts you create in Create System Users and Groups. The scripts.zip file you downloaded includes a script, usersAndGroups.sh, for setting user and group environment parameters.

Table 1.3. Define Users and Groups for Systems

Parameter Definition
HDFS_USER User that owns the Hadoop Distributed File Sysem (HDFS) services. For example, hdfs.
YARN_USER User that owns the YARN services. For example, yarn.
ZOOKEEPER_USER User that owns the ZooKeeper services. For example, zookeeper.
HIVE_USER User that owns the Hive services. For example, hive.
WEBHCAT_USER User that owns the WebHCat services. For example, hcat.
HBASE_USER User that owns the HBase services. For example, hbase.
SQOOP_USER User owning the Sqoop services. For example, sqoop.
KAFKA_USER User owning the Apache Kafka services. For example, kafka.
OOZIE_USER User owning the Oozie services. For example oozie.
HADOOP_GROUP A common group shared by services. For example, hadoop.
KNOX_USER User that owns the Knox Gateway services. For example, knox.

1.9. Creating System Users and Groups

In general, Apache Hadoop services should be owned by specific users and not by root or application users. The following table shows the typical users for Hadoop services. If you choose to install the ODP components using the RPMs, these users are automatically set up.

If you do not install with the RPMs, or want different users, then you must identify the users that you want for your Hadoop services and the common Hadoop group and create these accounts on your system.

To create these accounts manually, you must follow this procedure:

Add the user to the group.

useradd -G <groupname> <username>

Table 1.4. Typical System Users and Groups

Hadoop Service User Group
HDFS hdfs hadoop
YARN yarn hadoop
MapReduce mapred hadoop, mapred
Hive hive hadoop
HCatalog/WebHCatalog hcat hadoop
HBase hbase hadoop
Sqoop sqoop hadoop
ZooKeeper zookeeper hadoop
Oozie oozie hadoop
Knox Gateway knox hadoop

1.10. Determining ODP Memory Configuration Settings

You can use either of two methods determine YARN and MapReduce memory configuration settings:

  • Running the YARN Utility Script
  • Calculating YARN and MapReduce Memory Requirements

The ODP utility script is the recommended method for calculating ODP memory configuration settings, but information about manually calculating YARN and MapReduce memory configuration settings is also provided for reference.

1.10.1. Running the YARN Utility Script

This section describes how to use the yarn-utils.py script to calculate YARN, MapReduce, Hive, and Tez memory allocation settings based on the node hardware specifications. The yarn-utils.py script is included in the ODP companion files. See Download Companion Files.

To run the yarn-utils.py script, execute the following command from the folder containing the script yarn-utils.py options, where options are as follows:

Table 1.5. yarn-utils.py Options

Option Description
-c CORES The number of cores on each host
-m MEMORY The amount of memory on each host, in gigabytes
-d DISKS The number of disks on each host
-k HBASE "True" if HBase is installed; "False" if not

Notes

Requires python26 to run.

You can also use the -h or --help option to display a Help message that describes the options.

Example Running the following command from the odp_manual_install_rpm_helper_files-3.2.2.0. $BUILD directory:

python yarn-utils.py -c 16 -m 64 -d 4 -k True

Returns

Using cores=16 memory=64GB disks=4 hbase=True 
Profile: cores=16 memory=49152MB reserved=16GB usableMem=48GB disks=4 Num Container=8 
Container Ram=6144MB 
Used Ram=48GB 
Unused Ram=16GB 
yarn.scheduler.minimum-allocation-mb=6144 
yarn.scheduler.maximum-allocation-mb=49152 
yarn.nodemanager.resource.memory-mb=49152 
mapreduce.map.memory.mb=6144 
mapreduce.map.java.opts=-Xmx4096m 
mapreduce.reduce.memory.mb=6144 
mapreduce.reduce.java.opts=-Xmx4096m 
yarn.app.mapreduce.am.resource.mb=6144 
yarn.app.mapreduce.am.command-opts=-Xmx4096m 
mapreduce.task.io.sort.mb=1792 
tez.am.resource.memory.mb=6144 
tez.am.launch.cmd-opts =-Xmx4096m 
hive.tez.container.size=6144 
hive.tez.java.opts=-Xmx4096m

1.10.2. Calculating YARN and MapReduce Memory Requirements

This section describes how to manually configure YARN and MapReduce memory allocation settings based on the node hardware specifications.

YARN takes into account all of the available compute resources on each machine in the cluster. Based on the available resources, YARN negotiates resource requests from applications running in the cluster, such as MapReduce. YARN then provides processing capacity to each application by allocating containers. A container is the basic unit of processing capacity in YARN, and is an encapsulation of resource elements such as memory and CPU.

In an Apache Hadoop cluster, it is vital to balance the use of memory (RAM), processors (CPU cores), and disks so that processing is not constrained by any one of these cluster resources. As a general recommendation, allowing for two containers per disk and per core gives the best balance for cluster utilization.

When determining the appropriate YARN and MapReduce memory configurations for a cluster node, you should start with the available hardware resources. Specifically, note the following values on each node:

  • RAM (amount of memory)
  • CORES (number of CPU cores)
  • DISKS (number of disks)

The total available RAM for YARN and MapReduce should take into account the Reserved Memory. Reserved memory is the RAM needed by system processes and other Hadoop processes (such as HBase):

reserved memory = stack memory reserve + HBase memory reserve (if HBase is on the same node)

You can use the values in the following table to determine what you need for reserved memory per node:

Table 1.6. Reserved Memory Recommendations

Total Memory per Node Recommended Reserved System Memory Recommended Reserved HBase Memory
4 GB 1 GB 1 GB
8 GB 2 GB 1 GB
16 GB 24 GB 2 GB 4 GB 2 GB 4 GB
48 GB 6 GB 8 GB
64 GB 8 GB 8 GB
72 GB 8 GB 8 GB
96 GB 128 GB 12 GB 24 GB 16 GB 24 GB
256 GB 32 GB 32 GB
512 GB 64 GB 64 GB

After you determine the amount of memory you need per node, you must determine the maximum number of containers allowed per node:

Number of containers = min (2CORES, 1.8DISKS, (total available RAM) / MIN_CONTAINER_SIZE) DISKS is the value for dfs.data.dirs (number of data disks) per machine.

MIN_CONTAINER_SIZE is the minimum container size (in RAM). This value depends on the amount of RAM available; in smaller memory nodes, the minimum container size should also be smaller.

The following table provides the recommended values:

Table 1.7. Recommended Container Size Values

Total RAM per Node Recommended Minimum Container Size
Less than 4 GB 256 MB
Between 4 GB and 8 GB 512 MB
Between 8 GB and 24 GB 1024 MB
Above 24 GB 2048 MB

Finally, you must determine the amount of RAM per container:

RAM per container = max(MIN_CONTAINER_SIZE, (total available RAM, per containers)

Using the results of all the previous calculations, you can configure YARN and MapReduce.

Table 1.8. YARN and MapReduce Configuration Values

Configuration File Configuration Setting Value Calculation
yarn-site.xml yarn.nodemanager.resource.memory mb = containers * RAM-per-container
yarn-site.xml yarn.scheduler.minimum-allocation mb = RAM-per-container
yarn-site.xml yarn.scheduler.maximum-allocation mb = containers * RAM-per-container
mapred-site.xml mapreduce.map.memory.mb = RAM-per-container
mapred-site.xml mapreduce.reduce.memory.mb = 2 * RAM-per-container
mapred-site.xml mapreduce.map.java.opts = 0.8 * RAM-per-container
mapred-site.xml mapreduce.reduce.java.opts = 0.8 * 2 * RAM-per-container
mapred-site.xml yarn.app.mapreduce.am.resource.mb = 2 * RAM-per-container
mapred-site.xml yarn.app.mapreduce.am.command opts = 0.8 * 2 * RAM-per-container

Note

After installation, both yarn-site.xml and mapred-site.xml are located in the /etc/ hadoop/conf folder.

Examples Assume that your cluster nodes have 12 CPU cores, 48 GB RAM, and 12 disks:

Reserved memory = 6 GB system memory reserve + 8 GB for HBase min container size = 2 GB

If there is no HBase, then you can use the following calculation:

Number of containers = min (212, 1.8 12, (48-6)/2) = min (24, 21.6, 21) = 21 RAM-per-container = max (2, (48-6)/21) = max (2, 2) = 2

Table 1.9. Example Value Calculations Without HBase

Configuration Value Calculation
yarn.nodemanager.resource.memory-mb = 21 * 2 = 42*1024 MB
yarn.scheduler.minimum-allocation-mb = 2*1024 MB
yarn.scheduler.maximum-allocation-mb = 21 * 2 = 42*1024 MB
mapreduce.map.memory.mb = 2*1024 MB
mapreduce.reduce.memory.mb = 2 * 2 = 4*1024 MB
mapreduce.map.java.opts = 0.8 * 2 = 1.6*1024 MB
mapreduce.reduce.java.opts = 0.8 * 2 * 2 = 3.2*1024 MB
yarn.app.mapreduce.am.resource.mb = 2 * 2 = 4*1024 MB
yarn.app.mapreduce.am.command-opts = 0.8 * 2 * 2 = 3.2*1024 MB

If HBase is included:

Number of containers = min (212, 1.8 12, (48-6-8)/2) = min (24, 21.6, 17) = 17 RAM-per-container = max (2, (48-6-8)/17) = max (2, 2) = 2

Table 1.10. Example Value Calculations with HBase

Configuration Value Calculation
yarn.nodemanager.resource.memory-mb = 17 * 2 = 34*1024 MB
yarn.scheduler.minimum-allocation-mb = 2*1024 MB
yarn.scheduler.maximum-allocation-mb = 17 * 2 = 34*1024 MB
mapreduce.map.memory.mb = 2*1024 MB
mapreduce.reduce.memory.mb = 2 * 2 = 4*1024 MB
mapreduce.map.java.opts = 0.8 * 2 = 1.6*1024 MB
mapreduce.reduce.java.opts = 0.8 * 2 * 2 = 3.2*1024 MB
yarn.app.mapreduce.am.resource.mb = 2 * 2 = 4*1024 MB
yarn.app.mapreduce.am.command-opts = 0.8 * 2 * 2 = 3.2*1024 MB

Notes:

  • Updating values for yarn.scheduler.minimum-allocation-mb without also changing yarn.nodemanager.resource.memory-mb, or changing yarn.nodemanager.resource.memory-mb without also changing yarn.scheduler.minimum-allocation-mb changes the number of containers per node.
  • If your installation has a large amount of RAM but not many disks or cores, you can free RAM for other tasks by lowering both >yarn.scheduler.minimum-allocation-mb and yarn.nodemanager.resource.memory-mb.
  • With MapReduce on YARN, there are no longer preconfigured static slots for Map and Reduce tasks.

The entire cluster is available for dynamic resource allocation of Map and Reduce tasks as needed by each job. In the previous example cluster, with the previous configurations, YARN is able to allocate up to 10 Mappers (40/4) or 5 Reducers (40/8) on each node (or some other combination of Mappers and Reducers within the 40 GB per node limit).

1.11. Configuring NameNode Heap Size

NameNode heap size depends on many factors, such as the number of files, the number of blocks, and the load on the system. The following table provides recommendations for NameNode heap size configuration. These settings should work for typical Hadoop clusters in which the number of blocks is very close to the number of files (generally, the average ratio of number of blocks per file in a system is 1.1 to 1.2).

Some clusters might require further tweaking of the following settings. Also, it is generally better to set the total Java heap to a higher value.

Table 1.11. Recommended NameNode Heap Size Settings

Number of Files, in Millions Total Java Heap (Xmx and Xms) Young Generation Size (-XX:NewSize - XX:MaxNewSize)
< 1 million files 1126m 128m
1-5 million files 3379m 512m
5-10 5913m 768m
10-20 10982m 1280m
20-30 16332m 2048m
30-40 21401m 2560m
40-50 26752m 3072m
50-70 36889m 4352m
70-100 52659m 6144m
100-125 65612m 7680m
125-150 78566m 8960m
150-200 104473m 8960m

Note

Acceldata recommends a maximum of 300 million files on the NameNode. You should also set -XX:PermSize to 128m and -XX:MaxPermSize to 256m.

Following are the recommended settings for HADOOP_NAMENODE_OPTS in the hadoop env.sh file (replacing the ##### placeholder for -XX:NewSize, -XX:MaxNewSize, -Xms, and - Xmx with the recommended values from the table):

-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/ log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=##### -XX:MaxNewSize=##### - Xms##### -Xmx##### -XX:PermSize=128m -XX:MaxPermSize=256m -Xloggc:/var/log/ hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX: +PrintGCTimeStamps -XX:+PrintGCDateStamps -Dhadoop.security.logger=INFO,DRFAS  -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HADOOP_NAMENODE_OPTS}

If the cluster uses a secondary NameNode, you should also set HADOOP_SECONDARYNAMENODE_OPTS to HADOOP_NAMENODE_OPTS in the hadoop env.sh file:

HADOOP_SECONDARYNAMENODE_OPTS=$HADOOP_NAMENODE_OPTS

Another useful HADOOP_NAMENODE_OPTS setting is -XX:+HeapDumpOnOutOfMemoryError.

This option specifies that a heap dump should be executed when an out-of-memory error occurs. You should also use -XX:HeapDumpPath to specify the location for the heap dump file:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./etc/heapdump.hprof

1.12. Allocating Adequate Log Space for ODP

Logs are an important part of managing and operating your ODP cluster. The directories and disks that you assign for logging in ODP must have enough space to maintain logs during ODP operations. Allocate at least 10 GB of free space for any disk you want to use for ODP logging.

1.13. Downloading the ODP Maven Artifacts

The Acceldata Release Engineering team hosts all the released ODP maven artifacts at http://repo.acceldata.com/content/repositories/releases/

Other than the release artifacts, some non-Acceldata artifacts are necessary for building the ODP stack. These third-party artifacts are hosted in the Acceldata nexus repository:

http://repo.acceldata.com/content/repositories/jetty-hadoop/

and

http://repo.acceldata.com/content/repositories/re-hosted/

If developers want to develop an application against the ODP stack, and they also have a maven repository manager in-house, then they can proxy these three repositories and continue referring to the internal maven groups repo.

If developers do not have access to their in-house maven repos, they can directly use the Acceldata public groups repo http://repo.acceldata.com/content/groups/public/ and continue to develop applications.

2. Installing Apache ZooKeeper

This section describes installing and testing Apache ZooKeeper, a centralized tool for providing services to highly distributed systems.

Note

HDFS and YARN depend on ZooKeeper, so install ZooKeeper first.

  1. Install the ZooKeeper Package
  2. Securing ZooKeeper with Kerberos (optional)
  3. Securing ZooKeeper Access
  4. Set Directories and Permissions
  5. Set Up the Configuration Files
  6. Start Zookeeper

2.1. Install the ZooKeeper Package

Note

In a production environment, Acceldata recommends installing ZooKeeper server on three (or a higher odd number) nodes to ensure that ZooKeeper service is available.

On all nodes of the cluster that you have identified as ZooKeeper servers, type:

  • For RHEL/CentOS 7

yum install zookeeper-server

For Ubuntu 18/20:

apt-get install zookeeper

Note

Grant the zookeeper user shell access on Ubuntu 18/20.

usermod -s /bin/bash zookeeper

2.2. Securing ZooKeeper with Kerberos (optional)

Note

Before starting the following steps, refer to Setting up Security for Manual Installs.

(Optional) To secure ZooKeeper with Kerberos, perform the following steps on the host that runs KDC (Kerberos Key Distribution Center):

  1. Start the kadmin.local utility:

/usr/sbin/kadmin.local

  1. Create a principal for ZooKeeper:
sudo kadmin.local -q 'addprinc zookeeper/

<ZOOKEEPER_HOSTNAME>@STORM.EXAMPLE.COM'
  1. Create a keytab for ZooKeeper:

sudo kadmin.local -q "ktadd -k /tmp/zk.keytab zookeeper/ <ZOOKEEPER_HOSTNAME>@STORM.EXAMPLE.COM"

  1. Copy the keytab to all ZooKeeper nodes in the cluster.

Note

Verify that only the ZooKeeper and Storm operating system users can access the ZooKeeper keytab.

  1. Administrators must add the following properties to the zoo.cfg configuration file located at /etc/zookeeper/conf:
authProvider.1 = org.apache.zookeeper.server.auth.SASLAuthenticationProvider kerberos.removeHostFromPrincipal = true 
kerberos.removeRealmFromPrincipal = true 

Note

Grant the zookeeper user shell access on Ubuntu 18/20.

usermod -s /bin/bash zookeeper

2.3. Securing ZooKeeper Access

The default value of yarn.resourcemanager.zk-acl allows anyone to have full access to the znode. Acceldata recommends that you modify this permission to restrict access by performing the steps in the following sections.

  • ZooKeeper Configuration
  • YARN Configuration
  • HDFS Configuration

2.3.1. ZooKeeper Configuration

Note

The steps in this section only need to be performed once for the ODP cluster. If this task has been done to secure HBase for example, then there is no need to repeat these ZooKeeper steps if the YARN cluster uses the same ZooKeeper server.

  1. Create a keytab for ZooKeeper called zookeeper.service.keytab and save it to / etc/security/keytabs.
sudo kadmin.local -q "ktadd -k /tmp/zk.keytab zookeeper/ 
 <ZOOKEEPER_HOSTNAME>@STORM.EXAMPLE.COM" 
  1. Add the following to the zoo.cfg file:
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider jaasLoginRenew=3600000 
kerberos.removeHostFromPrincipal=true 
kerberos.removeRealmFromPrincipal=true 
  1. Create the zookeeper_client_jaas.conf file.
Client { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=false 
useTicketCache=true; 
}; 
  1. Create the zookeeper_jaas.conf file.
Server { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=true 
storeKey=true 
useTicketCache=false 
keyTab="$PATH_TO_ZOOKEEPER_KEYTAB" 
(such as"/etc/security/keytabs/zookeeper.service.keytab") 
principal="zookeeper/$HOST"; 
(such as "zookeeper/[email protected]";) };
  1. Add the following information to zookeeper-env-sh:
export CLIENT_JVMFLAGS="-Djava.security.auth.login.config=/etc/zookeeper/conf/zookeeper_client_jaas.conf" 
export SERVER_JVMFLAGS="-Xmx1024m -Djava.security.auth.login.config=/etc/zookeeper/conf/zookeeper_jaas.conf"

2.3.2. YARN Configuration

Note

The following steps must be performed on all nodes that launch the ResourceManager.

  1. Create a new configuration file called yarn_jaas.conf in the directory that contains the Hadoop Core configurations (typically, /etc/hadoop/conf).
Client { 
com.sun.security.auth.module.Krb5LoginModule required 
useKeyTab=true 
storeKey=true 
useTicketCache=false 
keyTab="$PATH_TO_RM_KEYTAB" 
(such as "/etc/security/keytabs/rm.service.keytab") 
principal="rm/$HOST"; 
(such as "rm/[email protected]";) 
}; 
  1. Add a new property to the yarn-site.xml file.
<property> 
<name>yarn.resourcemanager.zk-acl</name> 
<value>sasl:rm:rwcda</value> 
</property>

Note

Because yarn-resourcemanager.zk-acl is set to sasl, you do not need to set any value for yarn.resourcemanager.zk-auth.

Setting the value to sasl also means that you cannot run the command addauth<scheme><auth> in the zkclient CLI.

  1. Add a new YARN_OPTS to the yarn-env.sh file and make sure this YARN_OPTS is picked up when you start your ResourceManagers.
YARN_OPTS="$YARN_OPTS -Dzookeeper.sasl.client=true 
-Dzookeeper.sasl.client.username=zookeeper 
-Djava.security.auth.login.config=/etc/hadoop/conf/yarn_jaas.conf 
-Dzookeeper.sasl.clientconfig=Client"

2.3.3. HDFS Configuration

  1. In the hdfs-site.xml file, set the following property, for security of ZooKeeper based fail-over controller. when NameNode HA is enabled:
<property> 
<name>ha.zookeeper.acl</name> 
 <value>sasl:nn:rwcda</value> 
</property>

2.4. Set Directories and Permissions

Create directories and configure ownership and permissions on the appropriate hosts as described below. If any of these directories already exist, we recommend deleting and recreating them.

Acceldata provides a set of configuration files that represent a working ZooKeeper configuration. (See Download Companion Files.) You can use these files as a reference point, however, you need to modify them to match your own cluster environment.

If you choose to use the provided configuration files to set up your ZooKeeper environment, complete the following steps to create the appropriate directories.

  1. Execute the following commands on all ZooKeeper nodes:
mkdir -p $ZOOKEEPER_LOG_DIR; 
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_LOG_DIR; 
chmod -R 755 $ZOOKEEPER_LOG_DIR; 

mkdir -p $ZOOKEEPER_PID_DIR; 
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_PID_DIR; 
chmod -R 755 $ZOOKEEPER_PID_DIR; 

mkdir -p $ZOOKEEPER_DATA_DIR; 
chmod -R 755 $ZOOKEEPER_DATA_DIR; 
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_DATA_DIR 

where:
• $ZOOKEEPER_USER is the user owning the ZooKeeper services. For example, zookeeper.
• $ZOOKEEPER_LOG_DIR is the directory to store the ZooKeeper logs. For example, / var/log/zookeeper.
• $ZOOKEEPER_PID_DIR is the directory to store the ZooKeeper process ID. For example, /var/run/zookeeper.
• $ZOOKEEPER_DATA_DIR is the directory where ZooKeeper stores data. For example, /grid/hadoop/zookeeper/data.

  1. Initialize the ZooKeeper data directories with the 'myid' file. Create one file per ZooKeeper server, and put the number of that server in each file:

vi $ZOOKEEPER_DATA_DIR/myid

  • In the myid file on the first server, enter the corresponding number: 1
  • In the myid file on the second server, enter the corresponding number: 2
  • In the myid file on the third server, enter the corresponding number: 3

2.5. Set Up the Configuration Files

You must set up several configuration files for ZooKeeper. Acceldata provides a set of configuration files that represent a working ZooKeeper configuration. (See Download Companion Files. You can use these files as a reference point, however, you need to modify them to match your own cluster environment.

If you choose to use the provided configuration files to set up your ZooKeeper environment, complete the following steps:

  1. Extract the ZooKeeper configuration files to a temporary directory.

The files are located in the configuration_files/zookeeper directories where you decompressed the companion files.

  1. Modify the configuration files.

In the respective temporary directories, locate the zookeeper-env.sh file and modify the properties based on your environment including the JDK version you downloaded.

  1. Edit the zookeeper-env.sh file to match the Java home directory, ZooKeeper log directory, ZooKeeper PID directory in your cluster environment and the directories you set up above.

See below for an example configuration:

export JAVA_HOME=/usr/jdk64/jdk1.8.0_202 
export ZOOKEEPER_HOME=/usr/odp/current/zookeeper-server 
export ZOOKEEPER_LOG_DIR=/var/log/zookeeper 
export ZOOKEEPER_PID_DIR=/var/run/zookeeper/zookeeper_server.pid 
export SERVER_JVMFLAGS=-Xmx1024m 
export JAVA=$JAVA_HOME/bin/java 
CLASSPATH=$CLASSPATH:$ZOOKEEPER_HOME/* 
  1. Edit the zoo.cfg file to match your cluster environment. Below is an example of a typical zoo.cfs file:
dataDir=$zk.data.directory.path 
server.1=$zk.server1.full.hostname:2888:3888 
server.2=$zk.server2.full.hostname:2888:3888 
server.3=$zk.server3.full.hostname:2888:3888 
  1. Copy the configuration files.
  • On all hosts create the config directory:
rm -r $ZOOKEEPER_CONF_DIR ; 
mkdir -p $ZOOKEEPER_CONF_DIR ; 
  • Copy all the ZooKeeper configuration files to the $ZOOKEEPER_CONF_DIR directory. • Set appropriate permissions:
chmod a+x $ZOOKEEPER_CONF_DIR/; 
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_CONF_DIR/../; 
chmod -R 755 $ZOOKEEPER_CONF_DIR/../

Note:

  • $ZOOKEEPER_CONF_DIR is the directory to store the ZooKeeper configuration files. For example, /etc/zookeeper/conf.
  • $ZOOKEEPER_USER is the user owning the ZooKeeper services. For example, zookeeper.

2.6. Start ZooKeeper

To install and configure HBase and other Hadoop ecosystem components, you must start the ZooKeeper service and the ZKFC:

sudo -E -u zookeeper bash -c "export ZOOCFGDIR=$ZOOKEEPER_CONF_DIR ; export  ZOOCFG=zoo.cfg; 
 source $ZOOKEEPER_CONF_DIR/zookeeper-env.sh ; $ZOOKEEPER_HOME/bin/ zkServer.sh 
 start" 

For example:

su - zookeeper -c "export ZOOCFGDIR=/usr/odp/current/zookeeper-server/ conf ; export ZOOCFG=zoo.cfg; source /usr/odp/current/zookeeper-server/conf/ zookeeper-env.sh ; /usr/odp/current/zookeeper-server/bin/zkServer.sh start" 
su -l hdfs -c "/usr/odp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop daemon.sh start zkfc"

3. Installing HDFS, YARN, and MapReduce

This section describes how to install the Hadoop Core components, HDFS, YARN, and MapReduce.

Complete the following instructions to install Hadoop Core components:

  1. Set Default File and Directory Permissions
  2. Install the Hadoop Packages
  3. Install Compression Libraries
  4. Create Directories

3.1. Set Default File and Directory Permissions

Set the default operating system file and directory permissions to 0022 (022).

Use the umask command to confirm that the permissions are set as necessary. For example, to see what the current umask setting are, enter:

umask

If you want to set a default umask for all users of the OS, edit the /etc/profile file, or other appropriate file for system-wide shell configuration.

Ensure that the umask is set for all terminal sessions that you use during installation.

3.2. Install the Hadoop Packages

Execute the following command on all cluster nodes.

  • For RHEL/CentOS 7

yum install hadoop hadoop-hdfs hadoop-libhdfs hadoop-yarn hadoop mapreduce hadoop-client openssl

  • For Ubuntu 18/20:

apt-get install hadoop hadoop-hdfs libhdfs0 hadoop-yarn hadoop mapreduce hadoop-client openssl

3.3. Install Compression Libraries

Make the following compression libraries available on all the cluster nodes.

3.3.1. Install Snappy

Install Snappy on all the nodes in your cluster. At each node:

  • For RHEL/CentOS 7

yum install snappy snappy-devel

  • For Ubuntu 18/20:

apt-get install libsnappy1 libsnappy-dev

3.3.2. Install LZO

Execute the following command at all the nodes in your cluster:

  • RHEL/CentOS 7

yum install lzo lzo-devel hadooplzo hadooplzo-native

• For Ubuntu 18/20:

apt-get install liblzo2-2 liblzo2-dev hadooplzo

3.4. Create Directories

Create directories and configure ownership + permissions on the appropriate hosts as described below.

Before you begin:

  • If any of these directories already exist, we recommend deleting and recreating them.

  • Acceldata provides a set of configuration files that represent a working ZooKeeper configuration. (See Download Companion Files. You can use these files as a reference point, however, you need to modify them to match your own cluster environment.

Use the following instructions to create appropriate directories:

  1. Create the NameNode Directories
  2. Create the SecondaryNameNode Directories
  3. Create DataNode and YARN NodeManager Local Directories
  4. Create the Log and PID Directories
  5. Symlink Directories with odp-select

3.4.1. Create the NameNode Directories

On the node that hosts the NameNode service, execute the following commands:

mkdir -p $DFS_NAME_DIR; 
chown -R $HDFS_USER:$HADOOP_GROUP $DFS_NAME_DIR; 
chmod -R 755 $DFS_NAME_DIR;

Where:

  • $DFS_NAME_DIR is the space separated list of directories where NameNode stores the file system image. For example, /grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/ nn.
  • $HDFS_USER is the user owning the HDFS services. For example, hdfs.
  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

3.4.2. Create the SecondaryNameNode Directories

On all the nodes that can potentially run the SecondaryNameNode service, execute the following commands:

mkdir -p $FS_CHECKPOINT_DIR; 
chown -R $HDFS_USER:$HADOOP_GROUP $FS_CHECKPOINT_DIR; 
chmod -R 755 $FS_CHECKPOINT_DIR; 

where:

  • $FS_CHECKPOINT_DIR is the space-separated list of directories where SecondaryNameNode should store the checkpoint image. For example, /grid/hadoop/ hdfs/snn /grid1/hadoop/hdfs/snn /grid2/hadoop/hdfs/snn.
  • $HDFS_USER is the user owning the HDFS services. For example, hdfs.
  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

3.4.3. Create DataNode and YARN NodeManager Local Directories

At each DataNode, execute the following commands:

mkdir -p $DFS_DATA_DIR; 
chown -R $HDFS_USER:$HADOOP_GROUP $DFS_DATA_DIR; 
chmod -R 750 $DFS_DATA_DIR; 

where:

  • $DFS_DATA_DIR is the space-separated list of directories where DataNodes should store the blocks. For example, /grid/hadoop/hdfs/dn /grid1/hadoop/hdfs/dn / grid2/hadoop/hdfs/dn.
  • $HDFS_USER is the user owning the HDFS services. For example, hdfs.
  • $HADOOP_GROUP is a common group shared by services. For example, hadoop. At each ResourceManager and all DataNodes, execute the following commands.
mkdir -p $YARN_LOCAL_DIR; 
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_DIR; 
chmod -R 755 $YARN_LOCAL_DIR;

where:

  • $YARN_LOCAL_DIR is the space separated list of directories where YARN should store container log data. For example, /grid/hadoop/yarn/local /grid1/hadoop/ yarn/local /grid2/hadoop/yarn/local.
  • $YARN_USER is the user owning the YARN services. For example, yarn.
  • $HADOOP_GROUP is a common group shared by services. For example, hadoop. At each ResourceManager and all DataNodes, execute the following commands:
mkdir -p $YARN_LOCAL_LOG_DIR; 
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_LOG_DIR; 
chmod -R 755 $YARN_LOCAL_LOG_DIR;

where:

  • $YARN_LOCAL_LOG_DIR is the space-separated list of directories where YARN should store temporary data. For example, /grid/hadoop/yarn/logs /grid1/hadoop/yarn/logs /grid2/hadoop/yarn/logs
  • $YARN_USER is the user owning the YARN services. For example, yarn.
  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

3.4.4. Create the Log and PID Directories

Each ZooKeeper service requires a log and PID directory. In this section, you create directories for each service. If you choose to use the companion file scripts, these environment variables are already defined and you can copy and paste the examples into your terminal window.

3.4.4.1. HDFS Logs

At all nodes, execute the following commands:

mkdir -p $HDFS_LOG_DIR; 
chown -R $HDFS_USER:$HADOOP_GROUP $HDFS_LOG_DIR; 
chmod -R 755 $HDFS_LOG_DIR;

where:

  • $HDFS_LOG_DIR is the directory for storing the HDFS logs.
    This directory name is a combination of a directory and the $HDFS_USER. For example, / var/log/hadoop/hdfs, where hdfs is the $HDFS_USER.
  • $HDFS_USER is the user owning the HDFS services. For example, hdfs.
  • $HADOOP_GROUP is a common group shared by services. For example, hadoop

3.4.4.2. Yarn Logs

At all nodes, execute the following commands:

mkdir -p $YARN_LOG_DIR; 
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOG_DIR; 
chmod -R 755 $YARN_LOG_DIR; 

where:

  • $YARN_LOG_DIR is the directory for storing the YARN logs.
    This directory name is a combination of a directory and the $YARN_USER. For example, / var/log/hadoop/yarn, where yarn is the $YARN_USER.
  • $YARN_USER is the user owning the YARN services. For example, yarn.
  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

3.4.4.3. HDFS Process

At all nodes, execute the following commands:

mkdir -p $HDFS_PID_DIR; 
chown -R $HDFS_USER:$HADOOP_GROUP $HDFS_PID_DIR; 
chmod -R 755 $HDFS_PID_DIR; 

where:

  • $HDFS_PID_DIR is the directory for storing the HDFS process ID. This directory name is a combination of a directory and the $HDFS_USER. For example, / var/run/hadoop/hdfs where hdfs is the $HDFS_USER.
  • $HDFS_USER is the user owning the HDFS services. For example, hdfs.
  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

3.4.4.4. Yarn Process ID

At all nodes, execute the following commands:

mkdir -p $YARN_PID_DIR; 
chown -R $YARN_USER:$HADOOP_GROUP $YARN_PID_DIR; 
chmod -R 755 $YARN_PID_DIR;

where:

  • $YARN_PID_DIR is the directory for storing the YARN process ID.
    This directory name is a combination of a directory and the $YARN_USER. For example, / var/run/hadoop/yarn where yarn is the $YARN_USER.
  • $YARN_USER is the user owning the YARN services. For example, yarn.
  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

3.4.4.5. JobHistory Server Logs

At all nodes, execute the following commands:

mkdir -p $MAPRED_LOG_DIR; 
chown -R $MAPRED_USER:$HADOOP_GROUP $MAPRED_LOG_DIR; 
chmod -R 755 $MAPRED_LOG_DIR; 

where:

  • $MAPRED_LOG_DIR is the directory for storing the JobHistory Server logs. This directory name is a combination of a directory and the $MAPRED_USER. For example, /var/log/hadoop/mapred where mapred is the $MAPRED_USER.
  • $MAPRED_USER is the user owning the MAPRED services. For example, mapred. • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

3.4.4.6. JobHistory Server Process ID

At all nodes, execute the following commands:

mkdir -p $MAPRED_PID_DIR; 
chown -R $MAPRED_USER:$HADOOP_GROUP $MAPRED_PID_DIR; 
chmod -R 755 $MAPRED_PID_DIR; 

where:

  • $MAPRED_PID_DIR is the directory for storing the JobHistory Server process ID. This directory name is a combination of a directory and the $MAPRED_USER. For example, /var/run/hadoop/mapred where mapred is the $MAPRED_USER.
  • $MAPRED_USER is the user owning the MAPRED services. For example, mapred. • $HADOOP_GROUP is a common group shared by services. For example, hadoop.

3.4.5. Symlink Directories with odp-select

Important

ODP 3.2.2.0 installs odp-select automatically with the installation or upgrade of the first ODP component.

To prevent version-specific directory issues for your scripts and updates, Acceldata provides odp-select, a script that symlinks directories to odp-current and modifies paths for configuration directories.

Determine the version number of the odp-select installed package:

yum list | grep odp (on CentOS 7) 
rpm –q -a | grep odp (on CentOS 7) 
dpkg -l | grep odp (on Ubuntu)

For example:

/usr/bin/odp-select set all 3.2.2.0-<$BUILD>

Run odp-select set all on the NameNode and on all DataNodes. If YARN is deployed separately, also run odp-select on the Resource Manager and all Node Managers.

odp-select set all 3.2.2.0-<$BUILD>

4. Setting Up the Hadoop Configuration

This section describes how to set up and edit the deployment configuration files for HDFS and MapReduce.

You must be set up several configuration files for HDFS and MapReduce. Acceldata provides a set of configuration files that represent a working HDFS and MapReduce configuration. (See Download Companion Files.) You can use these files as a reference point, however, you need to modify them to match your own cluster environment.

If you choose to use the provided configuration files to set up your HDFS and MapReduce environment, complete the following steps:

  1. Extract the core Hadoop configuration files to a temporary directory.

The files are located in the configuration_files/core_hadoop directory where you decompressed the companion files.

  1. Modify the configuration files.

In the temporary directory, locate the following files and modify the properties based on your environment. Search for TODO in the files for the properties to replace. For further information, see "Define Environment Parameters" in this guide.

  • Edit core-site.xml and modify the following properties:
<property> 
 <name>fs.defaultFS</name> 
 <value>hdfs://$namenode.full.hostname:8020</value> 
 <description>Enter your NameNode hostname</description> 
</property> 

<property> 
 <name>odp.version</name> 
 <value>${odp.version}</value> 
 <description>Replace with the actual ODP version</description> </property> 
  • Edit hdfs-site.xml and modify the following properties:
<property> 
 <name>dfs.namenode.name.dir</name> 
 <value>/grid/hadoop/hdfs/nn,/grid1/hadoop/hdfs/nn</value> 
 <description>Comma-separated list of paths. Use the list of 
 directories from $DFS_NAME_DIR. For example, /grid/hadoop/hdfs/nn,/grid1/ hadoop/hdfs/nn.</description> 
</property> 

<property> 
 <name>dfs.datanode.data.dir</name> 
 <value>file:///grid/hadoop/hdfs/dn, file:///grid1/hadoop/hdfs/dn</ value>
 <description>Comma-separated list of paths. Use the list of  directories from $DFS_DATA_DIR. For example, file:///grid/hadoop/hdfs/dn,  file:///grid1/ hadoop/hdfs/dn.</description> 
</property>

<property> 
 <name>dfs.namenode.http-address</name> 
 <value>$namenode.full.hostname:50070</value> 
 <description>Enter your NameNode hostname for http access.</ description> 
</property> 

<property> 
 <name>dfs.namenode.secondary.http-address</name> 
 <value>$secondary.namenode.full.hostname:50090</value>  <description>Enter your Secondary NameNode hostname.</description> 
</property> 

<property> 
 <name>dfs.namenode.checkpoint.dir</name> 
 <value>/grid/hadoop/hdfs/snn,/grid1/hadoop/hdfs/snn,/grid2/hadoop/ hdfs/snn</value> 
 <description>A comma-separated list of paths. Use the list of  directories from $FS_CHECKPOINT_DIR. For example, /grid/hadoop/hdfs/snn, sbr/grid1/hadoop/hdfs/ snn,sbr/grid2/hadoop/hdfs/snn </description> 
</property> 

<property> 
 <name>dfs.namenode.checkpoint.edits.dir</name> 
 <value>/grid/hadoop/hdfs/snn,/grid1/hadoop/hdfs/snn,/grid2/hadoop/ hdfs/snn</value> 
 <description>A comma-separated list of paths. Use the list of  directories from $FS_CHECKPOINT_DIR. For example, /grid/hadoop/hdfs/snn, sbr/grid1/hadoop/hdfs/ snn,sbr/grid2/hadoop/hdfs/snn </description>
 </property> 

<property> 
 <name>dfs.namenode.rpc-address</name> 
 <value>namenode_host_name:8020> 
 <description>The RPC address that handles all clients requests.</ description.> 
</property> 

<property> 
 <name>dfs.namenode.https-address</name> 
 <value>namenode_host_name:50470> 
 <description>The namenode secure http server address and port.</ description.> 
</property>

Note

The maximum value of the NameNode new generation size (- XX:MaxnewSize ) should be 1/8 of the maximum heap size (-Xmx). Ensure that you check the default setting for your environment.

  • Edit yarn-site.xml and modify the following properties:
<property> 
 <name>yarn.resourcemanager.scheduler.class</name> 
 <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler. capacity.CapacityScheduler</value> 
</property> 

<property> 
 <name>yarn.resourcemanager.resource-tracker.address</name>  <value>$resourcemanager.full.hostname:8025</value> 
 <description>Enter your ResourceManager hostname.</description> 
</property> 

<property> 
 <name>yarn.resourcemanager.scheduler.address</name>  <value>$resourcemanager.full.hostname:8030</value> 
 <description>Enter your ResourceManager hostname.</description> 
</property> 

<property> 
 <name>yarn.resourcemanager.address</name> 
 <value>$resourcemanager.full.hostname:8050</value> 
 <description>Enter your ResourceManager hostname.</description> 
</property> 

<property> 
 <name>yarn.resourcemanager.admin.address</name> 
 <value>$resourcemanager.full.hostname:8141</value> 
 <description>Enter your ResourceManager hostname.</description> 
</property> 

<property> 
 <name>yarn.nodemanager.local-dirs</name> 
 <value>/grid/hadoop/yarn/local,/grid1/hadoop/yarn/local</value>  <description>Comma separated list of paths. Use the list of  directories from $YARN_LOCAL_DIR.For example, /grid/hadoop/yarn/local,/ grid1/hadoop/yarn/ local.</description> 
</property> 

<property> 
 <name>yarn.nodemanager.log-dirs</name> 
 <value>/grid/hadoop/yarn/log</value> 
 <description>Use the list of directories from $YARN_LOCAL_LOG_DIR.  For example, /grid/hadoop/yarn/log,/grid1/hadoop/yarn/ log,/grid2/hadoop/ yarn/log</description> 
</property> 

<property> 
 <name>yarn.nodemanager.recovery</name.dir> 
 <value>{hadoop.tmp.dir}/yarn-nm-recovery</value> 
</property> 

<property> 
 <name>yarn.log.server.url</name> 
 <value>http://$jobhistoryserver.full.hostname:19888/jobhistory/logs/ </ value> 
 <description>URL for job history server</description> 
</property>
 
<property>
<name>yarn.resourcemanager.webapp.address</name> 
 <value>$resourcemanager.full.hostname:8088</value> 
 <description>URL for job history server</description> 
</property> 

<property> 
 <name>yarn.timeline-service.webapp.address</name> 
 <value><Resource_Manager_full_hostname>:8188</value> 
</property>
  • Edit mapred-site.xml and modify the following properties:
<property> 
 <name>mapreduce.jobhistory.address</name> 
 <value>$jobhistoryserver.full.hostname:10020</value> 
 <description>Enter your JobHistoryServer hostname.</description>
 </property> 

<property> 
 <name>mapreduce.jobhistory.webapp.address</name> 
 <value>$jobhistoryserver.full.hostname:19888</value> 
 <description>Enter your JobHistoryServer hostname.</description>
 </property>
  1. On each node of the cluster, create an empty file named dfs.exclude inside $HADOOP_CONF_DIR. Append the following to /etc/profile:
touch $HADOOP_CONF_DIR/dfs.exclude 
JAVA_HOME=<java_home_path> 
export JAVA_HOME 
HADOOP_CONF_DIR=/etc/hadoop/conf/ 
export HADOOP_CONF_DIR 
export PATH=$PATH:$JAVA_HOME:$HADOOP_CONF_DIR 
  1. Optional: Configure MapReduce to use Snappy Compression.

To enable Snappy compression for MapReduce jobs, edit core-site.xml and mapred site.xml.

  • Add the following properties to mapred-site.xml:
<property> 
 <name>mapreduce.admin.map.child.java.opts</name> 
 <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/odp/current/ hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>  <final>true</final> 
</property> 

<property> 
 <name>mapreduce.admin.reduce.child.java.opts</name> 
 <value>-server -XX:NewRatio=8 -Djava.library.path=/usr/odp/current/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value>  <final>true</final> 
</property>
  • Add the SnappyCodec to the codecs list in core-site.xml:
<property> 
 <name>io.compression.codecs</name>
 <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value> 
</property>
  1. Optional: If you are using the LinuxContainerExecutor, you must set up container-executor.cfg in the config directory. The file must be owned by root:root. The settings are in the form of key=value with one key per line. There must entries for all keys. If you do not want to assign a value for a key, you can leave it unset in the form of key=#.

The keys are defined as follows:

  • yarn.nodemanager.linux-container-executor.group - the configured value of yarn.nodemanager.linux-container-executor.group. This must match the value of yarn.nodemanager.linux-container-executor.group in yarn-site.xml.
  • banned.users - a comma separated list of users who cannot run container executor.
  • min.user.id - the minimum value of user id, this is to prevent system users from running container-executor.
  • allowed.system.users - a comma separated list of allowed system users.
  1. Replace the default memory configuration settings in yarn-site.xml and mapred site.xml with the YARN and MapReduce memory configuration settings you calculated previously. Fill in the memory/CPU values that match what the documentation or helper scripts suggests for your environment.

  2. Copy the configuration files.

  • On all hosts in your cluster, create the Hadoop configuration directory:
rm -rf $HADOOP_CONF_DIR 
mkdir -p $HADOOP_CONF_DIR 

where $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf.

  • Copy all the configuration files to $HADOOP_CONF_DIR.
  • Set the appropriate permissions:
chown -R $HDFS_USER:$HADOOP_GROUP $HADOOP_CONF_DIR/../ 
chmod -R 755 $HADOOP_CONF_DIR/../ 

where:

  • $HDFS_USER is the user owning the HDFS services. For example, hdfs.
  • $HADOOP_GROUP is a common group shared by services. For example, hadoop.
  1. Set the Concurrent Mark-Sweep (CMS) Garbage Collector (GC) parameters.

On the NameNode host, open the /etc/hadoop/conf/hadoop-env.sh file. Locate export HADOOP_NAMENODE_OPTS=<parameters> and add the following parameters:

-XX:+UseCMSInitiatingOccupancyOnly 
-XX:CMSInitiatingOccupancyFraction=70

By default CMS GC uses a set of heuristic rules to trigger garbage collection. This makes garbage collection less predictable and tends to delay collection until the old generation is almost fully occupied. Initiating it in advance allows garbage collection to complete before the old generation is full, and thus avoid Full GC (i.e. a stop-the-world pause).

  • XX:+UseCMSInitiatingOccupancyOnly prevents the use of GC heuristics.

-XX:CMSInitiatingOccupancyFraction=<percent> tells the Java VM when CMS should be triggered. Basically, it allows the creation of a buffer in heap, which can be filled with data while CMS is running. This percent should be back calculated from the speed with which memory is consumed in the old generation during production load. If this percent is set too low, the CMS runs too often; if it is set too high, the CMS is triggered too late and concurrent mode failure may occur. The recommended setting for -XX:CMSInitiatingOccupancyFraction is 70, which means that the application should utilize less than 70% of the old generation.

5. Validating the Core Hadoop Installation

Use the following instructions to start core Hadoop and perform the smoke tests.

  1. Format and Start HDFS
  2. Smoke Test HDFS
  3. Configure YARN and MapReduce
  4. Start YARN
  5. Start MapReduce JobHistory Server
  6. Smoke Test MapReduce

5.1. Format and Start HDFS

  1. Modify the JAVA_HOME value in the hadoop-env.sh file:

export JAVA_HOME=/usr/java/default

  1. Execute the following commands on the NameNode host machine:
su - $HDFS_USER 
/usr/odp/current/hadoop-hdfs-namenode/../hadoop/bin/hdfs namenode -format /usr/odp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh -- config $HADOOP_CONF_DIR start namenode 
  1. Execute the following commands on the SecondaryNameNode:
su - $HDFS_USER 
/usr/odp/current/hadoop-hdfs-secondarynamenode/../hadoop/sbin/hadoop-daemon. sh --config $HADOOP_CONF_DIR start secondarynamenode 
  1. Execute the following commands on all DataNodes:
su - $HDFS_USER 
/usr/odp/current/hadoop-hdfs-datanode/../hadoop/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode 

Where $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf.

Where $HDFS_USER is the HDFS user, for example, hdfs.

5.2. Smoke Test HDFS

  1. Determine if you can reach the NameNode server with your browser:

http://$namenode.full.hostname:50070

  1. Create the hdfs user directory in HDFS:
su - $HDFS_USER 
hdfs dfs -mkdir -p /user/hdfs 
  1. Try copying a file into HDFS and listing that file:
su - $HDFS_USER 
hdfs dfs -copyFromLocal /etc/passwd passwd 
hdfs dfs -ls 
  1. Use the Namenode web UI and the Utilities menu to browse the file system.

5.3. Configure YARN and MapReduce

After you install Hadoop, modify your configs.

  1. As the HDFS user, for example 'hdfs', upload the MapReduce tarball to HDFS.
su - $HDFS_USER 
hdfs dfs -mkdir -p /odp/apps/<odp_version>/mapreduce/ 
hdfs dfs -put /usr/odp/current/hadoop-client/mapreduce.tar.gz /odp/apps/<odp_version>/mapreduce/ 
hdfs dfs -chown -R hdfs:hadoop /odp 
hdfs dfs -chmod -R 555 /odp/apps/<odp_version>/mapreduce 
hdfs dfs -chmod 444 /odp/apps/<odp_version>/mapreduce/mapreduce.tar.gz

Where $HDFS_USER is the HDFS user, for example hdfs, and <odp_version> is the current ODP version, for example 3.2.2.0.

  1. Copy mapred-site.xml from the companion files and make the following changes to mapred-site.xml:
  • Add
<property> 
 <name>mapreduce.admin.map.child.java.opts</name> 
 <value>-server -Djava.net.preferIPv4Stack=true -Dodp.version=${odp.version}</value> 
 <final>true</final> 
</property>

Note

You do not need to modify ${odp.version}.

  • Modify the following existing properties to include ${odp.version}:
<property> 
 <name>mapreduce.admin.user.env</name> 
 <value>LD_LIBRARY_PATH=/usr/odp/${odp.version}/hadoop/lib/native:/ usr/odp/${odp.version}/hadoop/lib/native/Linux-amd64-64</value> 
</property>

<property> 
 <name>mapreduce.application.framework.path</name> 
 <value>/odp/apps/${odp.version}/mapreduce/mapreduce.tar.gz#mr framework</value> 
</property> 

<property> 
<name>mapreduce.application.classpath</name> 
<value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/ share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/ *:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/ share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*: $PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/odp/${odp.version}/ hadoop/lib/hadoop-lzo-0.6.0.${odp.version}.jar:/etc/hadoop/conf/secure</ value> 
</property>

Note

You do not need to modify ${odp.version}.

  1. Copy yarn-site.xml from the companion files and modify:
<property> 
 <name>yarn.application.classpath</name> 
 <value>$HADOOP_CONF_DIR,/usr/odp/${odp.version}/hadoop-client/*,  /usr/odp/${odp.version}/hadoop-client/lib/*, 
 /usr/odp/${odp.version}/hadoop-hdfs-client/*, 
 /usr/odp/${odp.version}/hadoop-hdfs-client/lib/*, 
 /usr/odp/${odp.version}/hadoop-yarn-client/*, 
 /usr/odp/${odp.version}/hadoop-yarn-client/lib/*</value> 
</property>
  1. For secure clusters, you must create and configure the container-executor.cfg configuration file:
  • Create the container-executor.cfg file in /etc/hadoop/conf/

  • Insert the following properties:

yarn.nodemanager.linux-container-executor.group=hadoop 
banned.users=hdfs,yarn,mapred 
min.user.id=1000 
  • Set the file /etc/hadoop/conf/container-executor.cfg file permissions to only be readable by root:
chown root:hadoop /etc/hadoop/conf/container-executor.cfg 
chmod 400 /etc/hadoop/conf/container-executor.cfg 
  • Set the container-executor program so that only root or hadoop group users can execute it:
chown root:hadoop /usr/odp/${odp.version}/hadoop-yarn/bin/container executor 
chmod 6050 /usr/odp/${odp.version}/hadoop-yarn/bin/container-executor

5.4. Start YARN

Note

To install and configure the Timeline Server see Configuring the Timeline Server. 1. As $YARN_USER, run the following command from the ResourceManager server:

su -l yarn -c "/usr/odp/current/hadoop-yarn-resourcemanager/sbin/yarn daemon.sh --config $HADOOP_CONF_DIR start resourcemanager"

  1. As $YARN_User, run the following command from all NodeManager nodes:

su -l yarn -c "/usr/odp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager"

where: $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf.

5.5. Start MapReduce JobHistory Server

  1. Change permissions on the container-executor file.

chown -R root:hadoop /usr/odp/current/hadoop-yarn*/bin/container-executor chmod -R 6050 /usr/odp/current/hadoop-yarn*/bin/container-executor

Note

If these permissions are not set, the healthcheck script returns an error stating that the DataNode is UNHEALTHY.

  1. Execute these commands from the JobHistory server to set up directories on HDFS:
su $HDFS_USER 
hdfs dfs -mkdir -p /mr-history/tmp 
hdfs dfs -mkdir -p /mr-history/done 

hdfs dfs -chmod 1777 /mr-history 
hdfs dfs -chmod 1777 /mr-history/tmp 
hdfs dfs -chmod 1770 /mr-history/done 

hdfs dfs -chown $MAPRED_USER:$MAPRED_USER_GROUP /mr-history 
hdfs dfs -chown $MAPRED_USER:$MAPRED_USER_GROUP /mr-history/tmp 
hdfs dfs -chown $MAPRED_USER:$MAPRED_USER_GROUP /mr-history/done 

Where 

$MAPRED_USER : mapred 
$MAPRED_USER_GROUP: mapred or hadoop 
hdfs dfs -mkdir -p /app-logs 
hdfs dfs -chmod 1777 /app-logs 
hdfs dfs -chown $YARN_USER:$HADOOP_GROUP /app-logs 

Where

$YARN_USER : yarn 
$HADOOP_GROUP: hadoop
  1. Run the following command from the JobHistory server:
su -l $YARN_USER -c 
"/usr/odp/current/hadoop-mapreduce-historyserver/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver" 

$HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf.

5.6. Smoke Test MapReduce

  1. Browse to the ResourceManager:

http://$resourcemanager.full.hostname:8088/

  1. Create a $CLIENT_USER in all of the nodes and add it to the users group.
useradd client 
usermod -a -G users client 
  1. As the HDFS user, create a /user/$CLIENT_USER.
sudo su - $HDFS_USER 
hdfs dfs -mkdir /user/$CLIENT_USER 
hdfs dfs -chown $CLIENT_USER:$CLIENT_USER /user/$CLIENT_USER 
hdfs dfs -chmod -R 755 /user/$CLIENT_USER 
  1. Run the smoke test as the $CLIENT_USER. Using Terasort, sort 10GB of data.
su - $CLIENT_USER 
/usr/odp/current/hadoop-client/bin/hadoop jar /usr/odp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-*.jar teragen 10000 tmp/ teragenout 
/usr/odp/current/hadoop-client/bin/hadoop jar /usr/odp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-*.jar terasort tmp/teragenout  tmp/terasortout

6. Deploying ODP In Production Data Centers With Firewalls

A typical Open source Data Platform (ODP) install requires access to the Internet in order to fetch software packages from a remote repository. Because corporate networks typically have various levels of firewalls, these firewalls may limit or restrict Internet access, making it impossible for your cluster nodes to access the ODP repository during the install process.

The solution for this is to either:

  • Create a local mirror repository inside your firewall hosted on a local mirror server inside your firewall; or
  • Provide a trusted proxy server inside your firewall that can access the hosted repositories.

Note

Many of the descriptions in this section assume you are using RHEL/Centos 7.

This document will cover these two options in detail, discuss the trade-offs, provide configuration guidelines, and will also provide recommendations for your deployment strategy.

In general, before installing Open source Data Platform in a production data center, it is best to ensure that both the Data Center Security team and the Data Center Networking team are informed and engaged to assist with these aspects of the deployment.

6.1. Terminology

The table below lists the various terms used throughout this section.

Table 6.1. Terminology

Item Description
Yum Package Manager (yum) A package management tool that fetches and installs software packages and performs automatic dependency resolution.
Local Mirror Repository The yum repository hosted on your Local Mirror Server that will serve the ODP software.
Local Mirror Server The server in your network that will host the Local Mirror Repository. This server must be accessible from all hosts in your cluster where you will install ODP.
ODP Repositories A set of repositories hosted by Acceldata that contains the ODP software packages. ODP software packages include the ODP Repository and the ODP-UTILS Repository.
ODP Repository Tarball A tarball image that contains the complete contents of the ODP Repositories.

6.2. Mirroring or Proxying

ODP uses yum or zypper to install software, and this software is obtained from the ODP Repositories. If your firewall prevents Internet access, you must mirror or proxy the ODP Repositories in your Data Center.

Mirroring a repository involves copying the entire repository and all its contents onto a local server and enabling an HTTPD service on that server to serve the repository locally. Once the local mirror server setup is complete, the *.repo configuration files on every cluster node must be updated, so that the given package names are associated with the local mirror server instead of the remote repository server.

There are two options for creating a local mirror server. Each of these options is explained in detail in a later section.

  • Mirror server has no access to Internet at all: Use a web browser on your workstation to download the ODP Repository Tarball, move the tarball to the selected mirror server using scp or an USB drive, and extract it to create the repository on the local mirror server.

  • Mirror server has temporary access to Internet: Temporarily configure a server to have Internet access, download a copy of the ODP Repository to this server using the reposync command, then reconfigure the server so that it is back behind the firewall.

Note

Option I is probably the least effort, and in some respects, is the most secure deployment option.

Option III is best if you want to be able to update your Hadoop installation periodically from the Acceldata Repositories.

Trusted proxy server: Proxying a repository involves setting up a standard HTTP proxy on a local server to forward repository access requests to the remote repository server and route responses back to the original requestor. Effectively, the proxy server makes the repository server accessible to all clients, by acting as an intermediary.

Once the proxy is configured, change the /etc/yum.conf file on every cluster node, so that when the client attempts to access the repository during installation, the request goes through the local proxy server instead of going directly to the remote repository server.

6.3. Considerations for choosing a Mirror or Proxy solution

The following table lists some benefits provided by these alternative deployment strategies:

Advantages of Repository Mirroring Advantages of creating a proxy
Is therefore faster, reliable, and more cost effective (reduced WAN bandwidth minimizes the data center costs). Allows security-conscious data centers to qualify a fixed set of repository files. It also ensures that the remote server will not change these repository files. Large data centers may already have existing repository mirror servers for the purpose of OS upgrades and software maintenance. You can easily add the ODP Repositories to these existing servers New versions, and bug fixes). Almost all data centers already have a setup of well-known proxies. In such cases, you can simply add the local proxy server to the existing proxies configurations. This approach is easier compared to creating local mirror servers in data centers with no mirror server setup. The network access is same as that required when using a mirror repository, but the source repository handles file management.

However, each of the above approaches are also known to have the following disadvantages:

  • Mirrors have to be managed for updates, upgrades, new versions, and bug fixes.
  • Proxy servers rely on the repository provider to not change the underlying files without notice.
  • Caching proxies are necessary, because non-caching proxies do not decrease WAN traffic and do not speed up the install process.

6.4. Recommendations for Deploying ODP

This section provides information on the various components of the Apache Hadoop ecosystem.

In many datacenters, using a mirror for the ODP Repositories can be the best deployment strategy. The ODP Repositories are small and easily mirrored, allowing you secure control over the contents of the Hadoop packages accepted for use in your data center.

Note

The installer pulls many packages from the base OS repositories (repos). If you do not have a complete base OS available to all your machines at the time of installation, you may run into issues. If you encounter problems with base OS repos being unavailable, please contact your system administrator to arrange for these additional repos to be proxied or mirrored.

6.5. Detailed Instructions for Creating Mirrors and Proxies

6.5.1. Option I - Mirror server has no access to the Internet

Complete the following instructions to set up a mirror server that has no access to the Internet:

  1. Check Your Prerequisites.

Select a mirror server host with the following characteristics:

  • The server OS is CentOS (7), RHEL (7), or Ubuntu (18,20), and has several GB of storage available.
  • This server and the cluster nodes are all running the same OS.

Note

To support repository mirroring for heterogeneous clusters requires a more complex procedure than the one documented here.

  • The firewall lets all cluster nodes (the servers on which you want to install ODP) access this serve.
  1. Install the Repos.

    a. Use a workstation with access to the Internet and download the tarball image of the appropriate Acceldata yum repository.

Table 6.2. Acceldata Yum Repositories

Cluster OS ODP Repository Tarballs
RHEL/CentOS 7 wget [INSERT_URL]
RHEL/CentOS 7 wget [INSERT_URL]
Ubuntu 18 wget [INSERT_URL]
wget [INSERT_URL]
Ubuntu 20 wget [INSERT_URL]
wget [INSERT_URL]

b. Create an HTTP server.
• On the mirror server, install an HTTP server (such as Apache httpd) using the instructions provided here.
• Activate this web server.
• Ensure that the firewall settings (if any) allow inbound HTTP access from your cluster nodes to your mirror server.

Note

If you are using EC2, make sure that SELinux is disabled.

If you are using EC2, make sure that SELinux is disabled.

c. On your mirror server, create a directory for your web server.

For example, from a shell window, type:

  • For RHEL/CentOS 7:

mkdir –p /var/www/html/odp/

For Ubuntu 18/20:

mkdir –p /var/www/html/odp/

If you are using a symlink, enable the following symlinks on your web server.

d. Copy the ODP Repository Tarball to the directory created in step 3, and untar it.

e. Verify the configuration.

  • The configuration is successful, if you can access the above directory through your web browser.

To test this out, browse to the following location: http://$yourwebserver/odp/$os/ODP-3.2.2.0-1/.

You should see directory listing for all the ODP components along with the RPMs at: $os/ODP-3.2.2.0-1.

Note

$os can be Centos7, Ubuntu 18/20. Use the following options table for $osparameter:

Table 6.3. ODP Component Options

Operating System Value
RHEL 7 centos 7
CentOs 7 centos7
Ubuntu 18 ubuntu18
Ubuntu 20 ubuntu20

f. Configure the yum clients on all the nodes in your cluster.

  • Fetch the yum configuration file from your mirror server.
  • Store the odp.repo file to a temporary location.
  • Edit the odp.repo file changing the value of the base url property to point to your local repositories based on your cluster OS.

where

  • $yourwebserver is the FQDN of your local mirror server.
  • $os can be RHEL 7, Centos7 or Ubuntu 18/20. Use the following options table for $os parameter:

Table 6.4. Yum Client Options

Operating System Value
RHEL 7 centos7
CentOs 7 centos7
Ubuntu 18 Ubuntu18
Ubuntu 20 Ubuntu20
  • Use scp or pdsh to copy the client yum configuration file to /etc/yum.repos.d/ directory on every node in the cluster.

  • [Conditional]: If you have multiple repositories configured in your environment, deploy the following plugin on all the nodes in your cluster.

  • Install the plugin.

  • For RHEL and CentOS

yum install yum-plugin-priorities

  • Edit the /etc/yum/pluginconf.d/priorities.conf file to add the following:
[main] 
enabled=1 
gpgcheck=0 

6.5.2. Option II - Mirror server has temporary or continuous access to the Internet

Complete the following instructions to set up a mirror server that has temporary access to the Internet:

  1. Check Your Prerequisites.

Select a local mirror server host with the following characteristics:

  • The server OS is CentOS (7), RHEL (7), or Ubuntu (18,20), and has several GB of storage available.
  • The local mirror server and the cluster nodes must have the same OS. If they are not running CentOS or RHEL, the mirror server must not be a member of the Hadoop cluster.

Note

To support repository mirroring for heterogeneous clusters requires a more complex procedure than the one documented here.

  • The firewall allows all cluster nodes (the servers on which you want to install ODP) to access this server.
  • Ensure that the mirror server hasyum installed.
  • Add the yum-utils and createrepo packages on the mirror server. yum install yum-utils createrepo
  1. Install the Repos.
  • Temporarily reconfigure your firewall to allow Internet access from your mirror server host.
  • Execute the following command to download the appropriate Acceldata yum client configuration file and save it in /etc/yum.repos.d/ directory on the mirror server host.

Table 6.5. Yum Client Configuration Commands

Cluster OS ODP Repository Tarballs
RHEL/CentOS 7 wget [INSERT_URL]
Ubuntu 18 wget [INSERT_URL]
Ubuntu 20 wget [INSERT_URL]
  • Create an HTTP server.
    • On the mirror server, install an HTTP server (such as Apache httpd using the instructions provided
    • Activate this web server.
    • Ensure that the firewall settings (if any) allow inbound HTTP access from your cluster nodes to your mirror server.

Note

If you are using EC2, make sure that SELinux is disabled.

Optional - If your mirror server uses SLES, modify the default-server.conf file to enable the docs root folder listing.

sed -e s/Options None/Options Indexes MultiViews/ig /etc/apache2/default-server.conf /tmp/tempfile.tmp 
mv /tmp/tempfile.tmp /etc/apache2/default-server.conf

On your mirror server, create a directory for your web server.

• For example, from a shell window, type:

• For RHEL/CentOS 7:

mkdir –p /var/www/html/odp/

• For Ubuntu 18/20:

mkdir –p /var/www/html/odp/

• If you are using a symlink, enable the followsymlinks on your web server.

• Copy the contents of entire ODP repository for your desired OS from the remote

  • Continuing the previous example, from a shell window, type:

  • For RHEL/CentOS 7/Ubuntu 18/20:

cd/var/www/html/odp

Then for all hosts, type:

  • ODP Repository
reposync -r ODP 
reposync -r ODP-3.2.2.0-1 
reposync -r ODP-UTILS-1.1.0.21

You should see both an ODP-3.2.2.0-1 directory and an ODP-UTILS-1.1.0.21 directory, each with several subdirectories.

  • Generate appropriate metadata.

This step defines each directory as a yum repository. From a shell window, type:

  • For RHEL/CentOS 7:

    • ODP Repository:
createrepo /var/www/html/odp/ODP-3.2.2.0-1
createrepo /var/www/html/odp/ODP-UTILS-1.1.0.21

You should see a new folder called repodata inside both ODP directories.

  • Verify the configuration.

  • The configuration is successful, if you can access the above directory through your web browser.

To test this out, browse to the following location:

  • ODP:http://$yourwebserver/odp/ODP-3.2.2.0-1/

  • You should now see directory listing for all the ODP components.

  • At this point, you can disable external Internet access for the mirror server, so that the mirror server is again entirely within your data center firewall.

  • Depending on your cluster OS, configure the yum clients on all the nodes in your cluster

  • Edit the repo files, changing the value of the baseurl property to the local mirror URL.

  • Edit the /etc/yum.repos.d/odp.repo file, changing the value of the baseurl property to point to your local repositories based on your cluster OS.

[ODP-3.x] 
name=Open source Data Platform Version - ODP-3.x 
baseurl=http://$yourwebserver/ODP/$os/3.x/GA 
gpgcheck=1 
gpgkey=http://public-repo-1.acceldata.com/ODP/$os/RPM-GPG-KEY/RPM GPG-KEY-Jenkins 
enabled=1 
priority=1 

[ODP-UTILS-1.1.0.21] 
name=Open source Data Platform Utils Version - ODP-UTILS-1.1.0.21
baseurl=http://$yourwebserver/ODP-UTILS-1.1.0.21/repos/$os 
gpgcheck=1 
gpgkey=http://public-repo-1.acceldata.com/ODP/$os/RPM-GPG-KEY/RPM GPG-KEY-Jenkins 
enabled=1 
priority=1 

[ODP-2.6.0.3] 
name=Open source Data Platform ODP-3.2.2.0-1 
baseurl=http://$yourwebserver/ODP/$os/3.x/updates/3.2.2.0-1 
gpgcheck=1 
gpgkey=http://public-repo-1.acceldata.com/ODP/$os/RPM-GPG-KEY/RPM GPG-KEY-Jenkins 
enabled=1 
priority=1

where

  • $yourwebserver is the FQDN of your local mirror server.
  • $os can be Centos7, suse11sp3, Ubuntu 18/20. Use the following options table for $os parameter:

Table 6.6. $OS Parameter Values

Operating System Value
RHEL 7 centos7
CentOs 7 centos7
Ubuntu 18 ubuntu18
Ubuntu 20 ubuntu20
  • Copy the yum/zypper client configuration file to all nodes in your cluster.

    • RHEL/CentOS 7:

      Use scp or pdsh to copy the client yum configuration file to /etc/yum.repos.d/ directory on every node in the cluster.

  • For Ubuntu 18/20:

    On every node, invoke the following command:

    • ODP Repository:

      sudo add-apt-repository deb [INSERT_URL]

    • Optional - Ambari Repository

      sudo add-apt-repository deb [INSERT_URL]

    • If using Ambari, verify the configuration by deploying an Ambari server on one of the cluster nodes.

      yum install ambari-server

  • If your cluster runs CentOS 7, or RHEL and if you have multiple repositories configured in your environment, deploy the following plugin on all the nodes in your cluster.

    • Install the plugin.

      • For RHEL and CentOs v7.x

        yum install yum-plugin-priorities

      • Edit the /etc/yum/pluginconf.d/priorities.conf file to add the following:

        [main] 
        enabled=1 
        gpgcheck=0
        

6.6. Set up a trusted proxy server

Complete the following instructions to set up a trusted proxy server:

  1. Check Your Prerequisites.

Select a mirror server host with the following characteristics:

  • This server runs on either CentOS 7/RHEL or Ubuntu 18/20, and has several GB of storage available.

  • The firewall allows all cluster nodes (the servers on which you want to install ODP) to access this server, and allows this server to access the Internet (at least those Internet servers for the repositories to be proxied)Install the Repos

  1. Create a caching HTTP Proxy server on the selected host.

• It is beyond the scope of this document to show how to set up an HTTP PROXY server, given the many variations that may be required, depending on your data center’s network security policy. If you choose to use the Apache HTTPD server, it starts by installing httpd, using the instructions provided here , and then adding the mod_proxy and mod_cache modules, as stated here. Please engage your network security specialists to correctly set up the proxy server.

  • Activate this proxy server and configure its cache storage location.

  • Ensure that the firewall settings (if any) allow inbound HTTP access from your cluster nodes to your mirror server, and outbound access to the desired repo sites, including: public-repo-1.acceldata.com.

If you are using EC2, make sure that SELinux is disabled.

  • Depending on your cluster OS, configure the yum clients on all the nodes in your cluster.

The following description is taken from the CentOS documentation. On each cluster node, add the following lines to the /etc/yum.conf file. (As an example, the settings below will enable yum to use the proxy server mycache.mydomain.com, connecting to port 3128, with the following credentials: yum-user/query.

# proxy server:port number 
proxy=http://mycache.mydomain.com:3128  
# account details for secure yum proxy connections 
proxy_username=yum-user 
proxy_password=qwerty
  • Once all nodes have their /etc/yum.conf file updated with appropriate configuration info, you can proceed with the ODP installation just as though the nodes had direct access to the Internet repositories.

  • If this proxy configuration does not seem to work, try adding a / at the end of the proxy URL. For example:

proxy=http://mycache.mydomain.com:3128/

⚠️ **GitHub.com Fallback** ⚠️