5 ODP Command Line Installation - acceldata-io/odpdocumentation GitHub Wiki
This chapter describes how to prepare to install Open source Data Platform (ODP) manually. You must complete the following tasks before you deploy Hadoop cluster using ODP:
- Meeting Minimum System Requirements
- https://github.com/acceldata-io/odpdocumentation/wiki/ODP-Command-Line-Installation#1-preparing-to-manually-install-odp
- Deciding on a Deployment Type
- Collect Information
- Prepare the Environment
- Download Companion Files
- Define Environment Parameters
- [Optional] Create System Users and Groups
- Determining ODP Memory Configuration Settings
- Allocating Adequate Log Space for ODP
- Download ODP Maven Artifacts
Important
See the ODP Release Notes for the ODP 3.2.2.0-1 repo information.
To run Open Source Data Platform, your system must meet minimum requirements.
Although there is no single hardware requirement for installing ODP, there are some basic guidelines. A complete installation of ODP 3.2.2 consumes about 8 GB of disk space.
Refer to the Acceldata Support Matrix for information regarding supported operating systems.
You must install the following software on each of your hosts:
-
apt-get
(for Ubuntu 18/20) -
chkconfig
(Ubuntu 18/20) curl
reposync
-
rpm
(for RHEL, CentOS 7) scp
tar
unzip
wget
-
yum
(for RHEL or CentOS 7)
In addition, if you are creating local mirror repositories as part of the installation process and you are using RHEL, CentOS 7, you need the following utilities on the mirror repo server:
createrepo
reposync
yum-utils
See Deploying ODP in Production Data Centers with Firewalls.
Your system must have the correct Java Development Kit (JDK) installed on all cluster nodes.
Refer to the Support Matrix for information regarding supported JDKs.
Important
Before enabling Kerberos in the cluster, you must deploy the Java Cryptography Extension (JCE) security policy files on all hosts in the cluster. See Installing the JCE for more information.
The following sections describe how to install and configure the JDK.
Use the following instructions to manually install JDK 1.8:
-
If you do not have a
/usr/java
directory, create one:mkdir /usr/java
-
Download the Oracle 64-bit JDK (
jdk-8u202-linux-x64.tar.gz
) from the Oracle download site. -
Open a web browser and navigate to http://www.oracle.com/ technetwork/java/javase/downloads/jdk8-downloads-2133151.html.
-
Copy the downloaded
jdk.tar.gz
file to the/usr/java
directory. -
Navigate to the
/usr/java
directory and extract thejdk.tar.gz
file:cd /usr/java tar zxvf jdk-8u202-linux-x64.tar.gz
The JDK files are extracted into a
/usr/ java/jdk1.8.0_202
directory. -
Create a symbolic link (symlink) to the JDK:
ln -s /usr/java/jdk1.8.0_202 /usr/java/default
-
Set the
JAVA_HOME
andPATH
environment variables:export JAVA_HOME=/usr/java/default export PATH=$JAVA_HOME/bin:$PATH
-
Verify that Java is installed in your environment:
java -version
You should see output similar to the following:
java version "1.8.0_202"
Java(TM) SE Runtime Environment (build 1.8.0_202-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.67-b01, mixed mode)
Unless you are using OpenJDK with unlimited-strength JCE, you must manually install the Java Cryptography Extension (JCE) security policy files on all hosts in the cluster:
- Obtain the JCE policy file appropriate for the JDK version in your cluster:
- Oracle JDK 1.8
https://www.oracle.com/java/technologies/javase-jce8-downloads.html
-
Save the policy file archive in a temporary location.
-
On each host in the cluster, add the unlimited security policy JCE jars to
$JAVA_HOME/jre/lib/security/
.
For example, run the following command to extract the policy jars into the JDK installed on your host:
unzip -o -j -q jce_policy-8.zip -d /usr/jdk64/jdk1.8.0_202/jre/lib/security/
If you are installing Apache projects Hive and HCatalog, Oozie, Hue, or Ranger, you must install a database to store metadata information in the metastore. You can either use an existing database instance or install a new instance manually.
Refer to the Support Matrix for information regarding supported metastore databases.
The following sections describe how to install and configure the metastore database.
The database administrator must create the following users and specify the following values:
- For Apache Hive: hive_dbname, hive_dbuser, and hive_dbpasswd.
- For Apache Oozie: oozie_dbname, oozie_dbuser, and oozie_dbpasswd.
Note
By default, Hive uses the Derby database for the metastore. However, Derby is not supported for production systems.
- For Hue: Hue user name and Hue user password
- For Apache Ranger: RANGER_ADMIN_DB_NAME
The following instructions explain how to install PostgreSQL as the metastore database. See your third-party documentation for instructions on how to install other supported databases.
Important
Prior to using PostgreSQL as your Hive metastore, consult with the offiical PostgreSQL documentation and ensure you are using a JDBC 4+ driver that corresponds to your implementation of PostgreSQL.
Use the following instructions to install a new instance of PostgreSQL:
- Using a terminal window, connect to the host machine where you plan to deploy a PostgreSQL instance:
yum install postgresql-server
- Start the instance:
/etc/init.d/postgresql start
For some newer versions of PostgreSQL, you might need to execute the command /etc/init.d/postgresql initdb
.
- Reconfigure PostgreSQL server:
- Edit the
/var/lib/pgsql/data/postgresql.conf
file.
Change the value of #listen_addresses = 'localhost'
to listen_addresses = '*'
.
- Edit the
/var/lib/pgsql/data/postgresql.conf
file.
Remove comments from the "port = " line and specify the port number (default 5432).
- Edit the
/var/lib/pgsql/data/pg_hba.conf
file by adding the following:
host all all 0.0.0.0/0 trust
- If you are using PostgreSQL v9.1 or later, add the following to the
/var/lib/pgsql/data/postgresql.conf
file:
standard_conforming_strings = off
- Create users for PostgreSQL server by logging in as the root user and entering the following syntax:
echo "CREATE DATABASE $dbname;" | sudo -u $postgres psql -U postgres echo "CREATE USER $user WITH PASSWORD '$passwd';" | sudo -u $postgres psql -U postgres
echo "GRANT ALL PRIVILEGES ON DATABASE $dbname TO $user;" | sudo -u $postgres psql -U postgres
The previous syntax should have the following values:
• $postgres is the postgres user.
• $user is the user you want to create.
• $dbname is the name of your PostgreSQL database.
Note
For access to the Hive metastore, you must create hive_dbuser after Hive has been installed, and for access to the Oozie metastore, you must create oozie_dbuser after Oozie has been installed.
- On the Hive metastore host, install the connector:
yum install postgresql-jdbc*
- Confirm that the
.jar
file is in the Java share directory:
ls -l /usr/share/java/postgresql-jdbc.jar
To install a new instance of PostgreSQL:
- Connect to the host machine where you plan to deploy PostgreSQL instance. At a terminal window, enter:
apt-get install postgresql-server
- Start the instance.
Note
For some newer versions of PostgreSQL, you might need to execute the command:
/etc/init.d/postgresql initdb
- Reconfigure PostgreSQL server:
- Edit the
/var/lib/pgsql/data/postgresql.conf
file.
Change the value of #listen_addresses = 'localhost'
to listen_addresses = '*'
- Edit the
/var/lib/pgsql/data/postgresql.conf
file.
Change the port setting from #port = 5432 to port = 5432
- Edit the
/var/lib/pgsql/data/pg_hba.conf
Add the following:
host all all 0.0.0.0/0 trust
- Optional: If you are using PostgreSQL v9.1 or later, add the following to the
/var/lib/pgsql/data/postgresql.conf
file:
standard_conforming_strings = off
- Create users for PostgreSQL server.
Log in as the root and enter:
echo "CREATE DATABASE $dbname;" | sudo -u $postgres psql -U postgres echo "CREATE USER $user WITH PASSWORD '$passwd';" | sudo -u $postgres psql - U postgres
echo "GRANT ALL PRIVILEGES ON DATABASE $dbname TO $user;" | sudo -u $postgres psql -U postgres
Where: $postgres is the postgres user, $user is the user you want to create, and $dbname is the name of your PostgreSQL database.
Note
For access to the Hive metastore, create hive_dbuser after Hive has been installed, and for access to the Oozie metastore, create oozie_dbuser after Oozie has been installed.
- On the Hive Metastore host, install the connector.
apt-get install -y libpostgresql-jdbc-java
- Copy the connector .jar file to the Java share directory.
cp /usr/share/java/postgresql-*jdbc3.jar /usr/share/java/ postgresql-jdbc.jar
- Confirm that the .jar is in the Java share directory.
ls /usr/share/java/postgresql-jdbc.jar
- Change the access mode of the .jar file to 644.
chmod 644 /usr/share/java/postgresql-jdbc.jar
This section describes how to install MariaDB as the metastore database. For instructions on how to install other supported databases, see your third-party documentation.
For additional information regarding MariaDB, see MariaDB.
Important
If you are installing on CentOS or RHEL, it is highly recommended that you install from a repository using yum.
Follow these steps to install a new instance of MariaDB on RHEL and CentOS:
-
There are YUM repositories for several YUM-based Linux distributions. Use the Maria DB Downloads page to generate the YUM repository.
-
Move the MariaDB repo file to the directory
/etc/yum.repos.d/
.
It is suggested that you name your file MariaDB.repo
.
The following is an example MariaDB.repo
file for CentOS 7:
[mariadb]
name=MariaDB
baseurl=http://yum.mariadb.org/10.1/centos7-amd64
gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
gpgcheck=1
In this example the gpgkey
line automatically fetches the GPG key that is used to sign the repositories. gpgkey enables yum and rpm to verify the integrity of the packages that it downloads.The id of MariaDB's signing key is 0xcbcb082a1bb943db
. The short form of the id is 0x1BB943DB and the full key fingerprint is:1993 69E5 404B D5FC 7D2F E43B CBCB 082A 1BB9 43DB
.
If you want to fix the version to an older version, follow the instructions on Adding the MariaDB YUM Repository.
-
If you do not have the MariaDB GPG signing key installed, YUM prompts you to install it after downloading the packages. If you are prompted to do so, install the MariaDB GPG signing key.
-
Use the following command to install MariaDB:
sudo yum install MariaDB-server MariaDB-client
- If you already have the
MariaDB-Galera-server
package installed, you might need to remove it prior to installingMariaDB-server
. If you need to removeMariaDB Galera-server
, use the following command:
sudo yum remove MariaDB-Galera-server
No databases are removed when the MariaDB-Galera-server
rpm package is removed, though with any upgrade, it is best to have backups.
-
Install MariaDB with YUM by following the directions at Enabling MariaDB.
-
Use one of the following commands to start MariaDB:
- If your system is using systemctl:
sudo systemctl start mariadb
- If your system is not using systemctl:
sudo /etc/init.d/mysql start
1.1.5.4. Installing and Configuring MySQL
This section describes how to install MySQL as the metastore database. For instructions on how to install other supported databases, see your third-party documentation.
Important
When you use MySQL as your Hive metastore, you must use
mysql connector-java-5.1.35.zip
or later JDBC driver.
To install a new instance of MySQL:
-
Connect to the host machine you plan to use for Hive and HCatalog.
-
Install MySQL server.
From a terminal window, enter:
yum install mysql-community-release
For CentOS7, install MySQL server from the ODP-Utils repository.
- Start the instance.
/etc/init.d/mysqld start
- Set the root user password using the following command format:
mysqladmin -u root password $mysqlpassword
For example, use the following command to set the password to "root":
mysqladmin -u root password root
- Remove unnecessary information from log and STDOUT:
mysqladmin -u root 2>&1> /dev/null
- Log in to MySQL as the root user:
mysql -u root -p root
In this syntax, "root" is the root user password.
- Log in as the root user, create the “dbuser,” and grant dbuser adequate privileges:
[root@c6402 /]# mysql -u root -proot
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 11
Server version: 5.1.73 Source distribution
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> CREATE USER 'dbuser'@'localhost' IDENTIFIED BY 'dbuser';
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost';
Query OK, 0 rows affected (0.00 sec)
mysql> CREATE USER 'dbuser'@'%' IDENTIFIED BY 'dbuser';
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%';
Query OK, 0 rows affected (0.00 sec)
mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost' WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%' WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)
mysql>
-
Use the exit command to exit MySQL.
-
You should now be able to reconnect to the database as "dbuser" by using the following command:
mysql -u dbuser -p dbuser
After testing the dbuser login, use the exit command to exit MySQL.
10.Install the MySQL connector .jar file:
yum install mysql-connector-java*
To install a new instance of MySQL:
-
Connect to the host machine you plan to use for Hive and HCatalog.
-
Install MySQL server.
From a terminal window, enter:
apt-get install mysql-server
- Start the instance.
/etc/init.d/mysql start
- Set the root user password using the following command format:
mysqladmin -u root password $mysqlpassword
For example, to set the password to "root":
mysqladmin -u root password root
- Remove unnecessary information from log and STDOUT.
mysqladmin -u root 2>&1> /dev/null
- Log in to MySQL as the root user:
mysql -u root -p root
- Log in as the root user, create the dbuser, and grant it adequate privileges. This user provides access to the Hive metastore. Use the following series of commands (shown here with the returned responses) to create dbuser with password dbuser.
[root@c6402 /]# mysql -u root -proot
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 11
Server version: 5.1.73 Source distribution
Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> CREATE USER 'dbuser'@'localhost' IDENTIFIED BY 'dbuser'; Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost'; Query OK, 0 rows affected (0.00 sec)
mysql> CREATE USER 'dbuser'@'%' IDENTIFIED BY 'dbuser';
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%';
Query OK, 0 rows affected (0.00 sec)
mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'localhost' WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON *.* TO 'dbuser'@'%' WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)
mysql>
-
Use the exit command to exit MySQL.
-
You should now be able to reconnect to the database as dbuser, using the following command:
mysql -u dbuser -p dbuser
After testing the dbuser login, use the exit command to exit MySQL.
10.Install the MySQL connector JAR file.
apt-get install mysql-connector-java*
You can select Oracle as the metastore database. For instructions on how to install the databases, see your third-party documentation. To configure Oracle as the Hive Metastore, install ODP and Hive, and then follow the instructions in "Set up Oracle DB for use with Hive Metastore" in this guide.
Open source Data Platform (ODP) is certified and supported when running on virtual or cloud platforms (for example, VMware vSphere or Amazon Web Services EC2) if the respective guest operating system is supported by ODP and any issues detected on these platforms are reproducible on the same supported operating system installed elsewhere.
See the Support Matrix for the list of supported operating systems for ODP.
The standard ODP install fetches the software from a remote yum repository over the Internet. To use this option, you must set up access to the remote repository and have an available Internet connection for each of your hosts. To download the ODP maven artifacts and build your own repository, see Download the ODP Maven Artifacts.
Important
See the ODP Release Notes and ODP 2.6 repo information.
Note
If your cluster does not have access to the Internet, or if you are creating a large cluster and you want to conserve bandwidth, you can instead provide a local copy of the ODP repository that your hosts can access.
- 6.x line of RHEL/CentOS 7
wget -nv http://public-repo-1.acceldata.com/ODP/centos7/3.2.2.0-1/odp.repo -O /etc/yum.repos.d/odp.repo
- 7.x line of RHEL/CentOS
wget -nv http://public-repo-1.acceldata.com/ODP/centos7/3.2.2.0-1/odp.repo
- Ubuntu 18/20
apt-get update
wget http://public-repo-1.acceldata.com/ODP/ubuntu<version>/3.2.2.0-1/odp.list -O /etc/apt/sources.list.d/odp.list
While it is possible to deploy all of ODP on a single host, you should use at least four hosts: one master host and three slaves.
To deploy your ODP, you need the following information:
-
The fully qualified domain name (FQDN) for each host in your system, and the components you want to set up on each host. You can use hostname -f to check for the FQDN.
-
If you install Apache Hive, HCatalog, or Apache Oozie, you need the host name, database name, user name, and password for the metastore instance.
Note
If you are using an existing instance, the dbuser you create for ODP must be granted ALL PRIVILEGES permissions on that instance.
To deploy your ODP instance, you must prepare your deployment environment:
- Enable NTP on Your Cluster
- Disable SELinux
- Disable IPTables
The clocks of all the nodes in your cluster must be synchronized. If your system does not have access to the Internet, you should set up a master node as an NTP xserver to achieve this synchronization.
Use the following instructions to enable NTP for your cluster:
- Configure NTP clients by executing the following command on each node in your cluster:
- For RHEL/CentOS/:
a. Configure the NTP clients:
yum install ntp
b. Enable the service:
systemctl enable ntpd
c. Start NTPD:
systemctl start ntpd
- Enable the service by executing the following command on each node in your cluster:
- For RHEL/CentOS
chkconfig ntpd on
- For Ubuntu 18/20:
chkconfig ntp on
- Start the NTP. Execute the following command on all the nodes in your cluster. • For RHEL/CentOS 7:
/etc/init.d/ntpd start
- For Ubuntu 18/20
/etc/init.d/ntp start
-
If you want to use an existing NTP server as the X server in your environment, complete the following steps:
a. Configure the firewall on the local NTP server to enable UDP input traffic on Port 123 and replace 192.168.1.0/24 with the IP addresses in the cluster, as shown in the following example using RHEL hosts:
# iptables -A RH-Firewall-1-INPUT -s 192.168.1.0/24 -m state --state NEW -p udp --dport 123 -j ACCEPT
b. Save and restart iptables. Execute the following command on all the nodes in your cluster:
# service iptables save # service iptables restart
c. Finally, configure clients to use the local NTP server. Edit the /etc/ntp.conf file and add the following line:
server $LOCAL_SERVER_IP OR HOSTNAME
The Security-Enhanced (SE) Linux feature should be disabled during the installation process. 1. Check the state of SELinux. On all the host machines, execute the following command:
getenforce
If the command returns "disabled" or "permissive" as the response, no further actions are required. If the result is enabled
, proceed to Step 2.
- Disable SELinux either temporarily for each session or permanently.
- Disable SELinux temporarily by executing the following command:
setenforce 0
- Disable SELinux permanently in the
/etc/sysconfig/selinux
file by changing the value of the SELINUX field to permissive or disabled. Restart your system.
Because certain ports must be open and available during installation, you should temporarily disable iptables. If the security protocols at your installation do not allow you to disable iptables, you can proceed with them on if all of the relevant ports are open and available; otherwise, cluster installation fails.
- On all RHEL/CentOS 6 host machines, execute the following commands to disable iptables:
chkconfig iptables off
service iptables stop
Restart iptables after your setup is complete.
- On RHEL/CENTOS 7 host machines, execute the following commands to disable firewalld:
systemctl stop firewalld
systemctl mask firewalld
Restart firewalld after your setup is complete.
On Ubuntu 18/20 Host machines, execute the following command to disable iptables:
service ufw stop
Restart iptables after your setup is complete.
Important
If you leave iptables enabled and do not set up the necessary ports, the cluster installation fails.
You can download and extract a set of companion files, including script files and configuration files, that you can then modify to match your own cluster environment:
To download and extract the files:
wget http://public-repo-1.acceldata.com/ODP/tools/3.2.2.0-1/
odp_manual_install_rpm_helper_files-3.2.2.0.1.tar.gz
tar zxvf odp_manual_install_rpm_helper_files-3.2.2.0.1.tar.gz
Important
See the ODP Release Notes for the ODP 3.2.2.0 repo information.
You must set up specific users and directories for your ODP installation by using the following instructions:
- Define directories.
The following table describes the directories you need for installation, configuration, data storage, process IDs, and log information based on the Apache Hadoop Services you plan to install. Use this table to define what you are going to use to set up your environment.
Note
The scripts.zip file that you downloaded in the supplied companion files includes a script, directories.sh, for setting directory environment parameters.
You should edit and source (or copy the contents to your ~/.bash_profile) to set up these environment variables in your environment.
Table 1.1. Directories Needed to Install Core Hadoop
Hadoop Service | Parameter | Definition |
---|---|---|
HDFS | DFS_NAME_DIR | Space separated list of directories to which NameNode should store the file system image: for example, / grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/nn . |
HDFS | DFS_DATA_DIR | Space separated list of directories where DataNodes should store the blocks. For example, /grid/hadoop/hdfs/dn, /grid1/ hadoop/hdfs/dn /grid2/hadoop/hdfs/dn
|
HDFS | FS_CHECKPOINT_DIR | Space separated list of directories where SecondaryNameNode should store the checkpoint image. For example, /grid/hadoop/hdfs/snn /grid1/hadoop/hdfs/snn /grid2/hadoop/hdfs/snn
|
HDFS | HDFS_LOG_DIR | Directory for storing the HDFS logs. This directory name is a combination of a directory and the $HDFS_USER. For example, /var/log/hadoop/hdfs, where hdfs is the $HDFS_USER
|
HDFS | HDFS_PID_DIR | Directory for storing the HDFS process ID. This directory name is a combination of a directory and the $HDFS_USER. For example, /var/run/hadoop/hdfs, where hdfs is the $HDFS_USER
|
HDFS | HADOOP_CONF_DIR | Directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf
|
YARN | YARN_LOCAL_DIR | Space-separated list of directories where YARN should store temporary data. For example, /grid/hadoop/yarn /grid1/hadoop/yarn /grid2/hadoop/yarn
|
YARN | YARN_LOG_DIR | Directory for storing the YARN logs. For example, /var/log/hadoop/yarn . This directory name is a combination of a directory and the $YARN_USER. In the example yarn is the $YARN_USER. |
YARN | YARN_LOCAL_LOG_DIR | Space-separated list of directories where YARN stores container log data. For example, /grid/hadoop/yarn/logs /grid1/hadoop/yarn/log . |
YARN | YARN_PID_DI | Directory for storing the YARN process ID. For example, /var/run/hadoop/yarn . This directory name is a combination of a directory and the $YARN_USER. In the example, yarn is the $YARN_USER |
MapReduce | MAPRED_LOG_DIR | Directory for storing the JobHistory Server logs. For example, /var/log/hadoop/mapred . This directory name is a combination of a directory and the $MAPRED_USER. In the example, mapred is the $MAPRED_USER |
Table 1.2. Directories Needed to Install Ecosystem Components
Hadoop Service | Parameter | Definition |
---|---|---|
Oozie | OOZIE_CONF_DIR | Directory to store the Oozie configuration files. For example, /etc/oozie/conf. |
Oozie | OOZIE_DATA | Directory to store the Oozie data. For example, /var/db/oozie. |
Oozie | OOZIE_LOG_DIR | Directory to store the Oozie logs. For example, /var/log/oozie. |
Oozie | OOZIE_PID_DIR | Directory to store the Oozie process ID. For example, /var/run/oozie. |
Oozie | OOZIE_TMP_DIR | Directory to store the Oozie temporary files. For example, /var/tmp/oozie. |
Hive | HIVE_CONF_DIR | Directory to store the Hive configuration files. For example, /etc/hive/conf. |
Hive | HIVE_LOG_DIR | Directory to store the Hive logs. For example, /var/log/hive. |
Hive | HIVE_PID_DIR | Directory to store the Hive process ID. For example, /var/run/hive. |
WebHCat | WEBHCAT_CONF_DIR | Directory to store the WebHCat configuration files. For example, /etc/hcatalog/conf/webhcat. |
WebHCat | WEBHCAT_LOG_DIR | Directory to store the WebHCat logs. For example, /var/log/webhcat. |
WebHCat | WEBHCAT_PID_DIR | Directory to store the WebHCat process ID. For example, /var/run/webhcat. |
HBase | HBASE_CONF_DIR | Directory to store the Apache HBase configuration files. For example, /etc/hbase/conf. |
HBase | HBASE_LOG_DIR | Directory to store the HBase logs. For example, /var/log/hbase. |
HBase | HBASE_PID_DIR | Directory to store the HBase process ID. For example, /var/run/hbase. |
ZooKeeper | ZOOKEEPER_DATA_DIR | Directory where Apache ZooKeeper stores data. For example, /grid/hadoop/zookeeper/data |
ZooKeeper | ZOOKEEPER_CONF_DIR | Directory to store the ZooKeeper configuration files. For example, /etc/zookeeper/conf. |
ZooKeeper | ZOOKEEPER_LOG_DIR | Directory to store the ZooKeeper logs. For example, /var/log/zookeeper. |
ZooKeeper | ZOOKEEPER_PID_DIR | Directory to store the ZooKeeper process ID. For example, /var/run/zookeeper. |
Sqoop | SQOOP_CONF_DIR | Directory to store the Apache Sqoop configuration files. For example, /etc/sqoop/conf. |
If you use the companion files, the following screen provides a snapshot of how your directories.sh file should look after you edit the TODO variables:
#!/bin/sh
#
# Directories Script
#
# 1. To use this script, you must edit the TODO variables below for your environment.
#
# 2. Warning: Leave the other parameters as the default values. Changing these default values requires you to
# change values in other configuration files.
#
#
# Hadoop Service - HDFS
#
# Space separated list of directories where NameNode stores file system image. For example, /grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/nn DFS_NAME_DIR="TODO-LIST-OF-NAMENODE-DIRS";
# Space separated list of directories where DataNodes stores the blocks. For example, /grid/hadoop/hdfs/dn /grid1/hadoop/hdfs/dn /grid2/hadoop/hdfs/dn DFS_DATA_DIR="TODO-LIST-OF-DATA-DIRS";
# Space separated list of directories where SecondaryNameNode stores checkpoint image. For example, /grid/hadoop/hdfs/snn /grid1/hadoop/hdfs/ snn /grid2/hadoop/hdfs/snn
FS_CHECKPOINT_DIR="TODO-LIST-OF-SECONDARY-NAMENODE-DIRS";
# Directory to store the HDFS logs.
HDFS_LOG_DIR="/var/log/hadoop/hdfs";
# Directory to store the HDFS process ID.
HDFS_PID_DIR="/var/run/hadoop/hdfs";
# Directory to store the Hadoop configuration files.
HADOOP_CONF_DIR="/etc/hadoop/conf";
#
# Hadoop Service - YARN
#
# Space separated list of directories where YARN stores temporary data. For example, /grid/hadoop/yarn/local /grid1/hadoop/yarn/local /grid2/hadoop/yarn/local
YARN_LOCAL_DIR="TODO-LIST-OF-YARN-LOCAL-DIRS";
# Directory to store the YARN logs.
YARN_LOG_DIR="/var/log/hadoop/yarn";
# Space separated list of directories where YARN stores container log data. For example, /grid/hadoop/yarn/logs /grid1/hadoop/yarn/logs /grid2/hadoop/ yarn/logs
YARN_LOCAL_LOG_DIR="TODO-LIST-OF-YARN-LOCAL-LOG-DIRS";
# Directory to store the YARN process ID.
YARN_PID_DIR="/var/run/hadoop/yarn";
#
# Hadoop Service - MAPREDUCE
#
# Directory to store the MapReduce daemon logs.
MAPRED_LOG_DIR="/var/log/hadoop/mapred";
# Directory to store the mapreduce jobhistory process ID. MAPRED_PID_DIR="/var/run/hadoop/mapred";
#
# Hadoop Service - Hive
#
# Directory to store the Hive configuration files.
HIVE_CONF_DIR="/etc/hive/conf";
# Directory to store the Hive logs.
HIVE_LOG_DIR="/var/log/hive";
# Directory to store the Hive process ID.
HIVE_PID_DIR="/var/run/hive";
#
# Hadoop Service - WebHCat (Templeton)
#
# Directory to store the WebHCat (Templeton) configuration files. WEBHCAT_CONF_DIR="/etc/hcatalog/conf/webhcat";
# Directory to store the WebHCat (Templeton) logs.
WEBHCAT_LOG_DIR="var/log/webhcat";
# Directory to store the WebHCat (Templeton) process ID.
WEBHCAT_PID_DIR="/var/run/webhcat";
#
# Hadoop Service - HBase
#
# Directory to store the HBase configuration files.
HBASE_CONF_DIR="/etc/hbase/conf";
# Directory to store the HBase logs.
HBASE_LOG_DIR="/var/log/hbase";
# Directory to store the HBase logs.
HBASE_PID_DIR="/var/run/hbase";
#
# Hadoop Service - ZooKeeper
#
# Directory where ZooKeeper stores data. For example, /grid1/hadoop/ zookeeper/data
ZOOKEEPER_DATA_DIR="TODO-ZOOKEEPER-DATA-DIR";
# Directory to store the ZooKeeper configuration files. ZOOKEEPER_CONF_DIR="/etc/zookeeper/conf";
# Directory to store the ZooKeeper logs.
ZOOKEEPER_LOG_DIR="/var/log/zookeeper";
# Directory to store the ZooKeeper process ID.
ZOOKEEPER_PID_DIR="/var/run/zookeeper";
#
# Hadoop Service - Oozie
#
# Directory to store the Oozie configuration files.
OOZIE_CONF_DIR="/etc/oozie/conf"
# Directory to store the Oozie data.
OOZIE_DATA="/var/db/oozie"
# Directory to store the Oozie logs.
OOZIE_LOG_DIR="/var/log/oozie"
# Directory to store the Oozie process ID.
OOZIE_PID_DIR="/var/run/oozie"
# Directory to store the Oozie temporary files.
OOZIE_TMP_DIR="/var/tmp/oozie"
#
# Hadoop Service - Sqoop
#
SQOOP_CONF_DIR="/etc/sqoop/conf"
- The following table describes system user account and groups. Use this table to define what you are going to use in setting up your environment. These users and groups should reflect the accounts you create in Create System Users and Groups. The
scripts.zip
file you downloaded includes a script, usersAndGroups.sh, for setting user and group environment parameters.
Table 1.3. Define Users and Groups for Systems
Parameter | Definition |
---|---|
HDFS_USER | User that owns the Hadoop Distributed File Sysem (HDFS) services. For example, hdfs. |
YARN_USER | User that owns the YARN services. For example, yarn. |
ZOOKEEPER_USER | User that owns the ZooKeeper services. For example, zookeeper. |
HIVE_USER | User that owns the Hive services. For example, hive. |
WEBHCAT_USER | User that owns the WebHCat services. For example, hcat. |
HBASE_USER | User that owns the HBase services. For example, hbase. |
SQOOP_USER | User owning the Sqoop services. For example, sqoop. |
KAFKA_USER | User owning the Apache Kafka services. For example, kafka. |
OOZIE_USER | User owning the Oozie services. For example oozie. |
HADOOP_GROUP | A common group shared by services. For example, hadoop. |
KNOX_USER | User that owns the Knox Gateway services. For example, knox. |
In general, Apache Hadoop services should be owned by specific users and not by root or application users. The following table shows the typical users for Hadoop services. If you choose to install the ODP components using the RPMs, these users are automatically set up.
If you do not install with the RPMs, or want different users, then you must identify the users that you want for your Hadoop services and the common Hadoop group and create these accounts on your system.
To create these accounts manually, you must follow this procedure:
Add the user to the group.
useradd -G <groupname> <username>
Table 1.4. Typical System Users and Groups
Hadoop Service | User | Group |
---|---|---|
HDFS | hdfs | hadoop |
YARN | yarn | hadoop |
MapReduce | mapred | hadoop, mapred |
Hive | hive | hadoop |
HCatalog/WebHCatalog | hcat | hadoop |
HBase | hbase | hadoop |
Sqoop | sqoop | hadoop |
ZooKeeper | zookeeper | hadoop |
Oozie | oozie | hadoop |
Knox Gateway | knox | hadoop |
You can use either of two methods determine YARN and MapReduce memory configuration settings:
- Running the YARN Utility Script
- Calculating YARN and MapReduce Memory Requirements
The ODP utility script is the recommended method for calculating ODP memory configuration settings, but information about manually calculating YARN and MapReduce memory configuration settings is also provided for reference.
This section describes how to use the yarn-utils.py script to calculate YARN, MapReduce, Hive, and Tez memory allocation settings based on the node hardware specifications. The yarn-utils.py script is included in the ODP companion files. See Download Companion Files.
To run the yarn-utils.py script, execute the following command from the folder containing the script yarn-utils.py options, where options are as follows:
Table 1.5. yarn-utils.py Options
Option | Description |
---|---|
-c CORES | The number of cores on each host |
-m MEMORY | The amount of memory on each host, in gigabytes |
-d DISKS | The number of disks on each host |
-k HBASE | "True" if HBase is installed; "False" if not |
Notes
Requires python26 to run.
You can also use the -h or --help option to display a Help message that describes the options.
Example Running the following command from the odp_manual_install_rpm_helper_files-3.2.2.0. $BUILD directory:
python yarn-utils.py -c 16 -m 64 -d 4 -k True
Returns
Using cores=16 memory=64GB disks=4 hbase=True
Profile: cores=16 memory=49152MB reserved=16GB usableMem=48GB disks=4 Num Container=8
Container Ram=6144MB
Used Ram=48GB
Unused Ram=16GB
yarn.scheduler.minimum-allocation-mb=6144
yarn.scheduler.maximum-allocation-mb=49152
yarn.nodemanager.resource.memory-mb=49152
mapreduce.map.memory.mb=6144
mapreduce.map.java.opts=-Xmx4096m
mapreduce.reduce.memory.mb=6144
mapreduce.reduce.java.opts=-Xmx4096m
yarn.app.mapreduce.am.resource.mb=6144
yarn.app.mapreduce.am.command-opts=-Xmx4096m
mapreduce.task.io.sort.mb=1792
tez.am.resource.memory.mb=6144
tez.am.launch.cmd-opts =-Xmx4096m
hive.tez.container.size=6144
hive.tez.java.opts=-Xmx4096m
This section describes how to manually configure YARN and MapReduce memory allocation settings based on the node hardware specifications.
YARN takes into account all of the available compute resources on each machine in the cluster. Based on the available resources, YARN negotiates resource requests from applications running in the cluster, such as MapReduce. YARN then provides processing capacity to each application by allocating containers. A container is the basic unit of processing capacity in YARN, and is an encapsulation of resource elements such as memory and CPU.
In an Apache Hadoop cluster, it is vital to balance the use of memory (RAM), processors (CPU cores), and disks so that processing is not constrained by any one of these cluster resources. As a general recommendation, allowing for two containers per disk and per core gives the best balance for cluster utilization.
When determining the appropriate YARN and MapReduce memory configurations for a cluster node, you should start with the available hardware resources. Specifically, note the following values on each node:
- RAM (amount of memory)
- CORES (number of CPU cores)
- DISKS (number of disks)
The total available RAM for YARN and MapReduce should take into account the Reserved Memory. Reserved memory is the RAM needed by system processes and other Hadoop processes (such as HBase):
reserved memory = stack memory reserve + HBase memory reserve (if HBase is on the same node)
You can use the values in the following table to determine what you need for reserved memory per node:
Table 1.6. Reserved Memory Recommendations
Total Memory per Node | Recommended Reserved System Memory | Recommended Reserved HBase Memory |
---|---|---|
4 GB | 1 GB | 1 GB |
8 GB | 2 GB | 1 GB |
16 GB 24 GB | 2 GB 4 GB | 2 GB 4 GB |
48 GB | 6 GB | 8 GB |
64 GB | 8 GB | 8 GB |
72 GB | 8 GB | 8 GB |
96 GB 128 GB | 12 GB 24 GB | 16 GB 24 GB |
256 GB | 32 GB | 32 GB |
512 GB | 64 GB | 64 GB |
After you determine the amount of memory you need per node, you must determine the maximum number of containers allowed per node:
Number of containers = min (2CORES, 1.8DISKS, (total available RAM) / MIN_CONTAINER_SIZE) DISKS is the value for dfs.data.dirs (number of data disks) per machine.
MIN_CONTAINER_SIZE is the minimum container size (in RAM). This value depends on the amount of RAM available; in smaller memory nodes, the minimum container size should also be smaller.
The following table provides the recommended values:
Table 1.7. Recommended Container Size Values
Total RAM per Node | Recommended Minimum Container Size |
---|---|
Less than 4 GB | 256 MB |
Between 4 GB and 8 GB | 512 MB |
Between 8 GB and 24 GB | 1024 MB |
Above 24 GB | 2048 MB |
Finally, you must determine the amount of RAM per container:
RAM per container = max(MIN_CONTAINER_SIZE, (total available RAM, per containers)
Using the results of all the previous calculations, you can configure YARN and MapReduce.
Table 1.8. YARN and MapReduce Configuration Values
Configuration File | Configuration Setting | Value Calculation |
---|---|---|
yarn-site.xml | yarn.nodemanager.resource.memory mb | = containers * RAM-per-container |
yarn-site.xml | yarn.scheduler.minimum-allocation mb | = RAM-per-container |
yarn-site.xml | yarn.scheduler.maximum-allocation mb | = containers * RAM-per-container |
mapred-site.xml | mapreduce.map.memory.mb | = RAM-per-container |
mapred-site.xml | mapreduce.reduce.memory.mb | = 2 * RAM-per-container |
mapred-site.xml | mapreduce.map.java.opts | = 0.8 * RAM-per-container |
mapred-site.xml | mapreduce.reduce.java.opts | = 0.8 * 2 * RAM-per-container |
mapred-site.xml | yarn.app.mapreduce.am.resource.mb | = 2 * RAM-per-container |
mapred-site.xml | yarn.app.mapreduce.am.command opts | = 0.8 * 2 * RAM-per-container |
Note
After installation, both yarn-site.xml and mapred-site.xml are located in the
/etc/ hadoop/conf
folder.
Examples Assume that your cluster nodes have 12 CPU cores, 48 GB RAM, and 12 disks:
Reserved memory = 6 GB system memory reserve + 8 GB for HBase min container size = 2 GB
If there is no HBase, then you can use the following calculation:
Number of containers = min (212, 1.8 12, (48-6)/2) = min (24, 21.6, 21) = 21 RAM-per-container = max (2, (48-6)/21) = max (2, 2) = 2
Table 1.9. Example Value Calculations Without HBase
Configuration | Value Calculation |
---|---|
yarn.nodemanager.resource.memory-mb | = 21 * 2 = 42*1024 MB |
yarn.scheduler.minimum-allocation-mb | = 2*1024 MB |
yarn.scheduler.maximum-allocation-mb | = 21 * 2 = 42*1024 MB |
mapreduce.map.memory.mb | = 2*1024 MB |
mapreduce.reduce.memory.mb | = 2 * 2 = 4*1024 MB |
mapreduce.map.java.opts | = 0.8 * 2 = 1.6*1024 MB |
mapreduce.reduce.java.opts | = 0.8 * 2 * 2 = 3.2*1024 MB |
yarn.app.mapreduce.am.resource.mb | = 2 * 2 = 4*1024 MB |
yarn.app.mapreduce.am.command-opts | = 0.8 * 2 * 2 = 3.2*1024 MB |
If HBase is included:
Number of containers = min (212, 1.8 12, (48-6-8)/2) = min (24, 21.6, 17) = 17 RAM-per-container = max (2, (48-6-8)/17) = max (2, 2) = 2
Table 1.10. Example Value Calculations with HBase
Configuration | Value Calculation |
---|---|
yarn.nodemanager.resource.memory-mb | = 17 * 2 = 34*1024 MB |
yarn.scheduler.minimum-allocation-mb | = 2*1024 MB |
yarn.scheduler.maximum-allocation-mb | = 17 * 2 = 34*1024 MB |
mapreduce.map.memory.mb | = 2*1024 MB |
mapreduce.reduce.memory.mb | = 2 * 2 = 4*1024 MB |
mapreduce.map.java.opts | = 0.8 * 2 = 1.6*1024 MB |
mapreduce.reduce.java.opts | = 0.8 * 2 * 2 = 3.2*1024 MB |
yarn.app.mapreduce.am.resource.mb | = 2 * 2 = 4*1024 MB |
yarn.app.mapreduce.am.command-opts | = 0.8 * 2 * 2 = 3.2*1024 MB |
Notes:
- Updating values for yarn.scheduler.minimum-allocation-mb without also changing yarn.nodemanager.resource.memory-mb, or changing yarn.nodemanager.resource.memory-mb without also changing yarn.scheduler.minimum-allocation-mb changes the number of containers per node.
- If your installation has a large amount of RAM but not many disks or cores, you can free RAM for other tasks by lowering both >yarn.scheduler.minimum-allocation-mb and yarn.nodemanager.resource.memory-mb.
- With MapReduce on YARN, there are no longer preconfigured static slots for Map and Reduce tasks.
The entire cluster is available for dynamic resource allocation of Map and Reduce tasks as needed by each job. In the previous example cluster, with the previous configurations, YARN is able to allocate up to 10 Mappers (40/4) or 5 Reducers (40/8) on each node (or some other combination of Mappers and Reducers within the 40 GB per node limit).
NameNode heap size depends on many factors, such as the number of files, the number of blocks, and the load on the system. The following table provides recommendations for NameNode heap size configuration. These settings should work for typical Hadoop clusters in which the number of blocks is very close to the number of files (generally, the average ratio of number of blocks per file in a system is 1.1 to 1.2).
Some clusters might require further tweaking of the following settings. Also, it is generally better to set the total Java heap to a higher value.
Table 1.11. Recommended NameNode Heap Size Settings
Number of Files, in Millions | Total Java Heap (Xmx and Xms) | Young Generation Size (-XX:NewSize - XX:MaxNewSize) |
---|---|---|
< 1 million files | 1126m | 128m |
1-5 million files | 3379m | 512m |
5-10 | 5913m | 768m |
10-20 | 10982m | 1280m |
20-30 | 16332m | 2048m |
30-40 | 21401m | 2560m |
40-50 | 26752m | 3072m |
50-70 | 36889m | 4352m |
70-100 | 52659m | 6144m |
100-125 | 65612m | 7680m |
125-150 | 78566m | 8960m |
150-200 | 104473m | 8960m |
Note
Acceldata recommends a maximum of 300 million files on the NameNode. You should also set -XX:PermSize to 128m and -XX:MaxPermSize to 256m.
Following are the recommended settings for HADOOP_NAMENODE_OPTS in the hadoop env.sh file (replacing the ##### placeholder for -XX:NewSize, -XX:MaxNewSize, -Xms, and - Xmx with the recommended values from the table):
-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/ log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=##### -XX:MaxNewSize=##### - Xms##### -Xmx##### -XX:PermSize=128m -XX:MaxPermSize=256m -Xloggc:/var/log/ hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX: +PrintGCTimeStamps -XX:+PrintGCDateStamps -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HADOOP_NAMENODE_OPTS}
If the cluster uses a secondary NameNode, you should also set HADOOP_SECONDARYNAMENODE_OPTS to HADOOP_NAMENODE_OPTS in the hadoop env.sh file:
HADOOP_SECONDARYNAMENODE_OPTS=$HADOOP_NAMENODE_OPTS
Another useful HADOOP_NAMENODE_OPTS setting is -XX:+HeapDumpOnOutOfMemoryError.
This option specifies that a heap dump should be executed when an out-of-memory error occurs. You should also use -XX:HeapDumpPath to specify the location for the heap dump file:
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./etc/heapdump.hprof
Logs are an important part of managing and operating your ODP cluster. The directories and disks that you assign for logging in ODP must have enough space to maintain logs during ODP operations. Allocate at least 10 GB of free space for any disk you want to use for ODP logging.
The Acceldata Release Engineering team hosts all the released ODP maven artifacts at http://repo.acceldata.com/content/repositories/releases/
Other than the release artifacts, some non-Acceldata artifacts are necessary for building the ODP stack. These third-party artifacts are hosted in the Acceldata nexus repository:
http://repo.acceldata.com/content/repositories/jetty-hadoop/
and
http://repo.acceldata.com/content/repositories/re-hosted/
If developers want to develop an application against the ODP stack, and they also have a maven repository manager in-house, then they can proxy these three repositories and continue referring to the internal maven groups repo.
If developers do not have access to their in-house maven repos, they can directly use the Acceldata public groups repo http://repo.acceldata.com/content/groups/public/ and continue to develop applications.
This section describes installing and testing Apache ZooKeeper, a centralized tool for providing services to highly distributed systems.
Note
HDFS and YARN depend on ZooKeeper, so install ZooKeeper first.
- Install the ZooKeeper Package
- Securing ZooKeeper with Kerberos (optional)
- Securing ZooKeeper Access
- Set Directories and Permissions
- Set Up the Configuration Files
- Start Zookeeper
Note
In a production environment, Acceldata recommends installing ZooKeeper server on three (or a higher odd number) nodes to ensure that ZooKeeper service is available.
On all nodes of the cluster that you have identified as ZooKeeper servers, type:
- For RHEL/CentOS 7
yum install zookeeper-server
For Ubuntu 18/20:
apt-get install zookeeper
Note
Grant the zookeeper user shell access on Ubuntu 18/20.
usermod -s /bin/bash zookeeper
Note
Before starting the following steps, refer to Setting up Security for Manual Installs.
(Optional) To secure ZooKeeper with Kerberos, perform the following steps on the host that runs KDC (Kerberos Key Distribution Center):
- Start the kadmin.local utility:
/usr/sbin/kadmin.local
- Create a principal for ZooKeeper:
sudo kadmin.local -q 'addprinc zookeeper/
<ZOOKEEPER_HOSTNAME>@STORM.EXAMPLE.COM'
- Create a keytab for ZooKeeper:
sudo kadmin.local -q "ktadd -k /tmp/zk.keytab zookeeper/ <ZOOKEEPER_HOSTNAME>@STORM.EXAMPLE.COM"
- Copy the keytab to all ZooKeeper nodes in the cluster.
Note
Verify that only the ZooKeeper and Storm operating system users can access the ZooKeeper keytab.
- Administrators must add the following properties to the zoo.cfg configuration file located at /etc/zookeeper/conf:
authProvider.1 = org.apache.zookeeper.server.auth.SASLAuthenticationProvider kerberos.removeHostFromPrincipal = true
kerberos.removeRealmFromPrincipal = true
Note
Grant the zookeeper user shell access on Ubuntu 18/20.
usermod -s /bin/bash zookeeper
The default value of yarn.resourcemanager.zk-acl allows anyone to have full access to the znode. Acceldata recommends that you modify this permission to restrict access by performing the steps in the following sections.
- ZooKeeper Configuration
- YARN Configuration
- HDFS Configuration
Note
The steps in this section only need to be performed once for the ODP cluster. If this task has been done to secure HBase for example, then there is no need to repeat these ZooKeeper steps if the YARN cluster uses the same ZooKeeper server.
- Create a keytab for ZooKeeper called zookeeper.service.keytab and save it to / etc/security/keytabs.
sudo kadmin.local -q "ktadd -k /tmp/zk.keytab zookeeper/
<ZOOKEEPER_HOSTNAME>@STORM.EXAMPLE.COM"
- Add the following to the zoo.cfg file:
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider jaasLoginRenew=3600000
kerberos.removeHostFromPrincipal=true
kerberos.removeRealmFromPrincipal=true
- Create the zookeeper_client_jaas.conf file.
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=false
useTicketCache=true;
};
- Create the zookeeper_jaas.conf file.
Server {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
useTicketCache=false
keyTab="$PATH_TO_ZOOKEEPER_KEYTAB"
(such as"/etc/security/keytabs/zookeeper.service.keytab")
principal="zookeeper/$HOST";
(such as "zookeeper/[email protected]";) };
- Add the following information to zookeeper-env-sh:
export CLIENT_JVMFLAGS="-Djava.security.auth.login.config=/etc/zookeeper/conf/zookeeper_client_jaas.conf"
export SERVER_JVMFLAGS="-Xmx1024m -Djava.security.auth.login.config=/etc/zookeeper/conf/zookeeper_jaas.conf"
Note
The following steps must be performed on all nodes that launch the ResourceManager.
- Create a new configuration file called
yarn_jaas.conf
in the directory that contains the Hadoop Core configurations (typically,/etc/hadoop/conf
).
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
storeKey=true
useTicketCache=false
keyTab="$PATH_TO_RM_KEYTAB"
(such as "/etc/security/keytabs/rm.service.keytab")
principal="rm/$HOST";
(such as "rm/[email protected]";)
};
- Add a new property to the
yarn-site.xml
file.
<property>
<name>yarn.resourcemanager.zk-acl</name>
<value>sasl:rm:rwcda</value>
</property>
Note
Because
yarn-resourcemanager.zk-acl
is set to sasl, you do not need to set any value foryarn.resourcemanager.zk-auth
.
Setting the value to sasl also means that you cannot run the command
addauth<scheme><auth>
in the zkclient CLI.
- Add a new YARN_OPTS to the yarn-env.sh file and make sure this YARN_OPTS is picked up when you start your ResourceManagers.
YARN_OPTS="$YARN_OPTS -Dzookeeper.sasl.client=true
-Dzookeeper.sasl.client.username=zookeeper
-Djava.security.auth.login.config=/etc/hadoop/conf/yarn_jaas.conf
-Dzookeeper.sasl.clientconfig=Client"
- In the hdfs-site.xml file, set the following property, for security of ZooKeeper based fail-over controller. when NameNode HA is enabled:
<property>
<name>ha.zookeeper.acl</name>
<value>sasl:nn:rwcda</value>
</property>
Create directories and configure ownership and permissions on the appropriate hosts as described below. If any of these directories already exist, we recommend deleting and recreating them.
Acceldata provides a set of configuration files that represent a working ZooKeeper configuration. (See Download Companion Files.) You can use these files as a reference point, however, you need to modify them to match your own cluster environment.
If you choose to use the provided configuration files to set up your ZooKeeper environment, complete the following steps to create the appropriate directories.
- Execute the following commands on all ZooKeeper nodes:
mkdir -p $ZOOKEEPER_LOG_DIR;
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_LOG_DIR;
chmod -R 755 $ZOOKEEPER_LOG_DIR;
mkdir -p $ZOOKEEPER_PID_DIR;
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_PID_DIR;
chmod -R 755 $ZOOKEEPER_PID_DIR;
mkdir -p $ZOOKEEPER_DATA_DIR;
chmod -R 755 $ZOOKEEPER_DATA_DIR;
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_DATA_DIR
where:
• $ZOOKEEPER_USER is the user owning the ZooKeeper services. For example, zookeeper.
• $ZOOKEEPER_LOG_DIR is the directory to store the ZooKeeper logs. For example, / var/log/zookeeper
.
• $ZOOKEEPER_PID_DIR is the directory to store the ZooKeeper process ID. For example, /var/run/zookeeper
.
• $ZOOKEEPER_DATA_DIR is the directory where ZooKeeper stores data. For example, /grid/hadoop/zookeeper/data
.
- Initialize the ZooKeeper data directories with the 'myid' file. Create one file per ZooKeeper server, and put the number of that server in each file:
vi $ZOOKEEPER_DATA_DIR/myid
- In the myid file on the first server, enter the corresponding number: 1
- In the myid file on the second server, enter the corresponding number: 2
- In the myid file on the third server, enter the corresponding number: 3
You must set up several configuration files for ZooKeeper. Acceldata provides a set of configuration files that represent a working ZooKeeper configuration. (See Download Companion Files. You can use these files as a reference point, however, you need to modify them to match your own cluster environment.
If you choose to use the provided configuration files to set up your ZooKeeper environment, complete the following steps:
- Extract the ZooKeeper configuration files to a temporary directory.
The files are located in the configuration_files/zookeeper
directories where you decompressed the companion files.
- Modify the configuration files.
In the respective temporary directories, locate the zookeeper-env.sh
file and modify the properties based on your environment including the JDK version you downloaded.
- Edit the
zookeeper-env.sh
file to match the Java home directory, ZooKeeper log directory, ZooKeeper PID directory in your cluster environment and the directories you set up above.
See below for an example configuration:
export JAVA_HOME=/usr/jdk64/jdk1.8.0_202
export ZOOKEEPER_HOME=/usr/odp/current/zookeeper-server
export ZOOKEEPER_LOG_DIR=/var/log/zookeeper
export ZOOKEEPER_PID_DIR=/var/run/zookeeper/zookeeper_server.pid
export SERVER_JVMFLAGS=-Xmx1024m
export JAVA=$JAVA_HOME/bin/java
CLASSPATH=$CLASSPATH:$ZOOKEEPER_HOME/*
- Edit the zoo.cfg file to match your cluster environment. Below is an example of a typical zoo.cfs file:
dataDir=$zk.data.directory.path
server.1=$zk.server1.full.hostname:2888:3888
server.2=$zk.server2.full.hostname:2888:3888
server.3=$zk.server3.full.hostname:2888:3888
- Copy the configuration files.
- On all hosts create the config directory:
rm -r $ZOOKEEPER_CONF_DIR ;
mkdir -p $ZOOKEEPER_CONF_DIR ;
- Copy all the ZooKeeper configuration files to the $ZOOKEEPER_CONF_DIR directory. • Set appropriate permissions:
chmod a+x $ZOOKEEPER_CONF_DIR/;
chown -R $ZOOKEEPER_USER:$HADOOP_GROUP $ZOOKEEPER_CONF_DIR/../;
chmod -R 755 $ZOOKEEPER_CONF_DIR/../
Note:
- $ZOOKEEPER_CONF_DIR is the directory to store the ZooKeeper configuration files. For example,
/etc/zookeeper/conf
.- $ZOOKEEPER_USER is the user owning the ZooKeeper services. For example, zookeeper.
To install and configure HBase and other Hadoop ecosystem components, you must start the ZooKeeper service and the ZKFC:
sudo -E -u zookeeper bash -c "export ZOOCFGDIR=$ZOOKEEPER_CONF_DIR ; export ZOOCFG=zoo.cfg;
source $ZOOKEEPER_CONF_DIR/zookeeper-env.sh ; $ZOOKEEPER_HOME/bin/ zkServer.sh
start"
For example:
su - zookeeper -c "export ZOOCFGDIR=/usr/odp/current/zookeeper-server/ conf ; export ZOOCFG=zoo.cfg; source /usr/odp/current/zookeeper-server/conf/ zookeeper-env.sh ; /usr/odp/current/zookeeper-server/bin/zkServer.sh start"
su -l hdfs -c "/usr/odp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop daemon.sh start zkfc"
This section describes how to install the Hadoop Core components, HDFS, YARN, and MapReduce.
Complete the following instructions to install Hadoop Core components:
- Set Default File and Directory Permissions
- Install the Hadoop Packages
- Install Compression Libraries
- Create Directories
Set the default operating system file and directory permissions to 0022 (022).
Use the umask command to confirm that the permissions are set as necessary. For example, to see what the current umask setting are, enter:
umask
If you want to set a default umask for all users of the OS, edit the /etc/profile
file, or other appropriate file for system-wide shell configuration.
Ensure that the umask is set for all terminal sessions that you use during installation.
Execute the following command on all cluster nodes.
- For RHEL/CentOS 7
yum install hadoop hadoop-hdfs hadoop-libhdfs hadoop-yarn hadoop mapreduce hadoop-client openssl
- For Ubuntu 18/20:
apt-get install hadoop hadoop-hdfs libhdfs0 hadoop-yarn hadoop mapreduce hadoop-client openssl
Make the following compression libraries available on all the cluster nodes.
Install Snappy on all the nodes in your cluster. At each node:
- For RHEL/CentOS 7
yum install snappy snappy-devel
- For Ubuntu 18/20:
apt-get install libsnappy1 libsnappy-dev
Execute the following command at all the nodes in your cluster:
- RHEL/CentOS 7
yum install lzo lzo-devel hadooplzo hadooplzo-native
• For Ubuntu 18/20:
apt-get install liblzo2-2 liblzo2-dev hadooplzo
Create directories and configure ownership + permissions on the appropriate hosts as described below.
Before you begin:
-
If any of these directories already exist, we recommend deleting and recreating them.
-
Acceldata provides a set of configuration files that represent a working ZooKeeper configuration. (See Download Companion Files. You can use these files as a reference point, however, you need to modify them to match your own cluster environment.
Use the following instructions to create appropriate directories:
- Create the NameNode Directories
- Create the SecondaryNameNode Directories
- Create DataNode and YARN NodeManager Local Directories
- Create the Log and PID Directories
- Symlink Directories with odp-select
On the node that hosts the NameNode service, execute the following commands:
mkdir -p $DFS_NAME_DIR;
chown -R $HDFS_USER:$HADOOP_GROUP $DFS_NAME_DIR;
chmod -R 755 $DFS_NAME_DIR;
Where:
- $DFS_NAME_DIR is the space separated list of directories where NameNode stores the file system image. For example,
/grid/hadoop/hdfs/nn /grid1/hadoop/hdfs/ nn
. - $HDFS_USER is the user owning the HDFS services. For example, hdfs.
- $HADOOP_GROUP is a common group shared by services. For example, hadoop.
On all the nodes that can potentially run the SecondaryNameNode service, execute the following commands:
mkdir -p $FS_CHECKPOINT_DIR;
chown -R $HDFS_USER:$HADOOP_GROUP $FS_CHECKPOINT_DIR;
chmod -R 755 $FS_CHECKPOINT_DIR;
where:
- $FS_CHECKPOINT_DIR is the space-separated list of directories where
SecondaryNameNode should store the checkpoint image. For example,
/grid/hadoop/ hdfs/snn /grid1/hadoop/hdfs/snn /grid2/hadoop/hdfs/snn
. - $HDFS_USER is the user owning the HDFS services. For example, hdfs.
- $HADOOP_GROUP is a common group shared by services. For example, hadoop.
At each DataNode, execute the following commands:
mkdir -p $DFS_DATA_DIR;
chown -R $HDFS_USER:$HADOOP_GROUP $DFS_DATA_DIR;
chmod -R 750 $DFS_DATA_DIR;
where:
- $DFS_DATA_DIR is the space-separated list of directories where DataNodes should store the blocks. For example,
/grid/hadoop/hdfs/dn /grid1/hadoop/hdfs/dn / grid2/hadoop/hdfs/dn
. - $HDFS_USER is the user owning the HDFS services. For example, hdfs.
- $HADOOP_GROUP is a common group shared by services. For example, hadoop. At each ResourceManager and all DataNodes, execute the following commands.
mkdir -p $YARN_LOCAL_DIR;
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_DIR;
chmod -R 755 $YARN_LOCAL_DIR;
where:
- $YARN_LOCAL_DIR is the space separated list of directories where YARN should store container log data. For example,
/grid/hadoop/yarn/local /grid1/hadoop/ yarn/local /grid2/hadoop/yarn/local
. - $YARN_USER is the user owning the YARN services. For example, yarn.
- $HADOOP_GROUP is a common group shared by services. For example, hadoop. At each ResourceManager and all DataNodes, execute the following commands:
mkdir -p $YARN_LOCAL_LOG_DIR;
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOCAL_LOG_DIR;
chmod -R 755 $YARN_LOCAL_LOG_DIR;
where:
- $YARN_LOCAL_LOG_DIR is the space-separated list of directories where YARN should store temporary data. For example,
/grid/hadoop/yarn/logs /grid1/hadoop/yarn/logs /grid2/hadoop/yarn/logs
- $YARN_USER is the user owning the YARN services. For example, yarn.
- $HADOOP_GROUP is a common group shared by services. For example, hadoop.
Each ZooKeeper service requires a log and PID directory. In this section, you create directories for each service. If you choose to use the companion file scripts, these environment variables are already defined and you can copy and paste the examples into your terminal window.
At all nodes, execute the following commands:
mkdir -p $HDFS_LOG_DIR;
chown -R $HDFS_USER:$HADOOP_GROUP $HDFS_LOG_DIR;
chmod -R 755 $HDFS_LOG_DIR;
where:
- $HDFS_LOG_DIR is the directory for storing the HDFS logs.
This directory name is a combination of a directory and the $HDFS_USER. For example, / var/log/hadoop/hdfs, where hdfs is the $HDFS_USER. - $HDFS_USER is the user owning the HDFS services. For example, hdfs.
- $HADOOP_GROUP is a common group shared by services. For example, hadoop
At all nodes, execute the following commands:
mkdir -p $YARN_LOG_DIR;
chown -R $YARN_USER:$HADOOP_GROUP $YARN_LOG_DIR;
chmod -R 755 $YARN_LOG_DIR;
where:
- $YARN_LOG_DIR is the directory for storing the YARN logs.
This directory name is a combination of a directory and the $YARN_USER. For example, / var/log/hadoop/yarn, where yarn is the $YARN_USER. - $YARN_USER is the user owning the YARN services. For example, yarn.
- $HADOOP_GROUP is a common group shared by services. For example, hadoop.
At all nodes, execute the following commands:
mkdir -p $HDFS_PID_DIR;
chown -R $HDFS_USER:$HADOOP_GROUP $HDFS_PID_DIR;
chmod -R 755 $HDFS_PID_DIR;
where:
- $HDFS_PID_DIR is the directory for storing the HDFS process ID.
This directory name is a combination of a directory and the $HDFS_USER. For example,
/ var/run/hadoop/hdfs
where hdfs is the $HDFS_USER. - $HDFS_USER is the user owning the HDFS services. For example, hdfs.
- $HADOOP_GROUP is a common group shared by services. For example, hadoop.
At all nodes, execute the following commands:
mkdir -p $YARN_PID_DIR;
chown -R $YARN_USER:$HADOOP_GROUP $YARN_PID_DIR;
chmod -R 755 $YARN_PID_DIR;
where:
- $YARN_PID_DIR is the directory for storing the YARN process ID.
This directory name is a combination of a directory and the $YARN_USER. For example, / var/run/hadoop/yarn where yarn is the $YARN_USER. - $YARN_USER is the user owning the YARN services. For example, yarn.
- $HADOOP_GROUP is a common group shared by services. For example, hadoop.
At all nodes, execute the following commands:
mkdir -p $MAPRED_LOG_DIR;
chown -R $MAPRED_USER:$HADOOP_GROUP $MAPRED_LOG_DIR;
chmod -R 755 $MAPRED_LOG_DIR;
where:
- $MAPRED_LOG_DIR is the directory for storing the JobHistory Server logs.
This directory name is a combination of a directory and the $MAPRED_USER. For example,
/var/log/hadoop/mapred
where mapred is the $MAPRED_USER. - $MAPRED_USER is the user owning the MAPRED services. For example, mapred. • $HADOOP_GROUP is a common group shared by services. For example, hadoop.
At all nodes, execute the following commands:
mkdir -p $MAPRED_PID_DIR;
chown -R $MAPRED_USER:$HADOOP_GROUP $MAPRED_PID_DIR;
chmod -R 755 $MAPRED_PID_DIR;
where:
- $MAPRED_PID_DIR is the directory for storing the JobHistory Server process ID.
This directory name is a combination of a directory and the $MAPRED_USER. For example,
/var/run/hadoop/mapred
where mapred is the $MAPRED_USER. - $MAPRED_USER is the user owning the MAPRED services. For example, mapred. • $HADOOP_GROUP is a common group shared by services. For example, hadoop.
Important
ODP 3.2.2.0 installs odp-select automatically with the installation or upgrade of the first ODP component.
To prevent version-specific directory issues for your scripts and updates, Acceldata provides odp-select, a script that symlinks directories to odp-current and modifies paths for configuration directories.
Determine the version number of the odp-select installed package:
yum list | grep odp (on CentOS 7)
rpm –q -a | grep odp (on CentOS 7)
dpkg -l | grep odp (on Ubuntu)
For example:
/usr/bin/odp-select set all 3.2.2.0-<$BUILD>
Run odp-select set all on the NameNode and on all DataNodes. If YARN is deployed separately, also run odp-select on the Resource Manager and all Node Managers.
odp-select set all 3.2.2.0-<$BUILD>
This section describes how to set up and edit the deployment configuration files for HDFS and MapReduce.
You must be set up several configuration files for HDFS and MapReduce. Acceldata provides a set of configuration files that represent a working HDFS and MapReduce configuration. (See Download Companion Files.) You can use these files as a reference point, however, you need to modify them to match your own cluster environment.
If you choose to use the provided configuration files to set up your HDFS and MapReduce environment, complete the following steps:
- Extract the core Hadoop configuration files to a temporary directory.
The files are located in the configuration_files/core_hadoop
directory where you decompressed the companion files.
- Modify the configuration files.
In the temporary directory, locate the following files and modify the properties based on your environment. Search for TODO in the files for the properties to replace. For further information, see "Define Environment Parameters" in this guide.
- Edit core-site.xml and modify the following properties:
<property>
<name>fs.defaultFS</name>
<value>hdfs://$namenode.full.hostname:8020</value>
<description>Enter your NameNode hostname</description>
</property>
<property>
<name>odp.version</name>
<value>${odp.version}</value>
<description>Replace with the actual ODP version</description> </property>
- Edit hdfs-site.xml and modify the following properties:
<property>
<name>dfs.namenode.name.dir</name>
<value>/grid/hadoop/hdfs/nn,/grid1/hadoop/hdfs/nn</value>
<description>Comma-separated list of paths. Use the list of
directories from $DFS_NAME_DIR. For example, /grid/hadoop/hdfs/nn,/grid1/ hadoop/hdfs/nn.</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///grid/hadoop/hdfs/dn, file:///grid1/hadoop/hdfs/dn</ value>
<description>Comma-separated list of paths. Use the list of directories from $DFS_DATA_DIR. For example, file:///grid/hadoop/hdfs/dn, file:///grid1/ hadoop/hdfs/dn.</description>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>$namenode.full.hostname:50070</value>
<description>Enter your NameNode hostname for http access.</ description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>$secondary.namenode.full.hostname:50090</value> <description>Enter your Secondary NameNode hostname.</description>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/grid/hadoop/hdfs/snn,/grid1/hadoop/hdfs/snn,/grid2/hadoop/ hdfs/snn</value>
<description>A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/hadoop/hdfs/snn, sbr/grid1/hadoop/hdfs/ snn,sbr/grid2/hadoop/hdfs/snn </description>
</property>
<property>
<name>dfs.namenode.checkpoint.edits.dir</name>
<value>/grid/hadoop/hdfs/snn,/grid1/hadoop/hdfs/snn,/grid2/hadoop/ hdfs/snn</value>
<description>A comma-separated list of paths. Use the list of directories from $FS_CHECKPOINT_DIR. For example, /grid/hadoop/hdfs/snn, sbr/grid1/hadoop/hdfs/ snn,sbr/grid2/hadoop/hdfs/snn </description>
</property>
<property>
<name>dfs.namenode.rpc-address</name>
<value>namenode_host_name:8020>
<description>The RPC address that handles all clients requests.</ description.>
</property>
<property>
<name>dfs.namenode.https-address</name>
<value>namenode_host_name:50470>
<description>The namenode secure http server address and port.</ description.>
</property>
Note
The maximum value of the NameNode new generation size (- XX:MaxnewSize ) should be 1/8 of the maximum heap size (-Xmx). Ensure that you check the default setting for your environment.
- Edit yarn-site.xml and modify the following properties:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler. capacity.CapacityScheduler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name> <value>$resourcemanager.full.hostname:8025</value>
<description>Enter your ResourceManager hostname.</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name> <value>$resourcemanager.full.hostname:8030</value>
<description>Enter your ResourceManager hostname.</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>$resourcemanager.full.hostname:8050</value>
<description>Enter your ResourceManager hostname.</description>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>$resourcemanager.full.hostname:8141</value>
<description>Enter your ResourceManager hostname.</description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/grid/hadoop/yarn/local,/grid1/hadoop/yarn/local</value> <description>Comma separated list of paths. Use the list of directories from $YARN_LOCAL_DIR.For example, /grid/hadoop/yarn/local,/ grid1/hadoop/yarn/ local.</description>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/grid/hadoop/yarn/log</value>
<description>Use the list of directories from $YARN_LOCAL_LOG_DIR. For example, /grid/hadoop/yarn/log,/grid1/hadoop/yarn/ log,/grid2/hadoop/ yarn/log</description>
</property>
<property>
<name>yarn.nodemanager.recovery</name.dir>
<value>{hadoop.tmp.dir}/yarn-nm-recovery</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://$jobhistoryserver.full.hostname:19888/jobhistory/logs/ </ value>
<description>URL for job history server</description>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>$resourcemanager.full.hostname:8088</value>
<description>URL for job history server</description>
</property>
<property>
<name>yarn.timeline-service.webapp.address</name>
<value><Resource_Manager_full_hostname>:8188</value>
</property>
- Edit
mapred-site.xml
and modify the following properties:
<property>
<name>mapreduce.jobhistory.address</name>
<value>$jobhistoryserver.full.hostname:10020</value>
<description>Enter your JobHistoryServer hostname.</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>$jobhistoryserver.full.hostname:19888</value>
<description>Enter your JobHistoryServer hostname.</description>
</property>
- On each node of the cluster, create an empty file named dfs.exclude inside $HADOOP_CONF_DIR. Append the following to /etc/profile:
touch $HADOOP_CONF_DIR/dfs.exclude
JAVA_HOME=<java_home_path>
export JAVA_HOME
HADOOP_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR
export PATH=$PATH:$JAVA_HOME:$HADOOP_CONF_DIR
- Optional: Configure MapReduce to use Snappy Compression.
To enable Snappy compression for MapReduce jobs, edit core-site.xml and mapred site.xml.
- Add the following properties to mapred-site.xml:
<property>
<name>mapreduce.admin.map.child.java.opts</name>
<value>-server -XX:NewRatio=8 -Djava.library.path=/usr/odp/current/ hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value> <final>true</final>
</property>
<property>
<name>mapreduce.admin.reduce.child.java.opts</name>
<value>-server -XX:NewRatio=8 -Djava.library.path=/usr/odp/current/hadoop/lib/native/ -Djava.net.preferIPv4Stack=true</value> <final>true</final>
</property>
- Add the SnappyCodec to the codecs list in core-site.xml:
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
</property>
-
Optional: If you are using the
LinuxContainerExecutor
, you must set upcontainer-executor.cfg
in theconfig
directory. The file must be owned byroot:root
. The settings are in the form of key=value with one key per line. There must entries for all keys. If you do not want to assign a value for a key, you can leave it unset in the form ofkey=#
.
The keys are defined as follows:
-
yarn.nodemanager.linux-container-executor.group
- the configured value ofyarn.nodemanager.linux-container-executor.group
. This must match the value ofyarn.nodemanager.linux-container-executor.group
in yarn-site.xml. -
banned.users
- a comma separated list of users who cannot runcontainer executor
. -
min.user.id
- the minimum value of user id, this is to prevent system users from runningcontainer-executor
. -
allowed.system.users
- a comma separated list of allowed system users.
-
Replace the default memory configuration settings in yarn-site.xml and mapred site.xml with the YARN and MapReduce memory configuration settings you calculated previously. Fill in the memory/CPU values that match what the documentation or helper scripts suggests for your environment.
-
Copy the configuration files.
- On all hosts in your cluster, create the Hadoop configuration directory:
rm -rf $HADOOP_CONF_DIR
mkdir -p $HADOOP_CONF_DIR
where $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf
.
- Copy all the configuration files to $HADOOP_CONF_DIR.
- Set the appropriate permissions:
chown -R $HDFS_USER:$HADOOP_GROUP $HADOOP_CONF_DIR/../
chmod -R 755 $HADOOP_CONF_DIR/../
where:
- $HDFS_USER is the user owning the HDFS services. For example, hdfs.
- $HADOOP_GROUP is a common group shared by services. For example, hadoop.
- Set the Concurrent Mark-Sweep (CMS) Garbage Collector (GC) parameters.
On the NameNode host, open the /etc/hadoop/conf/hadoop-env.sh
file. Locate export HADOOP_NAMENODE_OPTS=<parameters>
and add the following parameters:
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=70
By default CMS GC uses a set of heuristic rules to trigger garbage collection. This makes garbage collection less predictable and tends to delay collection until the old generation is almost fully occupied. Initiating it in advance allows garbage collection to complete before the old generation is full, and thus avoid Full GC (i.e. a stop-the-world pause).
-
XX:+UseCMSInitiatingOccupancyOnly
prevents the use of GC heuristics.
• -XX:CMSInitiatingOccupancyFraction=<percent>
tells the Java VM when CMS should be triggered. Basically, it allows the creation of a buffer in heap, which can be filled with data while CMS is running. This percent should be back calculated from the speed with which memory is consumed in the old generation during production load. If this percent is set too low, the CMS runs too often; if it is set too high, the CMS is triggered too late and concurrent mode failure may occur. The recommended setting for -XX:CMSInitiatingOccupancyFraction
is 70, which means that the application should utilize less than 70% of the old generation.
Use the following instructions to start core Hadoop and perform the smoke tests.
- Format and Start HDFS
- Smoke Test HDFS
- Configure YARN and MapReduce
- Start YARN
- Start MapReduce JobHistory Server
- Smoke Test MapReduce
- Modify the JAVA_HOME value in the hadoop-env.sh file:
export JAVA_HOME=/usr/java/default
- Execute the following commands on the NameNode host machine:
su - $HDFS_USER
/usr/odp/current/hadoop-hdfs-namenode/../hadoop/bin/hdfs namenode -format /usr/odp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh -- config $HADOOP_CONF_DIR start namenode
- Execute the following commands on the SecondaryNameNode:
su - $HDFS_USER
/usr/odp/current/hadoop-hdfs-secondarynamenode/../hadoop/sbin/hadoop-daemon. sh --config $HADOOP_CONF_DIR start secondarynamenode
- Execute the following commands on all DataNodes:
su - $HDFS_USER
/usr/odp/current/hadoop-hdfs-datanode/../hadoop/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode
Where $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf
.
Where $HDFS_USER is the HDFS user, for example, hdfs
.
- Determine if you can reach the NameNode server with your browser:
http://$namenode.full.hostname:50070
- Create the hdfs user directory in HDFS:
su - $HDFS_USER
hdfs dfs -mkdir -p /user/hdfs
- Try copying a file into HDFS and listing that file:
su - $HDFS_USER
hdfs dfs -copyFromLocal /etc/passwd passwd
hdfs dfs -ls
- Use the Namenode web UI and the Utilities menu to browse the file system.
After you install Hadoop, modify your configs.
- As the HDFS user, for example 'hdfs', upload the MapReduce tarball to HDFS.
su - $HDFS_USER
hdfs dfs -mkdir -p /odp/apps/<odp_version>/mapreduce/
hdfs dfs -put /usr/odp/current/hadoop-client/mapreduce.tar.gz /odp/apps/<odp_version>/mapreduce/
hdfs dfs -chown -R hdfs:hadoop /odp
hdfs dfs -chmod -R 555 /odp/apps/<odp_version>/mapreduce
hdfs dfs -chmod 444 /odp/apps/<odp_version>/mapreduce/mapreduce.tar.gz
Where $HDFS_USER is the HDFS user, for example hdfs, and <odp_version> is the current ODP version, for example 3.2.2.0.
- Copy mapred-site.xml from the companion files and make the following changes to mapred-site.xml:
- Add
<property>
<name>mapreduce.admin.map.child.java.opts</name>
<value>-server -Djava.net.preferIPv4Stack=true -Dodp.version=${odp.version}</value>
<final>true</final>
</property>
Note
You do not need to modify ${odp.version}.
- Modify the following existing properties to include ${odp.version}:
<property>
<name>mapreduce.admin.user.env</name>
<value>LD_LIBRARY_PATH=/usr/odp/${odp.version}/hadoop/lib/native:/ usr/odp/${odp.version}/hadoop/lib/native/Linux-amd64-64</value>
</property>
<property>
<name>mapreduce.application.framework.path</name>
<value>/odp/apps/${odp.version}/mapreduce/mapreduce.tar.gz#mr framework</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/ share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/ *:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/ share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*: $PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/odp/${odp.version}/ hadoop/lib/hadoop-lzo-0.6.0.${odp.version}.jar:/etc/hadoop/conf/secure</ value>
</property>
Note
You do not need to modify ${odp.version}.
- Copy yarn-site.xml from the companion files and modify:
<property>
<name>yarn.application.classpath</name>
<value>$HADOOP_CONF_DIR,/usr/odp/${odp.version}/hadoop-client/*, /usr/odp/${odp.version}/hadoop-client/lib/*,
/usr/odp/${odp.version}/hadoop-hdfs-client/*,
/usr/odp/${odp.version}/hadoop-hdfs-client/lib/*,
/usr/odp/${odp.version}/hadoop-yarn-client/*,
/usr/odp/${odp.version}/hadoop-yarn-client/lib/*</value>
</property>
- For secure clusters, you must create and configure the container-executor.cfg configuration file:
-
Create the container-executor.cfg file in
/etc/hadoop/conf/
-
Insert the following properties:
yarn.nodemanager.linux-container-executor.group=hadoop
banned.users=hdfs,yarn,mapred
min.user.id=1000
- Set the file
/etc/hadoop/conf/container-executor.cfg
file permissions to only be readable by root:
chown root:hadoop /etc/hadoop/conf/container-executor.cfg
chmod 400 /etc/hadoop/conf/container-executor.cfg
- Set the container-executor program so that only root or hadoop group users can execute it:
chown root:hadoop /usr/odp/${odp.version}/hadoop-yarn/bin/container executor
chmod 6050 /usr/odp/${odp.version}/hadoop-yarn/bin/container-executor
Note
To install and configure the Timeline Server see Configuring the Timeline Server. 1. As $YARN_USER, run the following command from the ResourceManager server:
su -l yarn -c "/usr/odp/current/hadoop-yarn-resourcemanager/sbin/yarn daemon.sh --config $HADOOP_CONF_DIR start resourcemanager"
- As $YARN_User, run the following command from all NodeManager nodes:
su -l yarn -c "/usr/odp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager"
where: $HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf
.
- Change permissions on the container-executor file.
chown -R root:hadoop /usr/odp/current/hadoop-yarn*/bin/container-executor chmod -R 6050 /usr/odp/current/hadoop-yarn*/bin/container-executor
Note
If these permissions are not set, the healthcheck script returns an error stating that the DataNode is UNHEALTHY.
- Execute these commands from the JobHistory server to set up directories on HDFS:
su $HDFS_USER
hdfs dfs -mkdir -p /mr-history/tmp
hdfs dfs -mkdir -p /mr-history/done
hdfs dfs -chmod 1777 /mr-history
hdfs dfs -chmod 1777 /mr-history/tmp
hdfs dfs -chmod 1770 /mr-history/done
hdfs dfs -chown $MAPRED_USER:$MAPRED_USER_GROUP /mr-history
hdfs dfs -chown $MAPRED_USER:$MAPRED_USER_GROUP /mr-history/tmp
hdfs dfs -chown $MAPRED_USER:$MAPRED_USER_GROUP /mr-history/done
Where
$MAPRED_USER : mapred
$MAPRED_USER_GROUP: mapred or hadoop
hdfs dfs -mkdir -p /app-logs
hdfs dfs -chmod 1777 /app-logs
hdfs dfs -chown $YARN_USER:$HADOOP_GROUP /app-logs
Where
$YARN_USER : yarn
$HADOOP_GROUP: hadoop
- Run the following command from the JobHistory server:
su -l $YARN_USER -c
"/usr/odp/current/hadoop-mapreduce-historyserver/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver"
$HADOOP_CONF_DIR is the directory for storing the Hadoop configuration files. For example, /etc/hadoop/conf
.
- Browse to the ResourceManager:
http://$resourcemanager.full.hostname:8088/
- Create a $CLIENT_USER in all of the nodes and add it to the users group.
useradd client
usermod -a -G users client
- As the HDFS user, create a /user/$CLIENT_USER.
sudo su - $HDFS_USER
hdfs dfs -mkdir /user/$CLIENT_USER
hdfs dfs -chown $CLIENT_USER:$CLIENT_USER /user/$CLIENT_USER
hdfs dfs -chmod -R 755 /user/$CLIENT_USER
- Run the smoke test as the $CLIENT_USER. Using Terasort, sort 10GB of data.
su - $CLIENT_USER
/usr/odp/current/hadoop-client/bin/hadoop jar /usr/odp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-*.jar teragen 10000 tmp/ teragenout
/usr/odp/current/hadoop-client/bin/hadoop jar /usr/odp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-*.jar terasort tmp/teragenout tmp/terasortout
A typical Open source Data Platform (ODP) install requires access to the Internet in order to fetch software packages from a remote repository. Because corporate networks typically have various levels of firewalls, these firewalls may limit or restrict Internet access, making it impossible for your cluster nodes to access the ODP repository during the install process.
The solution for this is to either:
- Create a local mirror repository inside your firewall hosted on a local mirror server inside your firewall; or
- Provide a trusted proxy server inside your firewall that can access the hosted repositories.
Note
Many of the descriptions in this section assume you are using RHEL/Centos 7.
This document will cover these two options in detail, discuss the trade-offs, provide configuration guidelines, and will also provide recommendations for your deployment strategy.
In general, before installing Open source Data Platform in a production data center, it is best to ensure that both the Data Center Security team and the Data Center Networking team are informed and engaged to assist with these aspects of the deployment.
The table below lists the various terms used throughout this section.
Table 6.1. Terminology
Item | Description |
---|---|
Yum Package Manager (yum) | A package management tool that fetches and installs software packages and performs automatic dependency resolution. |
Local Mirror Repository | The yum repository hosted on your Local Mirror Server that will serve the ODP software. |
Local Mirror Server | The server in your network that will host the Local Mirror Repository. This server must be accessible from all hosts in your cluster where you will install ODP. |
ODP Repositories | A set of repositories hosted by Acceldata that contains the ODP software packages. ODP software packages include the ODP Repository and the ODP-UTILS Repository. |
ODP Repository Tarball | A tarball image that contains the complete contents of the ODP Repositories. |
ODP uses yum or zypper to install software, and this software is obtained from the ODP Repositories. If your firewall prevents Internet access, you must mirror or proxy the ODP Repositories in your Data Center.
Mirroring a repository involves copying the entire repository and all its contents onto a local server and enabling an HTTPD service on that server to serve the repository locally. Once the local mirror server setup is complete, the *.repo configuration files on every cluster node must be updated, so that the given package names are associated with the local mirror server instead of the remote repository server.
There are two options for creating a local mirror server. Each of these options is explained in detail in a later section.
-
Mirror server has no access to Internet at all: Use a web browser on your workstation to download the ODP Repository Tarball, move the tarball to the selected mirror server using scp or an USB drive, and extract it to create the repository on the local mirror server.
-
Mirror server has temporary access to Internet: Temporarily configure a server to have Internet access, download a copy of the ODP Repository to this server using the reposync command, then reconfigure the server so that it is back behind the firewall.
Note
Option I is probably the least effort, and in some respects, is the most secure deployment option.
Option III is best if you want to be able to update your Hadoop installation periodically from the Acceldata Repositories.
Trusted proxy server: Proxying a repository involves setting up a standard HTTP proxy on a local server to forward repository access requests to the remote repository server and route responses back to the original requestor. Effectively, the proxy server makes the repository server accessible to all clients, by acting as an intermediary.
Once the proxy is configured, change the /etc/yum.conf file on every cluster node, so that when the client attempts to access the repository during installation, the request goes through the local proxy server instead of going directly to the remote repository server.
The following table lists some benefits provided by these alternative deployment strategies:
Advantages of Repository Mirroring | Advantages of creating a proxy |
---|---|
Is therefore faster, reliable, and more cost effective (reduced WAN bandwidth minimizes the data center costs). Allows security-conscious data centers to qualify a fixed set of repository files. It also ensures that the remote server will not change these repository files. Large data centers may already have existing repository mirror servers for the purpose of OS upgrades and software maintenance. You can easily add the ODP Repositories to these existing servers | New versions, and bug fixes). Almost all data centers already have a setup of well-known proxies. In such cases, you can simply add the local proxy server to the existing proxies configurations. This approach is easier compared to creating local mirror servers in data centers with no mirror server setup. The network access is same as that required when using a mirror repository, but the source repository handles file management. |
However, each of the above approaches are also known to have the following disadvantages:
- Mirrors have to be managed for updates, upgrades, new versions, and bug fixes.
- Proxy servers rely on the repository provider to not change the underlying files without notice.
- Caching proxies are necessary, because non-caching proxies do not decrease WAN traffic and do not speed up the install process.
This section provides information on the various components of the Apache Hadoop ecosystem.
In many datacenters, using a mirror for the ODP Repositories can be the best deployment strategy. The ODP Repositories are small and easily mirrored, allowing you secure control over the contents of the Hadoop packages accepted for use in your data center.
Note
The installer pulls many packages from the base OS repositories (repos). If you do not have a complete base OS available to all your machines at the time of installation, you may run into issues. If you encounter problems with base OS repos being unavailable, please contact your system administrator to arrange for these additional repos to be proxied or mirrored.
Complete the following instructions to set up a mirror server that has no access to the Internet:
- Check Your Prerequisites.
Select a mirror server host with the following characteristics:
- The server OS is CentOS (7), RHEL (7), or Ubuntu (18,20), and has several GB of storage available.
- This server and the cluster nodes are all running the same OS.
Note
To support repository mirroring for heterogeneous clusters requires a more complex procedure than the one documented here.
- The firewall lets all cluster nodes (the servers on which you want to install ODP) access this serve.
-
Install the Repos.
a. Use a workstation with access to the Internet and download the tarball image of the appropriate Acceldata yum repository.
Table 6.2. Acceldata Yum Repositories
Cluster OS | ODP Repository Tarballs |
---|---|
RHEL/CentOS 7 | wget [INSERT_URL] |
RHEL/CentOS 7 | wget [INSERT_URL] |
Ubuntu 18 | wget [INSERT_URL] wget [INSERT_URL] |
Ubuntu 20 | wget [INSERT_URL] wget [INSERT_URL] |
b. Create an HTTP server.
• On the mirror server, install an HTTP server (such as Apache httpd) using the instructions provided here.
• Activate this web server.
• Ensure that the firewall settings (if any) allow inbound HTTP access from your cluster nodes to your mirror server.
Note
If you are using EC2, make sure that SELinux is disabled.
If you are using EC2, make sure that SELinux is disabled.
c. On your mirror server, create a directory for your web server.
For example, from a shell window, type:
- For RHEL/CentOS 7:
mkdir –p /var/www/html/odp/
For Ubuntu 18/20:
mkdir –p /var/www/html/odp/
If you are using a symlink, enable the following symlinks on your web server.
d. Copy the ODP Repository Tarball to the directory created in step 3, and untar it.
e. Verify the configuration.
- The configuration is successful, if you can access the above directory through your web browser.
To test this out, browse to the following location: http://$yourwebserver/odp/$os/ODP-3.2.2.0-1/.
You should see directory listing for all the ODP components along with the RPMs at: $os/ODP-3.2.2.0-1.
Note
$os
can be Centos7, Ubuntu 18/20. Use the following options table for $osparameter:
Table 6.3. ODP Component Options
Operating System | Value |
---|---|
RHEL 7 | centos 7 |
CentOs 7 | centos7 |
Ubuntu 18 | ubuntu18 |
Ubuntu 20 | ubuntu20 |
f. Configure the yum clients on all the nodes in your cluster.
- Fetch the yum configuration file from your mirror server.
- Store the
odp.repo
file to a temporary location. - Edit the
odp.repo
file changing the value of the base url property to point to your local repositories based on your cluster OS.
where
-
$yourwebserver
is the FQDN of your local mirror server. -
$os
can be RHEL 7, Centos7 or Ubuntu 18/20. Use the following options table for$os
parameter:
Table 6.4. Yum Client Options
Operating System | Value |
---|---|
RHEL 7 | centos7 |
CentOs 7 | centos7 |
Ubuntu 18 | Ubuntu18 |
Ubuntu 20 | Ubuntu20 |
-
Use
scp
orpdsh
to copy the client yum configuration file to /etc/yum.repos.d/ directory on every node in the cluster. -
[Conditional]: If you have multiple repositories configured in your environment, deploy the following plugin on all the nodes in your cluster.
-
Install the plugin.
-
For RHEL and CentOS
yum install yum-plugin-priorities
- Edit the /etc/yum/pluginconf.d/priorities.conf file to add the following:
[main]
enabled=1
gpgcheck=0
Complete the following instructions to set up a mirror server that has temporary access to the Internet:
- Check Your Prerequisites.
Select a local mirror server host with the following characteristics:
- The server OS is CentOS (7), RHEL (7), or Ubuntu (18,20), and has several GB of storage available.
- The local mirror server and the cluster nodes must have the same OS. If they are not running CentOS or RHEL, the mirror server must not be a member of the Hadoop cluster.
Note
To support repository mirroring for heterogeneous clusters requires a more complex procedure than the one documented here.
- The firewall allows all cluster nodes (the servers on which you want to install ODP) to access this server.
- Ensure that the mirror server hasyum installed.
- Add the
yum-utils
andcreaterepo
packages on the mirror server.yum install yum-utils createrepo
- Install the Repos.
- Temporarily reconfigure your firewall to allow Internet access from your mirror server host.
- Execute the following command to download the appropriate Acceldata yum client configuration file and save it in /etc/yum.repos.d/ directory on the mirror server host.
Table 6.5. Yum Client Configuration Commands
Cluster OS | ODP Repository Tarballs |
---|---|
RHEL/CentOS 7 | wget [INSERT_URL] |
Ubuntu 18 | wget [INSERT_URL] |
Ubuntu 20 | wget [INSERT_URL] |
- Create an HTTP server.
- On the mirror server, install an HTTP server (such as Apache httpd using the instructions provided
- Activate this web server.
- Ensure that the firewall settings (if any) allow inbound HTTP access from your cluster nodes to your mirror server.
Note
If you are using EC2, make sure that SELinux is disabled.
Optional - If your mirror server uses SLES, modify the default-server.conf
file to enable the docs root folder listing.
sed -e s/Options None/Options Indexes MultiViews/ig /etc/apache2/default-server.conf /tmp/tempfile.tmp
mv /tmp/tempfile.tmp /etc/apache2/default-server.conf
On your mirror server, create a directory for your web server.
• For example, from a shell window, type:
• For RHEL/CentOS 7:
mkdir –p /var/www/html/odp/
• For Ubuntu 18/20:
mkdir –p /var/www/html/odp/
• If you are using a symlink, enable the followsymlinks on your web server.
• Copy the contents of entire ODP repository for your desired OS from the remote
-
Continuing the previous example, from a shell window, type:
-
For RHEL/CentOS 7/Ubuntu 18/20:
cd/var/www/html/odp
Then for all hosts, type:
- ODP Repository
reposync -r ODP
reposync -r ODP-3.2.2.0-1
reposync -r ODP-UTILS-1.1.0.21
You should see both an ODP-3.2.2.0-1 directory and an ODP-UTILS-1.1.0.21 directory, each with several subdirectories.
- Generate appropriate metadata.
This step defines each directory as a yum repository. From a shell window, type:
-
For RHEL/CentOS 7:
- ODP Repository:
createrepo /var/www/html/odp/ODP-3.2.2.0-1
createrepo /var/www/html/odp/ODP-UTILS-1.1.0.21
You should see a new folder called repodata inside both ODP directories.
-
Verify the configuration.
-
The configuration is successful, if you can access the above directory through your web browser.
To test this out, browse to the following location:
-
ODP:http://$yourwebserver/odp/ODP-3.2.2.0-1/
-
You should now see directory listing for all the ODP components.
-
At this point, you can disable external Internet access for the mirror server, so that the mirror server is again entirely within your data center firewall.
-
Depending on your cluster OS, configure the yum clients on all the nodes in your cluster
-
Edit the repo files, changing the value of the baseurl property to the local mirror URL.
-
Edit the /etc/yum.repos.d/odp.repo file, changing the value of the baseurl property to point to your local repositories based on your cluster OS.
[ODP-3.x]
name=Open source Data Platform Version - ODP-3.x
baseurl=http://$yourwebserver/ODP/$os/3.x/GA
gpgcheck=1
gpgkey=http://public-repo-1.acceldata.com/ODP/$os/RPM-GPG-KEY/RPM GPG-KEY-Jenkins
enabled=1
priority=1
[ODP-UTILS-1.1.0.21]
name=Open source Data Platform Utils Version - ODP-UTILS-1.1.0.21
baseurl=http://$yourwebserver/ODP-UTILS-1.1.0.21/repos/$os
gpgcheck=1
gpgkey=http://public-repo-1.acceldata.com/ODP/$os/RPM-GPG-KEY/RPM GPG-KEY-Jenkins
enabled=1
priority=1
[ODP-2.6.0.3]
name=Open source Data Platform ODP-3.2.2.0-1
baseurl=http://$yourwebserver/ODP/$os/3.x/updates/3.2.2.0-1
gpgcheck=1
gpgkey=http://public-repo-1.acceldata.com/ODP/$os/RPM-GPG-KEY/RPM GPG-KEY-Jenkins
enabled=1
priority=1
where
-
$yourwebserver
is the FQDN of your local mirror server. -
$os
can be Centos7, suse11sp3, Ubuntu 18/20. Use the following options table for$os
parameter:
Table 6.6. $OS Parameter Values
Operating System | Value |
---|---|
RHEL 7 | centos7 |
CentOs 7 | centos7 |
Ubuntu 18 | ubuntu18 |
Ubuntu 20 | ubuntu20 |
-
Copy the yum/zypper client configuration file to all nodes in your cluster.
-
RHEL/CentOS 7:
Use scp or pdsh to copy the client yum configuration file to /etc/yum.repos.d/ directory on every node in the cluster.
-
-
For Ubuntu 18/20:
On every node, invoke the following command:
-
ODP Repository:
sudo add-apt-repository deb [INSERT_URL]
-
Optional - Ambari Repository
sudo add-apt-repository deb [INSERT_URL]
-
If using Ambari, verify the configuration by deploying an Ambari server on one of the cluster nodes.
yum install ambari-server
-
-
If your cluster runs CentOS 7, or RHEL and if you have multiple repositories configured in your environment, deploy the following plugin on all the nodes in your cluster.
-
Install the plugin.
-
For RHEL and CentOs v7.x
yum install yum-plugin-priorities
-
Edit the /etc/yum/pluginconf.d/priorities.conf file to add the following:
[main] enabled=1 gpgcheck=0
-
-
Complete the following instructions to set up a trusted proxy server:
- Check Your Prerequisites.
Select a mirror server host with the following characteristics:
-
This server runs on either CentOS 7/RHEL or Ubuntu 18/20, and has several GB of storage available.
-
The firewall allows all cluster nodes (the servers on which you want to install ODP) to access this server, and allows this server to access the Internet (at least those Internet servers for the repositories to be proxied)Install the Repos
- Create a caching HTTP Proxy server on the selected host.
• It is beyond the scope of this document to show how to set up an HTTP PROXY server, given the many variations that may be required, depending on your data center’s network security policy. If you choose to use the Apache HTTPD server, it starts by installing httpd, using the instructions provided here , and then adding the mod_proxy and mod_cache modules, as stated here. Please engage your network security specialists to correctly set up the proxy server.
-
Activate this proxy server and configure its cache storage location.
-
Ensure that the firewall settings (if any) allow inbound HTTP access from your cluster nodes to your mirror server, and outbound access to the desired repo sites, including: public-repo-1.acceldata.com.
If you are using EC2, make sure that SELinux is disabled.
- Depending on your cluster OS, configure the yum clients on all the nodes in your cluster.
The following description is taken from the CentOS documentation. On each cluster node, add the following lines to the /etc/yum.conf file. (As an example, the settings below will enable yum to use the proxy server mycache.mydomain.com, connecting to port 3128, with the following credentials: yum-user/query.
# proxy server:port number
proxy=http://mycache.mydomain.com:3128
# account details for secure yum proxy connections
proxy_username=yum-user
proxy_password=qwerty
-
Once all nodes have their /etc/yum.conf file updated with appropriate configuration info, you can proceed with the ODP installation just as though the nodes had direct access to the Internet repositories.
-
If this proxy configuration does not seem to work, try adding a / at the end of the proxy URL. For example:
proxy=http://mycache.mydomain.com:3128/