Instructions for creating new Hadoop distributions and making them available through orka.

The following instructions cover the creation of ~okeanos images from Apache distributions (Base Hadoop+Flume, Base Hadoop+Flume+Hue, Enriched Hadoop Ecosystem) and also the latest Cloudera distribution.

Hadoop Base image (General instructions and example for Hadoop-2.7.1)

Every instruction regarding image creation is executed as root.

Set up Debian 8.x repositories and update/upgrade

Create VM in ~okeanos with Debian 8.x (currently 8.3) image. If needed change mirrors in /etc/apt/sources.list, for example:

deb http://ftp.gr.debian.org/debian/ jessie main
deb-src http://ftp.gr.debian.org/debian/ jessie main

deb http://security.debian.org/ jessie/updates main
deb-src http://security.debian.org/ jessie/updates main

# jessie-updates, previously known as 'volatile'
deb http://ftp.gr.debian.org/debian/ jessie-updates main
deb-src http://ftp.gr.debian.org/debian/ jessie-updates main

and then

apt-get update
apt-get upgrade
apt-get install sudo

Install snf-image creator

nano /etc/apt/sources.list

Add line: deb http://apt.dev.grnet.gr jessie/

apt-get install curl
curl https://dev.grnet.gr/files/apt-grnetdev.pub | apt-key add -
apt-get update
apt-get install snf-image-creator
if asked for “supermin appliance”, choose “Yes”
apt-get install python-pip
pip install kamaki==0.13.5

Install Oracle Java

echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee /etc/apt/sources.list.d/webupd8team-java.list

echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee -a /etc/apt/sources.list.d/webupd8team-java.list

apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886
apt-get update
apt-get install oracle-java8-installer
apt-get install oracle-java8-set-default

Disable ipv6

nano /etc/sysctl.conf and add following lines:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Configure ssh parameters

nano /etc/ssh/ssh_config and uncomment/add the following lines:

StrictHostKeyChecking no
UserKnownHostsFile=/dev/null

Remove potentially problematic line from /etc/hosts

nano /etc/hosts file and remove second line (e.g. 127.0.1.1 snf-123456). We need this because VM's hostname will be used with the private IP assigned to it by ~okeanos for the Hadoop cluster.

Download and unzip Hadoop installation files

cd /usr/local
wget http://apache.forthnet.gr/hadoop/common/stable/hadoop-2.7.1.tar.gz
tar xvzf /usr/local/hadoop-2.7.1.tar.gz
rm /usr/local/hadoop-2.7.1.tar.gz

Install Flume

wget https://www.apache.org/dist/flume/stable/apache-flume-1.6.0-bin.tar.gz
tar xvzf apache-flume-1.6.0-bin.tar.gz
mv apache-flume-1.6.0-bin /usr/local/flume
rm apache-flume-1.6.0-bin.tar.gz

Setup Flume environment

export FLUME_CONF_DIR=/usr/local/flume/conf
cp $FLUME_CONF_DIR/flume-env.sh.template $FLUME_CONF_DIR/flume-env.sh
echo export JAVA_HOME=$JAVA_HOME >> $FLUME_CONF_DIR/flume-env.sh
echo export JAVA_OPTS=\"-Xms500m -Xmx2000m\" >> $FLUME_CONF_DIR/flume-env.sh
mkdir -p $FLUME_HOME/plugins.d

Prepare flume-agent daemon

mkdir -p /var/log/flume
mkdir -p /var/run/flume

Image creation

For ~okeanos image creation, the following command must be executed:

snf-mkimage --public --print-syspreps -f -u {{image_name}} -t {{token}} -a {{authentication url}} -r {{image_name}} /

Get the pithos uuid of a newly created image

kamaki image list | grep <new image name>

e.g for Hadoop-2.7.1

kamaki image list | grep Hadoop-2.7.1

will return

<some_pithos_uuid> Hadoop-2.7.1

Additions in database (Personal Orka server)

After the image is created, uploaded on pithos and registered in kamaki, one additional action is required for the image to be usable.

Update database (Personal Orka server)

Insert the newly created image in the database. This SQL script file can be checked for examples of how a new image (Orka or VRE) is added. The mandatory database fields are image_name, image_pithos_uuid and image_category_id.

For the Hadoop-2.7.1 image we did the following:

sudo -u postgres psql
\c escience; 
INSERT INTO backend_orkaimage (id,image_name, image_pithos_uuid, image_components, image_category_id) VALUES (6,'Hadoop-2.7.1','<hadoop271_pithos_uuid>', '{"Debian":{"version":"8.0","help":"https://www.debian.org/"},"Hadoop":{"version":"2.7.1","help":"https://hadoop.apache.org/"},"Flume":{"version":"1.6","help":"https://flume.apache.org/"}}',2);
\q

Alternatively, from {{personal_orka_server_IP}}/admin, an administrator can login and add the Hadoop image in Orka Images table.

Hue 3.9.0 image creation on top of Hadoop-2.7.1

(from http://gethue.com/how-to-build-hue-on-ubuntu-14-04-trusty/) Create ~okeanos VM with Hadoop-2.7.1 image.

Install Hue Dependencies

(sudo) apt-get update
(sudo) apt-get install ant gcc g++ libkrb5-dev libffi-dev libmysqlclient-dev libssl-dev libsasl2-dev libsasl2-modules-gssapi-mit libsqlite3-dev libtidy-0.99-0 libxml2-dev libxslt-dev make libldap2-dev maven python-dev python-setuptools libgmp3-dev
# pip install --upgrade cffi
# pip install cryptography

Download and install Hue

wget https://dl.dropboxusercontent.com/u/730827/hue/releases/3.9.0/hue-3.9.0.tgz
tar -xvzf hue-3.9.0.tgz
rm hue-3.9.0.tgz
cd hue-3.9.0
make install

Hue-3.9.0 is now installed on /usr/local/hue. Image now can be created in the same way described before:

snf-mkimage --public --print-syspreps -f -u Hue-3.9.0 -t {{token}} -a {{authentication url}} -r Hue-3.9.0 /

Update database (Personal Orka server)

For the Hue-3.9.0 image:

sudo -u postgres psql
\c escience;
INSERT INTO backend_orkaimage (id, image_name, image_pithos_uuid, image_components, image_category_id) VALUES (7, 'Hue-3.9.0', '<hue390_pithos_uuid>', '{"Debian":{"version":"8.2","help":"https://www.debian.org/"},"Hadoop":{"version":"2.7.1","help":"https://hadoop.apache.org/"},"Flume":{"version":"1.6","help":"https://flume.apache.org/"},"Hue":{"version":"3.9.0","help":"http://gethue.com/"}}',3);
\q

Alternatively, from {{personal_orka_server_IP}}/admin, an administrator can login and add the Hue image in Orka Images table.

Ecosystem image creation on top of Hue-3.9.0

Create ~okeanos VM with Hue-3.9.0 image.

Install Pig

wget http://mirrors.myaegean.gr/apache/pig/latest/pig-0.15.0.tar.gz
tar -zxvf pig-0.15.0.tar.gz
mv pig-0.15.0/ /usr/local/pig
rm pig-0.15.0.tar.gz

Install Oozie

apt-get install zip
wget http://mirrors.myaegean.gr/apache/oozie/4.1.0/oozie-4.1.0.tar.gz
tar -xvzf oozie-4.1.0.tar.gz
cd oozie-4.1.0
mvn clean package assembly:single -P hadoop-2 -DskipTests

mkdir Oozie
cp -R distro/target/oozie-4.1.0-distro/oozie-4.1.0/ Oozie/
cd Oozie/oozie-4.1.0
mkdir libext
cp -R ../../hadooplibs/hadoop-2/target/hadooplibs/hadooplib-2.3.0.oozie-4.1.0/* libext/
cd libext/
wget http://dev.sencha.com/deploy/ext-2.2.zip
cd ../../
mv oozie-4.1.0/ /usr/local/oozie
cd /usr/local/oozie/bin
./oozie-setup.sh prepare-war

Install HBase

wget http://apache.tsl.gr/hbase/stable/hbase-1.1.2-bin.tar.gz
tar -xvzf hbase-1.1.2-bin.tar.gz
mv hbase-1.1.2/ /usr/local/hbase
rm hbase-1.1.2-bin.tar.gz

Install Spark

wget http://apache.forthnet.gr/spark/spark-1.5.0/spark-1.5.0-bin-hadoop2.6.tgz
tar xvzf spark-1.5.0-bin-hadoop2.6.tgz
mv spark-1.5.0-bin-hadoop2.6/ /usr/local/spark

Install Sbt

echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
apt-get install apt-transport-https
apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
apt-get update
apt-get install sbt

Install Spark Job server

cd /usr/local/spark
git clone https://github.com/spark-jobserver/spark-jobserver.git

Install Hive

apt-get install subversion
cd  /usr/local/
svn co http://svn.apache.org/repos/asf/hive/trunk hive
cd hive
mvn clean package install -DskipTests -Phadoop-2,dist
cd conf/
cp hive-env.sh.template hive-env.sh

nano hive-env.sh and add following lines:

export HIVE_CONF_DIR=$HIVE_HOME/conf
export HIVE_AUX_JARS_PATH=$HIVE_HOME/lib
export HADOOP_HOME=/usr/local/hadoop

and then continue from /usr/local/hive:

cd packaging/target/apache-hive-1.2.0-SNAPSHOT-bin/apache-hive-1.2.0-SNAPSHOT-bin/lib
cp -r * /usr/local/hive/lib/
apt-get install libpostgresql-jdbc-java
ln -s /usr/share/java/postgresql-jdbc4.jar /usr/local/hive/lib/postgresql-jdbc4.jar

Create Ecosystem-on-Hue-3.9.0

snf-mkimage --public --print-syspreps -f -u Ecosystem-on-Hue-3.9.0 -t {{token}} -a {{authentication url}} -r Ecosystem-on-Hue-3.9.0 /

Update database (Personal Orka server)

For the Ecosystem-on-Hue-3.9.0 image:

sudo -u postgres psql
\c escience;
INSERT INTO backend_orkaimage (id,image_name, image_pithos_uuid, image_components, image_category_id) VALUES (8, 'Ecosystem-on-Hue-3.9.0', '<ecosystem390_pithos_uuid>', '{"Debian":{"version":"8.2","help":"https://www.debian.org/"},"Hadoop":{"version":"2.7.1","help":"https://hadoop.apache.org/"},"Flume":{"version":"1.6","help":"https://flume.apache.org/"},"Hue":{"version":"3.9.0","help":"http://gethue.com/"},"Pig":{"version":"0.15.0","help":"http://pig.apache.org/"},"Hive":{"version":"1.2.0","help":"http://hive.apache.org/"},"Hbase":{"version":"1.1.2","help":"http://hbase.apache.org/"},"Oozie":{"version":"4.1.0","help":"http://oozie.apache.org/"},"Spark":{"version":"1.5.0","help":"http://spark.apache.org/"}}',4);
\q

Alternatively, from {{personal_orka_server_IP}}/admin, an administrator can login and add the Ecosystem image in Orka Images table.

Cloudera image (General instructions and example for Cloudera-CDH-5.4.7)

Every instruction regarding image creation is executed as root.

Set up Debian 7.8 repositories and update/upgrade

Create VM in ~okeanos with Debian 7.8 image. If needed change mirrors in /etc/apt/sources.list, for example:

deb http://ftp.gr.debian.org/debian wheezy main
deb-src http://ftp.gr.debian.org/debian wheezy main

deb http://security.debian.org/ wheezy/updates main
deb-src http://security.debian.org/ wheezy/updates main

# wheezy-updates, previously known as 'volatile'
deb http://ftp.debian.org/debian/ wheezy-updates main
deb-src http://ftp.debian.org/debian/ wheezy-updates main

and then

apt-get update
apt-get upgrade
apt-get install sudo

Install snf-image creator

apt-get install curl
curl https://dev.grnet.gr/files/apt-grnetdev.pub | apt-key add -

nano /etc/apt/sources.list and add line :

    deb http://apt.dev.grnet.gr wheezy/

then:

apt-get update
apt-get install snf-image-creator

if asked for “supermin appliance”, choose “Yes”

apt-get install python-pip
pip install kamaki==0.13.5

Install Oracle Java and set JAVA environment parameters

echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee /etc/apt/sources.list.d/webupd8team-java.list
echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu precise main" | tee -a /etc/apt/sources.list.d/webupd8team-java.list
apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886
apt-get update
apt-get install oracle-java8-installer
apt-get install oracle-java8-set-default
update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/java-8-oracle/bin/java" 1
update-alternatives --config java

Install Postgres

apt-get install postgresql postgresql-client

Download CDH 5 "1-click Install" package

cd ~
wget http://archive.cloudera.com/cdh5/one-click-install/wheezy/amd64/cdh5-repository_1.0_all.deb

Install package

dpkg -i cdh5-repository_1.0_all.deb

Optionally Add a Repository Key

wget http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/archive.key -O archive.key
apt-key add archive.key

Install Cloudera latest distribution with YARN

apt-get update; apt-get install hadoop-yarn-resourcemanager
apt-get install hadoop-hdfs-namenode
apt-get install hadoop-hdfs-secondarynamenode
apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce
apt-get install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver
apt-get install hadoop-client

Install other components

Pig installation

apt-get install pig

Hue installation

apt-get install hue

Configuring the Hue Server to Store Data in PostgreSQL

Configuring and using PostgreSQL for Hue Server steps 4-17.

Oozie installation

(according to http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_hue_database1.html#concept_id1_wkj_zj_unique_4__section_o5w_qlj_zj_unique_3)

apt-get install oozie
apt-get install oozie-client

Database (Postgresql) configurations for oozie:

su - postgres
psql
CREATE ROLE oozie LOGIN ENCRYPTED PASSWORD 'some_password' NOSUPERUSER INHERIT CREATEDB NOCREATEROLE;
CREATE DATABASE "oozie" WITH OWNER = oozie ENCODING = 'UTF8' TABLESPACE = pg_default CONNECTION LIMIT = -1;
\q

Edit "/etc/postgresql/9.1/main/postgresql.conf"

listen_addresses property to '*'
standard_conforming_strings property is set to off

Edit "/etc/postgresql/9.1/main/pg_hba.conf"

# IPv4 local connections:
host    oozie             oozie             0.0.0.0/0            md5

Restart to reload the PostgreSQL configuration

/etc/init.d/postgresql restart

Enable the Oozie Web Console

wget http://archive.cloudera.com/gplextras/misc/ext-2.2.zip 
apt-get install unzip
unzip ext-2.2.zip -d /var/lib/oozie

Install Spark

apt-get install spark-core spark-master spark-worker spark-history-server spark-python

Install Hive

apt-get install hive hive-metastore hive-server2 hive-hbase

Install the PostgreSQL JDBC Driver on a Debian/Ubuntu system

(according to http://www.cloudera.com/documentation/enterprise/5-2-x/topics/cdh_ig_hive_metastore_configure.html)

apt-get install libpostgresql-jdbc-java
ln -s /usr/share/java/postgresql-jdbc4.jar /usr/lib/hive/lib/postgresql-jdbc4.jar

Create the metastore database and user account

sudo -u postgres psql
postgres=# CREATE USER hiveuser WITH PASSWORD 'some_password';
postgres=# CREATE DATABASE metastore;
postgres=# \c metastore;
You are now connected to database 'metastore'.
metastore=# \i /usr/lib/hive/scripts/metastore/upgrade/postgres/hive-schema-0.12.0.postgres.sql
SET
SET
...

Grant permission for all metastore tables to user hiveuser

sudo -u postgres psql
\c metastore
metastore=# \pset tuples_only on
metastore=# \o /tmp/grant-privs
metastore=#   SELECT 'GRANT SELECT,INSERT,UPDATE,DELETE ON "'  || schemaname || '". "' ||tablename ||'" TO hiveuser ;'
metastore-#   FROM pg_tables
metastore-#   WHERE tableowner = CURRENT_USER and schemaname = 'public';
metastore=# \o
metastore=# \pset tuples_only off
metastore=# \i /tmp/grant-privs

Install HBase, HBase Master, HBase Thrift Server, HBase REST

apt-get install hbase
apt-get install hbase-master
apt-get install hbase-thrift
apt-get install hbase-rest
apt-get install hbase-regionserver

Install Flume

apt-get install flume-ng flume-ng-agent flume-ng-doc

Crete a temp directory to host data that will be streamed to hdfs

mkdir /usr/lib/flume-ng/tmp

Remove services from start-up

update-rc.d -f <service> remove

where service:

flume-ng-agent, hive-metastore, hadoop-hdfs-datanode, hive-server2, hadoop-hdfs-namenode, hue, hadoop-hdfs-secondarynamenode, hadoop-mapreduce-historyserver, oozie, hadoop-yarn-nodemanager, hadoop-yarn-proxyserver, spark-history-server, hadoop-yarn-resourcemanager, spark-master, hbase-master, spark-worker, hbase-regionserver, hbase-rest, hbase-thrift

This is required because if Cloudera services are left to boot during cluster creation, ssh connection from staging/production server to the master VM of the cluster will fail and the cluster will be destroyed.

Image creation

For image creation (Cloudera-CDH-5.4.7), following command must be executed:

snf-mkimage --public --print-syspreps -f -u {{image_name}} -t {{token}} -a {{authentication url}} -r {{image_name}} /

Update database (Personal Orka server)

For the Cloudera-CDH-5.4.7 image:

sudo -u postgres psql
\c escience;
INSERT INTO backend_orkaimage (id,image_name, image_pithos_uuid, image_components, image_category_id) VALUES (9, 'Cloudera-CDH-5.4.7', '<cloudera547_pithos_uuid>','{"Debian":{"version":"7.8","help":"https://www.debian.org/"},"Hadoop":{"version":"2.6.0-cdh5.4.7","help":"https://hadoop.apache.org/"},"Flume":{"version":"1.5.0-cdh5.4.7","help":"https://flume.apache.org/"},"Hue":{"version":"3.7.0","help":"http://gethue.com/"},"Pig":{"version":"0.12.0-cdh5.4.7","help":"http://pig.apache.org/"},"Hive":{"version":"1.1.0+cdh5.4.7","help":"http://hive.apache.org/"},"Hbase":{"version":"1.0.0-cdh5.4.7","help":"http://hbase.apache.org/"},"Oozie":{"version":"4.1.0-cdh5.4.7","help":"http://oozie.apache.org/"},"Spark":{"version":"1.3.0","help":"http://spark.apache.org/"},"Cloudera":{"version":"5.4.7","help":"http://www.cloudera.com/content/cloudera/en/home.html"}}',5);
\q

Alternatively, from {{personal_orka_server_IP}}/admin, an administrator can login and add the Cloudera image in Orka Images table.

Create new orka image (examples include Hadoop 2.7.1, Hue 3.9.0, Ecosystem on Hue 3.9.0 and latest Cloudera dist) - grnet/e-science GitHub Wiki

Instructions for creating new Hadoop distributions and making them available through orka.

Hadoop Base image (General instructions and example for Hadoop-2.7.1)

Set up Debian 8.x repositories and update/upgrade

Install snf-image creator

Install Oracle Java

Disable ipv6

Configure ssh parameters

Remove potentially problematic line from /etc/hosts

Download and unzip Hadoop installation files

Install Flume

Setup Flume environment

Prepare flume-agent daemon

Image creation

Get the pithos uuid of a newly created image

Additions in database (Personal Orka server)

Update database (Personal Orka server)

Hue 3.9.0 image creation on top of Hadoop-2.7.1

Install Hue Dependencies

Download and install Hue

Update database (Personal Orka server)

Ecosystem image creation on top of Hue-3.9.0

Install Pig

Install Oozie

Install HBase

Install Spark

Install Sbt

Install Spark Job server

Install Hive

Create Ecosystem-on-Hue-3.9.0

Update database (Personal Orka server)

Cloudera image (General instructions and example for Cloudera-CDH-5.4.7)

Set up Debian 7.8 repositories and update/upgrade

Install snf-image creator

Install Oracle Java and set JAVA environment parameters

Install Postgres

Download CDH 5 "1-click Install" package

Install package

Optionally Add a Repository Key

Install Cloudera latest distribution with YARN

Install other components

Pig installation

Hue installation

Configuring the Hue Server to Store Data in PostgreSQL

Oozie installation

Database (Postgresql) configurations for oozie:

Edit "/etc/postgresql/9.1/main/postgresql.conf"

Edit "/etc/postgresql/9.1/main/pg_hba.conf"

Restart to reload the PostgreSQL configuration

Enable the Oozie Web Console

Install Spark

Install Hive

Install the PostgreSQL JDBC Driver on a Debian/Ubuntu system

Create the metastore database and user account

Grant permission for all metastore tables to user hiveuser

Install HBase, HBase Master, HBase Thrift Server, HBase REST

Install Flume

Remove services from start-up

Image creation

Update database (Personal Orka server)

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️