hg5k - mliroz/hadoop_g5k GitHub Wiki
General Overview and Basic Usage
hg5k
is a script that provides a command-line interface to manage a Hadoop cluster. You can list all the possible options by using its help command:
$ hg5k -h
In the following we show an example of a basic usage workflow of hg5k
to execute the mrbench
job in a cluster of 4 nodes. More details are given in the following section.
First, we need to reserve a set of nodes with the oarsub
command.
$ oarsub -I -t allow_classic_ssh -l nodes=4,walltime=2
Now, we are ready to create and use a Hadoop cluster. In the most simple way, a Hadoop cluster is created by just providing the set of machines that it will comprise. We can use the environment variable provided by oarsub. The option --version 2
indicates that we are using Hadoop 2.x.
$ hg5k --create $OAR_FILE_NODES --version 2
Next, Hadoop needs to be installed in the nodes of the cluster. We need to provide the binaries of Hadoop as downloaded from the Apache Hadoop project webpage. Here we use a version already stored in several sites of Grid5000.
$ hg5k --bootstrap /home/mliroz/public/sw/hadoop/hadoop-2.6.0.tar.gz
Note that this is just an example and /home/mliroz/public/sw/hadoop/hadoop-2.6.0.tar.gz
may not be present in all sites. It is recommended to download the desired version of Hadoop to the user space and reference it.
Once we have installed Hadoop, we just need to initialize it (which will configure Hadoop depending both on the parameters we specify and the characteristics of the computers of the cluster) and start the servers. We can do both steps in just one command.
$ hg5k --initialize --start
Now we are ready to execute jobs. We are going to execute the randomtextwriter
job, which is included in the specified jar file.
$ hg5k --jarjob /home/mliroz/public/sw/hadoop/hadoop-mapreduce-examples-2.6.0.jar randomtextwriter /output
We may check that the job executed correctly by printing the contents of HDFS. The following command is a shortcut for this:
$ hg5k --state files
If everything went OK, there should as many files in the /output
directory as reduce tasks in our MapReduce job.
When we are done, we should delete the cluster in order to remove all temporary files created during execution.
$ hg5k --delete
Advanced Usage
hg5k
allows to use several clusters at the same time, provided the set of nodes is disjoint. In order to identify each of the clusters, it internally uses a different id, and serializes the corresponding Python class in a different directory in /tmp
. If nothing is indicated, the last used cluster is the one to be loaded. However, the user can instruct the script to use the cluster corresponding to a specific identifier with the option --id
.
Creation and Destruction
In the basic usage, just the set of nodes composing the cluster is passed to the script. Additional options can be used when creating a cluster:
--properties path_to_conf
: It specifies a properties file (with INI format) that can be used to indicate the cluster genereral properties, such as local and remote directory locations, ports, etc. The following example file contains the defualt values used byhg5k
:
[cluster]
hadoop_base_dir = /tmp/hadoop
hadoop_conf_dir = /tmp/hadoop/conf
hadoop_logs_dir = /tmp/hadoop/logs
hadoop_temp_dir = /tmp/hadoop/temp
hdfs_port = 54310
mapred_port = 54311
[local]
local_base_conf_dir = ./conf
Note: /tmp
is used by default because it is accessible by every user and it is usually stored in the biggest partition. In any case, the usage of NFS directories should be avoided as the different nodes of the cluster may overwrite the same files.
--version hadoop_version
: It specifies the version of Hadoop to be used. Depending on the major number of the version a different Python class adapted to the corresponding distribution is used. By default it uses the class compliant with the old versions of Hadoop (0.* and 1.*).
Cluster State
After creation, the state of the cluster can be obtained at any moment by using:
hg5k --state [state_param]
Detailed state information can be obtained by passing an additional parameter to the --state
option:
general
: Shows general information about the cluster. If no state parameter is indicated, this option is used.files
: Shows the hierarchy of the files stored in the filesystem.dfs
: Shows the file system state.dfsblocks
: Shows the information about the blocks in the filesystem.mrjobs
: Shows the state of the MapReduce jobs.
Bootstrap
hg5k
can be passed a hadoop binary distribution to be installed in all nodes in the directories specified in the properties file. Alternatively, it is also possible to use a previously created environment where a specific version of Hadoop is installed. In that case, the user should modify the properties file so that the installation directories point to the correct locations.
Initialization
To start working, a Hadoop cluster needs to be configured (master and slaves, topology, etc.) and the distributed file system (HDFS) formatted. To do so we use the option --initialize
. More in detail, this option performs the following actions:
- It copies the user-defined configuration specified in the
local_base_conf_dir
property of the configuration file to all nodes in the cluster. If the directory does not exist, nothing is done. Note that some options may be overwritten, in particular those dependant on the machines reserved within the job, e.g.,fs.defaultFS
. - The slaves are taken from the nodes given in the
--create
option. The master is chosen among them. The topology is discovered automatically and the corresponding script created on the fly. - Some hardware-dependent properties are configured individually for each node following the best practices described in Hadoop: The Definitive Guide. If you add the option
feeling_lucky
to the--initialize
command some additional configuration is performed, including the configuration of the map and reduce containers and the memory allocated to each process depending on the characteristics of the machine. In any case, keep in mind that no configuration is universally optimal and that in your case these parameters may not be the best ones. - The distributed file system is formated.
The cluster can also be returned to the original state by executing the option --clean
. This will remove all the files created during cluster initialization and usage, including logs, dfs files, etc.
Start and Stop
Once the cluster is configured, the servers need to be started before doing anything. hg5k
provides the general option --start
to launch all services (which depend on your version of Hadoop). Note that you may specifically manage which services to start by specifying them individually.
--start_hdfs
starts theNameNode
in the master machine and theDataNodes
in the slaves.--start_yarn
starts theResourceManager
in the master machine and theNodeManagers
in the slaves. This option only applies to Hadoop version 2.x.--start_mr
starts theJobTracker
in the master machine and theTaskTrackers
in the slaves. This option only applies to Hadoop versions 0.x and 1.x.
In the same way the cluster can be stopped by using the --stop
option.
Hadoop Properties
Hadoop properties are stored in a set of .xml
files in the configuration directory. These properties can be changed by hg5k
by using the following command:
hg5k --changeconf [conf_file] prop1=value1 prop2=value2
A list of property names and the corresponding new values should be passed. hg5k
will look for the specified properties in the configuration files and replace their values. If a property is not found in any of the configuration files, the script will chose mapred-site.xml
. Note that this is not always correct. If you want to force the properties to be set in a specific file you should indicate this. For instance this commands configures the property yarn.scheduler.maximum-allocation-mb
in yarn-site.xml
:
$ hg5k --changeconf yarn-site.xml yarn.scheduler.maximum-allocation-mb=2000
Properties can also be read with the following command:
hg5k --getconf prop1 prop2
Note that if the property is not specifically set, nothing will be returned. In this case, it should be assumed that the default value is used.
Execution
The basic usage of hg5k
to execute jobs was described in the previous section. Additional properties can also be provided. The general form of the option is the following:
hg5k --jarjob program.jar job_options --lib_jars lib1.jar lib2.jar --node cluster_node
hg5k
will copy the job jar and the list of libraries to the indicated node (the master if nothing is specified) and execute the job with the given parameters.
With hg5k
you can also execute arbitrary Hadoop commands with the option --execute
. For instance, if we may want to remove the first of the partitions previously created with randomtextwriter
we can use the following command:
$ hg5k --execute "fs -rm /output/part-m-00000"
Note that the whole command should be written between quotes.
File Transfers
hg5k
allows to easily and efficiently copy data to and from the distributed filesystem in the cluster. The options to be used are respectively --putindfs
and --getfromdfs
, which takes as parameters the source and destination dirs (one of them in the dfs and the other in the local filesystem).
For instance, following the previous example of randomtextwriter
, if you want to copy back the generated directory into your /tmp
directory you have to use the following command:
$ hg5k --getfromdfs /output /tmp
Transfers of data to the cluster are performed efficiently by starting one thread on each of the nodes of the cluster and copying a subset of the files to that node.
Finally, the statistics of the job can be then retrieved with the following command:
hg5k --copyhistory local_path [job_id1 job_id2 ...]
It will find the statistics of the execution of the jobs with the ids specified in the option (all executed jobs if nothing is indicated) and copy them to a local directory.
For instance, let's copy the statistics of the previous example. When executed the job id was printed. We could anyway retrieve it by checking already executed jobs.
$ hg5k --state mrjobs
2015-12-01 19:50:28,785 INFO: hadoop_id = 1 -> HadoopV2Cluster([parapluie:4], running)
15/12/01 18:50:31 INFO client.RMProxy: Connecting to ResourceManager at parapluie-9.rennes.grid5000.fr/172.16.99.9:8032
Total jobs:1
JobId State StartTime UserName Queue Priority UsedContainers RsvdContainers UsedMem RsvdMem NeededMem AM info
job_1448992678044_0001 SUCCEEDED 1448992911941 mliroz default NORMAL N/A N/A N/A N/A N/A http://parapluie-9.rennes.grid5000.fr:8088/proxy/application_1448992678044_0001/jobhistory/job/job_1448992678044_0001
The id of our job is job_1448992678044_0001
. We can use it to retrieve their stats:
$ hg5k --copyhistory . job_1448992678044_0001
This will create the directory hadoop_hist
in our current path with the job stats.