hg5k - mliroz/hadoop_g5k GitHub Wiki

General Overview and Basic Usage

hg5k is a script that provides a command-line interface to manage a Hadoop cluster. You can list all the possible options by using its help command:

$ hg5k -h

In the following we show an example of a basic usage workflow of hg5k to execute the mrbench job in a cluster of 4 nodes. More details are given in the following section.

First, we need to reserve a set of nodes with the oarsub command.

$ oarsub -I -t allow_classic_ssh -l nodes=4,walltime=2

Now, we are ready to create and use a Hadoop cluster. In the most simple way, a Hadoop cluster is created by just providing the set of machines that it will comprise. We can use the environment variable provided by oarsub. The option --version 2 indicates that we are using Hadoop 2.x.

$ hg5k --create $OAR_FILE_NODES --version 2

Next, Hadoop needs to be installed in the nodes of the cluster. We need to provide the binaries of Hadoop as downloaded from the Apache Hadoop project webpage. Here we use a version already stored in several sites of Grid5000.

$ hg5k --bootstrap /home/mliroz/public/sw/hadoop/hadoop-2.6.0.tar.gz

Note that this is just an example and /home/mliroz/public/sw/hadoop/hadoop-2.6.0.tar.gz may not be present in all sites. It is recommended to download the desired version of Hadoop to the user space and reference it.

Once we have installed Hadoop, we just need to initialize it (which will configure Hadoop depending both on the parameters we specify and the characteristics of the computers of the cluster) and start the servers. We can do both steps in just one command.

$ hg5k --initialize --start

Now we are ready to execute jobs. We are going to execute the randomtextwriter job, which is included in the specified jar file.

$ hg5k --jarjob /home/mliroz/public/sw/hadoop/hadoop-mapreduce-examples-2.6.0.jar randomtextwriter /output

We may check that the job executed correctly by printing the contents of HDFS. The following command is a shortcut for this:

$ hg5k --state files

If everything went OK, there should as many files in the /output directory as reduce tasks in our MapReduce job.

When we are done, we should delete the cluster in order to remove all temporary files created during execution.

$ hg5k --delete

Advanced Usage

hg5k allows to use several clusters at the same time, provided the set of nodes is disjoint. In order to identify each of the clusters, it internally uses a different id, and serializes the corresponding Python class in a different directory in /tmp. If nothing is indicated, the last used cluster is the one to be loaded. However, the user can instruct the script to use the cluster corresponding to a specific identifier with the option --id.

Creation and Destruction

In the basic usage, just the set of nodes composing the cluster is passed to the script. Additional options can be used when creating a cluster:

  • --properties path_to_conf: It specifies a properties file (with INI format) that can be used to indicate the cluster genereral properties, such as local and remote directory locations, ports, etc. The following example file contains the defualt values used by hg5k:
 [cluster]
 hadoop_base_dir = /tmp/hadoop
 hadoop_conf_dir = /tmp/hadoop/conf
 hadoop_logs_dir = /tmp/hadoop/logs
 hadoop_temp_dir = /tmp/hadoop/temp
 hdfs_port = 54310
 mapred_port = 54311
 
 [local]
 local_base_conf_dir = ./conf

Note: /tmp is used by default because it is accessible by every user and it is usually stored in the biggest partition. In any case, the usage of NFS directories should be avoided as the different nodes of the cluster may overwrite the same files.

  • --version hadoop_version: It specifies the version of Hadoop to be used. Depending on the major number of the version a different Python class adapted to the corresponding distribution is used. By default it uses the class compliant with the old versions of Hadoop (0.* and 1.*).

Cluster State

After creation, the state of the cluster can be obtained at any moment by using:

hg5k --state [state_param]

Detailed state information can be obtained by passing an additional parameter to the --state option:

  • general: Shows general information about the cluster. If no state parameter is indicated, this option is used.
  • files: Shows the hierarchy of the files stored in the filesystem.
  • dfs: Shows the file system state.
  • dfsblocks: Shows the information about the blocks in the filesystem.
  • mrjobs: Shows the state of the MapReduce jobs.

Bootstrap

hg5k can be passed a hadoop binary distribution to be installed in all nodes in the directories specified in the properties file. Alternatively, it is also possible to use a previously created environment where a specific version of Hadoop is installed. In that case, the user should modify the properties file so that the installation directories point to the correct locations.

Initialization

To start working, a Hadoop cluster needs to be configured (master and slaves, topology, etc.) and the distributed file system (HDFS) formatted. To do so we use the option --initialize. More in detail, this option performs the following actions:

  • It copies the user-defined configuration specified in the local_base_conf_dir property of the configuration file to all nodes in the cluster. If the directory does not exist, nothing is done. Note that some options may be overwritten, in particular those dependant on the machines reserved within the job, e.g., fs.defaultFS.
  • The slaves are taken from the nodes given in the --create option. The master is chosen among them. The topology is discovered automatically and the corresponding script created on the fly.
  • Some hardware-dependent properties are configured individually for each node following the best practices described in Hadoop: The Definitive Guide. If you add the option feeling_lucky to the --initialize command some additional configuration is performed, including the configuration of the map and reduce containers and the memory allocated to each process depending on the characteristics of the machine. In any case, keep in mind that no configuration is universally optimal and that in your case these parameters may not be the best ones.
  • The distributed file system is formated.

The cluster can also be returned to the original state by executing the option --clean. This will remove all the files created during cluster initialization and usage, including logs, dfs files, etc.

Start and Stop

Once the cluster is configured, the servers need to be started before doing anything. hg5k provides the general option --start to launch all services (which depend on your version of Hadoop). Note that you may specifically manage which services to start by specifying them individually.

  • --start_hdfs starts the NameNode in the master machine and the DataNodes in the slaves.
  • --start_yarn starts the ResourceManager in the master machine and the NodeManagers in the slaves. This option only applies to Hadoop version 2.x.
  • --start_mr starts the JobTracker in the master machine and the TaskTrackers in the slaves. This option only applies to Hadoop versions 0.x and 1.x.

In the same way the cluster can be stopped by using the --stop option.

Hadoop Properties

Hadoop properties are stored in a set of .xml files in the configuration directory. These properties can be changed by hg5k by using the following command:

hg5k --changeconf [conf_file] prop1=value1 prop2=value2

A list of property names and the corresponding new values should be passed. hg5k will look for the specified properties in the configuration files and replace their values. If a property is not found in any of the configuration files, the script will chose mapred-site.xml. Note that this is not always correct. If you want to force the properties to be set in a specific file you should indicate this. For instance this commands configures the property yarn.scheduler.maximum-allocation-mb in yarn-site.xml:

$ hg5k --changeconf yarn-site.xml yarn.scheduler.maximum-allocation-mb=2000

Properties can also be read with the following command:

hg5k --getconf prop1 prop2

Note that if the property is not specifically set, nothing will be returned. In this case, it should be assumed that the default value is used.

Execution

The basic usage of hg5k to execute jobs was described in the previous section. Additional properties can also be provided. The general form of the option is the following:

hg5k --jarjob program.jar job_options --lib_jars lib1.jar lib2.jar --node cluster_node

hg5k will copy the job jar and the list of libraries to the indicated node (the master if nothing is specified) and execute the job with the given parameters.

With hg5k you can also execute arbitrary Hadoop commands with the option --execute. For instance, if we may want to remove the first of the partitions previously created with randomtextwriter we can use the following command:

$ hg5k --execute "fs -rm /output/part-m-00000"

Note that the whole command should be written between quotes.

File Transfers

hg5k allows to easily and efficiently copy data to and from the distributed filesystem in the cluster. The options to be used are respectively --putindfs and --getfromdfs, which takes as parameters the source and destination dirs (one of them in the dfs and the other in the local filesystem).

For instance, following the previous example of randomtextwriter, if you want to copy back the generated directory into your /tmp directory you have to use the following command:

$ hg5k --getfromdfs /output /tmp

Transfers of data to the cluster are performed efficiently by starting one thread on each of the nodes of the cluster and copying a subset of the files to that node.

Finally, the statistics of the job can be then retrieved with the following command:

hg5k --copyhistory local_path [job_id1 job_id2 ...]

It will find the statistics of the execution of the jobs with the ids specified in the option (all executed jobs if nothing is indicated) and copy them to a local directory.

For instance, let's copy the statistics of the previous example. When executed the job id was printed. We could anyway retrieve it by checking already executed jobs.

$ hg5k --state mrjobs
2015-12-01 19:50:28,785 INFO: hadoop_id = 1 -> HadoopV2Cluster([parapluie:4], running)

15/12/01 18:50:31 INFO client.RMProxy: Connecting to ResourceManager at parapluie-9.rennes.grid5000.fr/172.16.99.9:8032
Total jobs:1
                  JobId	     State	     StartTime	    UserName	       Queue	  Priority	 UsedContainers	 RsvdContainers	 UsedMem	 RsvdMem	 NeededMem	   AM info
 job_1448992678044_0001	 SUCCEEDED	 1448992911941	      mliroz	     default	    NORMAL	            N/A	            N/A	     N/A	     N/A	       N/A	http://parapluie-9.rennes.grid5000.fr:8088/proxy/application_1448992678044_0001/jobhistory/job/job_1448992678044_0001

The id of our job is job_1448992678044_0001. We can use it to retrieve their stats:

$ hg5k --copyhistory . job_1448992678044_0001

This will create the directory hadoop_hist in our current path with the job stats.