1.x vs 2.x : https: acadgild.com blog 10 big differences between hadoop1 and hadoop2 https: wiki.apache.org hadoop TaskTracker - prabhu914/Hadoop-Interview-Question GitHub Wiki

In 1.x Task Tracker has fixed no of slots for Map and Reduce Tasks.But container can run map and reduce and any other application.

Hadoop 1

1.x Supports only MapReduce (MR) processing model.it Does not support non-MR tools.

1.x Scalability of nodes to 4000 bcz of load on JobTracker

Works on concepts of slots – slots can run either a Map task or a Reduce task only.

A single Namenode to manage the entire namespace.

1.x Has Single-Point-of-Failure (SPOF) – because of single Namenode- and in case of Namenode failure, needs manual intervention to overcome.

Not great for graph processing applications. Iterative applications implemented using MR are 10x slow.

Hadoop 2

2.x Allows to work in MR as well as other distributed computing models like Spark, Hama, Giraph, Message Passing Interface) MPI & HBase coprocessors.

YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done using different processing models.

2.x Has better scalability.* Scalable up to 10000 nodes per cluster*.

Works on concepts of containers. Using containers can run generic tasks.

Multiple Namenode servers manage multiple namespace.

2.x Has feature to overcome SPOF with a standby Namenode and in case of Namenode failure, it is configured for automatic recovery.

MR API requires additional files for a program written in Hadoop1x to execute in Hadoop2x.

A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.

Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, **it first looks for an empty slot on the same server **that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

YARN Components:

JobTracker is replaced by Resource Manager and it has 2 components.

ResourceManager: 1)Scheduler 2)ApplicationManger

Scheduler : Deals with scheduling of the jobs.Does not involve in monitoring and tracking of applications status. Scheduler performs its scheduling function based on the resource requirements of the applications

ApplicationManager: Which monitors and statuses of the job.

NodeManager : Present on Slave Nodes and responsible for launch and manage the Containers and monitoring memory cpu network and reports to RM.

ApplicationMaster: carryout the execution of job associates with it.Coordinates task running,monitoring and aggregates and send status to client.its under nodemanager by instructions from ResourceManager. **Pond one for every job and fires after completions. **

Think it lyk a officer hired by ResourceManager for executing a job and fires after completion of job

YARN Child : Run Map and Reduce task and responsible for send updates and progress to ApplicationMaster.

Job Initialization in 2.x:

As soon as Job Scheduler Picks up a job it(ResourceManager) contacts the NodeManager and start new container and launches a ApplicationMaster for that job. ApplicationMaster creates a object for task management.