Hadoop - kdwivedi1985/system-design GitHub Wiki

What is Hadoop?

Apache Hadoop is an open source framework for storing and processing of large datasets across distributed clusters. It can scale up from a single computer to thousands of clustered computers, with each machine offering local computation and storage. It supports datasets from gigabytes to petabytes of data. [We can say Hadoop is Big data platform which has storage and computation capabilities].

What are the components/modules of Hadoop?

Hadoop Distributed File System (HDFS): HDFS is a distributed file system in which individual Hadoop nodes operate on data that resides in their local storage. It is one of the primary component of Hadoop, and removes network latency, providing high-throughput access to application data.
Yet Another Resource Negotiator (YARN): YARN is a resource-management platform responsible for managing compute resources in clusters and using them to schedule user's applications. It performs scheduling and resource allocation across the Hadoop system.
MapReduce: MapReduce is a programming model for large-scale data processing. In the MapReduce model, subsets of larger datasets and instructions for processing the subsets are dispatched to multiple different nodes, where each subset is processed by a node in parallel with other processing jobs. After processing the results, individual subsets are combined into a smaller, more manageable dataset.
Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other Hadoop modules.
Beyond HDFS, YARN, and MapReduce there are other tools and applications to help collect, store, process, analyze, and manage big data. These include Apache Pig, Apache Hive, Apache HBase, Apache Spark, Presto, and Apache Zeppelin etc.

How does Hadoop works?

Hadoop allows distribution of datasets across a cluster. Processing is performed in parallel on multiple servers simultaneously.
Client applications input data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce then processes and converts the data. Finally, YARN divides the jobs across the computing cluster.

Hadoop Cluster Architecture

Hadoop cluster follows Master and Slave architecture. Master node controls other nodes.
Requests are served by Primary node. Secondary Master Node is setup for high availability.
Zookeeper Promotes Standby to Active if the current Active node fails.

Master Node: Handles meta-data management, resource management and Parallel processing for Map-Reduce. It consists of NamedNode (HDFS) and ResourceManager (YARN).
- NameNode is specifically responsible for managing the file system metadata within the HDFS.
  - Meta-data consists of file name, permission, block id, block location, number of replicas etc.
- ResourceManager: YARN is a separate resource management system that allocates computing resources across the entire cluster, allowing different applications to run on the same nodes without conflicting with each other.
- Master Node could refer to either the NameNode (for HDFS operations) or the YARN ResourceManager depending on the context.
Slave Nodes: Slave Node is a worker node/machine in the cluster that stores data and/or runs computation tasks. It receives instructions from master nodes like the NameNode (HDFS) or ResourceManager (YARN).
- DataNode and NodeManager are the key components of Slave node.
- DataNode:
  - DataNode Stores the actual HDFS data blocks.
  - DataNode communicates with NamedNode, sends heart-beat, handle replication, read/write etc.
- NodeManager: It is a YARN Component. Manages compute resources and containers on that node.
  - NodeManager communicates with ResourceManager and Application Master.
  - It launches and monitors containers for Map/Reduce or other applications.
  - Reports resource usage and node health to ResourceManager.

YARN and it's Architecture

YARN (Yet Another Resource Negotiator) serves as resource management layer in Hadoop Ecosystem. It consists of resource management and job scheduling.
YARN performs centralized resource management, job scheduling and handles scalability & efficiency.
Supports running different types of applications (e.g., MapReduce, Spark, Hive) simultaneously on the same cluster.
Monitoring & Coordination: Keeps track of running jobs, application progress, and node health.

Components Of YARN: It has two main components Resource Manager and Node Manager.
- Resource Manager: ResourceManager (RM) is the central authority that includes two key internal components
  - Application Manager: Works with Client, Scheduler, NodeManager and ApplicationMaster
    - Receives job submissions from clients.
    - Maintains a registry of running applications.
    - Works with the scheduler to allocate the first container for each new job — this container runs the ApplicationMaster.
    - Coordinates the launch of the ApplicationMaster on a NodeManager.
    - Handles application life cycle tracking and recovery in case of failure.
  - Scheduler: Allocates containers and resources and works with Application Manager, Node Manager and Application Master.
    - Allocates resources (CPU, memory), Container across applications.
    - Makes decisions based on scheduling policies (e.g., Capacity Scheduler, Fair Scheduler, FIFO).
    - The Scheduler decides which application gets how much resource (like CPU and memory) on which node, based on the current availability and policies.
- Node Manager: It is an Worker Agent runs on every node, responsible for managing containers and monitoring resources.
  - It has two components Application Master and Container.
  - Node Manager communicates with -
    - Resource Manager for sending heart beats and receiving container commands and
    - Application Manager to launch tasks inside containers.
  - Application Master:
    - It is like a project manager for a specific application, manages the lifecycle for single Job(MapReduce, Spark, or Hive query etc.).
    - It negotiates tasks with Resource Manager.
    - It monitor tasks by communicating with Node manager to launch and monitor containers.
    - Handles task re-execution on failure.
  - Container:
    - YARN Containers are little different than Docker containers. A is a logical bundle of physical resources (CPU, memory, etc.) assigned by YARN to run a task (Map task, Spark executor, etc.). YARN can be configured to launch its containers as Docker containers.
    - It grants rights to an application to use a specific amount of resources (memory, CPU etc.) on a specific host.

How does Resource Manager on the Master knows about resources while resources are allocated to Worker Nodes?

NodeManager on a worker node sends heartbeats to the ResourceManager and reports:
- Total memory & vCores available
- Containers running
- Node health
ResourceManager maintains a global view of all available resources across the cluster based on above information.
When an ApplicationMaster requests containers, the ResourceManager allocates them based on what’s available on the worker nodes.

How are failures handled if a distributed node or a container is crashed?

If Container is crashed, Application Master will retry the task on another container.
If Application Master Fails- YARN supports automatic ApplicationMaster recovery. If configured, the ResourceManager restarts the ApplicationMaster on a different healthy node. If recovery is enabled, new AM can recover the job’s previous state and continue managing the app without restarting tasks.
If Node Fails- Resource Manager removes it from the list of healthy nodes. All containers that were running on that node are marked as failed. Application Master sees which containers were lost and reschedules those tasks on other nodes. If AM fails, ResourceManager restarts the ApplicationMaster on a different healthy node.

How does application master runs containers on different slaves than it's own?

ApplicationMaster doesn't launch containers itself. It requests containers from the ResourceManager(Scheduler), and the NodeManagers on the assigned nodes actually launch them. ApplicationMaster contacts the ResourceManager, asking for containers to run tasks with resource requirement and preference. Scheduler then allocates containers on appropriate nodes.

Does Hadoop Support Auto-Scaling?

No, Out of the box, Hadoop (including YARN) does not support native auto-scaling like cloud-native platforms such as Kubernetes does. However, auto-scaling can be implemented in Cloud.

Hadoop - kdwivedi1985/system-design GitHub Wiki

What is Hadoop?

What are the components/modules of Hadoop?

How does Hadoop works?

Hadoop Cluster Architecture

YARN and it's Architecture

How does Resource Manager on the Master knows about resources while resources are allocated to Worker Nodes?

How are failures handled if a distributed node or a container is crashed?

How does application master runs containers on different slaves than it's own?

Does Hadoop Support Auto-Scaling?

What is Hive?

What is HBase?

What is Pig?

What is SQOOP?

What is OZIEE?