Hadoop, Hive, HBase, Sqoop, Oziee - kdwivedi1985/system-design GitHub Wiki

What is Hadoop?

Apache Hadoop is an open source framework for storing and processing of large datasets across distributed clusters. It can scale up from a single computer to thousands of clustered computers, with each machine offering local computation and storage. It supports datasets from gigabytes to petabytes of data. [We can say Hadoop is Big data platform which has storage and computation capabilities].

What are the components/modules of Hadoop?

Hadoop Distributed File System (HDFS): HDFS is a distributed file system in which individual Hadoop nodes operate on data that resides in their local storage. It is one of the primary component of Hadoop, and removes network latency, providing high-throughput access to application data.
Yet Another Resource Negotiator (YARN): YARN is a resource-management platform responsible for managing compute resources in clusters and using them to schedule user's applications. It performs scheduling and resource allocation across the Hadoop system.
MapReduce: MapReduce is a programming model for large-scale data processing. In the MapReduce model, subsets of larger datasets and instructions for processing the subsets are dispatched to multiple different nodes, where each subset is processed by a node in parallel with other processing jobs. After processing the results, individual subsets are combined into a smaller, more manageable dataset.
Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other Hadoop modules.
Beyond HDFS, YARN, and MapReduce there are other tools and applications to help collect, store, process, analyze, and manage big data. These include Apache Pig, Apache Hive, Apache HBase, Apache Spark, Presto, and Apache Zeppelin etc.

How does Hadoop works?

Hadoop allows distribution of datasets across a cluster. Processing is performed in parallel on multiple servers simultaneously.
Client applications input data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce then processes and converts the data. Finally, YARN divides the jobs across the computing cluster.

Hadoop Cluster Architecture

Hadoop cluster follows Master and Slave architecture. Master node controls other nodes.
Requests are served by Primary node. Secondary Master Node is setup for high availability.
Zookeeper Promotes Standby to Active if the current Active node fails.

Master Node: Handles meta-data management, resource management and Parallel processing for Map-Reduce. It consists of NamedNode (HDFS) and ResourceManager (YARN).
- NameNode is specifically responsible for managing the file system metadata within the HDFS.
  - Meta-data consists of file name, permission, block id, block location, number of replicas etc.
- ResourceManager: YARN is a separate resource management system that allocates computing resources across the entire cluster, allowing different applications to run on the same nodes without conflicting with each other.
- Master Node could refer to either the NameNode (for HDFS operations) or the YARN ResourceManager depending on the context.
Slave Nodes: Slave Node is a worker node/machine in the cluster that stores data and/or runs computation tasks. It receives instructions from master nodes like the NameNode (HDFS) or ResourceManager (YARN).
- DataNode and NodeManager are the key components of Slave node.
- DataNode:
  - DataNode Stores the actual HDFS data blocks.
  - DataNode communicates with NamedNode, sends heart-beat, handle replication, read/write etc.
- NodeManager: It is a YARN Component. Manages compute resources and containers on that node.
  - NodeManager communicates with ResourceManager and Application Master.
  - It launches and monitors containers for Map/Reduce or other applications.
  - Reports resource usage and node health to ResourceManager.

YARN and it's Architecture

YARN (Yet Another Resource Negotiator) serves as resource management layer in Hadoop Ecosystem. It consists of resource management and job scheduling.
YARN performs centralized resource management, job scheduling and handles scalability & efficiency.
Supports running different types of applications (e.g., MapReduce, Spark, Hive) simultaneously on the same cluster.
Monitoring & Coordination: Keeps track of running jobs, application progress, and node health.

Components Of YARN: It has two main components Resource Manager and Node Manager.
- Resource Manager: ResourceManager (RM) is the central authority that includes two key internal components
  - Application Manager: Works with Client, Scheduler, NodeManager and ApplicationMaster
    - Receives job submissions from clients.
    - Maintains a registry of running applications.
    - Works with the scheduler to allocate the first container for each new job — this container runs the ApplicationMaster.
    - Coordinates the launch of the ApplicationMaster on a NodeManager.
    - Handles application life cycle tracking and recovery in case of failure.
  - Scheduler: Allocates containers and resources and works with Application Manager, Node Manager and Application Master.
    - Allocates resources (CPU, memory), Container across applications.
    - Makes decisions based on scheduling policies (e.g., Capacity Scheduler, Fair Scheduler, FIFO).
    - The Scheduler decides which application gets how much resource (like CPU and memory) on which node, based on the current availability and policies.
- Node Manager: It is an Worker Agent runs on every node, responsible for managing containers and monitoring resources.
  - It has two components Application Master and Container.
  - Node Manager communicates with -
    - Resource Manager for sending heart beats and receiving container commands and
    - Application Manager to launch tasks inside containers.
  - Application Master:
    - It is like a project manager for a specific application, manages the lifecycle for single Job(MapReduce, Spark, or Hive query etc.).
    - It negotiates tasks with Resource Manager.
    - It monitor tasks by communicating with Node manager to launch and monitor containers.
    - Handles task re-execution on failure.
  - Container:
    - YARN Containers are little different than Docker containers. A is a logical bundle of physical resources (CPU, memory, etc.) assigned by YARN to run a task (Map task, Spark executor, etc.). YARN can be configured to launch its containers as Docker containers.
    - It grants rights to an application to use a specific amount of resources (memory, CPU etc.) on a specific host.

How does Resource Manager on the Master knows about resources while resources are allocated to Worker Nodes?

NodeManager on a worker node sends heartbeats to the ResourceManager and reports:
- Total memory & vCores available
- Containers running
- Node health
ResourceManager maintains a global view of all available resources across the cluster based on above information.
When an ApplicationMaster requests containers, the ResourceManager allocates them based on what’s available on the worker nodes.

How are failures handled if a distributed node or a container is crashed?

If Container is crashed, Application Master will retry the task on another container.
If Application Master Fails- YARN supports automatic ApplicationMaster recovery. If configured, the ResourceManager restarts the ApplicationMaster on a different healthy node. If recovery is enabled, new AM can recover the job’s previous state and continue managing the app without restarting tasks.
If Node Fails- Resource Manager removes it from the list of healthy nodes. All containers that were running on that node are marked as failed. Application Master sees which containers were lost and reschedules those tasks on other nodes. If AM fails, ResourceManager restarts the ApplicationMaster on a different healthy node.

How does application master runs containers on different slaves than it's own?

ApplicationMaster doesn't launch containers itself. It requests containers from the ResourceManager(Scheduler), and the NodeManagers on the assigned nodes actually launch them. ApplicationMaster contacts the ResourceManager, asking for containers to run tasks with resource requirement and preference. Scheduler then allocates containers on appropriate nodes.

Does Hadoop Support Auto-Scaling?

No, Out of the box, Hadoop (including YARN) does not support native auto-scaling like cloud-native platforms such as Kubernetes does. However, auto-scaling can be implemented in Cloud.

What is Hive?

Hive is data warehouse system/infrastructure build on top of Hadoop. It allows us to manage and query the large dataset stored in distributed storage system using SQL like queries called HiveQL.
In other words, we can say that Hive acts as an abstraction layer on top of HDFS, allowing users to interactively query and retrieve data through familiar SQL syntax.
It translates SQL like queries to Map-Reduce and Spark jobs etc.
Hive can be used for data summarization, batch processing, ETL and ad-hoc queries.
Hive can store meta-data pointing to data stored in HDFS or it can store the actual data in Hive tables as well.
Tables in hive can point to data stored in various formats in HDFS: Text, Parquet, ORC(Optimized Row Columnar-default), etc.
ORC is column-oriented format and supports ACID transactions.
Hive is used for OLAP.

What is HBase?

HBase is an open source key-value based, column oriented, No-SQL distributed database, build on top of HDFS.
Supports real time read/write to large dataset.
HBase can store billions of rows and millions of columns.
Cassandra and HBase are similar but differs in architecture and consistency. HBase follows Master-Salve(Worker) architecture and provides strong consistency while Cassandra is single master architecture and eventual consistent.
HBase is good for read heady workloads.
Hive is used for OLTP, it is used for Reporting (Batch-processing), ad-hoc(on-the-fly) queries.

HMaster (Master Node): Manages the cluster and maintains meta-data.
RegionServers (Worker Nodes): These are the actual worker nodes which handle actual read/write operations from clients. Serve one or more regions, which are horizontal partitions of tables.
Client can directly talk to RegionServers.

What is Pig?

Pig is a data flow language and execution framework developed by Yahoo that sits on top of Hadoop. It simplifies data processing and analysis.
It uses a simple SQL-like scripting language called Pig Latin, which is translated into MapReduce jobs, so we don't have to write low-level Java code.
HBase is suited for data warehousing and batch processing of structured data, while Pig is designed for complex data transformations and ETL (Extract, Transform, Load) processes on both structured and unstructured data. Both can be used together as well.
It can work as alternate to Hive and Spark.

What is SQOOP?

Apache Sqoop (pronounced "scoop"), Sqoop stands for SQL-to-Hadoop.
It is a data transfer tool used in the Hadoop ecosystem to efficiently import and export data between Hadoop and relational databases (RDBMS) such as MySQL, Oracle, PostgreSQL, and SQL Server.
It automates the process of importing data from a database to HDFS, Hive, or HBase. and exporting data from Hadoop back to an RDBMS, but can't export to NO-SQL db.
It is batch-oriented, command-line tool with parallel transfer support. Sqoop doesn’t run continuously on its own, butit can be set up as recurring job (e.g., via cron, Oozie, or Airflow) to simulate continuous data transfer.
It is used when we are building a dataware house pipipline with Hive or HDFS.

What is OZIEE?

Apache Oozie is a workflow scheduler system specifically designed for managing Hadoop jobs and complex data pipelines.
Coordinate and automate execution of various Hadoop jobs: MapReduce, Hive queries, Pig scripts, Sqoop imports/exports and Spark jobs etc.

What is Airflow?

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. It’s widely used for managing complex data pipelines across many environments, not just Hadoop.

EMR

Amazon EMR(Elastic MapReduce) is a big data processing service which handles compute/ infrastructure to run Apache Spark, Trino, Apache Flink, and Apache Hive in AWS Cloud. It provides built-in auto-scaling, intelligent monitoring, and managed infrastructure,
EMR can stores data in S3, Hive or can send to Kafka, Kinesis, RDS, Redshift etc.
EMR can be deployed as serverless, EC2 or in EKS.
While running Hive on EMR, EBS backed by EC2 can form HDFS cluster but instead use S3 for data storage instead HDFS.

Hadoop, Hive, HBase, Sqoop, Oziee - kdwivedi1985/system-design GitHub Wiki

What is Hadoop?

What are the components/modules of Hadoop?

How does Hadoop works?

Hadoop Cluster Architecture

YARN and it's Architecture

How does Resource Manager on the Master knows about resources while resources are allocated to Worker Nodes?

How are failures handled if a distributed node or a container is crashed?

How does application master runs containers on different slaves than it's own?

Does Hadoop Support Auto-Scaling?

What is Hive?

What is HBase?

What is Pig?

What is SQOOP?

What is OZIEE?

What is Airflow?

EMR

Redshift