hadoop 4 components - unix1998/technical_notes GitHub Wiki

Hadoop is a framework for storing and processing big data, and it consists of several key components, including:

  • HDFS (Hadoop Distributed File System): This is the storage layer of Hadoop. It's a distributed file system designed to store large datasets reliably and efficiently across multiple machines. HDFS breaks down large files into smaller blocks and distributes them across the cluster for redundancy and fault tolerance.

  • MapReduce: This is the processing engine of Hadoop. It's a programming model and framework for processing large datasets in parallel across a cluster of computers. MapReduce jobs involve two phases:

    • Map: Processes data in parallel on individual nodes, transforming it into key-value pairs.
    • Reduce: Aggregates the results from the map phase based on the keys, producing the final output.
  • YARN (Yet Another Resource Negotiator): Introduced in later versions of Hadoop, YARN provides a more efficient resource management layer. It allows multiple processing frameworks (like MapReduce and Spark) to share resources within the Hadoop cluster.

Hive: While not technically a core component of Hadoop, Hive is closely associated with it and often considered part of the Hadoop ecosystem. Here's how it fits in:

  • Hive: It's a data warehouse software layer built on top of Hadoop. Hive provides a familiar SQL-like interface for querying and analyzing data stored in HDFS. Users can write HiveQL queries to interact with the data without needing to write complex MapReduce jobs directly.

Here's a breakdown of their roles:

  • HDFS: Stores the data.
  • MapReduce: Processes the data in parallel using a MapReduce job.
  • YARN: Manages resources for various processing frameworks (including MapReduce) running on the cluster.
  • Hive: Provides an SQL-like interface for querying data stored in HDFS.

In Conclusion:

These components work together to provide a powerful platform for managing and analyzing big data. HDFS stores the data, MapReduce and YARN handle the distributed processing, and Hive allows for convenient querying using SQL-like syntax. While Hive isn't a core component, it's a valuable tool within the broader Hadoop ecosystem.