JPS - zhamri/Hadoop GitHub Wiki

When you start a Hadoop cluster using commands like start-dfs.sh and start-yarn.sh, several Java processes are initiated to run the different components of Hadoop. You can list these processes using the jps command. Here are the key Java processes you should typically see for a fully operational Hadoop environment:

  1. NameNode: This is the master node of the Hadoop Distributed File System (HDFS). It manages the metadata of the HDFS. If you're running a fully distributed setup, this will be on a dedicated master machine.

  2. DataNode: These nodes store the actual data in HDFS. In a fully distributed setup, each machine acting as a DataNode will run this process.

  3. SecondaryNameNode: This process performs housekeeping functions for the NameNode, like periodic merging of namespace image with edit logs. It's a common misconception that this is a backup NameNode, but it's not.

  4. ResourceManager (part of YARN): This process manages the allocation of compute resources in the cluster. In a full-fledged cluster, it typically runs on a separate machine.

  5. NodeManager (part of YARN): Each NodeManager takes care of the containers and resource usage on its machine. If you're running a YARN cluster, each DataNode will typically also run a NodeManager.

  6. JobHistoryServer (optional): This process stores information about completed jobs and is part of YARN.

Depending on the configuration and the version of Hadoop, there may be additional or fewer services. For example, in a simple pseudo-distributed setup (where Hadoop is configured to run on a single machine in a distributed manner), all these processes will be running on your local machine.

In a more complex, fully distributed setup, these processes would be distributed across multiple machines. Additionally, if you're running other Hadoop ecosystem components (like HBase, Hive, etc.), there would be additional processes for those as well.