Hive Vs Impala - ignacio-alorre/Hive GitHub Wiki

Why Impala query speed is faster than Hive:

  • Impala does not make use of Mapreduce as it contains its own pre-defined daemon process to run a job. It sits on top of only the Hadoop Distributed File System (HDFS) as it uses the same to merely store the data. Therefore, we prefer calling it as simply “SQL on HDFS”

  • Hive functions on top of Hadoop which itself includes HDFS as well as MapReduce. Executing an Hive query would then, set forth a series of mapreduce commands until we arrive at the results.

  • Since Impala doesn’t have to translate a SQL query into another processing framework like the map/shuffle/reduce, it does not suffer from the latencies that those operations impose and this makes Impala much faster than Hive on performance benchmarks.

  • Impala has daemons running on all the nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.

  • The reason for this is that there is a certain overhead involved in running a Map/Reduce job, so to avoid latency, Impala circumvents MapReduce and we can get some pretty big gain in runtime.

Query Execution Procedure:

  • Whenever an user passes a query using any of the interfaces provided, this is accepted by one of the Impalads (Impala Daemon) in the cluster. This Impalad is treated as a coordinator for that particular query.

  • After receiving the query, the query coordinator verifies whether the query is appropriate, using the Table Schema from the Hive meta store. Later, it collects the information about the location of the data that is required to execute the query, from HDFS name node and sends this information to other impalads in order to execute the query.

Sources