ICP4 - gracesyl/big-data-hadoop GitHub Wiki

Hadoop Hive- ICP4 lesson module 4!!

In this lesson why we use hive in analysis problems using query language with splitting and using join. we have also learned about data cleansing.

Hive is a data warehousing system to store structured data on Hadoop file system and provides an easy query these data by execution Hadoop MapReduce plans. In this exercise we will learn basics of Hive QL.

1.Create Hive Tables and Perform Queries for Use Case based on Petrol Data. See the Slides for details:

By using the hive prompt we have loaded the dataset and performed the following queries:

The describtion of created table petrol1.

The constraint to this query is the difference between volumeIN and volumeOuT is illegal in real life if greater than 500. As we see all distributors are receiving patrols on every next cycle.List all distributors who have this difference, along with the year and the difference which they have in that year.Hint: (vol_IN-vol_OUT)>500

There is no argument for this query because no values are greater than 500.

2.Create Hive Tables and Perform Queries for Use Case based on Olympics Data. See the Slides for details as follows:

5)Try One yourselfWhich country got medals for Shooting, year wise classification?

Bonus:We are asked to split the dataset and use join with where condition.

By selecting the distinct records in petrol1 dataset the result is 401 records = count(*) in petrol1

Therefore each record is unique and there is no duplicate records to delete.

https://github.com/gracesyl/big-data-hadoop/blob/master/icp4/documentation/Capture4.PNG

splitting the records in the dataset in petrol1 as two separate tables and join them using the where clause.

The output for the bonus are as follows fetching the single record as per the joining query;