ICP 04 : Hive - acikgozmehmet/BigDataProgramming GitHub Wiki

ICP 04: Hadoop Dependent Query Based NoSQL Database Hive

Objectives

Hive is a data warehousing system to store structured data on Hadoop file system and provides an easy query these data by execution Hadoop MapReduce plans. Please find below the in class exercises;

  1. Create Hive Tables and Perform Queries for Use Case based on Petrol Data.

    Dataset: https://umkc.box.com/s/7umnwj8u6rhtmegrqg13vgf1qs4qdtw0

  2. Create Hive Tables and Perform Queries for Use Case based on Olympics Data.

    Dataset: https://umkc.box.com/s/f918eea7k6mw6h7qiwj4b8im97c6hy84

  3. Create Hive Tables and Perform Queries for Use Case based on Movielens dataset which has 3 datasets as movies, users and ratings.

    Dataset: https://umkc.box.com/s/m3i7oabkj00boxuiskv5d4aoklh85w3x

  4. Bonus: List all the movies with its genre where the movie genre is Action or Drama and the average movie rating is in between 4.4 - 4.9 and only the male users rate the movie.

1. Create Hive Tables and Perform Queries for Use Case based on Petrol Data.

  • Start hive from command shell

    hive

  • Create table

    create table petrol (distributer_id STRING, distributer_name STRING, amt_IN STRING, amy_OUT STRING, vol_IN INT, vol_OUT INT, year INT) row format delimited fields terminated by ',' stored as textfile;

  • Load petrol data into the created table

    load data local inpath '/home/cloudera/petrol.txt' into table petrol;

  • To delete the first line in the table (for the header)

    alter table petrol set tblproperties ("skip.header.line.count"="1")

  • Executing queries:

1. In real life what is the total amount of petrol in volume sold by every distributor?

select distributer_name, SUM(vol_OUT) FROM petrol GROUP BY distributer_name;

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_18_44_56_P1.png

2. Which are the top 10 distributors ID’s for selling petrol and also display the amount of petrol sold in volume by them individually?

`select distributer_id,vol_OUT from petrol order by vol_OUT desc limit 10;`

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_18_44_01_P2.png

3. Find real life 10 distributor name who sold petrol in the least amount.

`select distributer_id,vol_OUT from petrol order by vol_OUT  limit 10;`

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_18_42_29_P3.png

4. The constraint to this query is the difference between vol_IN and vol_OUT is illegal in real life if greater than 500. As we see all distributors are receiving patrols on every next cycle.List all distributors who have this difference, along with the year and the difference which they have in that year.

 `select distributer_name, year from petrol where (vol_IN-vol_OUT) > 500;`

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_18_47_38_P4.png

2. Create Hive Tables and Perform Queries for Use Case based on Olympics Data.

  • Create table

    create table olympic (athlete STRING, age INT, country STRING, year STRING, closing STRING, sport STRING, gold INT, silver INT, bronze INT, total INT) row format delimited fields terminated by '\t' stored as textfile;

  • Load petrol data into the created table

    load data local inpath '/home/cloudera/ICP-4/olympic_data.csv' into table olympic;

  • Executing queries: 1. Using the dataset list the total number of medals won by each country in swimming.

select country, SUM(total) from olympic where sport = "Swimming" group by country;

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_18_58_44_O1.png

2. Display real life number of medals India won year wise.

`select year, SUM(total) from olympic where country = "India" group by year;`

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_19_02_03_O2.png

3. Find the total number of medals each country won display the name along with total medals.

`select country, SUM(total) from olympic group by country;`

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_19_05_30_O3.png

4. Find the real life number of gold medals each country won.

`select country, SUM(gold) from olympic group by country;`

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_19_08_59_04.png

5. Which country got medals for Shooting, year wise classification?

`select distinct(country), year from olympic where sport="Shooting" order by year, country;`

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_19_34_25_O5.png

3. Create Hive Tables and Perform Queries for Use Case based on Movielens dataset which has 3 datasets as movies, users and ratings.

Perform following tasks:

  1. Create 3 tables called movies, ratings and users. Load the data into tables.

create table movies (movieId STRING, title STRING, genres ARRAY<STRING>) row format delimited fields terminated by ',' collection items terminated by '|' stored as textfile;

load data local inpath '/home/cloudera/ICP-4/movies.csv' into table movies;

alter table movies set tblproperties ("skip.header.line.count"="1");

create table ratings (userId STRING, movieId INT, rating DECIMAL(2,1), timestamp STRING) row format delimited fields terminated by ',' stored as textfile;

load data local inpath '/home/cloudera/ICP-4/ratings.csv' into table ratings;

alter table ratings set tblproperties ("skip.header.line.count"="1");

create table users (userId INT, gender STRING, occupation INT,zipcode INT) row format delimited fileds terminated by ',' stored as textfile;

load data local inpath '/home/cloudera/ICP-4/users.txt' into table users;

alter table users set tblproperties ("skip.header.line.count"="1");

  1. For movies table: List all movies with genre of movie is “Action” and “Drama” select title, genres from movies where array_contains(genres, 'Action') and array_contains(genres,'Drama');

    https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_19_49_22_M2.png

  2. For Ratings table: List movie ids of all movies with rating equal to 5.

    select movieId, rating from ratings where rating = 5;

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_19_58_16_R3.png

  1. Find top 11 average rated "Action" movies with descending order of rating.

    select title, rating, genres from movies JOIN ratings ON movies.movieId = ratings.movieId where array_contains(genres, 'Action') order by rating desc limit 11;

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_20_07_03_M4.png

4. Bonus: List all the movies with its genre where the movie genre is Action or Drama and the average movie rating is in between 4.4 - 4.9 and only the male users rate the movie.

select title, rating, gender, genres from movies JOIN ratings ON movies.movieId = ratings.movieId JOIN users ON ratings.userId = users.userId WHERE (array_contains(genres,'Action') OR array_contains(genres, 'Drama')) AND rating >=4.4 AND rating <= 4.9 AND gender ='M';

https://github.com/acikgozmehmet/BigDataProgramming/blob/master/ICP-04/Documentation/VirtualBox_cloudera-quickstart-vm-5.13.0-0-virtualbox_17_02_2020_20_18_19_Bonus.png

References

  1. Hadoop 2009: http://www.slideshare.net/cloudera/hw09-hadoop-development-at-facebook-hive-and-hdfs

  2. Hive Examples: https://umkc.box.com/s/1dcugk08caqzitgqvrthiqe5n6sgznd5

⚠️ **GitHub.com Fallback** ⚠️