ICP Assignment 4 - MadhuriSarode/BDP GitHub Wiki
Student ID : 24 : Madhuri Sarode
Student ID : 4 : Bhargavi
Student ID : 16 : Bhavana
Hadoop Dependent Query Based NoSQL Database Hive
In this assignment, hive query language is explored by creating database tables in hive using HiveQL. HIVE supports queries expressed in SQL-like language called HiveQL which are compiled into MapReduce jobs that are executed on Hadoop. It is a data warehousing system to store structured data on Hadoop file system.
Since the data set in each table will be huge, querying it using SQL is a slower and daunting task , map reduce is better way of handling such a problem. Each query execution creates a MapReduce job to process the data request in a faster manner. Multiple mappers and reducers are created to distribute the task. The HQL is a high level query which makes it easier for the end user to manipulate and handle data without getting into low level map reduce jobs written for each query.
A working database can be created under hive and it can be used for further operations. If no database is created, then all the created tables and dataset will be saved in default
database.
create database icp4;
-- creates new dabase
use icp4;
-- Tells the hive system to use icp4 database further for all operations.
Part 1 - DataSet PetrolData
- Creation of Petrol table and loading of the data from the file Petrol.txt onto the table.
The query for the same is as follows
Once the data is loaded, the table can be viewed in HUE.
The table can be queried for any information, such as following
A)The total amount of petrol in volume sold by every distributor
B)The top 10 distributors ID’s for selling petrol and also the amount of petrol sold in volume by them individually
C)10 distributor's names who sold petrol in the least amount. (Since the data extracted by the query by default is in ascending order, the top 10 rows gives the distributor's names who have sold least amount of petrol.
D)All distributors who have the difference of petrol volumeIn < volumeOut, along with the year and the difference which they have in that year.
Part 2 - DataSet Olympic
- Creation of Olympic table and loading of the data from the file olympic_data.csv onto the table.
Once the data is loaded, the table can be viewed in HUE.
The table can be queried for any information, such as following
A)The total number of medals won by each country in swimming.
B)Number of medals India won year wise
C)Display of total number of medals each country won with name along with total medals
D)Number of gold medals each country won
D)Countries which got medal for shooting , year wise classification
Part 3 - Dataset MovieLens
- Creation of Movies, ratings and users table and loading of the data from the respective files movies.csv, ratings.csv, and users.csv onto the respective table.
The data can be viewed in the tables
A)List of all movies with genre of movie is “Action” and “Drama”
B)List of movie ids of all movies with rating equal to 5.
C)Top 11 average rated "Action" movies with descending order of rating.
Bonus Question
List all the movies with its genre where the movie genre is Action or Drama and the average movie rating is in between 4.4 -4.9 and only the male users rate the movie.
select m.title, m.genre from movies m join ratings r on m.movie_id = r.movie_id join users u on u.user_id = r.user_id and u.gender like '%M%' where ( m.genre like '%Action%' or m.genre like '%Drama%') group by m.title, m.genre having avg(r.rating)> 4.4 and avg(r.rating)<4.9;
The solution requires 3 joins to be performed on movies,ratings and users table. a) Movies table and ratings table is joined using the key movie_id
b) Ratings and users table is joined using key user_id
c) The ratings of male users are alone selected using like %M% condition on user.gender field
d) The movie genre of only action and drama are selected using like %Action% , %Drama%
e) The results selected are grouped using movie titles and movie genres.
f) The rows selected satisfying all the conditions above are again filtered on the ratings field, using
avg function, only those rows are selected which has an average rating >4.4 and <4.9