ICP 4 - awais546/Big-Data-Programming-Hadoop-Pyspark GitHub Wiki
Big Data Programming Hadoop/Spark (Hive)
Introduction
General introduction about the Hive was delivered. Hive is SQL based, which lies on top of Hadoop or mapper reducer function. Hive is comparatively easy to use and faster.
Lab Task
Following are the lab tasks performed.
- Run the SQL queries on Petrol dataset.
- Run the SQL queries on Olympic dataset.
- Run the SQL queries on Movies dataset.
Question 1
1. Creation Of Table
For running the hive commands we first have to make a table with all the required columns and their respective data types. After that we have to separate it out on the basis of a delimiter based on the type of the file. After that we have to specify the format of the file. Screen of creating a table is given below.
2. In real life what is the total amount of petrol in volume sold by every distributor?
All the queries related to select are done in the mapper and all the queries like group by, order by are done by the reducer. Below is the query and the screenshot of the result.
SELECT distributer_name,SUM(vol_OUT) FROM petrol GROUP BY distributer_name;
3. Which are the top 10 distributors ID’s for selling petrol and also display the amount of petrol sold in volume by them individually?
SELECT distributer_id,vol_OUT FROM petrol order by vol_OUT desclimit 10;
4. Find real life 10 distributor name who sold petrol in the least amount.
SELECT distributer_id,vol_OUT FROM petrol order by vol_OUT limit 10;
5. List all distributors who have this difference, along with the year and the difference which they have in that year.
SELECT distributer_name,Year, (vol_IN-vol_OUT) FROM petrol WHERE (vol_IN-vol_OUT)>400;
Question 2
1. Using the dataset list the total number of medals won by each country in swimming.
select country,SUM(total) from olympic where sport = “Swimming” GROUP BY country;
2. Display real life number of medals India won year wise.
select year,SUM(total) from olympicwhere country = “India” GROUP BY year
3. Find the total number of medals each country won display the name along with total medals.
select country,SUM(total) from olympic GROUP BY country;
4. Find the real life number of gold medals each country won.
select country,SUM(gold) from olympic GROUP BY country;
5. Which country got medals for Shooting, year wise classification?
select country,Year, SUM(total) from olympic where sport = "Shootring" group by Year, country;
Question 3
1. For movies table:
select title,Year from movies where genre = "Action" and genre = "Drama"; For this filter query you can also use "Like" and "%" sign where the number of filters are more.
2. For Ratings table:
select c.title,o.rating from movies c join ratings c on (c.movieId = o.movieId) where o.rating = 5 order by c.tittle desc limit 11;
3. Find top 11 average rated "Action" movies with descending order of rating.
select c.title, o.rating from movies c join ratings - on (c.movieId = o.movieId) where c.genre = "Action" order by c.title desc limit 11;