ICP 4 - manaswinivedula/Big-Data-Programming GitHub Wiki

Hive commands.

Task 1: Using the petrol data set

1. created a table named petrol using the following command

create table petrol (distributer_id STRING,distributer_name STRING,amt_IN STRING,amy_OUT STRING,vol_IN INT,vol_OUT INT,year INT) row format delimited fields terminated by ‘,’ stored as textfile;

2. Loaded the data petrol.txt into the table using the below command

load data local inpath ‘/home/cloudera/Downloads/petrol.txt’ into table petrol;

Queries:

1. To know the total amount of petrol in volume sold by every distributor in real life.

SELECT distributer_name,SUM(vol_OUT) FROM petrol GROUP BY distributer_name;

2. To know the top 10 distributors ID’s for selling petrol and also to display the amount of petrol sold in volume by them individually.

SELECT distributer_id,vol_OUT FROM petrol order by vol_OUT desc limit 10;

3. To know the 10 distributor names who sold petrol in the least amount in real life.

SELECT distributer_id,vol_OUT FROM petrol order by vol_OUT limit 10;

4. The constraint to this query is the difference between volumeIN and volumeOuT is illegal in real life if the difference is greater than 500. As we see all distributors are receiving patrols on every next cycle. So, Listing all distributors who have the difference>500, along with the year and the difference which they have in that year.

select distributer_name, year from petrol where (vol_IN-vol_OUT) > 500;

5. Sort by clause performs on column names of Hive tables to sort the output.

select distributer_name,year from petrol sort by year limit 10;

6.Cluster BY clause used on tables present in Hive. Hive uses the columns in Cluster by to distribute the rows among reducers. Cluster BY columns will go to the multiple reducers.

select distributer_name,year from petrol cluster by distributer_name sort by year limit 10;

7. Distribute BY clause used on tables present in Hive. Hive uses the columns in Distribute by to distribute the rows among reducers. All Distribute BY columns will go to the same reducer.

select distributer_name,year from petrol distribute by distributer_name limit 10;

Task 2: using Olympics Dataset

1. created a table named Olympics using the following command

create table olympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by ‘\t’ stored as textfile;

2. Loaded the data Olympics.csv into the table using the below command

load data local inpath‘/home/cloudera/Downloads/olympic_data.csv’ into table olympic;

Queries:

1.listing out the total number of medals won by each country in the sport swimming.

select country,SUM(total) from olympic where sport = "Swimming" GROUP BY country;

2. To know the real-life numbers of medals, India won year wise.

select year,SUM(total) from olympic where country = "India" GROUP BY year;

3.To find the total number of medals each country won in the Olympics.

select country,SUM(total) from olympic GROUP BY country;

4.To find the real-life number of gold medals won by each country.

select country,SUM(gold) from olympic GROUP BY country;

5. To find Which country got medals for Shooting, classifying according to year.

select distinct(country), year from olympic where sport="Shooting" order by year, country;

Task 3: using Movies, ratings, users Data set

1. created a table named movies_new using the following command

create table movies_new (movieId STRING, title STRING, genres ARRAY<STRING>) row format delimited fields terminated by ',' collection items terminated by '|' stored as textfile;

2. Loaded the data petrol.txt into the table using the below command

load data local inpath '/home/cloudera/Downloads/MovieLens/movies.csv' into table movies;

3. altering the table removing 1st row.

alter table movies set tblproperties ("skip.header.line.count"="1");

4. created a table named ratings_new using the following command

create table ratings_new (userId STRING, movieId INT, rating DECIMAL(2,1), timestamp STRING) row format delimited fields terminated by ',' stored as textfile;

5. Loaded the data petrol.txt into the table using the below command

load data local inpath '/home/cloudera/Downloads/MovieLens/ratings.csv' into table ratings;

6. altering the table removing 1st row.

alter table movies set tblproperties ("skip.header.line.count"="1");

7. created a table named users using the following command.

create table users (userId INT, gender STRING, occupation INT,zipcode INT) row format delimited fileds terminated by ',' stored as textfile;

7. Loaded the data users.txt into the table using the below command.

load data local inpath '/home/cloudera/Downloads/MovieLens/users.txt' into table users;

8. altering the table removing 1st row.

alter table movies set tblproperties ("skip.header.line.count"="1");

Queries:

1. To List, all movies with the genre of movie are “Action” and “Drama”

select title,genres from movies_new where array_contains(genres, 'Action') and array_contains(genres,'Drama');

2. List movie ids of all movies with a rating equal to 5.

select movieId, rating from ratings_new where rating = 5;

3. To Find top 11 average rated "Action" movies with descending order of rating.

select title, rating, genres from movies_new JOIN ratings_new ON movies_new.movieId = ratings_new.movieId where array_contains(genres, 'Action') order by rating desc limit 11;

References:

1.https://umkc.box.com/s/1dcugk08caqzitgqvrthiqe5n6sgznd5 2.https://www.tutorialspoint.com/hive/hive_introduction.htm

ICP 4 - manaswinivedula/Big-Data-Programming GitHub Wiki

Hive commands.

Task 1: Using the petrol data set

Queries:

Task 2: using Olympics Dataset

Queries:

Task 3: using Movies, ratings, users Data set

Queries:

References:

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️