Hive Queries - ayushmathur94/Spark GitHub Wiki
Create Database:
hive> create database retail;
hive> show databases;
hive> use retail;
Create Hive Managed Table :
hive> create table customer (id INT, age INT, name STRING) row format delimited fields terminated by ',' stored as textfile ;
hive> describe customer ;
LOAD DATA IN TABLE via external source :
data file in some hdfs path from local system eg:
hadoop fs -copyFromLocal /user/localdir/input.txt /user/userapp/myproj/
hive> load data INPATH '/user/userapp/myproj/input.txt' overwrite into table customer ;
hive> load data INPATH '/user/userapp/myproj/input.txt' overwrite into table customer;
- if keyword local is specified , then in load command give local path , if not then give hdfs path.
eg:
hive> load data local INPATH '/user/localdir/customerinput.txt' into table customer ;
Count total no. of records :
hive> select count(*) from customer ;
Create External Table :
- create external table with specified location.
create external table customer_ex (id INT, name STRING , age INT) row format delimited fields terminated by ',' stored as textfile location '/user/hive/external/' ;
set hive.exec.dynamic.partition=true;
drop table hive> drop table customer;
Insert command :
- insert overwrite is used to overwrite the existing data in table or partition.
- insert into is used to append data into existing data in table.
copy data from managed table to external table : hive> From customer cus insert overwrite table customer_ex select cus.customerId, cus.name ;
hive> insert into table customer_ex select cus.customerId, cus.name from customer cus where cus.customerId = "1" ;
Create table with partitioning and bucketing [IMP]
hive> create table customer(customerId INT , name STRING , age INT, city STRING , STATE STRING) partitioned by (STATE STRING) clustered by (city) into 8 buckets row format delimited fields terminated by ',' stored as textfile ;
Now when we insert data , hive throws errors as dynamic partition mode is strict and dynamic partition is not enabled. therefore we need to set
-
set hive.exec.dynamic.partition=true
-
set hive.exec.dynamic.partition.mode=nonstrict
create table customer (id INT, name STRING , city STRING, state STRING) partitioned by (state STRING) clustered by (city) row format delimited fields terminated by ',' stored as ORC tblproperties("orc.compress" = "NONE" , "transactional" = "true");
The TBLPROPERTIES clause is used to set the properties for the table.
The "orc.compress" property is set to "NONE" which indicates that the ORC files should not be compressed.
The "transactional" property is set to "true", which indicates that the table is transactional and supports ACID operations. When you set "transactional" to true, it enables the table to support the following operations: * Multi-version Concurrency Control (MVCC) * Insert, Update and Delete operations * Atomicity and Durability
Load data with partition :
hive> load data local inpath '/user/localdir/input.txt' into table student partition(course=java);
Insert command with partition :
Specify partition name/value for row to get inserted to that particular partition.
hive> Insert into table table_name partition(course='java') values ( );
You can also use dynamic partitioning in Hive where you don't have to specify the partition values explicitly. Instead, you can use the keyword "dynamic" in the partition clause and Hive will automatically determine the partition values based on the data you are inserting. INSERT INTO TABLE sales PARTITION (year, month) SELECT product_id, product_name, sales, year, month FROM temp_sales; In this example, the year and month column is added in SELECT statement and it will automatically insert the data in the respective partition based on the value of those columns.
Change column name hive> Alter table table_name change firstname name STRING ;
-
firstname is old colname , name is new colname
-
In Hive, the SORTED BY clause is used to specify the sorting order for the data in a table or query result. The SORTED BY clause is used in conjunction with the CLUSTERED BY clause to define the bucketing and sorting order for the data in a table.
select * from employees clustered by (department) sorted by (age desc);
In this example, the query is selecting all columns and rows from the "employee_table" table and then sorting the data by the "age" column in descending order. The CLUSTERED BY clause is used to specify that the data should be clustered by the "department" column.
The use of SORTED BY and CLUSTERED BY together can help to improve the performance of queries on the table, as it allows Hive to read the data in a more efficient order. It also makes the data more readable and better for analytics
Please note that the sorting will be applied only on the rows that are within the same bucket , in other words the data will be sorted only for the same department in above example.
Q. How to delete duplicate records from hive tables ?
insert overwrite table customer select distinct * from customer;
ORC -: Optimized Row Columnar file format provides a highly efficient way to store hive data.
Row Format Delimited :
we are determining / specifying the limits on row with following format:
Fields Terminated by '\t' or ','
Lines Terminated by '\n'
if we don't specify, its fields will be assumed to be delimited w.r.t. ASCII 001
DELIMITER : specify serializer and deserializer format, escape characters, null characters and so on.
general meaning : mark or symbol used to show begining or end of separate elemets.
Fields Terminated by -: used to define column separator
Q To delete external table metadata and data completely ,
hive> Drop table tablename;
hive> truncate table tablename;
spark.sql("Drop table ");
or remove hdfs location : hadoop fs -rm -r hdfspath
- truncate delete data only , not metadata
- drop table delete data as well as metadata
How to truncate a partitioned external table in hive?
i have external table abc
hive> truncate table abc; But, it is throwing me an error stating : Cannot truncate non-managed table abc.
Make your table MANAGED first:
hive> ALTER TABLE abc SET TBLPROPERTIES('EXTERNAL'='FALSE');
Then truncate
hive> truncate table abc;
And finally you can make it external again:
hive> ALTER TABLE abc SET TBLPROPERTIES('EXTERNAL'='TRUE');