Hive - cchantra/bigdata.github.io GitHub Wiki
(source: http://tecadmin.net/install-apache-hive-on-centos-rhel/)
What is Apache Hive ? The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Read More on Official site This assumes you have Java installed. Download Hive Archive After configuring hadoop successfully on your linux system. Download latest hive source code and extract archive using following commands. (https://www-us.apache.org/dist/hive/)
cd /home/hadoop
wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
We used 2.3.9 version.
wget https://archive.apache.org/dist/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
then
tar xzf apache-hive-2.3.9-bin.tar.gz
mv apache-hive-2.3.9-bin hive
Setup Environment Variables (https://my-bigdata-blog.blogspot.com/2017/07/install-and-configure-apache-hive-on.html) After extracting hive archive file, switch to hadoop user and setup following environment variables in .bashrc using your editor by adding them at the end of the file. After that, save it and source it
export HADOOP_USER_CLASSPATH_FIRST=true
export HIVE_HOME=/home/hadoop/hive
export PATH=$HIVE_HOME/bin:$PATH
Then
source .bashrc
cd hive
And Start Hive Before running hive we need to create /tmp and /user/hive/warehouse and set them chmod g+w in HDFS before create a table in Hive. Use the following commands. You can ignore $HADOOP_HOME/bin prefix if hadoop is in your searching path.
cd /home/hadoop/hive
$HADOOP_HOME/bin/hadoop fs -mkdir /user1
$HADOOP_HOME/bin/hadoop fs -mkdir /user1/hive
$HADOOP_HOME/bin/hadoop fs -mkdir /user1/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user1/hive/warehouse
**For conflicting version of guava.jar, (for hive 2.3.9 ;copy from hadoop )
cp /home/hadoop/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar /home/hadoop/hive/lib
rm /home/hadoop/hive/lib/guava-14.0.1.jar
Now, let's start using hive using following command.
hive
The following is output:
Logging initialized using configuration in jar:file:/home/hadoop/hive/lib/hive-common-1.1.0.jar!/hive-log4j.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hive/lib/hive-jdbc-1.1.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
hive>
Your hive should be working now. Exit the prompt first. Next: Metastore for storing data in hive. Hive requires metastore backend to store data. Here we will use MySQL.
Reading : https://data-flair.training/blogs/apache-hive-metastore/
for what is metastore for Hive and variety of metastore supports.
Preliminary: Create meta store and start service using mysql
https://support.rackspace.com/how-to/install-mysql-server-on-the-ubuntu-operating-system/
You have to install mysql as a backend.
sudo apt-get update
sudo apt-get install mysql-server
sudo ufw allow mysql
and start service
sudo systemctl start mysql
Download mysql connector from hadoop user
wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-5.1.48.tar.gz
tar xvf mysql-connector-java-5.1.48.tar.gz
cp mysql-connector-java-5.1.48/mysql-connector-java-5.1.48.jar /home/hadoop/hive/lib/
Set JAVA classpath to find the connector.
Add this line to ~/.bashrc
export CLASSPATH=$CLASSPATH:/home/hadoop/hive/lib
and source it.
source ~/.bashrc
Preliminary: Create hive-site.xml to set backend to mysql
create file /home/hadoop/hive/conf/hive-site.xml
add the following lines. You may change mysql password properly. Here I use user name ‘root’ and password 'password' at line
<name>javax.jdo.option.ConnectionPassword</name>
in the hive-site.xml file. They are username and password of your mysql.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/home/hadoop/hive/tmp</value>
</property>
<property>
<name>hive.server2.active.passive.ha.enable</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true&useSSL=false&allowPublicKeyRetrieval=true&verifyServerCertificate=false&requireSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>True</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://localhost:9000/user1/hive/warehouse</value>
</property>
</configuration>
Then save it.
create /home/hadoop/hive/conf/core-site.xml
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- OOZIE proxy user setting -->
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>
<!-- HTTPFS proxy user setting -->
<property>
<name>hadoop.proxyuser.httpfs.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.groups</name>
<value>*</value>
</property>
<!-- HS2 proxy user setting -->
<property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>*</value>
</property>
<!-- metastore proxy user setting -->
<property>
<name>hadoop.proxyuser.superuser.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.superuser.groups</name>
<value>*</value>
</property>
</configuration>
Then save it.
Remove old one if there exists. Your meta-store is at the folder:
rm -rf metastore_db/
Command to setup mysql root password
https://www.digitalocean.com/community/tutorials/how-to-install-mysql-on-ubuntu-18-04)
sudo mysql
Then type the following mysql command.
ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'password'
(In my case I use password 'password')
Check the authentication methods employed by each of your users again to confirm that root no longer authenticates using the auth_socket plugin:
SELECT user,authentication_string,plugin,host FROM mysql.user;
Output
+------------------+-------------------------------------------+-----------------------+-----------+
| user | authentication_string | plugin | host |
+------------------+-------------------------------------------+-----------------------+-----------+
| root | *3636DACC8616D997782ADD0839F92C1571D6D78F | mysql_native_password | localhost |
| mysql.session | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE | mysql_native_password | localhost |
| mysql.sys | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE | mysql_native_password | localhost |
| debian-sys-maint | *CC744277A401A7D25BE1CA89AFF17BF607F876FF | mysql_native_password | localhost |
+------------------+-------------------------------------------+-----------------------+-----------+
4 rows in set (0.00 sec)
You can see in this example output that the root MySQL user now authenticates using a password. Once you confirm this on your own server, you can exit the MySQL shell:
exit
Now try to enter mysql again.
mysql -u root -p
you should get the mysql prompt. In case you want to delete existing metastore database, do the following.
mysql -u root -p
mysql> drop database metastore;
At the shell prompt, in hive folder, create metastore in mysql
$HIVE_HOME/bin/schematool -initSchema -dbType mysql
When it succeeds, you see
....
Initialization script completed
Mon Jan 18 01:39:25 ICT 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
schemaTool completed
(in a background process).
hive --service metastore
Now you can create table and make a query. Create Demo Table and Test
At this stage you have successfully installed hive. Lets create a sample table using following command
First call hive.
hive
You should get prompt.
hive>
Type the query command.
hive> CREATE TABLE demo1 (id int, name string);
OK
Time taken: 6.565 seconds
Show the created tables with below command.
hive> SHOW TABLES;
OK
demo1
Time taken: 0.231 seconds, Fetched: 1 row(s)
hive> insert into demo1 values (1,"joy");
Query ID = hadoop_20150425182727_43454be8-779d-4d6b-9310-3da7f32dcb95
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1429946598372_0003, Tracking URL = http://localhost:8088/proxy/application_1429946598372_0003/
Kill Command = /home/hadoop/hadoop/bin/hadoop job -kill job_1429946598372_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-04-25 18:28:16,172 Stage-1 map = 0%, reduce = 0%
2015-04-25 18:28:45,708 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1429946598372_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/user/hive/warehouse/demo1/.hive-staging_hive_2015-04-25_18-27-52_772_2752692263082577145-1/-ext-10000
Loading data to table default.demo1
Table default.demo1 stats: [numFiles=1, numRows=1, totalSize=6, rawDataSize=5]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.66 sec HDFS Read: 3377 HDFS Write: 75 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 660 msec
OK
Time taken: 57.777 seconds
Or you can load existing data Assume demo.txt contains:
chantana
ploy
arnon
putchong
mudang
bundit
wandee
First, copy to hadoop file system.
hdfs dfs -copyFromLocal /home/hadoop/hive/demo.txt /demo.txt
The create table in hive QL.
hive> create table demo2 (name string);
OK
Time taken: 2.516 seconds
hive> load data inpath 'hdfs://localhost:9000/demo.txt' into table demo2;
Loading data to table default.demo2
Table default.demo2 stats: [numFiles=1, totalSize=43]
OK
Time taken: 1.657 seconds
hive> select * from demo2;
OK
chantana
ploy
arnon
putchong
mudang
bundit
Time taken: 0.616 seconds, Fetched: 6 row(s)
Note: you use the following command. It will load demo2.txt
from your local directory, not from hdfs.
hive>load data inpath '/home/hadoop/hive/demo.txt' into table demo2;
Drop the table using below command.
hive> DROP TABLE demo1;
OK
Time taken: 2.393 seconds
hive> exit;
Read more about SQL operations on hive on apache hive official site. (Source: http://tecadmin.net/install-apache-hive-on-centos-rhel/) (https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started) (http://stackoverflow.com/questions/17425492/hive-insert-query-like-sql)
More comprehensive tutorial about hive.
(http://www.tutorialspoint.com/hive/)
Monitoring hive
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.0/bk_reference/content/hive-ports.html
http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/
The data are in https://github.com/cchantra/bigdata.github.io/raw/refs/heads/master/hive/hive_data_sample.tar.gz
- Download data : Master.csv and Batting.csv files from the dataset. **You can get the data provided below as well or in the server.
wget https://github.com/cchantra/bigdata.github.io/raw/refs/heads/master/hive/hive_data_sample.tar.gz
tar xvf hive_data_sample.tar.gz
- copy to HDFS
hdfs dfs -mkdir /data_hive
hdfs dfs -copyFromLocal Batting.csv /data_hive/Batting.csv
hdfs dfs -copyFromLocal Master.csv /data_hive/Master.csv
- Create table and load data. start hive and then put the commands:
hive> create table temp_batting (col_value STRING);
- load data csv to Tables
hive> LOAD DATA INPATH 'hdfs://localhost:9000/data_hive/Batting.csv' OVERWRITE INTO TABLE temp_batting;
- Extract the data. So first we will type in a query to create a new table called batting to hold the data. That table will have three columns for player_id, year and the number of runs. Create Table by:
hive> create table batting (player_id STRING, year INT, runs INT);
- Extract the data we want from temp_batting and copy it into batting. We will do this with a regexp pattern. The three regexp_extract calls are going to extract the player_id, year and run. Put the following command:
hive> insert overwrite table batting
select
regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) player_id,
regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) year,
regexp_extract(col_value, '^(?:([^,]*),?){9}', 1) run
from temp_batting;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20230804170128_c4f2502d-576f-44be-9b77-4fda34e12f8d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1689960917796_0019, Tracking URL = http://localhost:8088/proxy/application_1689960917796_0019/
Kill Command = /home/hadoop/hadoop/bin/hadoop job -kill job_1689960917796_0019
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2023-08-04 17:01:38,400 Stage-1 map = 0%, reduce = 0%
2023-08-04 17:01:44,741 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.48 sec
MapReduce Total cumulative CPU time: 4 seconds 480 msec
Ended Job = job_1689960917796_0019
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/batting/.hive-staging_hive_2023-08-04_17-01-28_633_1686320822013581688-1/-ext-10000
Loading data to table default.batting
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 4.48 sec HDFS Read: 6398990 HDFS Write: 1653403 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 480 msec
OK
Time taken: 17.563 seconds
hive>
**QUERY 6,7 may not work for two nodes
- Create Query to Filter The Data (year, runs): For each year, how many runs are there? This query first groups all the records by year and then selects the player with the highest runs from each year.
SELECT year, max(runs) FROM batting GROUP BY year;
- Create Query to Filter The Data (year, player, runs): How many runs for each year for each the player. need to go back and get the player_id(s) to know the player(s) name. We know that for a given year: we can use the runs to find the player(s) for that year. Take the previous query and join it with the batting records to get the final table.
SELECT a.year, a.player_id, a.runs from batting a
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b
ON (a.year = b.year AND a.runs = b.runs);
Query ID = chantana_20160918123836_82c5343a-b18e-4a47-ad90-b94bfe747a5d
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1473575783808_0012, Tracking URL = http://XTK40c:8088/proxy/application_1473575783808_0012/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1473575783808_0012
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-09-18 12:38:40,868 Stage-1 map = 0%, reduce = 0%
2016-09-18 12:38:46,048 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.43 sec
2016-09-18 12:38:51,220 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.2 sec
MapReduce Total cumulative CPU time: 6 seconds 200 msec
Ended Job = job_1473575783808_0012
2016-09-18 12:38:54,522 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/chantana/chantana_20160918123836_82c5343a-b18e-4a47-ad90-b94bfe747a5d.log
2016-09-18 12:38:54 Starting to launch local task to process map join; maximum memory = 514850816
2016-09-18 12:38:55 Dump the side-table for tag: 0 with group count: 13583 into file: file:/tmp/chantana/824e0a2e-892f-4ffa-8719-f833f0c8acf5/hive_2016-09-18_12-38-36_274_4959386345773063113-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile00--.hashtable
2016-09-18 12:38:56 Uploaded 1 File to: file:/tmp/chantana/824e0a2e-892f-4ffa-8719-f833f0c8acf5/hive_2016-09-18_12-38-36_274_4959386345773063113-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile00--.hashtable (1562453 bytes)
2016-09-18 12:38:56 End of local task; Time Taken: 1.068 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1473575783808_0013, Tracking URL = http://XTK40c:8088/proxy/application_1473575783808_0013/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1473575783808_0013
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2016-09-18 12:39:02,074 Stage-4 map = 0%, reduce = 0%
2016-09-18 12:39:06,314 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 2.88 sec
MapReduce Total cumulative CPU time: 2 seconds 880 msec
Ended Job = job_1473575783808_0013
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.2 sec HDFS Read: 1660057 HDFS Write: 3159 SUCCESS
Stage-Stage-4: Map: 1 Cumulative CPU: 2.88 sec HDFS Read: 8755 HDFS Write: 2833 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 80 msec
OK
1871 barnero01 66
1872 eggleda01 94
1873 barnero01 125
1874 mcveyca01 91
1875 barnero01 115
1876 barnero01 126
1877 orourji01 68
1878 highadi01 60
1879 jonesch01 85
1880 dalryab01 91
1881 gorege01 86
1882 gorege01 99
1883 stoveha01 110
1884 dunlafr01 160
1885 stoveha01 130
1886 kellyki01 155
1887 oneilti01 167
1888 pinknge01 134
1889 griffmi01 152
1889 stoveha01 152
1890 duffyhu01 161
1891 brownto01 177
1892 childcu01 136
1893 longhe01 149
1894 hamilbi01 192
...
2009 pujolal01 124
2010 pujolal01 115
2011 grandcu01 136
Time taken: 31.126 seconds, Fetched: 151 row(s)
- Find the player names with the top 10 runs.
- Find the top 10 states with the best runs.
- Who are the managers for all the runs in 1).
(https://hortonworks.com/hadoop-tutorial/using-hive-data-analysis/)
-
Load http://hortonassets.s3.amazonaws.com/tutorial/hive/Twitterdata.txt
-
copy to HDFS
-
Call Hive command line
DROP TABLE IF EXISTS TwitterExampletextexample;
CREATE TABLE TwitterExampletextexample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs://localhost:9000/data_hive/Twitterdata.txt' OVERWRITE INTO TABLE TwitterExampletextexample;
Test:
hive> select * from TwitterExampletextexample limit 10;
OK
459917312909647872 cloudfeednet PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909 Sat Apr 26 04:50:06 UTC 2014 Bay Area, CA 0 retweet 0 326
459917312909647872 cloudfeednet PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909 Sat Apr 26 04:50:06 UTC 2014 Bay Area, CA 0 retweet 0 326
459917312909647872 cloudfeednet PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909 Sat Apr 26 04:50:06 UTC 2014 Bay Area, CA 0 retweet 0 326
459917312909647872 cloudfeednet PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909 Sat Apr 26 04:50:06 UTC 2014 Bay Area, CA 0 retweet 0 326
459917312909647872 cloudfeednet PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909 Sat Apr 26 04:50:06 UTC 2014 Bay Area, CA 0 retweet 0 326
459917350704513024 anamarialeon2 you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169 Sat Apr 26 04:50:15 UTC 2014 Chicago 0 retweet 0 114
459917350704513024 anamarialeon2 you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169 Sat Apr 26 04:50:15 UTC 2014 Chicago 0 retweet 0 114
459917350704513024 anamarialeon2 you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169 Sat Apr 26 04:50:15 UTC 2014 Chicago 0 retweet 0 114
459917350704513024 anamarialeon2 you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169 Sat Apr 26 04:50:15 UTC 2014 Chicago 0 retweet 0 114
459917350704513024 anamarialeon2 you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169 Sat Apr 26 04:50:15 UTC 2014 Chicago 0 retweet 0 114
Time taken: 1.317 seconds, Fetched: 10 row(s)
-Create Table using RCFile format
Record Columnar(RC) format determines how to store relational tables on distributed computer clusters. With this format, you can get the advantages of a columnar format over row format of a record.
DROP TABLE IF EXISTS TwitterExampleRCtable;
CREATE TABLE TwitterExampleRCtable(
tweetId BIGINT,
username STRING,
txt STRING,
CreatedAt STRING,
profileLocation STRING COMMENT 'Location of user',
favc BIGINT,
retweet STRING,
retcount BIGINT,
followerscount BIGINT
)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS RCFILE;
Insert to RC Table:
hive> INSERT OVERWRITE TABLE TwitterExampleRCtable select * from TwitterExampletextexample;
Query ID = chantana_20160923142221_d8d39f6f-de11-4a41-b4cb-af2deec1760e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1473575783808_0014, Tracking URL = http://XTK40c:8088/proxy/application_1473575783808_0014/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1473575783808_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-23 14:22:27,560 Stage-1 map = 0%, reduce = 0%
2016-09-23 14:22:32,787 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.8 sec
MapReduce Total cumulative CPU time: 1 seconds 800 msec
Ended Job = job_1473575783808_0014
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/user/hive/warehouse/twitterexamplerctable/.hive-staging_hive_2016-09-23_14-22-21_650_3647274628905325844-1/-ext-10000
Loading data to table default.twitterexamplerctable
Table default.twitterexamplerctable stats: [numFiles=1, numRows=33, totalSize=2757, rawDataSize=2585]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.8 sec HDFS Read: 11256 HDFS Write: 2844 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 800 msec
OK
Time taken: 12.478 seconds
-Select data :
hive> select * from TwitterExampleRCtable limit 10;
OK
459917312909647872 cloudfeednet PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909 Sat Apr 26 04:50:06 UTC 2014 Bay Area, CA 0 retweet 0 326
459917312909647872 cloudfeednet PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909 Sat Apr 26 04:50:06 UTC 2014 Bay Area, CA 0 retweet 0 326
459917312909647872 cloudfeednet PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909 Sat Apr 26 04:50:06 UTC 2014 Bay Area, CA 0 retweet 0 326
459917312909647872 cloudfeednet PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909 Sat Apr 26 04:50:06 UTC 2014 Bay Area, CA 0 retweet 0 326
459917312909647872 cloudfeednet PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909 Sat Apr 26 04:50:06 UTC 2014 Bay Area, CA 0 retweet 0 326
459917350704513024 anamarialeon2 you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169 Sat Apr 26 04:50:15 UTC 2014 Chicago 0 retweet 0 114
459917350704513024 anamarialeon2 you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169 Sat Apr 26 04:50:15 UTC 2014 Chicago 0 retweet 0 114
459917350704513024 anamarialeon2 you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169 Sat Apr 26 04:50:15 UTC 2014 Chicago 0 retweet 0 114
459917350704513024 anamarialeon2 you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169 Sat Apr 26 04:50:15 UTC 2014 Chicago 0 retweet 0 114
459917350704513024 anamarialeon2 you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169 Sat Apr 26 04:50:15 UTC 2014 Chicago 0 retweet 0 114
Time taken: 0.155 seconds, Fetched: 10 row(s)
-Find top 10 countries who tweeted most from TwitterExampleRCtable :
SELECT profileLocation, COUNT(txt) as count1
FROM TwitterExampleRCtable
GROUP BY profileLocation
ORDER BY count1 desc limit 10;
...
OK
Chicago, IL 11
Chicago, IL. 8
Bay Area, CA 5
Chicago 5
NULL 0
Time taken: 40.631 seconds, Fetched: 5 row(s)
-Managed Table VS External Table Managed tables are created by default with CREATE TABLE statements, whereas External tables are used when you want your tables to point to data files in place.
-- Managed Table
DROP TABLE IF EXISTS ManagedExample;
CREATE TABLE ManagedExample(
tweetId BIGINT,
username STRING,
txt STRING,
CreatedAt STRING,
profileLocation STRING,
favc BIGINT,
retweet STRING,
retcount BIGINT,
followerscount BIGINT
)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
-- External Table
DROP TABLE IF EXISTS ExternalExample;
CREATE EXTERNAL TABLE ExternalExample(
tweetId BIGINT,
username STRING,
txt STRING,
CreatedAt STRING,
profileLocation STRING,
favc BIGINT,
retweet STRING,
retcount BIGINT,
followerscount BIGINT
)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/data/admin';
By creating a managed table the file you load in is moved to /apps/hive/warehouse that means that the data is controlled by hive.
Whereas the external tables points to the /data/admin directory in which we put the Twitterdata.txt
describe formatted ManagedExample;
output
# col_name data_type comment
tweetid bigint
username string
txt string
createdat string
profilelocation string
favc bigint
retweet string
retcount bigint
followerscount bigint
# Detailed Table Information
Database: default
Owner: chantana
CreateTime: Fri Sep 23 14:27:00 ICT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
...
describe formatted ExternalExample;
OK
# col_name data_type comment
tweetid bigint
username string
txt string
createdat string
profilelocation string
favc bigint
retweet string
retcount bigint
followerscount bigint
# Detailed Table Information
Database: default
Owner: chantana
CreateTime: Fri Sep 23 14:27:47 ICT 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://localhost:9000/data/admin
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
comment This is the Twitter streaming data
transient_lastDdlTime 1474615667
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim t
serialization.format t
Time taken: 0.103 seconds, Fetched: 37 row(s)
-Partitioned Table
Partitions are horizontal slices of data which allow large sets of data to be segmented into more manageable blocks. Here is the sample syntax to create a partitioned table and load data into partitions.
-- Create Partitioned Table
DROP TABLE IF EXISTS PARTITIONEDExample;
CREATE TABLE PARTITIONEDExample(
tweetId BIGINT,
username STRING,
txt STRING,
favc BIGINT,
retweet STRING,
retcount BIGINT,
followerscount BIGINT
)
COMMENT 'This is the Twitter streaming data'
PARTITIONED BY(CreatedAt STRING, profileLocation STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
-- Set dynamic partition mode
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
-- Insert data with partition
INSERT OVERWRITE TABLE PARTITIONEDExample
PARTITION (CreatedAt='Sat Apr 26 04:50:15 UTC 2014', profileLocation='Chicago')
SELECT
tweetId,
username,
txt,
favc,
retweet,
retcount,
followerscount
FROM twitterexampletextexample
WHERE profileLocation='Chicago'
LIMIT 100;
-Tables with buckets
Bucketing is a technique that allows to cluster or segment large sets of data to optimize query performance.
-- Create Bucketed Table
DROP TABLE IF EXISTS BucketingExample;
CREATE TABLE BucketingExample(
tweetId BIGINT,
username STRING,
txt STRING,
CreatedAt STRING,
favc BIGINT,
retweet STRING,
retcount BIGINT,
followerscount BIGINT
)
COMMENT 'This is the Twitter streaming data'
PARTITIONED BY(profileLocation STRING)
CLUSTERED BY(tweetId) INTO 2 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
-- Set required properties
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
-- Insert data
INSERT OVERWRITE TABLE BucketingExample
PARTITION(profileLocation='Chicago')
SELECT
tweetId,
username,
txt,
CreatedAt,
favc,
retweet,
retcount,
followerscount
FROM twitterexampletextexample
WHERE profileLocation='Chicago'
LIMIT 100;
# Clean up HDFS directories
hdfs dfs -rm -r /user/hive/warehouse/*
hdfs dfs -rm -r /data/admin/*
hdfs dfs -rm -r /data_hive/*
hdfs dfs -rm -r /user1/hive/warehouse/*
hdfs dfs -rm -r /demo.txt
hdfs dfs -rm -r /tmp/*
# Clean up local temporary files
rm -rf /home/hadoop/hive/tmp/*
rm -rf /tmp/hive/
# Stop Hive metastore
ps aux | grep hive
kill -9 <metastore_process_id> # Replace with actual process ID
# Clean up metastore (optional - be careful with this!)
mysql -u root -p
mysql> DROP DATABASE IF EXISTS metastore;
mysql> exit;
Reference: hive command
https://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/
https://www.oreilly.com/library/view/programming-hive/9781449326944/ch04.html