Hive - cchantra/bigdata.github.io GitHub Wiki

Hive setup

(source: http://tecadmin.net/install-apache-hive-on-centos-rhel/)

What is Apache Hive ? The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Read More on Official site This assumes you have Java installed. Download Hive Archive After configuring hadoop successfully on your linux system. Download latest hive source code and extract archive using following commands. (https://www-us.apache.org/dist/hive/)

cd /home/hadoop

wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

We used 2.3.9 version.

wget https://archive.apache.org/dist/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz

then

tar xzf apache-hive-2.3.9-bin.tar.gz
mv apache-hive-2.3.9-bin hive

Setup Environment Variables (https://my-bigdata-blog.blogspot.com/2017/07/install-and-configure-apache-hive-on.html) After extracting hive archive file, switch to hadoop user and setup following environment variables in .bashrc using your editor by adding them at the end of the file. After that, save it and source it

export HADOOP_USER_CLASSPATH_FIRST=true
export HIVE_HOME=/home/hadoop/hive
export PATH=$HIVE_HOME/bin:$PATH

Then

source .bashrc
cd hive

And Start Hive Before running hive we need to create /tmp and /user/hive/warehouse and set them chmod g+w in HDFS before create a table in Hive. Use the following commands. You can ignore $HADOOP_HOME/bin prefix if hadoop is in your searching path.

cd /home/hadoop/hive

$HADOOP_HOME/bin/hadoop fs -mkdir /user1
$HADOOP_HOME/bin/hadoop fs -mkdir /user1/hive
$HADOOP_HOME/bin/hadoop fs -mkdir /user1/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user1/hive/warehouse

**For conflicting version of guava.jar, (for hive 2.3.9 ;copy from hadoop )

cp /home/hadoop/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar /home/hadoop/hive/lib
rm /home/hadoop/hive/lib/guava-14.0.1.jar

Now, let's start using hive using following command.

hive

The following is output:

Logging initialized using configuration in jar:file:/home/hadoop/hive/lib/hive-common-1.1.0.jar!/hive-log4j.properties

SLF4J: Class path contains multiple SLF4J bindings.


SLF4J: Found binding in [jar:file:/home/hadoop/hive/lib/hive-jdbc-1.1.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]


SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]


SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.


SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]


hive> 

Your hive should be working now. Exit the prompt first. Next: Metastore for storing data in hive. Hive requires metastore backend to store data. Here we will use MySQL.

Reading : https://data-flair.training/blogs/apache-hive-metastore/

for what is metastore for Hive and variety of metastore supports.

Preliminary: Create meta store and start service using mysql

https://support.rackspace.com/how-to/install-mysql-server-on-the-ubuntu-operating-system/

You have to install mysql as a backend.

sudo apt-get update
sudo apt-get install mysql-server
 
sudo ufw allow mysql

and start service

sudo systemctl start mysql

Download mysql connector from hadoop user


wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-5.1.48.tar.gz

tar xvf mysql-connector-java-5.1.48.tar.gz

cp mysql-connector-java-5.1.48/mysql-connector-java-5.1.48.jar /home/hadoop/hive/lib/

Set JAVA classpath to find the connector. Add this line to ~/.bashrc

export CLASSPATH=$CLASSPATH:/home/hadoop/hive/lib

and source it.


source ~/.bashrc

Preliminary: Create hive-site.xml to set backend to mysql

setup configuration

create file /home/hadoop/hive/conf/hive-site.xml

add the following lines. You may change mysql password properly. Here I use user name ‘root’ and password 'password' at line

<name>javax.jdo.option.ConnectionPassword</name> in the hive-site.xml file. They are username and password of your mysql.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>

<property>
  <name>hive.exec.local.scratchdir</name>
  <value>/home/hadoop/hive/tmp</value>
</property>

<property>
    <name>hive.server2.active.passive.ha.enable</name>
    <value>true</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true&amp;useSSL=false&amp;allowPublicKeyRetrieval=true&amp;verifyServerCertificate=false&amp;requireSSL=false</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value> 
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>password</value>
</property>

<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>true</value>
</property>

<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>

<property>
 <name>datanucleus.autoCreateTables</name>
 <value>True</value>
</property>

<property>
 <name>hive.metastore.warehouse.dir</name>
 <value>hdfs://localhost:9000/user1/hive/warehouse</value>
</property>

</configuration>

Then save it.

create /home/hadoop/hive/conf/core-site.xml

<?xml version="1.0"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at
      http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <!-- OOZIE proxy user setting -->
  <property>
    <name>hadoop.proxyuser.oozie.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.oozie.groups</name>
    <value>*</value>
  </property>

  <!-- HTTPFS proxy user setting -->
  <property>
    <name>hadoop.proxyuser.httpfs.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.httpfs.groups</name>
    <value>*</value>
  </property>

  <!-- HS2 proxy user setting -->
  <property>
    <name>hadoop.proxyuser.hive.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hive.groups</name>
    <value>*</value>
  </property>

  <!-- metastore proxy user setting -->
  <property>
    <name>hadoop.proxyuser.superuser.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.superuser.groups</name>
    <value>*</value>
  </property>

</configuration>

Then save it.

Create hive meta-store

Remove old one if there exists. Your meta-store is at the folder:

rm -rf metastore_db/

Command to setup mysql root password

https://www.digitalocean.com/community/tutorials/how-to-install-mysql-on-ubuntu-18-04)

sudo mysql

Then type the following mysql command.

ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'password'

(In my case I use password 'password')

Check the authentication methods employed by each of your users again to confirm that root no longer authenticates using the auth_socket plugin:

SELECT user,authentication_string,plugin,host FROM mysql.user;

Output

+------------------+-------------------------------------------+-----------------------+-----------+
| user             | authentication_string                     | plugin                | host      |
+------------------+-------------------------------------------+-----------------------+-----------+
| root             | *3636DACC8616D997782ADD0839F92C1571D6D78F | mysql_native_password | localhost |
| mysql.session    | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE | mysql_native_password | localhost |
| mysql.sys        | *THISISNOTAVALIDPASSWORDTHATCANBEUSEDHERE | mysql_native_password | localhost |
| debian-sys-maint | *CC744277A401A7D25BE1CA89AFF17BF607F876FF | mysql_native_password | localhost |
+------------------+-------------------------------------------+-----------------------+-----------+
4 rows in set (0.00 sec)

You can see in this example output that the root MySQL user now authenticates using a password. Once you confirm this on your own server, you can exit the MySQL shell:

exit

Now try to enter mysql again.

mysql -u root -p

you should get the mysql prompt. In case you want to delete existing metastore database, do the following.

mysql -u root -p
mysql> drop database metastore;

At the shell prompt, in hive folder, create metastore in mysql

$HIVE_HOME/bin/schematool -initSchema -dbType mysql

When it succeeds, you see

....
Initialization script completed
Mon Jan 18 01:39:25 ICT 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
schemaTool completed

Start metastore services

(in a background process).

hive --service metastore

Now you can create table and make a query. Create Demo Table and Test

At this stage you have successfully installed hive. Lets create a sample table using following command

Try Hive QL.

First call hive.

hive

You should get prompt.

hive> 

Type the query command.

hive> CREATE TABLE demo1 (id int, name string);


OK
Time taken: 6.565 seconds

Show the created tables with below command.

hive> SHOW TABLES;
OK
demo1
Time taken: 0.231 seconds, Fetched: 1 row(s)

hive> insert into demo1 values (1,"joy");
Query ID = hadoop_20150425182727_43454be8-779d-4d6b-9310-3da7f32dcb95


Total jobs = 3


Launching Job 1 out of 3


Number of reduce tasks is set to 0 since there's no reduce operator


Starting Job = job_1429946598372_0003, Tracking URL = http://localhost:8088/proxy/application_1429946598372_0003/


Kill Command = /home/hadoop/hadoop/bin/hadoop job  -kill job_1429946598372_0003


Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0


2015-04-25 18:28:16,172 Stage-1 map = 0%,  reduce = 0%


2015-04-25 18:28:45,708 Stage-1 map = 100%,  reduce = 0%


Ended Job = job_1429946598372_0003


Stage-4 is selected by condition resolver.


Stage-3 is filtered out by condition resolver.


Stage-5 is filtered out by condition resolver.


Moving data to: hdfs://localhost:9000/user/hive/warehouse/demo1/.hive-staging_hive_2015-04-25_18-27-52_772_2752692263082577145-1/-ext-10000


Loading data to table default.demo1


Table default.demo1 stats: [numFiles=1, numRows=1, totalSize=6, rawDataSize=5]


MapReduce Jobs Launched: 


Stage-Stage-1: Map: 1   Cumulative CPU: 3.66 sec   HDFS Read: 3377 HDFS Write: 75 SUCCESS


Total MapReduce CPU Time Spent: 3 seconds 660 msec


OK


Time taken: 57.777 seconds

Or you can load existing data Assume demo.txt contains:

chantana
ploy
arnon
putchong
mudang
bundit
wandee

First, copy to hadoop file system.

 hdfs dfs -copyFromLocal /home/hadoop/hive/demo.txt /demo.txt

The create table in hive QL.

hive> create table demo2 (name string);
OK
Time taken: 2.516 seconds
hive> load data inpath 'hdfs://localhost:9000/demo.txt' into table demo2;
Loading data to table default.demo2
Table default.demo2 stats: [numFiles=1, totalSize=43]
OK
Time taken: 1.657 seconds
hive> select * from demo2;
OK
chantana
ploy
arnon
putchong
mudang
bundit
Time taken: 0.616 seconds, Fetched: 6 row(s)

Note: you use the following command. It will load demo2.txt from your local directory, not from hdfs.

hive>load data inpath '/home/hadoop/hive/demo.txt' into table demo2;
Drop the table using below command.
hive> DROP TABLE demo1;
OK
Time taken: 2.393 seconds

hive> exit;

Read more about SQL operations on hive on apache hive official site. (Source: http://tecadmin.net/install-apache-hive-on-centos-rhel/) (https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started) (http://stackoverflow.com/questions/17425492/hive-insert-query-like-sql)

More comprehensive tutorial about hive.

(http://www.tutorialspoint.com/hive/)

Monitoring hive

https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.0/bk_reference/content/hive-ports.html

Try more example from HortonWork:

http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/

The data are in https://github.com/cchantra/bigdata.github.io/raw/refs/heads/master/hive/hive_data_sample.tar.gz

  1. Download data : Master.csv and Batting.csv files from the dataset. **You can get the data provided below as well or in the server.
wget https://github.com/cchantra/bigdata.github.io/raw/refs/heads/master/hive/hive_data_sample.tar.gz
tar xvf hive_data_sample.tar.gz
  1. copy to HDFS
 hdfs dfs -mkdir /data_hive
 hdfs dfs -copyFromLocal Batting.csv /data_hive/Batting.csv
 hdfs dfs -copyFromLocal Master.csv /data_hive/Master.csv

  1. Create table and load data. start hive and then put the commands:
 hive> create table temp_batting (col_value STRING);
  • load data csv to Tables
hive>  LOAD DATA INPATH 'hdfs://localhost:9000/data_hive/Batting.csv' OVERWRITE INTO TABLE temp_batting;
  1. Extract the data. So first we will type in a query to create a new table called batting to hold the data. That table will have three columns for player_id, year and the number of runs. Create Table by:
hive> create table batting (player_id STRING, year INT, runs INT);
  1. Extract the data we want from temp_batting and copy it into batting. We will do this with a regexp pattern. The three regexp_extract calls are going to extract the player_id, year and run. Put the following command:
hive> insert overwrite table  batting  
select  
  regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) player_id,  
  regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) year,  
  regexp_extract(col_value, '^(?:([^,]*),?){9}', 1) run  
from temp_batting;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20230804170128_c4f2502d-576f-44be-9b77-4fda34e12f8d
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1689960917796_0019, Tracking URL = http://localhost:8088/proxy/application_1689960917796_0019/
Kill Command = /home/hadoop/hadoop/bin/hadoop job  -kill job_1689960917796_0019
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2023-08-04 17:01:38,400 Stage-1 map = 0%,  reduce = 0%
2023-08-04 17:01:44,741 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.48 sec
MapReduce Total cumulative CPU time: 4 seconds 480 msec
Ended Job = job_1689960917796_0019
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/batting/.hive-staging_hive_2023-08-04_17-01-28_633_1686320822013581688-1/-ext-10000
Loading data to table default.batting
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 4.48 sec   HDFS Read: 6398990 HDFS Write: 1653403 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 480 msec
OK
Time taken: 17.563 seconds
hive> 

**QUERY 6,7 may not work for two nodes

  1. Create Query to Filter The Data (year, runs): For each year, how many runs are there? This query first groups all the records by year and then selects the player with the highest runs from each year.
 SELECT year, max(runs) FROM batting GROUP BY year;
  1. Create Query to Filter The Data (year, player, runs): How many runs for each year for each the player. need to go back and get the player_id(s) to know the player(s) name. We know that for a given year: we can use the runs to find the player(s) for that year. Take the previous query and join it with the batting records to get the final table.
 SELECT a.year, a.player_id, a.runs from batting a 
 JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b 
 ON (a.year = b.year AND a.runs = b.runs);

 
Query ID = chantana_20160918123836_82c5343a-b18e-4a47-ad90-b94bfe747a5d
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1473575783808_0012, Tracking URL = http://XTK40c:8088/proxy/application_1473575783808_0012/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1473575783808_0012
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-09-18 12:38:40,868 Stage-1 map = 0%,  reduce = 0%
2016-09-18 12:38:46,048 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.43 sec
2016-09-18 12:38:51,220 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.2 sec
MapReduce Total cumulative CPU time: 6 seconds 200 msec
Ended Job = job_1473575783808_0012
2016-09-18 12:38:54,522 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/chantana/chantana_20160918123836_82c5343a-b18e-4a47-ad90-b94bfe747a5d.log
2016-09-18 12:38:54 Starting to launch local task to process map join; maximum memory = 514850816
2016-09-18 12:38:55 Dump the side-table for tag: 0 with group count: 13583 into file: file:/tmp/chantana/824e0a2e-892f-4ffa-8719-f833f0c8acf5/hive_2016-09-18_12-38-36_274_4959386345773063113-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile00--.hashtable
2016-09-18 12:38:56 Uploaded 1 File to: file:/tmp/chantana/824e0a2e-892f-4ffa-8719-f833f0c8acf5/hive_2016-09-18_12-38-36_274_4959386345773063113-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile00--.hashtable (1562453 bytes)
2016-09-18 12:38:56 End of local task; Time Taken: 1.068 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1473575783808_0013, Tracking URL = http://XTK40c:8088/proxy/application_1473575783808_0013/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1473575783808_0013
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2016-09-18 12:39:02,074 Stage-4 map = 0%,  reduce = 0%
2016-09-18 12:39:06,314 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 2.88 sec
MapReduce Total cumulative CPU time: 2 seconds 880 msec
Ended Job = job_1473575783808_0013
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.2 sec   HDFS Read: 1660057 HDFS Write: 3159 SUCCESS
Stage-Stage-4: Map: 1   Cumulative CPU: 2.88 sec   HDFS Read: 8755 HDFS Write: 2833 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 80 msec
OK
1871 barnero01 66
1872 eggleda01 94
1873 barnero01 125
1874 mcveyca01 91
1875 barnero01 115
1876 barnero01 126
1877 orourji01 68
1878 highadi01 60
1879 jonesch01 85
1880 dalryab01 91
1881 gorege01 86
1882 gorege01 99
1883 stoveha01 110
1884 dunlafr01 160
1885 stoveha01 130
1886 kellyki01 155
1887 oneilti01 167
1888 pinknge01 134
1889 griffmi01 152
1889 stoveha01 152
1890 duffyhu01 161
1891 brownto01 177
1892 childcu01 136
1893 longhe01 149
1894 hamilbi01 192
...
2009 pujolal01 124
2010 pujolal01 115
2011 grandcu01 136
Time taken: 31.126 seconds, Fetched: 151 row(s)

Exercise:

  1. Find the player names with the top 10 runs.
  2. Find the top 10 states with the best runs.
  3. Who are the managers for all the runs in 1).

Exploring Twitter data:

(https://hortonworks.com/hadoop-tutorial/using-hive-data-analysis/)

  1. Load http://hortonassets.s3.amazonaws.com/tutorial/hive/Twitterdata.txt

  2. copy to HDFS

  3. Call Hive command line

DROP TABLE IF EXISTS TwitterExampletextexample;

CREATE TABLE TwitterExampletextexample(
        tweetId BIGINT, username STRING,
        txt STRING, CreatedAt STRING,
        profileLocation STRING,
        favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
    COMMENT 'This is the Twitter streaming data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY 't'
    STORED AS TEXTFILE;
LOAD  DATA  INPATH  'hdfs://localhost:9000/data_hive/Twitterdata.txt' OVERWRITE INTO TABLE TwitterExampletextexample;

Test:

hive> select * from TwitterExampletextexample limit 10;
OK
459917312909647872      cloudfeednet    PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909   Sat Apr 26 04:50:06 UTC 2014    Bay Area, CA    0       retweet     0       326
459917312909647872      cloudfeednet    PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909   Sat Apr 26 04:50:06 UTC 2014    Bay Area, CA    0       retweet     0       326
459917312909647872      cloudfeednet    PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909   Sat Apr 26 04:50:06 UTC 2014    Bay Area, CA    0       retweet     0       326
459917312909647872      cloudfeednet    PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909   Sat Apr 26 04:50:06 UTC 2014    Bay Area, CA    0       retweet     0       326
459917312909647872      cloudfeednet    PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909   Sat Apr 26 04:50:06 UTC 2014    Bay Area, CA    0       retweet     0       326
459917350704513024      anamarialeon2   you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169      Sat Apr 26 04:50:15 UTC 2014    Chicago 0       retweet 0       114
459917350704513024      anamarialeon2   you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169      Sat Apr 26 04:50:15 UTC 2014    Chicago 0       retweet 0       114
459917350704513024      anamarialeon2   you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169      Sat Apr 26 04:50:15 UTC 2014    Chicago 0       retweet 0       114
459917350704513024      anamarialeon2   you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169      Sat Apr 26 04:50:15 UTC 2014    Chicago 0       retweet 0       114
459917350704513024      anamarialeon2   you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169      Sat Apr 26 04:50:15 UTC 2014    Chicago 0       retweet 0       114
Time taken: 1.317 seconds, Fetched: 10 row(s)

-Create Table using RCFile format

Record Columnar(RC) format determines how to store relational tables on distributed computer clusters. With this format, you can get the advantages of a columnar format over row format of a record.

DROP TABLE IF EXISTS TwitterExampleRCtable;

CREATE TABLE TwitterExampleRCtable(
    tweetId BIGINT, 
    username STRING,
    txt STRING, 
    CreatedAt STRING,
    profileLocation STRING COMMENT 'Location of user',
    favc BIGINT,
    retweet STRING,
    retcount BIGINT,
    followerscount BIGINT
)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS RCFILE;

Insert to RC Table:

hive> INSERT OVERWRITE TABLE TwitterExampleRCtable select * from  TwitterExampletextexample;

Query ID = chantana_20160923142221_d8d39f6f-de11-4a41-b4cb-af2deec1760e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1473575783808_0014, Tracking URL = http://XTK40c:8088/proxy/application_1473575783808_0014/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1473575783808_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-23 14:22:27,560 Stage-1 map = 0%,  reduce = 0%
2016-09-23 14:22:32,787 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.8 sec
MapReduce Total cumulative CPU time: 1 seconds 800 msec
Ended Job = job_1473575783808_0014
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/user/hive/warehouse/twitterexamplerctable/.hive-staging_hive_2016-09-23_14-22-21_650_3647274628905325844-1/-ext-10000
Loading data to table default.twitterexamplerctable
Table default.twitterexamplerctable stats: [numFiles=1, numRows=33, totalSize=2757, rawDataSize=2585]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.8 sec   HDFS Read: 11256 HDFS Write: 2844 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 800 msec
OK
Time taken: 12.478 seconds

-Select data :

hive> select * from TwitterExampleRCtable limit 10;
OK
459917312909647872      cloudfeednet    PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909   Sat Apr 26 04:50:06 UTC 2014    Bay Area, CA    0       retweet     0       326
459917312909647872      cloudfeednet    PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909   Sat Apr 26 04:50:06 UTC 2014    Bay Area, CA    0       retweet     0       326
459917312909647872      cloudfeednet    PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909   Sat Apr 26 04:50:06 UTC 2014    Bay Area, CA    0       retweet     0       326
459917312909647872      cloudfeednet    PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909   Sat Apr 26 04:50:06 UTC 2014    Bay Area, CA    0       retweet     0       326
459917312909647872      cloudfeednet    PLURALSIGHT BIG DATA AND REPORTING WITH MONGODB TUTORIAL-kEISO - - Arganza kEISO has released PluralSight’s... http://t.co/2nD7FxL909   Sat Apr 26 04:50:06 UTC 2014    Bay Area, CA    0       retweet     0       326
459917350704513024      anamarialeon2   you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169      Sat Apr 26 04:50:15 UTC 2014    Chicago 0       retweet 0       114
459917350704513024      anamarialeon2   you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169      Sat Apr 26 04:50:15 UTC 2014    Chicago 0       retweet 0       114
459917350704513024      anamarialeon2   you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169      Sat Apr 26 04:50:15 UTC 2014    Chicago 0       retweet 0       114
459917350704513024      anamarialeon2   you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169      Sat Apr 26 04:50:15 UTC 2014    Chicago 0       retweet 0       114
459917350704513024      anamarialeon2   you following me in my other twitter and me the hacked , please follow me I would be happy again @AustinMahone 169      Sat Apr 26 04:50:15 UTC 2014    Chicago 0       retweet 0       114
Time taken: 0.155 seconds, Fetched: 10 row(s)

-Find top 10 countries who tweeted most from TwitterExampleRCtable :

SELECT profileLocation, COUNT(txt) as count1 
    FROM TwitterExampleRCtable 
    GROUP BY profileLocation 
    ORDER BY count1 desc limit 10;
...
OK
Chicago, IL     11
Chicago, IL.    8
Bay Area, CA    5
Chicago 5
NULL    0
Time taken: 40.631 seconds, Fetched: 5 row(s)

-Managed Table VS External Table Managed tables are created by default with CREATE TABLE statements, whereas External tables are used when you want your tables to point to data files in place.

-- Managed Table
DROP TABLE IF EXISTS ManagedExample;

CREATE TABLE ManagedExample(
    tweetId BIGINT, 
    username STRING,
    txt STRING, 
    CreatedAt STRING,
    profileLocation STRING,
    favc BIGINT,
    retweet STRING,
    retcount BIGINT,
    followerscount BIGINT
)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

-- External Table
DROP TABLE IF EXISTS ExternalExample;

CREATE EXTERNAL TABLE ExternalExample(
    tweetId BIGINT, 
    username STRING,
    txt STRING, 
    CreatedAt STRING,
    profileLocation STRING,
    favc BIGINT,
    retweet STRING,
    retcount BIGINT,
    followerscount BIGINT
)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/data/admin';

By creating a managed table the file you load in is moved to /apps/hive/warehouse that means that the data is controlled by hive. Whereas the external tables points to the /data/admin directory in which we put the Twitterdata.txt

describe formatted ManagedExample;

output

# col_name              data_type               comment

tweetid                 bigint
username                string
txt                     string
createdat               string
profilelocation         string
favc                    bigint
retweet                 string
retcount                bigint
followerscount          bigint

# Detailed Table Information
Database:               default
Owner:                  chantana
CreateTime:             Fri Sep 23 14:27:00 ICT 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
...
describe formatted ExternalExample;
OK
# col_name              data_type               comment

tweetid                 bigint
username                string
txt                     string
createdat               string
profilelocation         string
favc                    bigint
retweet                 string
retcount                bigint
followerscount          bigint

# Detailed Table Information
Database:               default
Owner:                  chantana
CreateTime:             Fri Sep 23 14:27:47 ICT 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://localhost:9000/data/admin
Table Type:             EXTERNAL_TABLE
Table Parameters:
        EXTERNAL                TRUE
        comment                 This is the Twitter streaming data
        transient_lastDdlTime   1474615667

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        field.delim             t
        serialization.format    t
Time taken: 0.103 seconds, Fetched: 37 row(s)

-Partitioned Table

Partitions are horizontal slices of data which allow large sets of data to be segmented into more manageable blocks. Here is the sample syntax to create a partitioned table and load data into partitions.

-- Create Partitioned Table
DROP TABLE IF EXISTS PARTITIONEDExample;

CREATE TABLE PARTITIONEDExample(
    tweetId BIGINT, 
    username STRING, 
    txt STRING,
    favc BIGINT,
    retweet STRING,
    retcount BIGINT,
    followerscount BIGINT
) 
COMMENT 'This is the Twitter streaming data' 
PARTITIONED BY(CreatedAt STRING, profileLocation STRING) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE;

-- Set dynamic partition mode
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

-- Insert data with partition
INSERT OVERWRITE TABLE PARTITIONEDExample 
PARTITION (CreatedAt='Sat Apr 26 04:50:15 UTC 2014', profileLocation='Chicago') 
SELECT 
    tweetId,
    username,
    txt,
    favc,
    retweet,
    retcount,
    followerscount 
FROM twitterexampletextexample
WHERE profileLocation='Chicago' 
LIMIT 100;

-Tables with buckets

Bucketing is a technique that allows to cluster or segment large sets of data to optimize query performance.

-- Create Bucketed Table
DROP TABLE IF EXISTS BucketingExample;

CREATE TABLE BucketingExample(
    tweetId BIGINT, 
    username STRING,
    txt STRING,
    CreatedAt STRING,
    favc BIGINT,
    retweet STRING,
    retcount BIGINT, 
    followerscount BIGINT
)
COMMENT 'This is the Twitter streaming data'
PARTITIONED BY(profileLocation STRING)
CLUSTERED BY(tweetId) INTO 2 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

-- Set required properties
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

-- Insert data
INSERT OVERWRITE TABLE BucketingExample 
PARTITION(profileLocation='Chicago')    
SELECT 
    tweetId,
    username,
    txt,
    CreatedAt,
    favc,
    retweet,
    retcount,
    followerscount       
FROM twitterexampletextexample
WHERE profileLocation='Chicago' 
LIMIT 100;

Cleanup

# Clean up HDFS directories
hdfs dfs -rm -r /user/hive/warehouse/*
hdfs dfs -rm -r /data/admin/*
hdfs dfs -rm -r /data_hive/*
hdfs dfs -rm -r /user1/hive/warehouse/*
hdfs dfs -rm -r /demo.txt
hdfs dfs -rm -r /tmp/*

# Clean up local temporary files
rm -rf /home/hadoop/hive/tmp/*
rm -rf /tmp/hive/

# Stop Hive metastore
ps aux | grep hive
kill -9 <metastore_process_id>  # Replace with actual process ID

# Clean up metastore (optional - be careful with this!)
mysql -u root -p
mysql> DROP DATABASE IF EXISTS metastore;
mysql> exit;

Reference: hive command

https://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/

https://www.oreilly.com/library/view/programming-hive/9781449326944/ch04.html

⚠️ **GitHub.com Fallback** ⚠️