HIVE dml - zhongjiajie/zhongjiajie.github.com GitHub Wiki
-
show functions;
: 显示全部已有函数 -
[desc|describe] function substr;
: 某函数的描述 -
[desc|describe] formatted <TABLE>
: 会告知是managed_table还是external_table
load命令不会校验数据和schema的差异,LOCAL和OVERWRITE是可选参数
-- omitted LOCAL will find in HDFS
-- omitted OVERWRITE data files are appended to existing data sets
LOAD DATA [LOCAL] INPATH './examples/files/kv2.txt' [OVERWRITE] INTO TABLE table_name PARTITION (ds='2008-08-15');
-- INSERT 这个关键字可以忽略
-- omitted LOCAL will insert into HDFS
INSERT OVERWRITE [LOCAL] DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
-
hive 设置动态分区,并插入动态分区表
set hive.exec.dynamic.partition=true; insert overwrite table table_name partition(part_desc) select column from srv_table_name
LanguageManual Joins: 关于如何join及如果优化的手册
- where中使用正则:
rlike
和regexp
用的是 Java regular - 判断字段field是否是全都是数字的形式:
where field rlike '^\\d+$'
- 正则提取连续的数字,eg:
regexp_extract(field, '[0-9]*', 0)
- 正则提取连续的数字,eg:
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
Hive supports three types of conditional functions
. These functions are listed below:
-
IF( Test Condition, True Value, False Value )
: The IF condition evaluates the "Test Condition" and if the "Test Condition" is true, then it returns the "True Value". Otherwise, it returns the False Value. Example:IF(1=1, 'working', 'not working') returns
'working' -
COALESCE( value1,value2,... )
: The COALESCE function returns the fist not NULL value from the list of values. If all the values in the list are NULL, then it returns NULL. Example:COALESCE(NULL,NULL,5,NULL,4)
returns 5.COALESCE(column,CAST(0 AS BIGINT))
andCOALESCE(column, 0L)
-
CASE Statement
: The syntax for the case statement is:CASE [ expression ] WHEN condition1 THEN result1 WHEN condition2 THEN result2 ... WHEN conditionn THEN resultn ELSE result END
Here expression is optional. It is the value that you are comparing to the list of conditions. (ie: condition1, condition2, ... conditionn).
All the conditions must be of same datatype. Conditions are evaluated in the order listed. Once a condition is found to be true, the case statement will return the result and not evaluate the conditions any further.
All the results must be of same datatype. This is the value returned once a condition is found to be true.
IF no condition is found to be true, then the case statement will return the value in the ELSE clause. If the ELSE clause is omitted and no condition is found to be true, then the case statement will return NULL
Example:
CASE Fruit WHEN 'APPLE' THEN 'The owner is APPLE' WHEN 'ORANGE' THEN 'The owner is ORANGE' ELSE 'It is another Fruit' END The other form of CASE is CASE WHEN Fruit = 'APPLE' THEN 'The owner is APPLE' WHEN Fruit = 'ORANGE' THEN 'The owner is ORANGE' ELSE 'It is another Fruit' END
-
hive中将string转成date类型
- 最简单
cast('2017-01-01' as date)
只能是yyyy-MM-dd 类型 - 任意类型通过unix时间转
from_unixtime(bigint unix_timestamp(string '20170101', 'yyyyMMdd'), 'yyyy-MM-dd')
- 最简单
-
计算时间差
-
datediff('2017-01-01', '2017-01-02')
前面时间-后面时间,参数接受默认的格式是'yyyy-MM-dd' - 结合
from_unixtimestamp
和unix_timestamp
两个函数,可以实现各种类型时间时间差的计算
-
- hive中,大表与大表关联,关联的key发生数据倾斜,对关联的key作分桶,能提高3倍效率左右
- hiveserver2中,如果直接使用spark的
start-thriftserver.sh
启动且spark版本在1.5.1或者左右的,insert into table table values
失败的话,可以使用insert into table <tablename> select * from (select 'col1', 'col2') t;
完成数据插入 - 直接将hive分区表的数据直接插入hdfs之后在hive客户端找不到对应的分区和数据,这是因为在hive的元数据库中没有相应的分区信息,可以在hive客户端中运行
msck repair table <tablename>
方式让hive自动发现分区,或者直接用HQL更新元数据alter table <tablename> add partition (partition_column) location '/hadoop-file-system-location';
- 删除指定分区:
alter table <TABLE_NAME> drop partition(ds='20180101')
- 删除所有分区:
alter table <TABLE_NAME> drop partition(ds<>'')
- 删除一个范围的分区:
alter table <TABLE_NAME> drop partition(ds>'20180414',ds<'20180416')
insert overwrite table partition(part_desc)
select *
from src_table
可能会导致GC Overhead or Heap space error
,原因是插入同一分区的记录要求select
语句中的字段是一样的,如果不一致会导致在插入的过程中hive要做很多判断,要避免这样的情况发生要保证同一分区的记录在同一个reduce中,可以将sql改成如下形式,shuffle
数量最好大于或者等于动态分区数,加上DISTRIBUTE BY
之后可以按照分区字段进行分片,最后一个select的字段要是分区字段
set hive.exec.dynamic.partition.mode=nonstrict;
set spark.sql.shuffle.partitions=2000;
insert overwrite table partition(part_desc)
select *
from src_table
DISTRIBUTE BY part_desc
distinct会将所有数据放到同一个reduce中进行去重,导致数据倾斜,而group by则会产生多个reduce并将相同聚合值的字段放到同一个reduce中,避免了数据倾斜,同时可以指定set mapred.reduce.tasks=100
每个reduce的数量Hive能计算出数据量很大的数据集,但是不能计算数据倾斜的数据集
-
order by
: order by会对输入做全局排序,因此只有一个Reducer(多个Reducer无法保证全局有序),然而只有一个Reducer,会导致当输入规模较大时,消耗较长的计算时间.部分的order by可以使用distribute by和sort by完成 -
sort by
: sort by不是全局排序,其在数据进入reducer前完成排序,因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1
,则sort by只会保证每个reducer的输出有序,并不保证全局有序。sort by不同于order by,它不受hive.mapred.mode
属性的影响,sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序。使用sort by你可以指定执行的reduce个数(通过set mapred.reduce.tasks=n
来指定),对输出的数据再执行归并排序,即可得到全部结果 -
distribute by
: distribute by是控制map端如何拆分数据给reduce端的。hive会根据distribute by
后面列,对应reduce的个数进行分发,默认是采用hash算法。sort by为每个reduce产生一个排序文件。在有些情况下,需要控制某个特定行应该到哪个reducer,这通常是为了进行后续的聚集操作。distribute by刚好可以做这件事。因此,distribute by经常和sort by配合使用 -
cluster by
: cluster by除了具有distribute by
的功能外还兼具sort by
的功能。但是排序只能倒叙排序,不能指定排序规则为ASC或者DESC
distribute by
和sort by
的使用场景
- Map输出的文件大小不均
- Reduce输出文件大小不均
- 小文件过多
- 文件超大
SELECT CAST(MONTHS_BETWEEN(date1, date2) AS INT) as numberOfMonthsBetweenDates
FROM table
-- cust_1, month_1, f1, f2, f3
-- 转成
-- cust_1, month_1, f1
-- cust_1, month_1, f2
-- cust_1, month_1, f3
select cust, month, value
from table lateral view explode(array(col3,col4,col5)) orig_table_alias as value;
-- 通过唯一键将稀疏表的值合并到一起
select resource_id,
, case when collect_set(quarter_1)[0] is null then 0 else collect_set(quarter_1)[0] end as quarter_1_spends
, case when collect_set(quarter_2)[0] is null then 0 else collect_set(quarter_2)[0] end as quarter_2_spends
, case when collect_set(quarter_3)[0] is null then 0 else collect_set(quarter_3)[0] end as quarter_3_spends
, case when collect_set(quarter_4)[0] is null then 0 else collect_set(quarter_4)[0] end as quarter_4_spends
from (
-- 生成一个很多 null 的稀疏表
select resource_id,
, case when quarter = 'q1' then amount end as quarter_1
, case when quarter = 'q2' then amount end as quarter_2
, case when quarter = 'q3' then amount end as quarter_3
, case when quarter = 'q4' then amount end as quarter_4
from billing_info
) tbl1
group by resource_id;
select t.AccountNumber
, date_add (t.StartDate,pe.i) as Day
from Table1 t
lateral view
posexplode(split(space(datediff(t.EndDate, t.StartDate)), ' ')) pe as i,x
hive.support.quoted.identifiers=none;
SELECT `(exclude_field)?+.+` FROM <TABLE_NAME>;
因为beeline中语句的结尾就是分号,如果concat
split
使用了分号,就要将其转化为二进制格式\073
select split(col, ';') from table;
select split(col, '\073') from table;
在sql语句的后面加上distribute by rand()
,会将数据随机放到设定的分区中,除了可以避免产生过多的小文件外,还可以使数据更加均匀
insert overwrite table table_name
select *
from table_name_1
distribute by rand();
left semi join是查看仅仅会保留主表的记录,不能取右表中的记录,也不会因为右表有多个关联的值而使得主表的记录重复,查看