HIVE dml - zhongjiajie/zhongjiajie.github.com GitHub Wiki

HIVE-dml(Data Manipulation Language)

显示系统内容

  • show functions;: 显示全部已有函数
  • [desc|describe] function substr;: 某函数的描述
  • [desc|describe] formatted <TABLE>: 会告知是managed_table还是external_table

load命令导入数据

load命令不会校验数据和schema的差异,LOCAL和OVERWRITE是可选参数

-- omitted LOCAL will find in HDFS
-- omitted OVERWRITE data files are appended to existing data sets
LOAD DATA [LOCAL] INPATH './examples/files/kv2.txt' [OVERWRITE] INTO TABLE table_name PARTITION (ds='2008-08-15');

insert命令导入文件或表

-- INSERT 这个关键字可以忽略
-- omitted LOCAL will insert into HDFS
INSERT OVERWRITE [LOCAL] DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

MULTITABLE INSERT

FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

设置相关

  • hive 设置动态分区,并插入动态分区表

    set hive.exec.dynamic.partition=true;
    insert overwrite table table_name partition(part_desc) select column from srv_table_name

Join

LanguageManual Joins: 关于如何join及如果优化的手册

功能实现

正则

  • where中使用正则:rlikeregexp用的是 Java regular
  • 判断字段field是否是全都是数字的形式: where field rlike '^\\d+$'
  • 正则提取连续的数字,eg: regexp_extract(field, '[0-9]*', 0)
  • 正则提取连续的数字,eg: regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)

条件相关

if coalesce和case when三者的区别

Hive supports three types of conditional functions. These functions are listed below:

  • IF( Test Condition, True Value, False Value ): The IF condition evaluates the "Test Condition" and if the "Test Condition" is true, then it returns the "True Value". Otherwise, it returns the False Value. Example: IF(1=1, 'working', 'not working') returns 'working'

  • COALESCE( value1,value2,... ): The COALESCE function returns the fist not NULL value from the list of values. If all the values in the list are NULL, then it returns NULL. Example: COALESCE(NULL,NULL,5,NULL,4) returns 5. COALESCE(column,CAST(0 AS BIGINT)) and COALESCE(column, 0L)

  • CASE Statement: The syntax for the case statement is:

    CASE  [ expression ]
      WHEN condition1 THEN result1
      WHEN condition2 THEN result2
      ...
      WHEN conditionn THEN resultn
      ELSE result
    END

    Here expression is optional. It is the value that you are comparing to the list of conditions. (ie: condition1, condition2, ... conditionn).

    All the conditions must be of same datatype. Conditions are evaluated in the order listed. Once a condition is found to be true, the case statement will return the result and not evaluate the conditions any further.

    All the results must be of same datatype. This is the value returned once a condition is found to be true.

    IF no condition is found to be true, then the case statement will return the value in the ELSE clause. If the ELSE clause is omitted and no condition is found to be true, then the case statement will return NULL

    Example:

    CASE Fruit
      WHEN 'APPLE' THEN 'The owner is APPLE'
      WHEN 'ORANGE' THEN 'The owner is ORANGE'
      ELSE 'It is another Fruit'
    END
    The other form of CASE is
    
    CASE
      WHEN Fruit = 'APPLE' THEN 'The owner is APPLE'
      WHEN Fruit = 'ORANGE' THEN 'The owner is ORANGE'
      ELSE 'It is another Fruit'
    END

时间相关

  • hive中将string转成date类型

    • 最简单 cast('2017-01-01' as date) 只能是yyyy-MM-dd 类型
    • 任意类型通过unix时间转 from_unixtime(bigint unix_timestamp(string '20170101', 'yyyyMMdd'), 'yyyy-MM-dd')
  • 计算时间差

    • datediff('2017-01-01', '2017-01-02')前面时间-后面时间,参数接受默认的格式是'yyyy-MM-dd'
    • 结合from_unixtimestampunix_timestamp两个函数,可以实现各种类型时间时间差的计算

高阶函数

FAQ

  • hive中,大表与大表关联,关联的key发生数据倾斜,对关联的key作分桶,能提高3倍效率左右
  • hiveserver2中,如果直接使用spark的start-thriftserver.sh启动且spark版本在1.5.1或者左右的,insert into table table values失败的话,可以使用insert into table <tablename> select * from (select 'col1', 'col2') t;完成数据插入
  • 直接将hive分区表的数据直接插入hdfs之后在hive客户端找不到对应的分区和数据,这是因为在hive的元数据库中没有相应的分区信息,可以在hive客户端中运行msck repair table <tablename>方式让hive自动发现分区,或者直接用HQL更新元数据alter table <tablename> add partition (partition_column) location '/hadoop-file-system-location';

删除分区

  • 删除指定分区: alter table <TABLE_NAME> drop partition(ds='20180101')
  • 删除所有分区: alter table <TABLE_NAME> drop partition(ds<>'')
  • 删除一个范围的分区: alter table <TABLE_NAME> drop partition(ds>'20180414',ds<'20180416')

插入动态分区GC异常

insert overwrite table partition(part_desc)
select *
from src_table

可能会导致GC Overhead or Heap space error,原因是插入同一分区的记录要求select语句中的字段是一样的,如果不一致会导致在插入的过程中hive要做很多判断,要避免这样的情况发生要保证同一分区的记录在同一个reduce中,可以将sql改成如下形式,shuffle数量最好大于或者等于动态分区数,加上DISTRIBUTE BY之后可以按照分区字段进行分片,最后一个select的字段要是分区字段

set hive.exec.dynamic.partition.mode=nonstrict;
set spark.sql.shuffle.partitions=2000;
insert overwrite table partition(part_desc)
select *
from src_table
DISTRIBUTE BY part_desc

使用group by替代distinct

distinct会将所有数据放到同一个reduce中进行去重,导致数据倾斜,而group by则会产生多个reduce并将相同聚合值的字段放到同一个reduce中,避免了数据倾斜,同时可以指定set mapred.reduce.tasks=100每个reduce的数量Hive能计算出数据量很大的数据集,但是不能计算数据倾斜的数据集

order by, distribute by, sort by, cluster by

  • order by: order by会对输入做全局排序,因此只有一个Reducer(多个Reducer无法保证全局有序),然而只有一个Reducer,会导致当输入规模较大时,消耗较长的计算时间.部分的order by可以使用distribute by和sort by完成
  • sort by: sort by不是全局排序,其在数据进入reducer前完成排序,因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只会保证每个reducer的输出有序,并不保证全局有序。sort by不同于order by,它不受hive.mapred.mode属性的影响,sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序。使用sort by你可以指定执行的reduce个数(通过set mapred.reduce.tasks=n来指定),对输出的数据再执行归并排序,即可得到全部结果
  • distribute by: distribute by是控制map端如何拆分数据给reduce端的。hive会根据distribute by后面列,对应reduce的个数进行分发,默认是采用hash算法。sort by为每个reduce产生一个排序文件。在有些情况下,需要控制某个特定行应该到哪个reducer,这通常是为了进行后续的聚集操作。distribute by刚好可以做这件事。因此,distribute by经常和sort by配合使用
  • cluster by: cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能倒叙排序,不能指定排序规则为ASC或者DESC

distribute bysort by的使用场景

  • Map输出的文件大小不均
  • Reduce输出文件大小不均
  • 小文件过多
  • 文件超大

获得时间之间天数的差别

SELECT CAST(MONTHS_BETWEEN(date1, date2) AS INT) as numberOfMonthsBetweenDates
FROM table

行列互转

列转行

-- cust_1, month_1, f1, f2, f3
-- 转成
-- cust_1, month_1, f1
-- cust_1, month_1, f2
-- cust_1, month_1, f3

select cust, month, value
from table lateral view explode(array(col3,col4,col5)) orig_table_alias as value;

行转列

行转列链接

pic

-- 通过唯一键将稀疏表的值合并到一起
select resource_id,
     , case when collect_set(quarter_1)[0] is null then 0 else collect_set(quarter_1)[0] end as quarter_1_spends
     , case when collect_set(quarter_2)[0] is null then 0 else collect_set(quarter_2)[0] end as quarter_2_spends
     , case when collect_set(quarter_3)[0] is null then 0 else collect_set(quarter_3)[0] end as quarter_3_spends
     , case when collect_set(quarter_4)[0] is null then 0 else collect_set(quarter_4)[0] end as quarter_4_spends
from (
  -- 生成一个很多 null 的稀疏表
  select resource_id,
       , case when quarter = 'q1' then amount end as quarter_1
       , case when quarter = 'q2' then amount end as quarter_2
       , case when quarter = 'q3' then amount end as quarter_3
       , case when quarter = 'q4' then amount end as quarter_4
  from billing_info
) tbl1
group by resource_id;

如果生成数据序列

select t.AccountNumber
     , date_add (t.StartDate,pe.i) as Day
from Table1 t
     lateral view
     posexplode(split(space(datediff(t.EndDate, t.StartDate)), ' ')) pe as i,x

选择除了个别字段的全部字段

hive.support.quoted.identifiers=none;
SELECT `(exclude_field)?+.+` FROM <TABLE_NAME>;

beeline中处理concat split等的分号

因为beeline中语句的结尾就是分号,如果concat split使用了分号,就要将其转化为二进制格式\073

select split(col, ';') from table;
select split(col, '\073') from table;

避免产生过多的小文件

明确 spark.sql.shuffle.partitions 的文件数

在sql语句的后面加上distribute by rand(),会将数据随机放到设定的分区中,除了可以避免产生过多的小文件外,还可以使数据更加均匀

insert overwrite table table_name
select *
from table_name_1
distribute by rand();

left semi join和inner join及left outer join的区别

left semi join是查看仅仅会保留主表的记录,不能取右表中的记录,也不会因为右表有多个关联的值而使得主表的记录重复,查看


⚠️ **GitHub.com Fallback** ⚠️