hive使用总结 - yingziaiai/SetupEnv GitHub Wiki

1.hive安装好后在hive-site.xml中配置元数据存储地方为mysql数据库；然后在使用过程中，分如下几个类别： 1、常见命令行，如quit, exit, set等等： https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Commands 2､常见file format，如有orc, text file, parquet等 https://cwiki.apache.org/confluence/display/Hive/FileFormats 3. 常见data type, 如 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

然后接下来就是具体操作了：分为DDL, DML http://blog.csdn.net/young2simple/article/details/49721899 DDL:定义语言 Hive中的数据库本质上仅仅是HDFS的一个目录，数据库中的表将会以这个数据库目录子目录的形式存储其中又分为如下几大类的ＤＤＬ：(create/drop/alter/truncate/show/describe)，另外就是熟悉常用的分析函数与窗口函数；创建表的过程中，有如下几类表：内部表（管理表），外部表，分区表，桶表（http://blog.csdn.net/andrewgb/article/details/47359673）其中分区表又有内部分区表，外表分区表表的创建方法有三种：create, as, like; 表中数据的导入方式：6种：当然会根据表的类型而有具体的写法不同：这6种方式都有对应的更适用的使用场景；表中数据的导出方式：3种：其中有一个注意点就是能不能使用指明分隔符，针对导出到hdfs, local情况 insert, hive -e|-f, hdfs -get ,根据版本，还有新的其它方式如export, import.不过只能导入导出操作在hdfs之间

查询过程中使用的运算符：有5种：limit, where, distinct;数值比较类型：between, and ,<=, null,字符串like;聚合函数：count, sum, arg, max, min, count(1)与count(*)区别;group by , having; join(等值join, 左join, 右join, 全join);

http://www.cnblogs.com/CareySon/p/DifferenceBetweenCountStarAndCount1.html 其中的注意点，也是需要清楚什么时候执行mapreduce, 什么时候没有；group by ,sort by, 分区与分组；；设置hive优化， hive中几种排序：order by(对多个reduce时不起作用) ,sort by，distributed by ,cluster by distinct与group by

单个reduce时，sort by与order by一样 http://www.tuicool.com/articles/vEVZRz2

http://blog.sina.com.cn/s/blog_6676d74d0102vm2c.html

http://blog.csdn.net/szstephenzhou/article/details/8446481

http://www.cnblogs.com/CareySon/p/DifferenceBetweenCountStarAndCount1.html

collection items terminated by

http://www.cnblogs.com/justff/p/3453678.html

join:

http://blog.csdn.net/shadowyelling/article/details/7684714

然后就是自定义UDF

再就是另一种使用hive的方式，hiveserver2: 前端运行：后端运行加上& bin/hive --service hiveserver2 & bin/hiveserver2 &

然后使用beeline去连接服务器： !connect jdbc:hive2://***:10000 主机登录用户名密码 org.apache.hive.jdbc.HiveDriver

分析为什么有的hive任务是fetch task, 而有的是mapreduce? 设置hive-site.xml中的hive.fetch.task.conversion; set 为 more值时，就不会再mapreduce;

hive有3个虚拟列： _INPUT__FILE__NAME:注意这里是两个下划线，代表数据的来源； _BLOCK__OFFSET__INSIDE__FILE:记录文件在块的偏移量； _ROW__OFFSET__INSIDE__BLOCK:行的偏移量其中都有属性hive.exec.rowoffset;

hive的严格模式(strict mode)---hive-site.xml-hive.mapred.mode 在strict mode下的操作：

explain：解析出参数，然后加载到mapreduce模板；

Sequence Files https://my.oschina.net/xiangtao/blog/406553?p={{totalPage}}

http://blog.csdn.net/lucien_zong/article/details/10569073

http://itindex.net/detail/47472-%E5%AD%A6%E4%B9%A0-programing-hive

hive各种文件格式与压缩方式的结合测试 http://blog.csdn.net/chenyi8888/article/details/14281939

UDF, UDAF: http://computerdragon.blog.51cto.com/6235984/1288567/