Hive

这里是下拉

这里是下拉内显示的内容，记得先空一行

Note: The CREATE TABLE clauses FIELDS TERMINATED BY, ESCAPED BY, and LINES TERMINATED BY have special rules for the string literal used for their argument, because they all require a single character. You can use a regular character surrounded by single or double quotation marks, an octal sequence such as '\054' (representing a comma), or an integer in the range '-127'..'128' (with quotation marks but no backslash), which is interpreted as a single-byte ASCII character. Negative values are subtracted from 256; for example, FIELDS TERMINATED BY '-2' sets the field delimiter to ASCII code 254, the "Icelandic Thorn" character used as a delimiter by some data formats.

exported avro hdfs file to hive data

avro-tools功能键生成Avro Schema

avro-tools getschema hdfs://**101:8020/tmp/dataset/iottest/part_dt=2019-12-12/sdc-7beee5c9-d805-11e8-9a10-11ff78c635e0_167951a2-e1fc-47a0-b0f7-0c904ce1ab49 | tee ~/xx.avsc

xx.avsc
{
  "type" : "record",
  "name" : "ods_smart_speaker_reportdata_state",
  "namespace" : "smart_ods",
  "fields" : [ {
    "name" : "action",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "action_vod",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "bluetooth_status",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "mic_status",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "network_status",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "speaker_device_id",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "timestamp",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "volume",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "volume_status",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "timestamp_dt",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "day",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "w_insert_dt",
    "type" : [ "null", "string" ],
    "default" : null
  } ]
}

将Avro文件的Schema文件上传到HDFS

hadoop fs -put ~/speaker.avsc hdfs://101:8020/tmp/dataset/iottest/ hadoop fs -ls hdfs://:8020/tmp/dataset/iottest/

创建Hive表

CREATE EXTERNAL table ods_smart_speaker_reportdata_state stored as avro LOCATION 'hdfs://**101:8020/tmp/dataset/iottest/' tblproperties('avro.schema.url'='hdfs://**1010:8020/tmp/dataset/iottest/speaker.avsc')

增加对应的分区location

alter table ods_smart_speaker_reportdata_state add partition (part_dt='2019-12-12') location 'hdfs://**101:8020/tmp/dataset/iottest/part_dt=2019-12-12' ;

从表中查询数据

select * from ods_smart_speaker_reportdata_state limit 5;

Hive增加删除分区

alter table tmp.ods_smart_speaker_reportdata_state add partition (part_dt='2020-08-01') location 'hdfs://**:8020/user/hive/speaker/ods_smart_speaker_reportdata_state/part_dt=2020-08-01';

alter table tmp.ods_smart_speaker_reportdata_state add partition (part_dt='2020-08-02') location 'hdfs://**:8020/user/hive/speaker/ods_smart_speaker_reportdata_state/part_dt=2020-08-02';

alter table aipdc.device_domain_stage drop partition(year_month=202008)

update分区

ALTER TABLE dwd_vision_inferimage_d PARTITION(year = 2020) SET LOCATION 'hdfs://user/user1/dwd_vision_inferimage_d/2020';

truncate 分区

truncate table tmp.speaker_orionstar partition(part_dt='2073-01-22');

Hive查看分区信息

show table extended like ods_smart_speaker_reportdata_state partition (part_dt='2020-09-01');

Hive表distcp同步

set -e

hadoop distcp -D ipc.client.fallback-to-simple-auth-allowed=true -p -delete -overwrite ${hive_dw_src}/d.db/tt ${hive_dw_dest}/d.db/tt && hive -e "msck repair table d.tt sync partitions" && impala-shell -i ${IMPALA_SHELL_HOST}:${IMPALA_SHELL_PORT} -q "refresh d.tt"

Hive修改表名，列名，列注释，表注释，增加列，调整列顺序，属性名等操作

Alter Table是在Hive中用来修改的表。语法声明接受任意属性，我们希望在一个表中修改以下语法。

ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

（1）、重命名表

Rename To… 语句

 ALTER TABLE table_name RENAME TO new_table_name;

上面这个命令可以重命名表，数据所在的位置改变，但分区名都没有改变。

下面是查询重命名表，把 employee 修改为 emp。

ALTER TABLE employee RENAME TO emp;

（2）、改变列名/类型/位置/注释

ALTER TABLE table_name CHANGE
[CLOUMN] col_old_name col_new_name column_type
[CONMMENT col_conmment]
[FIRST|AFTER column_name];

这个命令可以修改表的列名，数据类型，列注释和列所在的位置顺序，FIRST将列放在第一列，AFTER col_name将列放在col_name后面一列，

例子1：

ALTER TABLE test_table CHANGE col1 col2 STRING COMMENT 'The datatype of col2 is STRING' AFTER col3;

上面的语句将列名col2修改为col2，数据类型为STRING并添加注释，最后将这一列放在col3后面。

例子2：下表包含employee表的字段，它显示的字段要被更改（粗体）。

字段名	从数据类型转换	更改字段名称	转换为数据类型
eid	int	eid	int
name	String	ename	String
salary	Float	salary	Double
designation	String	designation	String

下面查询重命名使用上述数据的列名和列数据类型：

hive> ALTER TABLE employee CHANGE name ename String;
hive> ALTER TABLE employee CHANGE salary salary Double;

（3）、增加/更新列

ADD COLUMNS允许用户在当前列的末尾，分区列之前添加新的列。 REPLACE COLUMNS 删除所有现有的列并添加新的列集。这仅适用于具有本机SerDe（DynamicSerDe，MetadataTypedColumnsetSerDe，LazySimpleSerDe和ColumnarSerDe）的表。有关更多信息，请参考Hive SerDe。替换列也可用于删除列。例如，"ALTER TABLE test_change REPLACE COLUMNS (a int, b int);" 。

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [CONMMENT col_comment], ...);

添加列语句，下面的查询增加了一个列名dept在employee表。

hive> ALTER TABLE employee ADD COLUMNS (dept STRING COMMENT 'Department name');

REPLACE语句

hive> ALTER TABLE employee REPLACE COLUMNS (id int, name string);

（4）、增加表的属性

（5）、增加SerDE属性

（6）、修改表文件格式和组织

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns

ALTER TABLE table_name SET FILEFORMAT file_format;

ALTER TABLE table_name SET TBLPROPERTIES ('comment' = new_comment);

ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...)
[SORTED By (col_name, ...)] INTO num_buckets BUCKETS;

上面两个命令都修改了表的物理属性。

hive 添加，修改，删除字段注意事项

alter table table_name add  columns(location_id string) ；

alter table table_name change column complete_status complete_status string；

修改字段名

alter table table_name add columns(location_id string) ；

alter table table_name change column complete_status complete_status string；

使用alter table语句更新变结构时，由于默认更行现有表结构，对历史分区表元数据无改动，导致原有分区变数据，新字段为null，

	bottom_tab_name	top_tab_name
1	NULL	热门
2	NULL	recommendation
3	NULL	hot

重新更新历史分区表也无动于衷，没有效果。

hive 1.1.0以前办法，

alter table table_name partition(p='20190101') add columns(location_id string) ；

alter table table_name partition(p='20190101') change column complete_status complete_status string；

通过添加partition一个一个分区修改，

hive 1.1.0后可以使用cascade，在alter table语句后添加cascade关键字执行，一劳永逸。

注意：CASCADE命令更改表更改列将更改表元数据的列，并将相同的更改级联到所有分区元数据

alter table table_name add columns(location_id string) cascade|restrict；

alter table table_name change column complete_status complete_status string cascade|restrict；

官方解释：

The CASCADE|RESTRICT clause is available in Hive 1.1.0. ALTER TABLE CHANGE COLUMN with CASCADE command changes the columns of a table’s metadata, and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column change only to table metadata.

ALTER TABLE CHANGE COLUMN CASCADE clause will override the table partition’s column metadata regardless of the table or partition’s protection mode. Use with discretion.

The column change command will only modify Hive’s metadata, and will not modify data. Users should make sure the actual data layout of the table/partition conforms with the metadata definition.

Hive导致的hdfs小文件

原因

分区表插入（可能启用动态分区）数据的时候，由于参数配置不合理（如： hive.exec.reducers.bytes.per.reducer=67108864 64MB）会产生大量的小文件，reduce数量越多小文件也越多，reduce的个数和输出文件个数一致；
数据源本身就包含有大量的小文件；
增量(如:Sqoop数据导入，一些增量insert等)；
分桶表，分桶表通常也会遇到小文件，本质上还是增量导入的问题；
有的场景小文件过多的表主要是入库表和一些中间表，单个中间表每小时产生的小文件达到了100K+，一天产生接近几M小文件；

有些分区下面的小文件个数和mapper数量一致，说明每个分区下的小文件个数和map数有关
如果map数量很大，并且分区数也很大的情况下，就会产生大量的小文件，小文件个数和 Map数 * 分区数正相关

通过配置stored as orc TBLPROPERTIES ("transactional"="true")可以修改的表，每天都会有一个快照，到后面1G大小的数据，表文件体积可以达到几十G，时间越长越大；

解决方法

减小表的分区数
减小任务map数
减小任务reduce数
在任务输出端进行合并
使用sort by, distrubute by 合并小文件

减小表的分区数

有的场景可能不太适用。

减小任务map数

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

hive.input.format
Default Value: [org.apache.hadoop.hive.ql.io](http://org.apache.hadoop.hive.ql.io/).CombineHiveInputFormat
Added In: Hive 0.5.0
The default input format. Set this to HiveInputFormat if you encounter problems with CombineHiveInputFormat.

-- 设置hive输入端端进行小文件合并
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

set mapreduce.input.fileinputformat.split.minsize=

-- 每个Map最大输入大小，调大可以减小mapper数 (是最核心的一个参数)
set mapreduce.input.fileinputformat.split.maxsize=536870912     --512M

-- 一个交换机下split的至少的大小,如果小于这个参数，则进行合并
set mapreduce.input.fileinputformat.split.minsize.per.rack =536870912  

-- 一个节点上split的至少的大小 ，如果小于这个数会进行合并
set mapreduce.input.fileinputformat.split.minsize.per.node =536870912

set mapreduce.job.maps=

-- The maximum permissible size of the split metainfo file. The JobTracker won't attempt to read split metainfo files bigger than the configured value. No limits if set to -1.
set mapreduce.job.split.metainfo.maxsize = 10000000;

set  dfs.blocksize=128MB

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

减小任务reduce数

hive cli output 
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
...

-- Size per reducer. If the input size is 10GiB and this is set to 1GiB, Hive will use 10 reducers.
hive.exec.reducers.bytes.per.reducer

when set hive.exec.reducers.bytes.per.reducer=67108864 64MB, then log printed:

Number of reduce tasks not specified. Estimated from input data size: 1099

after set hive.exec.reducers.bytes.per.reducer=1073741824 1GB, log printed:

Number of reduce tasks not specified. Estimated from input data size: 221

reducer根据输入的数据大小估算需要的reduce个数。

如果set mapreduce.job.reduces=16， reduce个数就是16，此配置项优先级高于 hive.exec.reducers.bytes.per.reducer。

reduce相关配置项优先级高于 hive.merge**配置项。

[root@retail1 mrpl100.bak]# hdfs dfs -ls -h /user/hive/warehouse/***.db/speech_dialog/year_month=202202
Found 16 items
-rwxrwxrwt   3 hive hive    685.8 M 2022-05-27 16:19 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000000_0
-rwxrwxrwt   3 hive hive    687.7 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000001_0
-rwxrwxrwt   3 hive hive    686.5 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000002_0
-rwxrwxrwt   3 hive hive    687.9 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000003_0
-rwxrwxrwt   3 hive hive    688.7 M 2022-05-27 16:19 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000004_0
-rwxrwxrwt   3 hive hive    685.4 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000005_0
-rwxrwxrwt   3 hive hive    683.2 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000006_0
-rwxrwxrwt   3 hive hive    682.6 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000007_0
-rwxrwxrwt   3 hive hive    686.5 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000008_0
-rwxrwxrwt   3 hive hive    682.1 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000009_0
-rwxrwxrwt   3 hive hive    688.4 M 2022-05-27 16:21 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000010_0
-rwxrwxrwt   3 hive hive    683.9 M 2022-05-27 16:19 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000011_0
-rwxrwxrwt   3 hive hive    682.4 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000012_0
-rwxrwxrwt   3 hive hive    686.9 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000013_0
-rwxrwxrwt   3 hive hive    682.7 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000014_0
-rwxrwxrwt   3 hive hive    682.4 M 2022-05-27 16:20 /user/hive/warehouse/***.db/speech_dialog/year_month=202202/000015_0

在任务输出端进行合并

主要有一下几个配置项：

hive.default.fileformat	Default file format for CREATE TABLE statement. Options are TextFile, SequenceFile, RCFile, and Orc.	TextFile
hive.merge.mapfiles	Merge small files at the end of a map-only job.	true
hive.merge.mapredfiles	Merge small files at the end of a map-reduce job.	false
hive.merge.size.per.task	Size of merged files at the end of the job.	256000000
hive.merge.smallfiles.avgsize	When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.	16000000

使用sort by, distrubute by 合并小文件

sort by 实际上将Map-only 任务变为了Map-reduce 任务，主要由reduce任务数决定，且sort 会对shuffle-read的数据进行排序合并（reducer 内部排序合并），使用sort by 小文件会大幅下降

select * from uu where id>=100 sort by id

distribute by id会将同一个id的数据shuffle到同一个reducer（sort by不会），然后在reducer内部进行聚合合并，最终写到对应id的分区。最终目标表的小文件从200K降到了1K左右，数量级和分区数数量级一致（小文件数<=分区数）

select * from uu where id>=100 sort by id

distribute by如果出现数据倾斜，会导致任务运行时间变长（用时间换空间），这时候可以采取数据划分策略Map-only insert任务处理数量特别大的那些id ，并调整好map的数量，对于剩下的数据采用 insert...select... distrubute by key 的方法减小小文件。这样能保持整个表的小文件不会太大。

使用hadoop的archive将小文件归档

Hadoop Archive简称HAR，是一个高效地将小文件放入HDFS块中的文件存档工具，它能够将多个小文件打包成一个HAR文件，这样在减少namenode内存使用的同时，仍然允许对文件进行透明的访问

#用来控制归档是否可用
set hive.archive.enabled=true;
#通知Hive在创建归档时是否可以设置父目录
set hive.archive.har.parentdir.settable=true;
#控制需要归档文件的大小
set har.partfile.size=1099511627776;

#使用以下命令进行归档
ALTER TABLE A ARCHIVE PARTITION(year_month=202101);

#对已归档的分区恢复为原文件
ALTER TABLE A UNARCHIVE PARTITION(year_month=202101);
注意:
归档的分区可以查看不能 insert overwrite，必须先 unarchive

最后如果是新集群，没有历史遗留问题的话，建议hive使用 orc 文件格式，以及启用 lzo 压缩。这样小文件过多可以使用hive自带命令 concatenate 快速合并。

sqoop

ref: Sqoop一些常用命令及参数

sqoop import

sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --split-by id --columns id,name --table customer --target-dir /user/cloudera/ingest/raw/customers --fields-terminated-by "," --hive-import --create-hive-table --hive-table sqoop_workspace.customers

Here’s what each individual Sqoop command option means:

    --connect – Provides jdbc string
    --username – Database username
    -P  – Will ask for the password in console. Alternatively you can use –password but this is not a good practice as its visible in your job execution logs and asking for trouble. One way to deal with this is store database passwords in a file in HDFS and provide at runtime.
    --table – Tells the computer which table you want to import from MySQL. Here, it's customer.

------------ import公用参数    
    --enclosed-by <char> 给字段值前后加指定的字符
    --escaped-by <char> 对字段值中特殊字符转义处理
    --split-by – Specifies your splitting column. I am specifying id here.
    --fields-terminated-by – 默认逗号I have specified comma (as by default it will import data into HDFS with comma-separated values)
    --lines-terminated-by  每行的分隔符，默认 \n
    --mysql-delimiters Mysql各个分隔符设置,默认是字段逗号分割, 行\n分割,转义\, 字段值已单引号限定.
    --optionally-enclosed-by <char> 给带有双引号或单引号的字段值前后加上指定的字符

------------ hive公用参数
    --hive-delims-replacement <arg> 用自定义字符串替换掉数据中的\r\n, \013 \010等字符
    --hive-drop-import-delims 在导入数据到hive时, 去掉数据中的\r\n, \013 \010等字符
    --map-column-hive <arg> 生成hive标识,可以修改字段数据类型.
    --target-dir – HDFS destination directory.
    --
    --hive-overwrite – 覆盖掉hive表中已存在的数据.
    --hive-import – Import table into Hive (Uses Hive’s default delimiters if none are set.)
    --create-hive-table – Determines if set job will fail if a Hive table already exists. It works in this case.
    --hive-table – Specifies <db_name>.<table_name>. Here it's sqoop_workspace.customers, where sqoop_workspace is my database and customers is the table name.
    --m or --num-mappers 启动N个map来并行导入数据，默认4个

sqoop import --connect jdbc:mysql://10.133.229.104:3306/acai --username root -P --table ac_ae_20182019 --hive-import --create-hive-table --hive-table ac.ac_ae_20182019 -m 1

sqoop import --connect jdbc:mysql://rr-bp19di1mgn34la9apeo.mysql.rds.aliyuncs.com/ai --username device_cache_readonly -P --table iot_device_cache --hive-import --create-hive-table --hive-table tmp.iot_device_cache -m 1

sqoop import --connect jdbc:mysql://rr-bp19di1mgn34la9apeo.mysql.rds.aliyuncs.com/ai --username device_cache_readonly -P --table iot_device_cache --hive-import --hive-table tmp.iot_device_cache -m 1 --hive-overwrite

sqoop import --connect jdbc:mysql://10.133.229.104:3306/acai --username root -P --table ac_af_20182019 --hive-import --create-hive-table --hive-table ac.ac_af_20182019 -m 1

on HDP 2.3.2 Sandbox, SQOOP-1400 bug, use --driver com.mysql.jdbc driver to overcome the problem

增量导入

    --incremental 包括: append和lastmodified.
    --last-value <value> 指定值(包含)作为增量导入的标记.

sqoop export from a Hive table into MySQL

sqoop export --connect jdbc:mysql://ip/smartac --username u_ac --password yourpw --table features_ux --direct --export-dir /apps/hive/warehouse/export_table --driver com.mysql.jdbc.Driver

sqoop export --connect jdbc:mysql://ip/smartac --username u_ac --password yourpw --table features_ux --hcatalog-database ac_ods --hcatalog-table features --num-mappers 1 --update-key appliance_id --update-mode updateonly --fields-terminated-by '\t'

------------ export公用参数    
    --input-enclosed-by <char> 给字段值前后加指定的字符
    --input-escaped-by <char> 对字段值中特殊字符转义处理
    --input-fields-terminated-by – 默认逗号I have specified comma (as by default it will import data into HDFS with comma-separated values)
    --input-lines-terminated-by  每行的分隔符，默认 \n
    --input-optionally-enclosed-by <char> 给带有双引号或单引号的字段值前后加上指定的字符
    --table：MySQL库中的表名
    --hcatalog-database：Hive中的库名   # The export-dir option conflicts with the hcatalog-table option.
    --hcatalog-table：Hive库中的表名，需要抽数的表
    --num-mappers or -m：执行作业的Map数

sqoop export --connect jdbc:mysql://ip/smartac --username u_ac --password yourpw --table features --hcatalog-database ac_ods --hcatalog-table features --num-mappers 1 --update-key appliance_id --update-mode allowinsert

其他命令

sqoop eval --connect 'jdbc:sqlserver://1.1.1.1;database=SomeDatabase;username=someUser;password=somePassword' --query "TRUNCATE TABLE some_table"

sqoop eval --connect jdbc:mysql://ip/smartac --username u_ac --password yourpw --query "TRUNCATE TABLE smartac.features_ux;"

sqoop list-databases --connect jdbc:mysql://10.133.229.104:3306/acai --username root -P

4...

The following refers to the use-case where no schema file is involved

The schema is stored in 2 places

The metastore
As part of the data files

All the information for the DESC/SHOW commands is taken from the metastore. Every DDL change impacts only the metastore.

When you query the data the matching between the 2 schemas is done by the columns names. If there is a mismatch in the columns types you'll get an error.

select c.column_name ,c.integer_idx ,c.type_name

from metastore.DBS as d join metastore.TBLS as t on t.db_id = d.db_id join metastore.SDS as s on s.sd_id = t.sd_id join metastore.COLUMNS_V2 as c on c.cd_id = s.cd_id

where d.name = 'local_db' and t.tbl_name = 'mytable'

bash-4.1$ avro-tools getschema 000000_0

{ "type" : "record", "name" : "mytable", "namespace" : "local_db", "fields" : [ { "name" : "myint", "type" : [ "null", "int" ], "default" : null }, { "name" : "mystring", "type" : [ "null", "string" ], "default" : null }, { "name" : "mydate", "type" : [ "null", { "type" : "int", "logicalType" : "date" } ], "default" : null } ] } alter table mytable change myint dummy1 int; select * from mytable; +--------+----------+------------+ | dummy1 | mystring | mydate | +--------+----------+------------+ | (null) | Hello | 2017-05-30 | +--------+----------+------------+ alter table mytable add columns (myint int); select * from mytable; +--------+----------+------------+-------+ | dummy1 | mystring | mydate | myint | +--------+----------+------------+-------+ | (null) | Hello | 2017-05-30 | 1 | +--------+----------+------------+-------+ Metastore

+-------------+-------------+-----------+ | column_name | integer_idx | type_name | +-------------+-------------+-----------+ | dummy1 | 0 | int | | mystring | 1 | string | | mydate | 2 | date | | myint | 3 | int | +-------------+-------------+-----------+ avro-tools (same schema as the original one)

bash-4.1$ avro-tools getschema 000000_0

{ "type" : "record", "name" : "mytable", "namespace" : "local_db", "fields" : [ { "name" : "myint", "type" : [ "null", "int" ], "default" : null }, { "name" : "mystring", "type" : [ "null", "string" ], "default" : null }, { "name" : "mydate", "type" : [ "null", { "type" : "int", "logicalType" : "date" } ], "default" : null } ] } Any work against that table is done based on the metadata stored in the Metastore. When the table is being queried, additional metadata is being used which is the metadata stored in data file. The query result structure is constructed from the Metastore (See in my example that 4 columns are being returned after the table was altered). The data returned depends on both schemes - a field with a specific name in the file schema will be mapped to the column with the same name in the Metastore schema. If the names match but the datatypes don't, an error will arise. A fields from the data file that does not have a corresponding column name in the Metastore would not be presented. A column in the Metastore without corresponding field in the data file schema will hold NULL values.

Since the partition name is in your case dt, not date, you have to use PARTITIONED BY (dt string), and then you still need to add the partitions to the metastore :

ALTER TABLE demo ADD PARTITION (dt='2016-02-01') If you have several partitions, you can repair the table, and it will add them automatically :

msck repair table demo

if you don't want to do this, you can use the add partition like

ALTER TABLE table_nameADD PARTITION (y='01',m='02',d='03',h='03',hh='12') location '/yourpath/y=2016/m=02/d=03/h=03/hh=12/'; (or any other folder)

import redis # 导入redis模块，通过python操作redis 也可以直接在redis主机的服务端操作缓存数据库

redis.StrictRedis(host='ip',port=20140,db=0,password='**',decode_responses=True) pool = redis.ConnectionPool(host='localhost', port=6379, decode_responses=True) # host是redis主机，需要redis服务端和客户端都起着 redis默认端口是6379 r = redis.Redis(connection_pool=pool) r.set('gender', 'male') # key是"gender" value是"male" 将键值对存入redis缓存 print(r.get('gender')) # gender 取出键male对应的值

llen("intent_202005")

Hive makes no guarantees about the order of records if you don’t sort them, but in practice, they come back in the same order in which they’re in the file, so this is far from truly random. As a next attempt:

select * from my_table order by rand() limit 10000; This does actually give you truly random data, but performance is not so good. In order to impose a total ordering, Hive must force all data to one single reducer. That reducer will sort the entire dataset. Not good. Luckily, Hive has a non-standard-SQL “sort by” clause that only sorts within a single reducer, and gives no guarantees that data will be sorted across reducers:

select * from my_table sort by rand() limit 10000; This is much better, but I’m not convinced it’s truly random. The issue is that Hive’s method of splitting data into multiple reducers is undefined. It might be truly random, it might be based on file order, it might be based on some value in the data. How Hive implements the limit clause across reducers is also undefined. Maybe it takes data from the reducers in order- i.e., all data from reducer 0, then all from reducer 1, etc. Maybe it round-robins through them and mixes everything together.

In the worst case, let’s say the reduce key is based on a column of the data, and the limit clause is in order of reducers. Then the sample will be extremely skewed.

The solution is another non-standard Hive feature: “distribute by”. For queries where the reduce key is not determined by the structure of the query (no “group by”, no joins), you can specify exactly what you want the reduce key to be. If we distribute randomly, and sort within each reducer randomly, then it doesn’t matter how “limit” functions.

select * from my_table distribute by rand() sort by rand() limit 10000; Finally, as one last optimization, we can do some filtering map-side. If the total size of the table is known, then you can easily randomly select some proportion of the data on a record-by-record basis, like this:

select * from my_table where rand() <= 0.0001 distribute by rand() sort by rand() limit 10000; In this case, since the total size is ten billion, and the sample size is ten thousand, I can easily calculate that’s exactly 0.000001 of the total data. However, if the where clause was “rand() < 0.000001”, it would be possible to end up with fewer than 10000 rows output. “rand() < 0.000002” would probably work, but that’s really relying on a very good implementation of rand(). Safer to fudge it by a bit more. In the end it doesn’t matter much since the bottleneck quickly becomes the simple full table scan to start the job, and not anything based on the volume sent to reducers.

select

( select * from (select time, asr, reply_text, row_number() over() as nlg from speech_dialog where year_month=202005 and from_unixtime(unix_timestamp(time), 'yyyyMMdd') = '20200501' and asr is not null ) t where rand() <= 0.1 distribute by rand() sort by rand() ) union all

( select * from (select time, asr, nlg from (select time, asr, reply_text nlg, row_number() over(partition by asr, reply_text order by time) as rn from speech_dialog where year_month=202005 and from_unixtime(unix_timestamp(time), 'yyyyMMdd') = '20200501' and asr is not null ) t where rn = 1) t1 where rand() <= 0.0001 distribute by rand() sort by rand() ) union all ( select * from (select time, asr, nlg from (select time, asr, reply_text nlg, row_number() over(partition by asr, reply_text order by time) as rn from speech_dialog where year_month=202005 and from_unixtime(unix_timestamp(time), 'yyyyMMdd') = '20200502' and asr is not null ) t where rn = 1) t1 where rand() <= 0.0001 distribute by rand() sort by rand() )

select mid ,time,speech_source FROM speech_dialog WHERE speech_source IS NOT NULL GROUP BY mid,time,speech_source limit 100;

select action, action_vod, bluetooth_status, mic_status, network_status, speaker_device_id, dtm_str, volume, volume_status, part_dt

select from_unixtime(cast(dtm_str/1000 as bigint), 'yyyyMMdd hh:mm:ss') as dt, action, bluetooth_status, mic_status, network_status, speaker_device_id, dtm_str, volume, volume_status, part_dt from tmp.speaker_orionstar

select from_unixtime(cast(timestamp/1000 as bigint), 'yyyyMMdd hh:mm:ss') as dt, action, bluetooth_status, mic_status, network_status, speaker_device_id, timestamp, volume, volume_status, part_dt from tmp.ods_smart_speaker_reportdata_state where part_dt = '2020-07-01' limit 200;

52504-02-03

select part_dt, count(1) from tmp.speaker_orionstar where part_dt between '2020-07-10' and '2020-08-01' group by part_dt;

select part_dt, count(1) from tmp.ods_smart_speaker_reportdata_state where part_dt between '2020-07-10' and '2020-08-01' group by part_dt order by part_dt;

1594701806590973 525040203 04:03:10 1594701591300 20200714 12:39:51

2020-07-30 12:20:53,213 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2507.08 sec MapReduce Total cumulative CPU time: 41 minutes 47 seconds 80 msec Ended Job = job_1525335504172_33295 MapReduce Jobs Launched: Stage-Stage-1: Map: 31 Reduce: 144 Cumulative CPU: 2507.08 sec HDFS Read: 9649908091 HDFS Write: 323 SUCCESS Total MapReduce CPU Time Spent: 41 minutes 47 seconds 80 msec OK 2020-07-14 1035756 2020-07-15 6929052 2020-07-16 8662057 2020-07-17 8689663 2020-07-18 8730276 2020-07-19 8716845 2020-07-20 8591511 2020-07-21 8731247 2020-07-22 8698059 2020-07-23 8182502 2020-07-24 8676884 2020-07-25 8773794 2020-07-26 8768648 2020-07-27 8704724 2020-07-28 8796582 2020-07-29 8933490 2020-07-30 3870592 Time taken: 77.206 seconds, Fetched: 17 row(s) hive>

2020-07-30 17:45:07,024 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 7090.86 sec 2020-07-30 17:45:08,060 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7096.63 sec MapReduce Total cumulative CPU time: 0 days 1 hours 58 minutes 16 seconds 630 msec Ended Job = job_1525335504172_33318 MapReduce Jobs Launched: Stage-Stage-1: Map: 62 Reduce: 292 Cumulative CPU: 7096.63 sec HDFS Read: 19636512279 HDFS Write: 339 SUCCESS Total MapReduce CPU Time Spent: 0 days 1 hours 58 minutes 16 seconds 630 msec OK 2020-07-14 2071512 2020-07-15 13858104 2020-07-16 17324114 2020-07-17 17379326 2020-07-18 17460552 2020-07-19 17433690 2020-07-20 17183022 2020-07-21 17462494 2020-07-22 17396118 2020-07-23 16365004 2020-07-24 17353768 2020-07-25 17547588 17547588 2020-07-26 17537296 2020-07-27 17409448 2020-07-28 17593164 2020-07-29 17866980 2020-07-30 10628628

select to_date(time) dd, nlu_classifier, count(1) from speech_dialog where year_month=202007 and to_date(time) = '2020-07-01' and speech_source = '音箱' group by to_date(time), nlu_classifier with rollup;

+-------------+-----------------+-----------+--+ | dd | nlu_classifier | count(1) | +-------------+-----------------+-----------+--+ | 2020-07-01 | **Domain | 7891 | | 2020-07-01 | | 115965 | | 2020-07-01 | Total | 123856 | +-------------+-----------------+-----------+--+

Hive 记录 - 9dian/Index GitHub Wiki

Hive

exported avro hdfs file to hive data

avro-tools功能键生成Avro Schema

将Avro文件的Schema文件上传到HDFS

创建Hive表

增加对应的分区location

从表中查询数据

Hive增加删除分区

update分区

truncate 分区

Hive查看分区信息

Hive表distcp同步

Hive修改表名，列名，列注释，表注释，增加列，调整列顺序，属性名等操作

（1）、重命名表

（2）、改变列名/类型/位置/注释

（3）、增加/更新列

（4）、增加表的属性

（5）、增加SerDE属性

（6）、修改表文件格式和组织

hive 添加，修改，删除字段注意事项

修改字段名

Hive导致的hdfs小文件

原因

解决方法

减小表的分区数

减小任务map数

减小任务reduce数

在任务输出端进行合并

使用sort by, distrubute by 合并小文件

使用hadoop的archive将小文件归档

sqoop

sqoop import

增量导入

sqoop export from a Hive table into MySQL

其他命令

4...

5...

⚠️ GitHub.com Fallback ⚠️

Hive 记录 - 9dian/Index GitHub Wiki

Hive

exported avro hdfs file to hive data

avro-tools功能键生成Avro Schema

将Avro文件的Schema文件上传到HDFS

创建Hive表

增加对应的分区location

从表中查询数据

Hive增加删除分区

update分区

truncate 分区

Hive查看分区信息

Hive表distcp同步

Hive修改表名，列名，列注释，表注释，增加列，调整列顺序，属性名等操作

（1）、重命名表

（2）、改变列名/类型/位置/注释

（3）、增加/更新列

（4）、增加表的属性

（5）、增加SerDE属性

（6）、修改表文件格式和组织

hive 添加，修改，删除字段注意事项

修改字段名

Hive导致的hdfs小文件

原因

解决方法

减小表的分区数

减小任务map数

减小任务reduce数

在任务输出端进行合并

使用sort by, distrubute by 合并小文件

使用hadoop的archive将小文件归档

sqoop

sqoop import

增量导入

sqoop export from a Hive table into MySQL

其他命令

4...

5...

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️