Hbase - 9dian/Index GitHub Wiki

Hbase Snapshot

ExportSnapshot

In Kerberos

[root@**2 ~]# hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot TestTable-snapshot1 -copy-to hdfs://**4:8020/tmp/bak/

Java HotSpot(TM) 64-Bit Server VM warning: Using incremental CMS is deprecated and will likely be removed in a future release
21/02/05 14:22:40 INFO snapshot.ExportSnapshot: Copy Snapshot Manifest
21/02/05 14:22:40 INFO hdfs.DFSClient: Created token for hdfs: HDFS_DELEGATION_TOKEN owner=hdfs/**[email protected], renewer=yarn, realUser=, issueDate=1612506160711, maxDate=1613110960711, sequenceNumber=285, masterKeyId=112 on 10.18.60.113:8020
21/02/05 14:22:40 INFO security.TokenCache: Got dt for hdfs://**4:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 10.18.60.113:8020, Ident: (token for hdfs: HDFS_DELEGATION_TOKEN owner=hdfs/**[email protected], renewer=yarn, realUser=, issueDate=1612506160711, maxDate=1613110960711, sequenceNumber=285, masterKeyId=112)
21/02/05 14:22:40 INFO client.RMProxy: Connecting to ResourceManager at **4/10.18.60.113:8032
21/02/05 14:22:41 INFO snapshot.ExportSnapshot: Loading Snapshot 'TestTable-snapshot1' hfile list
21/02/05 14:22:41 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
21/02/05 14:22:41 INFO mapreduce.JobSubmitter: number of splits:1
21/02/05 14:22:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1612494344882_0001
21/02/05 14:22:41 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: 10.18.60.113:8020, Ident: (token for hdfs: HDFS_DELEGATION_TOKEN owner=hdfs/**[email protected], renewer=yarn, realUser=, issueDate=1612506160711, maxDate=1613110960711, sequenceNumber=285, masterKeyId=112)
21/02/05 14:22:42 INFO impl.YarnClientImpl: Submitted application application_1612494344882_0001
21/02/05 14:22:42 INFO mapreduce.Job: The url to track the job: http://**4:8088/proxy/application_1612494344882_0001/
21/02/05 14:22:42 INFO mapreduce.Job: Running job: job_1612494344882_0001
21/02/05 14:22:45 INFO mapreduce.Job: Job job_1612494344882_0001 running in uber mode : false
21/02/05 14:22:45 INFO mapreduce.Job:  map 0% reduce 0%
21/02/05 14:22:46 INFO mapreduce.Job: Job job_1612494344882_0001 failed with state FAILED due to: Application application_1612494344882_0001 failed 2 times due to AM Container for appattempt_1612494344882_0001_000002 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://**4:8088/proxy/application_1612494344882_0001/Then, click on links to logs of each attempt.
Diagnostics: Application application_1612494344882_0001 initialization failed (exitCode=255) with output: main : command provided 0
main : run as user is hdfs
main : requested yarn user is hdfs
Requested user hdfs is banned

Failing this attempt. Failing the application.
21/02/05 14:22:46 INFO mapreduce.Job: Counters: 0
21/02/05 14:22:46 ERROR snapshot.ExportSnapshot: Snapshot export failed
org.apache.hadoop.hbase.snapshot.ExportSnapshotException: Copy Files Map-Reduce Job failed
        at org.apache.hadoop.hbase.snapshot.ExportSnapshot.runCopyJob(ExportSnapshot.java:825)
        at org.apache.hadoop.hbase.snapshot.ExportSnapshot.run(ExportSnapshot.java:1020)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.hbase.snapshot.ExportSnapshot.innerMain(ExportSnapshot.java:1094)
        at org.apache.hadoop.hbase.snapshot.ExportSnapshot.main(ExportSnapshot.java:1098)

'Requested user hdfs is banned', Yarn的配置中banned.users包括:hdfs, yar, mapred, bin 。

[root@**2 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/

Found 6 items
drwxrwxrwx   - hdfs   supergroup          0 2021-02-05 14:26 hdfs://**4:8020/tmp/.cloudera_health_monitoring_canary_files
drwxrwxr-x   - hbase  supergroup          0 2021-02-04 14:45 hdfs://**4:8020/tmp/bak
drwxr-xr-x   - yarn   supergroup          0 2020-07-09 22:47 hdfs://**4:8020/tmp/hadoop-yarn
drwx--x--x   - hbase  supergroup          0 2020-11-11 16:50 hdfs://**4:8020/tmp/hbase-staging
drwx-wx-wx   - hive   supergroup          0 2020-10-30 15:15 hdfs://**4:8020/tmp/hive
drwxrwxrwt   - mapred hadoop              0 2020-09-16 17:14 hdfs://**4:8020/tmp/logs

[root@**2 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/

Found 1 items
drwxr-xr-x   - hbase supergroup          0 2021-02-04 14:45 hdfs://**4:8020/tmp/bak/.hbase-snapshot

[root@**2 ~]# hdfs dfs -chmod -R g+w hdfs://**4:8020/tmp/bak/

[root@**2 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/

Found 1 items
drwxrwxr-x   - hbase supergroup          0 2021-02-04 14:45 hdfs://**4:8020/tmp/bak/.hbase-snapshot

[root@**2 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/.hbase-snapshot

Found 1 items
drwxrwxr-x   - hbase supergroup          0 2021-02-05 14:22 hdfs://**4:8020/tmp/bak/.hbase-snapshot/.tmp

[root@**1 ~]# pssh -h /cdhdata/bak/list_krb_clients -P -l root usermod -a -G supergroup hive

[1] 15:06:31 [SUCCESS] **1
[2] 15:06:31 [SUCCESS] **3
[3] 15:06:31 [SUCCESS] **2
[4] 15:06:31 [SUCCESS] **4

[root@**1 ~]# pssh -h /cdhdata/bak/list_krb_clients -P -l root usermod -a -G supergroup hbase

[1] 15:06:38 [SUCCESS] **1
[2] 15:06:38 [SUCCESS] **3
[3] 15:06:38 [SUCCESS] **2
[4] 15:06:38 [SUCCESS] **4

[root@**1 ~]# pssh -h /cdhdata/bak/list_krb_clients -P -l root tail -2 /etc/group

**2: yy:x:1006:
supergroup:x:1101:hive,hbase
[1] 15:06:49 [SUCCESS] **2
**1: yy:x:1007:
supergroup:x:1101:hbase,hive
[2] 15:06:49 [SUCCESS] **1
**4: yy:x:1009:
supergroup:x:1103:hive,hbase
[3] 15:06:49 [SUCCESS] **4
**3: yy:x:1009:
supergroup:x:1103:hive,hbase
[4] 15:06:49 [SUCCESS] **3

[root@**1 ~]# su - hive

[hive@**1 ~]$ hdfs dfs -rm -f -skipTrash hdfs://**4:8020/tmp/bak/info.txt

Deleted hdfs://**4:8020/tmp/bak/info.txt

[hive@**1 ~]$ hdfs dfs -copyFromLocal /cdhdata/bak/list_krb_clients hdfs://**4:8020/tmp/bak/

[hive@**1 ~]$ hdfs dfs -ls hdfs://**4:8020/tmp/bak

Found 1 items
-rw-r--r--   3 hive supergroup         32 2021-02-05 15:11 hdfs://**4:8020/tmp/bak/list_krb_clients

[root@**1 ~]# sudo -u hbase hdfs dfs -cp hdfs://**4:8020/tmp/bak/.hbase-snapshot/TestTable-snapshot1 hdfs://**4:8020/hbase/.hbase-snapshot/

21/02/05 16:27:02 WARN security.UserGroupInformation: PriviledgedActionException as:hbase (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
21/02/05 16:27:02 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
21/02/05 16:27:02 WARN security.UserGroupInformation: PriviledgedActionException as:hbase (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
cp: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "**1/10.18.60.114"; destination host is: "**4":8020;

[root@**1 ~]# klist

Ticket cache: FILE:/tmp/krb5cc_0
Default principal: hbase/**[email protected]

Valid starting       Expires              Service principal
02/05/2021 14:36:36  02/06/2021 14:36:36  krbtgt/[email protected]
        renew until 02/09/2021 18:16:03

[root@**1 ~]# hdfs dfs -cp hdfs://**4:8020/tmp/bak/.hbase-snapshot/TestTable-snapshot1 hdfs://**4:8020/hbase/.hbase-snapshot/

[root@**1 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/

[root@**1 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/archive/

[root@**1 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/archive/data/

[root@**1 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/archive/data/default/

[root@**1 ~]# hbase hbck -details

[root@**1 ~]# hdfs dfs -cp -f hdfs://**4:8020/tmp/bak/archive/data/default/TestTable hdfs://**4:8020/hbase/archive/data/default/

问题处理

RpcRetryingCaller - Call exception

用JAVA API连接HBase的过程中出现了RpcRetryingCaller Call exception,日志显示多次连接失败并重试,示例异常日志如下。

看着像是请求Zookeeper服务器的报错 ... 待分析。 
[INFO ] 2021-07-20 20:24:41.154 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t1] RpcRetryingCaller - Call exception, tries=10, retries=35, started=38390 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:24:51.234 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t1] RpcRetryingCaller - Call exception, tries=11, retries=35, started=48472 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:25:29.671 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t2] RpcRetryingCaller - Call exception, tries=10, retries=35, started=38330 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:25:39.687 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t2] RpcRetryingCaller - Call exception, tries=11, retries=35, started=48346 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:26:18.125 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t3] RpcRetryingCaller - Call exception, tries=10, retries=35, started=38231 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:26:28.178 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t3] RpcRetryingCaller - Call exception, tries=11, retries=35, started=48285 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:27:06.826 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t4] RpcRetryingCaller - Call exception, tries=10, retries=35, started=38342 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:27:16.858 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t4] RpcRetryingCaller - Call exception, tries=11, retries=35, started=48374 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0

解决方法: hosts文件添加HBase集群各个节点hostname,如:

**.**.**.114 ***
**.**.**.1*5 ***
**.**.**.1*2 ***
**.**.**.1*3 ***

HDFS Read调优

在基于 HDFS 存储的 HBase 中,主要有两种调优方式:

绕过RPC的选项,称为short circuit reads 开启让HDFS推测性地从多个datanode读数据的选项,称为 hedged reads

Short-Circuit Reads

一般来说,HBase RegionServer 与 HDFS DataNode在一起,所以可以实现很好的数据本地化。但是在早期Hadoop 1.0.0版本中,RegionServer 在与 DataNode通过RPC通信时,与其他常规客户端一样,需要经过整个RPC通信过程。在 Hadoop 1.0.0 版本之后,加入了short-circuit read选项,它可以完全绕过RPC栈,通过本地clients直接从底层文件系统读数据。

Hadoop 2.x 之后进一步优化了这个实现。当前DataNode与HDFS客户端(HBase也是其中一个)可以使用一个称为file descriptor passing的功能,使得数据交换全部发生在OS kernel层。相较于之前的实现会更快,更高效。使得多个进程在同一个实例上进行高效地交互。

在Hadoop中,可以参考以下官方文档配置启用short-circuit reads:

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html

下面是一个配置参考,需要在hbase-site.xml 与 hdfs-site.xml 两个配置文件中均配置,且配置完后需重启进程:

<property>
  <name>dfs.client.read.shortcircuit</name>
  <value>true</value>
  <description>
    This configuration parameter turns on short-circuit local reads.
  </description>
 
</property>
<property>
  <name>dfs.domain.socket.path</name>
  <value>/var/lib/hadoop-hdfs/dn_socket</value>
  <description>
    Optional. This is a path to a UNIX domain socket that will be used for
    communication between the DataNode and local HDFS clients.
    If the string "_PORT" is present in this path, it will be replaced by the
    TCP port of the DataNode.
  </description>
</property>

需要注意的是:dfs.domain.socket.path指定的文件(可以先不存在)的owner必须为OS的root用户,或者是运行datanode服务的用户。

最后,short-circuit read buffers的默认大小由dfs.client.read.shortcircuit.buffer.size指定,对于很繁忙的HBase 集群来说,默认值可能会比较高。在HBase中,如果没有没有显示指定此值,则会从默认的 1MB 直接降为 128KB(使用的是hbase.dfs.client.read.shortcircuit.buffer.size 属性,默认为128KB)。

在HBase 中的HDFS客户端,会为每个打开的data block分配一个direct byte buffer ,大小为参数hbase.dfs.client.read.shortcircuit.buffer.size 指定大小。此功能可以让HBase永久保持它的HDFS文件打开,所以会很快地增加。

Hedged Reads

Hedged reads是HDFS的一个功能,在Hadoop 2.4.0之后引入。一般来说,每个读请求都会由生成的一个线程处理。在Hedged reads 启用后,客户但可以等待一个预配置的时间,如果read没有返回,则客户端会生成第二个读请求,访问同一份数据的另一个block replica。之后,其中任意一个read 先返回的话,则另一个read请求则被丢弃。

Hedged reads使用的场景是:解决少概率的slow read(可能由瞬时错误导致,例如磁盘错误或是网络抖动等)。

HBase region server 是一个 HDFS client,所以我们可以在HBase中启用hedged reads,通过在 RegionServer 中的 hbase-site.xml 配置增加以下参数,并且根据实际环境对参数进行调整:

def.client.hedged.read.threadpool.size:默认值为0。指定有多少线程用于服务hedged reads。如果此值设置为0(默认),则hedged reads为disabled状态 dfs.client.hedged.read.threshold.millis:默认为500(0.5秒):在spawning 第二个线程前,等待的时间。 下面是一个示例配置,设置等待阈值为10ms,并且线程数为20:

<property>
  <name>dfs.client.hedged.read.threadpool.size</name>
  <value>20</value>
</property>
 
<property>
  <name>dfs.client.hedged.read.threshold.millis</name>
  <value>10</value>
</property>

需要注意的是:hedged reads 在HDFS中的功能,类似于MapReduce中的speculative execution:需要消耗额外的资源。例如,根据集群的负载与设定,它可能需要触发很多额外的读操作,且大部分是发送到远端的block replicas。产生的额外的I/O、以及网络可能会对集群性能造成较大影响。对此,需要在生产环境中的负载进行测试,以决定是否使用此功能。

⚠️ **GitHub.com Fallback** ⚠️