Hbase Snapshot

ExportSnapshot

In Kerberos

[root@**2 ~]# hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot TestTable-snapshot1 -copy-to hdfs://**4:8020/tmp/bak/

Java HotSpot(TM) 64-Bit Server VM warning: Using incremental CMS is deprecated and will likely be removed in a future release
21/02/05 14:22:40 INFO snapshot.ExportSnapshot: Copy Snapshot Manifest
21/02/05 14:22:40 INFO hdfs.DFSClient: Created token for hdfs: HDFS_DELEGATION_TOKEN owner=hdfs/**[email protected], renewer=yarn, realUser=, issueDate=1612506160711, maxDate=1613110960711, sequenceNumber=285, masterKeyId=112 on 10.18.60.113:8020
21/02/05 14:22:40 INFO security.TokenCache: Got dt for hdfs://**4:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 10.18.60.113:8020, Ident: (token for hdfs: HDFS_DELEGATION_TOKEN owner=hdfs/**[email protected], renewer=yarn, realUser=, issueDate=1612506160711, maxDate=1613110960711, sequenceNumber=285, masterKeyId=112)
21/02/05 14:22:40 INFO client.RMProxy: Connecting to ResourceManager at **4/10.18.60.113:8032
21/02/05 14:22:41 INFO snapshot.ExportSnapshot: Loading Snapshot 'TestTable-snapshot1' hfile list
21/02/05 14:22:41 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
21/02/05 14:22:41 INFO mapreduce.JobSubmitter: number of splits:1
21/02/05 14:22:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1612494344882_0001
21/02/05 14:22:41 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: 10.18.60.113:8020, Ident: (token for hdfs: HDFS_DELEGATION_TOKEN owner=hdfs/**[email protected], renewer=yarn, realUser=, issueDate=1612506160711, maxDate=1613110960711, sequenceNumber=285, masterKeyId=112)
21/02/05 14:22:42 INFO impl.YarnClientImpl: Submitted application application_1612494344882_0001
21/02/05 14:22:42 INFO mapreduce.Job: The url to track the job: http://**4:8088/proxy/application_1612494344882_0001/
21/02/05 14:22:42 INFO mapreduce.Job: Running job: job_1612494344882_0001
21/02/05 14:22:45 INFO mapreduce.Job: Job job_1612494344882_0001 running in uber mode : false
21/02/05 14:22:45 INFO mapreduce.Job:  map 0% reduce 0%
21/02/05 14:22:46 INFO mapreduce.Job: Job job_1612494344882_0001 failed with state FAILED due to: Application application_1612494344882_0001 failed 2 times due to AM Container for appattempt_1612494344882_0001_000002 exited with  exitCode: -1000
For more detailed output, check application tracking page:http://**4:8088/proxy/application_1612494344882_0001/Then, click on links to logs of each attempt.
Diagnostics: Application application_1612494344882_0001 initialization failed (exitCode=255) with output: main : command provided 0
main : run as user is hdfs
main : requested yarn user is hdfs
Requested user hdfs is banned

Failing this attempt. Failing the application.
21/02/05 14:22:46 INFO mapreduce.Job: Counters: 0
21/02/05 14:22:46 ERROR snapshot.ExportSnapshot: Snapshot export failed
org.apache.hadoop.hbase.snapshot.ExportSnapshotException: Copy Files Map-Reduce Job failed
        at org.apache.hadoop.hbase.snapshot.ExportSnapshot.runCopyJob(ExportSnapshot.java:825)
        at org.apache.hadoop.hbase.snapshot.ExportSnapshot.run(ExportSnapshot.java:1020)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.hbase.snapshot.ExportSnapshot.innerMain(ExportSnapshot.java:1094)
        at org.apache.hadoop.hbase.snapshot.ExportSnapshot.main(ExportSnapshot.java:1098)

'Requested user hdfs is banned', Yarn的配置中banned.users包括：hdfs, yar, mapred, bin 。

[root@**2 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/

Found 6 items
drwxrwxrwx   - hdfs   supergroup          0 2021-02-05 14:26 hdfs://**4:8020/tmp/.cloudera_health_monitoring_canary_files
drwxrwxr-x   - hbase  supergroup          0 2021-02-04 14:45 hdfs://**4:8020/tmp/bak
drwxr-xr-x   - yarn   supergroup          0 2020-07-09 22:47 hdfs://**4:8020/tmp/hadoop-yarn
drwx--x--x   - hbase  supergroup          0 2020-11-11 16:50 hdfs://**4:8020/tmp/hbase-staging
drwx-wx-wx   - hive   supergroup          0 2020-10-30 15:15 hdfs://**4:8020/tmp/hive
drwxrwxrwt   - mapred hadoop              0 2020-09-16 17:14 hdfs://**4:8020/tmp/logs

[root@**2 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/

Found 1 items
drwxr-xr-x   - hbase supergroup          0 2021-02-04 14:45 hdfs://**4:8020/tmp/bak/.hbase-snapshot

[root@**2 ~]# hdfs dfs -chmod -R g+w hdfs://**4:8020/tmp/bak/

[root@**2 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/

Found 1 items
drwxrwxr-x   - hbase supergroup          0 2021-02-04 14:45 hdfs://**4:8020/tmp/bak/.hbase-snapshot

[root@**2 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/.hbase-snapshot

Found 1 items
drwxrwxr-x   - hbase supergroup          0 2021-02-05 14:22 hdfs://**4:8020/tmp/bak/.hbase-snapshot/.tmp

[root@**1 ~]# pssh -h /cdhdata/bak/list_krb_clients -P -l root usermod -a -G supergroup hive

[1] 15:06:31 [SUCCESS] **1
[2] 15:06:31 [SUCCESS] **3
[3] 15:06:31 [SUCCESS] **2
[4] 15:06:31 [SUCCESS] **4

[root@**1 ~]# pssh -h /cdhdata/bak/list_krb_clients -P -l root usermod -a -G supergroup hbase

[1] 15:06:38 [SUCCESS] **1
[2] 15:06:38 [SUCCESS] **3
[3] 15:06:38 [SUCCESS] **2
[4] 15:06:38 [SUCCESS] **4

[root@**1 ~]# pssh -h /cdhdata/bak/list_krb_clients -P -l root tail -2 /etc/group

**2: yy:x:1006:
supergroup:x:1101:hive,hbase
[1] 15:06:49 [SUCCESS] **2
**1: yy:x:1007:
supergroup:x:1101:hbase,hive
[2] 15:06:49 [SUCCESS] **1
**4: yy:x:1009:
supergroup:x:1103:hive,hbase
[3] 15:06:49 [SUCCESS] **4
**3: yy:x:1009:
supergroup:x:1103:hive,hbase
[4] 15:06:49 [SUCCESS] **3

[root@**1 ~]# su - hive

[hive@**1 ~]$ hdfs dfs -rm -f -skipTrash hdfs://**4:8020/tmp/bak/info.txt

Deleted hdfs://**4:8020/tmp/bak/info.txt

[hive@**1 ~]$ hdfs dfs -copyFromLocal /cdhdata/bak/list_krb_clients hdfs://**4:8020/tmp/bak/

[hive@**1 ~]$ hdfs dfs -ls hdfs://**4:8020/tmp/bak

Found 1 items
-rw-r--r--   3 hive supergroup         32 2021-02-05 15:11 hdfs://**4:8020/tmp/bak/list_krb_clients

[root@**1 ~]# sudo -u hbase hdfs dfs -cp hdfs://**4:8020/tmp/bak/.hbase-snapshot/TestTable-snapshot1 hdfs://**4:8020/hbase/.hbase-snapshot/

21/02/05 16:27:02 WARN security.UserGroupInformation: PriviledgedActionException as:hbase (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
21/02/05 16:27:02 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
21/02/05 16:27:02 WARN security.UserGroupInformation: PriviledgedActionException as:hbase (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
cp: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "**1/10.18.60.114"; destination host is: "**4":8020;

[root@**1 ~]# klist

Ticket cache: FILE:/tmp/krb5cc_0
Default principal: hbase/**[email protected]

Valid starting       Expires              Service principal
02/05/2021 14:36:36  02/06/2021 14:36:36  krbtgt/[email protected]
        renew until 02/09/2021 18:16:03

[root@**1 ~]# hdfs dfs -cp hdfs://**4:8020/tmp/bak/.hbase-snapshot/TestTable-snapshot1 hdfs://**4:8020/hbase/.hbase-snapshot/

[root@**1 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/

[root@**1 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/archive/

[root@**1 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/archive/data/

[root@**1 ~]# hdfs dfs -ls hdfs://**4:8020/tmp/bak/archive/data/default/

[root@**1 ~]# hbase hbck -details

[root@**1 ~]# hdfs dfs -cp -f hdfs://**4:8020/tmp/bak/archive/data/default/TestTable hdfs://**4:8020/hbase/archive/data/default/

问题处理

RpcRetryingCaller - Call exception

用JAVA API连接HBase的过程中出现了RpcRetryingCaller Call exception，日志显示多次连接失败并重试，示例异常日志如下。

看着像是请求Zookeeper服务器的报错 ... 待分析。 
[INFO ] 2021-07-20 20:24:41.154 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t1] RpcRetryingCaller - Call exception, tries=10, retries=35, started=38390 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:24:51.234 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t1] RpcRetryingCaller - Call exception, tries=11, retries=35, started=48472 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:25:29.671 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t2] RpcRetryingCaller - Call exception, tries=10, retries=35, started=38330 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:25:39.687 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t2] RpcRetryingCaller - Call exception, tries=11, retries=35, started=48346 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:26:18.125 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t3] RpcRetryingCaller - Call exception, tries=10, retries=35, started=38231 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:26:28.178 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t3] RpcRetryingCaller - Call exception, tries=11, retries=35, started=48285 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:27:06.826 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t4] RpcRetryingCaller - Call exception, tries=10, retries=35, started=38342 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0
[INFO ] 2021-07-20 20:27:16.858 [hconnection-0x71cf1b07-metaLookup-shared--pool2-t4] RpcRetryingCaller - Call exception, tries=11, retries=35, started=48374 ms ago, cancelled=false, msg=**3 row 'speechdialog,,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=**3,60020,1626745338656, seqNum=0

解决方法： hosts文件添加HBase集群各个节点hostname，如：

**.**.**.114 ***
**.**.**.1*5 ***
**.**.**.1*2 ***
**.**.**.1*3 ***

HDFS Read调优

在基于 HDFS 存储的 HBase 中，主要有两种调优方式：

绕过RPC的选项，称为short circuit reads 开启让HDFS推测性地从多个datanode读数据的选项，称为 hedged reads

Short-Circuit Reads

一般来说，HBase RegionServer 与 HDFS DataNode在一起，所以可以实现很好的数据本地化。但是在早期Hadoop 1.0.0版本中，RegionServer 在与 DataNode通过RPC通信时，与其他常规客户端一样，需要经过整个RPC通信过程。在 Hadoop 1.0.0 版本之后，加入了short-circuit read选项，它可以完全绕过RPC栈，通过本地clients直接从底层文件系统读数据。

Hadoop 2.x 之后进一步优化了这个实现。当前DataNode与HDFS客户端（HBase也是其中一个）可以使用一个称为file descriptor passing的功能，使得数据交换全部发生在OS kernel层。相较于之前的实现会更快，更高效。使得多个进程在同一个实例上进行高效地交互。

在Hadoop中，可以参考以下官方文档配置启用short-circuit reads：

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html

下面是一个配置参考，需要在hbase-site.xml 与 hdfs-site.xml 两个配置文件中均配置，且配置完后需重启进程：

<property>
  <name>dfs.client.read.shortcircuit</name>
  <value>true</value>
  <description>
    This configuration parameter turns on short-circuit local reads.
  </description>
 
</property>
<property>
  <name>dfs.domain.socket.path</name>
  <value>/var/lib/hadoop-hdfs/dn_socket</value>
  <description>
    Optional. This is a path to a UNIX domain socket that will be used for
    communication between the DataNode and local HDFS clients.
    If the string "_PORT" is present in this path, it will be replaced by the
    TCP port of the DataNode.
  </description>
</property>

需要注意的是：dfs.domain.socket.path指定的文件（可以先不存在）的owner必须为OS的root用户，或者是运行datanode服务的用户。

最后，short-circuit read buffers的默认大小由dfs.client.read.shortcircuit.buffer.size指定，对于很繁忙的HBase 集群来说，默认值可能会比较高。在HBase中，如果没有没有显示指定此值，则会从默认的 1MB 直接降为 128KB（使用的是hbase.dfs.client.read.shortcircuit.buffer.size 属性，默认为128KB）。

在HBase 中的HDFS客户端，会为每个打开的data block分配一个direct byte buffer ，大小为参数hbase.dfs.client.read.shortcircuit.buffer.size 指定大小。此功能可以让HBase永久保持它的HDFS文件打开，所以会很快地增加。

Hedged Reads

Hedged reads是HDFS的一个功能，在Hadoop 2.4.0之后引入。一般来说，每个读请求都会由生成的一个线程处理。在Hedged reads 启用后，客户但可以等待一个预配置的时间，如果read没有返回，则客户端会生成第二个读请求，访问同一份数据的另一个block replica。之后，其中任意一个read 先返回的话，则另一个read请求则被丢弃。

Hedged reads使用的场景是：解决少概率的slow read（可能由瞬时错误导致，例如磁盘错误或是网络抖动等）。

HBase region server 是一个 HDFS client，所以我们可以在HBase中启用hedged reads，通过在 RegionServer 中的 hbase-site.xml 配置增加以下参数，并且根据实际环境对参数进行调整：

def.client.hedged.read.threadpool.size：默认值为0。指定有多少线程用于服务hedged reads。如果此值设置为0（默认），则hedged reads为disabled状态 dfs.client.hedged.read.threshold.millis：默认为500（0.5秒）：在spawning 第二个线程前，等待的时间。下面是一个示例配置，设置等待阈值为10ms，并且线程数为20：

<property>
  <name>dfs.client.hedged.read.threadpool.size</name>
  <value>20</value>
</property>
 
<property>
  <name>dfs.client.hedged.read.threshold.millis</name>
  <value>10</value>
</property>

需要注意的是：hedged reads 在HDFS中的功能，类似于MapReduce中的speculative execution：需要消耗额外的资源。例如，根据集群的负载与设定，它可能需要触发很多额外的读操作，且大部分是发送到远端的block replicas。产生的额外的I/O、以及网络可能会对集群性能造成较大影响。对此，需要在生产环境中的负载进行测试，以决定是否使用此功能。

Hbase - 9dian/Index GitHub Wiki

Hbase Snapshot

ExportSnapshot

In Kerberos

问题处理

RpcRetryingCaller - Call exception

HDFS Read调优

Short-Circuit Reads

Hedged Reads

⚠️ GitHub.com Fallback ⚠️

Hbase - 9dian/Index GitHub Wiki

Hbase Snapshot

ExportSnapshot

In Kerberos

问题处理

RpcRetryingCaller - Call exception

HDFS Read调优

Short-Circuit Reads

Hedged Reads

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️