HBase - kaushikdas/TechnicalWritings GitHub Wiki

1. Introduction

2. HBase Architecture

At the core of HBase, it is split into Region Servers that take care of something like horizontal sharding or range partitioning and automatically adapts to increase in data by repartitioning the data. To achieve this auto-sharding HBase follows a complex mechanism that involves write-ahead commit logs, merging things together over time asynchronously. These Region Servers sit on top of HDFS.

3. Logging into HBase

  1. ssh to VM ssh [email protected] -p 2222
$ ssh [email protected] -p 2222
[email protected]'s password:
Last login: Fri Mar 26 07:09:40 2021 from 10.0.2.2
[maria_dev@sandbox ~]$ 
  1. Switch to root user and then switch to user hbase
[maria_dev@sandbox ~]$ su
Password:
[root@sandbox maria_dev]# su hbase
[hbase@sandbox maria_dev]$
  1. Run hbase shell command to launch the shell
[hbase@sandbox maria_dev]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.5.0.0-1245, r53538b8ab6749cbb6fdc0fe448b89aa82495fb3f, Fri Aug 26 01:32:27 UTC 2016
hbase(main):001:0> list
  1. Now fire list command that will list the existing HBase tables:
hbase(main):003:0> list
TABLE
ATLAS_ENTITY_AUDIT_EVENTS
atlas_titan
iemployee
3 row(s) in 0.4590 seconds

=> ["ATLAS_ENTITY_AUDIT_EVENTS", "atlas_titan", "iemployee"]
hbase(main):004:0>

Note You may get below error if the HBase service is not running:

hbase(main):001:0> list
TABLE

ERROR: Can't get master address from ZooKeeper; znode data == null

Here is some help for this command:
List all tables in hbase. Optional regular expression parameter could
be used to filter the output. Examples:

 hbase> list
 hbase> list 'abc.*'
 hbase> list 'ns:abc.*'
 hbase> list 'ns:.*'


hbase(main):002:0>

To fix this run the HBase service:

  • Login to Ambari as admin
  • Start HBase service
    • Right panel HBase -> Service Actions drop down -> Start

4. Basic Interactions with HBase shell

  • Creating a namespace: create_namespace '<namespace_name>'

    hbase(main):005:0* create_namespace 'kaushik'
    0 row(s) in 0.2850 seconds
    • Use list_namespace to check (all) available namespaces:
    hbase(main):006:0> list_namespace
    NAMESPACE
    default
    hbase
    kaushik
    3 row(s) in 0.0950 seconds

    default and hbase are two default namespaces

    • Use describe_namespace '<namespace_name>' to describe a given namespace
    hbase(main):001:0> describe_namespace 'kaushik'
    DESCRIPTION
    {NAME => 'kaushik'}
    1 row(s) in 0.6230 seconds

    A namespace is equivalent to what database is in RDBMS terms

  • Create a table under a namespace create '[<namespace_name>:]<table_name>, '<column_family>' [, '<column_family>', ...]

    # cf_1 is the (only) column family, 
    # ...no need to mention columns inside it
    hbase(main):002:0> create 'kaushik:test', 'cf_1'
    0 row(s) in 2.7700 seconds
    
    => Hbase::Table - kaushik:test

    If namespace is not specified, the table will be created under default namespace

    • list will list all available tables
    hbase(main):001:0> list
    TABLE
    ATLAS_ENTITY_AUDIT_EVENTS
    atlas_titan
    iemployee
    kaushik:test
    4 row(s) in 0.2570 seconds
    
    => ["ATLAS_ENTITY_AUDIT_EVENTS", "atlas_titan", "iemployee", "kaushik:test"]
    • Use describe to describe any table
    hbase(main):003:0> describe 'kaushik:test'
    Table kaushik:test is ENABLED
    kaushik:test
    COLUMN FAMILIES DESCRIPTION
    {NAME => 'cf_1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER'
    , COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
    1 row(s) in 0.1810 seconds
    • drop command is used to delete a table — before deleting a table needs to be disabled using disable command
    hbase(main):004:0> disable 'kaushik:test'
    0 row(s) in 3.2250 seconds
    
    hbase(main):005:0> drop 'kaushik:test`
  • alter_namespace command is used to alter a created namespace

    hbase(main):005:0> alter_namespace 'kaushik', {METHOD => 'set', \
    hbase(main):006:1*     'VERSION' => 'DRAFT' }
    0 row(s) in 0.0570 seconds

    After the above command a new property VERSION will be added to the namespace kaushik

    hbase(main):007:0> describe_namespace 'kaushik'
    DESCRIPTION
    {NAME => 'kaushik', VERSION => 'DRAFT'}
    1 row(s) in 0.0130 seconds
  • A namespace can be dropped using drop_namespace

    hbase(main):008:0> drop_namespace 'kaushik'
    0 row(s) in 0.0680 seconds
    
    hbase(main):009:0> list_namespace
    NAMESPACE
    default
    hbase
    2 row(s) in 0.6190 seconds
    
    hbase(main):010:0>

4.1. Table specific interactions

  • DDL commands — CRUD operations

  • Create a table customer under namespace sales with versioning enabled:

    # enable versioning (4) for first column family ctInfo
    hbase(main):001:0> create 'sales:customer', \
    hbase(main):002:0*   {NAME => 'ctInfo', VERSIONS => 4}, \
    hbase(main):003:0*   'demo'
    0 row(s) in 2.7460 seconds
    
    => Hbase::Table - sales:customer
    • Versioning needs to enabled for each column family seperately — if not specified it will be set to default of 1

      • Here we have enabled four version 4 for column family ctInfo. Therefoe, for each column in this column family maximum 4 timestamped versions will be maintained for each value. We can verify this by describing the created 'sales:customer' table. We notice that for column family ctInfo VERSIONS => '4' where as for column family demo VERSIONS => '1'
      hbase(main):004:0> describe 'sales:customer'
      Table sales:customer is ENABLED
      sales:customer
      COLUMN FAMILIES DESCRIPTION
      {NAME => 'ctInfo', BLOOMFILTER => 'ROW', VERSIONS => '4', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVE
      R', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
      {NAME => 'demo', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER'
      , COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
      2 row(s) in 0.3020 seconds
  • scan — equivalent of select *

    hbase(main):005:0> scan 'sales:customer'
    ROW                                      COLUMN+CELL
    0 row(s) in 0.0720 seconds
    • Created table has no rows because we have not added any data
  • put — adding data to table

    • Add data for ONLY 1 column at a time
    hbase(main):006:0> put 'sales:customer', \           # table
    hbase(main):007:0*   'C00001', \                     # row key
    hbase(main):008:0*   'ctInfo:name', 'Sudhir Mishra'  # col name, value
    0 row(s) in 0.0890 seconds
    
    hbase(main):009:0> put 'sales:customer', \
    hbase(main):010:0*   'C00001', \
    hbase(main):011:0*   'ctInfo:mobile', '9988771122'
    0 row(s) in 0.0110 seconds
    
    hbase(main):012:0> put 'sales:customer', \
    hbase(main):013:0*   'C00001', \
    hbase(main):014:0*   'ctInfo:email', '[email protected]'
    0 row(s) in 0.0130 seconds
    
    hbase(main):015:0> put 'sales:customer', \
    hbase(main):016:0*   'C00001', \
    hbase(main):017:0*   'demo:age', '34'
    0 row(s) in 0.0240 seconds
    
    hbase(main):018:0> put 'sales:customer', \
    hbase(main):019:0*   'C00001', \
    hbase(main):020:0*   'demo:occupation', 'advocate'
    0 row(s) in 0.0140 seconds
    
    hbase(main):021:0> put 'sales:customer', \
    hbase(main):022:0*   'C00002', \
    hbase(main):023:0*   'ctInfo:name', 'Kunal K Bajaj'
    0 row(s) in 0.0140 seconds
    
    hbase(main):024:0> put 'sales:customer', \
    hbase(main):025:0*   'C00002', \
    hbase(main):026:0*   'ctInfo:tel', '02211223344'
    0 row(s) in 0.0130 seconds
    
    hbase(main):027:0> put 'sales:customer', \
    hbase(main):028:0*   'C00002', \
    hbase(main):029:0*   'demo:age', '42'
    0 row(s) in 0.0110 seconds
    
    hbase(main):030:0> put 'sales:customer', \
    hbase(main):031:0*   'C00002', \
    hbase(main):032:0*   'demo:education', 'graduate'
    0 row(s) in 0.0110 seconds
  • See the added data using scan:

hbase(main):033:0> scan 'sales:customer'
ROW                                      COLUMN+CELL
 C00001                                  column=ctInfo:email, timestamp=1616771164898, [email protected]
 C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
 C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
 C00001                                  column=demo:age, timestamp=1616771187409, value=34
 C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
 C00002                                  column=ctInfo:name, timestamp=1616771311930, value=Kunal K Bajaj
 C00002                                  column=ctInfo:tel, timestamp=1616771354236, value=02211223344
 C00002                                  column=demo:age, timestamp=1616771379529, value=42
 C00002                                  column=demo:education, timestamp=1616771405214, value=graduate
2 row(s) in 0.0530 seconds
  • Logical view of the table will be:
{
  "C00001" : {
    "ctInfo" : {
      1616771022652 : { "name" : "Sudhir Mishra" },
      1616771062503 : { "mobile" : "9988771122" },
      1616771164898 : { "email" : "[email protected]" }
    },
    "demo" : {
      1616771187409 : { "age" : "34" },
      1616771214484 : { "occupation" : "advocate" }
    }
  },
  "C00002" : {
    "ctInfo" : {
      1616771311930 : { "name" : "Kunal K Bajaj" },
      1616771354236 : { "tel" : "02211223344" }
    },
    "demo" : {
      1616771379529 : { "age" : "42" },
      1616771405214 : { "education" : "graduate" }
    }
  }
}

This is a good example of sparse table:

Row Key ctInfo:name ctInfo:mobile ctInfo:email ctInfo:tel demo:age demo:occupation demo:education
C00001 1616771022652: Sudhir Mishra 1616771062503: 9988771122 1616771164898 : [email protected] 1616771187409: 34 1616771214484 : advocate
C00002 1616771311930: Kunal K Bajaj 1616771354236: 02211223344 1616771379529: 42 1616771405214 : graduate
  • Now let us add one for email id for "C00001" and do a scan — we see that we get the latest email

    hbase(main):034:0> put 'sales:customer', \
    hbase(main):035:0*   'C00001', \
    hbase(main):036:0*   'ctInfo:email', '[email protected]'
    0 row(s) in 0.3130 seconds
    
    hbase(main):037:0> scan 'sales:customer'
    ROW                                      COLUMN+CELL
    C00001                                  column=ctInfo:email, timestamp=1616771484147, [email protected]
    C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
    C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
    C00001                                  column=demo:age, timestamp=1616771187409, value=34
    C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
    C00002                                  column=ctInfo:name, timestamp=1616771311930, value=Kunal K Bajaj
    C00002                                  column=ctInfo:tel, timestamp=1616771354236, value=02211223344
    C00002                                  column=demo:age, timestamp=1616771379529, value=42
    C00002                                  column=demo:education, timestamp=1616771405214, value=graduate
    2 row(s) in 0.0550 seconds
    
  • To get all the versions of the email column values we need to specify version for scan:

    hbase(main):039:0> scan 'sales:customer', {VERSIONS => 4}
    ROW                                      COLUMN+CELL
    C00001                                  column=ctInfo:email, timestamp=1616771484147, [email protected]
    C00001                                  column=ctInfo:email, timestamp=1616771164898, [email protected]
    C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
    C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
    C00001                                  column=demo:age, timestamp=1616771187409, value=34
    C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
    C00002                                  column=ctInfo:name, timestamp=1616771311930, value=Kunal K Bajaj
    C00002                                  column=ctInfo:tel, timestamp=1616771354236, value=02211223344
    C00002                                  column=demo:age, timestamp=1616771379529, value=42
    C00002                                  column=demo:education, timestamp=1616771405214, value=graduate
    2 row(s) in 0.0570 seconds
    • Now we see both email ids
  • get — fetch a specific row with key

    hbase(main):041:0> get 'sales:customer', 'C00001'
    COLUMN                                   CELL
    ctInfo:email                            timestamp=1616771484147, [email protected]
    ctInfo:mobile                           timestamp=1616771062503, value=9988771122
    ctInfo:name                             timestamp=1616771022652, value=Sudhir Mishra
    demo:age                                timestamp=1616771187409, value=34
    demo:occupation                         timestamp=1616771214484, value=advocate
    5 row(s) in 0.0220 seconds
    • Again to get all versions we need to specify version
    hbase(main):043:0> get 'sales:customer', 'C00001', {COLUMN => 'ctInfo', VERSIONS => 2}
    COLUMN                                   CELL
    ctInfo:email                            timestamp=1616771484147, [email protected]
    ctInfo:email                            timestamp=1616771164898, [email protected]
    ctInfo:mobile                           timestamp=1616771062503, value=9988771122
    ctInfo:name                             timestamp=1616771022652, value=Sudhir Mishra
    4 row(s) in 0.0410 seconds
    • Now we see both email ids
  • delete and deleteall — delete a specific column value or a full row

    • To delete a spefic version of column value we need to give version:
    hbase(main):045:0> delete 'sales:customer', 'C00001', 'ctInfo:email', 1616771164898
    0 row(s) in 0.0250 seconds
    
    hbase(main):046:0> get 'sales:customer', 'C00001', {COLUMN => 'ctInfo', VERSIONS => 2}
    COLUMN                                   CELL
    ctInfo:email                            timestamp=1616771484147, [email protected]
    ctInfo:mobile                           timestamp=1616771062503, value=9988771122
    ctInfo:name                             timestamp=1616771022652, value=Sudhir Mishra
    3 row(s) in 0.0280 seconds
    
    hbase(main):047:0> scan 'sales:customer'
    ROW                                      COLUMN+CELL
    C00001                                  column=ctInfo:email, timestamp=1616771484147, [email protected]
    C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
    C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
    C00001                                  column=demo:age, timestamp=1616771187409, value=34
    C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
    C00002                                  column=ctInfo:name, timestamp=1616771311930, value=Kunal K Bajaj
    C00002                                  column=ctInfo:tel, timestamp=1616771354236, value=02211223344
    C00002                                  column=demo:age, timestamp=1616771379529, value=42
    C00002                                  column=demo:education, timestamp=1616771405214, value=graduate
    2 row(s) in 0.0530 seconds
    
    hbase(main):048:0> scan 'sales:customer', {VERSIONS => 4}
    ROW                                      COLUMN+CELL
    C00001                                  column=ctInfo:email, timestamp=1616771484147, [email protected]
    C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
    C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
    C00001                                  column=demo:age, timestamp=1616771187409, value=34
    C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
    C00002                                  column=ctInfo:name, timestamp=1616771311930, value=Kunal K Bajaj
    C00002                                  column=ctInfo:tel, timestamp=1616771354236, value=02211223344
    C00002                                  column=demo:age, timestamp=1616771379529, value=42
    C00002                                  column=demo:education, timestamp=1616771405214, value=graduate
    2 row(s) in 0.1720 seconds
    • Delete a full row with deleteall
    hbase(main):050:0> deleteall 'sales:customer', 'C00002'
    0 row(s) in 0.0060 seconds
    
    hbase(main):051:0> scan 'sales:customer', {VERSIONS => 4}
    ROW                                      COLUMN+CELL
    C00001                                  column=ctInfo:email, timestamp=1616771484147, [email protected]
    C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
    C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
    C00001                                  column=demo:age, timestamp=1616771187409, value=34
    C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
    1 row(s) in 0.0300 seconds

5. Other Ways to load data to HBase

5.1. Using mapreduce.ImportTsv

  • We will create a sales table and populate that with sale information

    • From Ambari start HBase service
[maria_dev@sandbox ~]$ su
Password:
[root@sandbox maria_dev]# su hbase
[hbase@sandbox maria_dev]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.5.0.0-1245, r53538b8ab6749cbb6fdc0fe448b89aa82495fb3f, Fri Aug 26 01:32:27 UTC 2016

hbase(main):001:0> create table "kaushik:sales", "cf_order"
NoMethodError: undefined method `table' for #<Object:0x3d3c886f>

hbase(main):007:0> create "kaushik:sales", "cf_order"
0 row(s) in 3.3440 seconds

=> Hbase::Table - kaushik:sales

hbase(main):008:0> describe "kaushik:sales"
Table kaushik:sales is ENABLED
kaushik:sales
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf_order', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FORE
VER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.3510 seconds

hbase(main):009:0>

  1. We will use salesOrder.csv from the Internet
$ curl -o ./sales.csv https://raw.githubusercontent.com/bsullins/data/master/salesOrders.csv
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  453k  100  453k    0     0   226k      0  0:00:02  0:00:02 --:--:--  200k

$ ls sales.csv
sales.csv

$ head sales.csv
RowKey,Order ID,Order Date,Ship Date,Ship Mode,Profit,Quantity,Sales
1,CA-2011-100006,2013-09-07 00:00:00,2013-09-13 00:00:00,Standard Class,109.61,3.00,377.97
2,CA-2011-100090,2013-07-08 00:00:00,2013-07-12 00:00:00,Standard Class,-19.09,9.00,699.19
3,CA-2011-100293,2013-03-14 00:00:00,2013-03-18 00:00:00,Standard Class,31.87,6.00,91.06
4,CA-2011-100328,2013-01-28 00:00:00,2013-02-03 00:00:00,Standard Class,1.33,1.00,3.93
5,CA-2011-100363,2013-04-08 00:00:00,2013-04-15 00:00:00,Standard Class,7.72,5.00,21.38
6,CA-2011-100391,2013-05-25 00:00:00,2013-05-29 00:00:00,Standard Class,6.73,2.00,14.62
7,CA-2011-100678,2013-04-18 00:00:00,2013-04-22 00:00:00,Standard Class,61.79,11.00,697.07
8,CA-2011-100706,2013-12-16 00:00:00,2013-12-18 00:00:00,Second Class,17.72,8.00,129.44
9,CA-2011-100762,2013-11-24 00:00:00,2013-11-29 00:00:00,Standard Class,219.08,11.00,508.62
  1. We will remove the 1st row:
$ sed -i '1d' sales.csv

$ head sales.csv
1,CA-2011-100006,2013-09-07 00:00:00,2013-09-13 00:00:00,Standard Class,109.61,3.00,377.97
2,CA-2011-100090,2013-07-08 00:00:00,2013-07-12 00:00:00,Standard Class,-19.09,9.00,699.19
...
  1. We will transfer the downloaded file to VM
$ scp -P 2222 sales.csv maria_dev@localhost:/home/maria_dev/.
 
  • Now we will copy the file sales.csv to HDFS
$ ssh -p 2222 maria_dev@localhost
maria_dev@localhost's password:
Last login: Fri Mar 26 10:22:44 2021 from 10.0.2.2

[maria_dev@sandbox ~]$ head sales.csv
1,CA-2011-100006,2013-09-07 00:00:00,2013-09-13 00:00:00,Standard Class,109.61,3.00,377.97
2,CA-2011-100090,2013-07-08 00:00:00,2013-07-12 00:00:00,Standard Class,-19.09,9.00,699.19
...

[maria_dev@sandbox ~]$ hadoop fs -copyFromLocal sales.csv /tmp

[maria_dev@sandbox ~]$ hadoop fs -ls /tmp
Found 7 items
drwxrwxrwx   - maria_dev hdfs          0 2020-03-05 17:09 /tmp/.pigjobs
drwxrwxrwx   - maria_dev hdfs          0 2020-03-05 17:09 /tmp/.pigscripts
drwxrwxrwx   - maria_dev hdfs          0 2020-03-05 17:09 /tmp/.pigstore
drwxr-xr-x   - hdfs      hdfs          0 2016-10-25 07:48 /tmp/entity-file-history
drwx-wx-wx   - ambari-qa hdfs          0 2016-10-25 07:51 /tmp/hive
drwx------   - maria_dev hdfs          0 2020-03-05 17:14 /tmp/maria_dev
-rw-r--r--   1 maria_dev hdfs     459327 2021-03-31 16:02 /tmp/sales.csv
[maria_dev@sandbox ~]$
  • We will now load the kaushik:sales HBase table with data from sales.csv:
[maria_dev@sandbox ~]$ su

[root@sandbox maria_dev]# hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
-Dimporttsv.separator=,         # field seperator in the input file
-Dimporttsv.columns="\          # start specifiying columns
HBASE_ROW_KEY, \                # ... 1st item is row key
cf_order:orderId, \             # ... followed by values for 
cf_order:orderDate, \           # ... other columns
cf_order:shipDate, \
cf_order:shipMode, \
cf_order:profit, \
cf_order:quantity, \
cf_order:sales" \
kaushik:sales \                 # target table name
hdfs://sandbox.hortonworks.com:/tmp/sales.csv       # hdfs path for input csv file
  • This will fire a map-reduce job:
2021-03-31 16:39:02,868 INFO  [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x62e7f11d connecting to ZooKeeper ensemble=sandbox.hortonworks.com:2181
2021-03-31 16:39:02,879 INFO  [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-1245--1, built on 08/26/2016 00:47 GMT
2021-03-31 16:39:02,879 INFO  [main] zookeeper.ZooKeeper: Client environment:host.name=sandbox.hortonworks.com
2021-03-31 16:39:02,879 INFO  [main] zookeeper.ZooKeeper: Client environment:java.version=1.8.0_111
...
  • and the table will be filles with rows:
hbase(main):002:0> scan "kaushik:sales"
...
 999                                     column=cf_order:shipDate, timestamp=1617208741407, value=2014-08-23 00:00:00
 999                                     column=cf_order:shipMode, timestamp=1617208741407, value=Same Day
5009 row(s) in 11.8840 seconds

5.1. Using HBase REST API

  • HBase exposes REST API that can be directly accessed from the cleint system

    • We will use a Python client to create a sparse table using the movie rating data:
$ head u.data
0       50      5       881250949
0       172     5       881250949
0       133     1       881250949
196     242     3       881250949
186     302     3       891717742
22      377     1       878887116
244     51      2       880606923
166     346     1       886397596
298     474     4       884182806
115     265     2       881171488
- First column is user id, second column is movie id, 3rd column is rating of the movie by the user and last column is the rating time stamp

- In our sparse table we will keep only the movie ratings by each user &mdash; so logically the table will be something as below:
  +---------+       +--------------+     +---------------+     +---------------+
  | user_id | ----> | movie_id: 50 | --> | movie_id: 172 | --> | movie_id: 133 | 
  +---------+       | rating: 5    |     | rating: 5     |     | rating: 5     |
                    +--------------+     +---------------+     +---------------+
  • We will use starbase Python package that provide nice programming interface over HBase REST APIs. We can use this using pip install starbase

  • To use the HBase REST interface we need to first start the HBase REST server

[root@sandbox maria_dev]# /usr/hdp/current/hbase-master/bin/hbase-daemon.sh \
> start rest \        # start HBase REST service
> -p 8000 \           # ... on port 8000
> --infoport 8001     # ... and debugging information streaming at port 8001
starting rest, logging to /var/log/hbase/hbase-maria_dev-rest-sandbox.hortonworks.com.out
[root@sandbox maria_dev]#

  • Also we need to enable the corresponding port forwarding for the VM"

    1. Run VM
    2. Running VM -> Right Click -> Settings -> Network -> Advanced -> Port Forwarding -> add 2 New port forwarding rule by clicking + icon:
Name Protocol Host IP Host Port Guest IP Guest Port
HBase REST TCP 127.0.0.1 8000 8000
HBase REST Info TCP 127.0.0.1 8000 8000
⚠️ **GitHub.com Fallback** ⚠️