HBase - kaushikdas/TechnicalWritings GitHub Wiki

1. Introduction

2. HBase Architecture

At the core of HBase, it is split into Region Servers that take care of something like horizontal sharding or range partitioning and automatically adapts to increase in data by repartitioning the data. To achieve this auto-sharding HBase follows a complex mechanism that involves write-ahead commit logs, merging things together over time asynchronously. These Region Servers sit on top of HDFS.

3. Logging into HBase

ssh to VM ssh [email protected] -p 2222

$ ssh [email protected] -p 2222
[email protected]'s password:
Last login: Fri Mar 26 07:09:40 2021 from 10.0.2.2
[maria_dev@sandbox ~]$

Switch to root user and then switch to user hbase

[maria_dev@sandbox ~]$ su
Password:
[root@sandbox maria_dev]# su hbase
[hbase@sandbox maria_dev]$

Run hbase shell command to launch the shell

[hbase@sandbox maria_dev]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.5.0.0-1245, r53538b8ab6749cbb6fdc0fe448b89aa82495fb3f, Fri Aug 26 01:32:27 UTC 2016
hbase(main):001:0> list

Now fire list command that will list the existing HBase tables:

hbase(main):003:0> list
TABLE
ATLAS_ENTITY_AUDIT_EVENTS
atlas_titan
iemployee
3 row(s) in 0.4590 seconds

=> ["ATLAS_ENTITY_AUDIT_EVENTS", "atlas_titan", "iemployee"]
hbase(main):004:0>

Note You may get below error if the HBase service is not running:
hbase(main):001:0> list
TABLE

ERROR: Can't get master address from ZooKeeper; znode data == null

Here is some help for this command:
List all tables in hbase. Optional regular expression parameter could
be used to filter the output. Examples:

 hbase> list
 hbase> list 'abc.*'
 hbase> list 'ns:abc.*'
 hbase> list 'ns:.*'


hbase(main):002:0>
To fix this run the HBase service:

Login to Ambari as admin

Start HBase service

Right panel HBase -> Service Actions drop down -> Start

4. Basic Interactions with HBase shell

Creating a namespace: create_namespace '<namespace_name>'

hbase(main):005:0* create_namespace 'kaushik'
0 row(s) in 0.2850 seconds

Use list_namespace to check (all) available namespaces:

hbase(main):006:0> list_namespace
NAMESPACE
default
hbase
kaushik
3 row(s) in 0.0950 seconds

default and hbase are two default namespaces

Use describe_namespace '<namespace_name>' to describe a given namespace

hbase(main):001:0> describe_namespace 'kaushik'
DESCRIPTION
{NAME => 'kaushik'}
1 row(s) in 0.6230 seconds

A namespace is equivalent to what database is in RDBMS terms

Create a table under a namespace create '[<namespace_name>:]<table_name>, '<column_family>' [, '<column_family>', ...]

# cf_1 is the (only) column family, 
# ...no need to mention columns inside it
hbase(main):002:0> create 'kaushik:test', 'cf_1'
0 row(s) in 2.7700 seconds

=> Hbase::Table - kaushik:test

If namespace is not specified, the table will be created under default namespace

list will list all available tables

hbase(main):001:0> list
TABLE
ATLAS_ENTITY_AUDIT_EVENTS
atlas_titan
iemployee
kaushik:test
4 row(s) in 0.2570 seconds

=> ["ATLAS_ENTITY_AUDIT_EVENTS", "atlas_titan", "iemployee", "kaushik:test"]

Use describe to describe any table

hbase(main):003:0> describe 'kaushik:test'
Table kaushik:test is ENABLED
kaushik:test
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf_1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER'
, COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.1810 seconds

drop command is used to delete a table — before deleting a table needs to be disabled using disable command

hbase(main):004:0> disable 'kaushik:test'
0 row(s) in 3.2250 seconds

hbase(main):005:0> drop 'kaushik:test`

alter_namespace command is used to alter a created namespace

hbase(main):005:0> alter_namespace 'kaushik', {METHOD => 'set', \
hbase(main):006:1*     'VERSION' => 'DRAFT' }
0 row(s) in 0.0570 seconds

After the above command a new property VERSION will be added to the namespace kaushik

hbase(main):007:0> describe_namespace 'kaushik'
DESCRIPTION
{NAME => 'kaushik', VERSION => 'DRAFT'}
1 row(s) in 0.0130 seconds

A namespace can be dropped using drop_namespace

hbase(main):008:0> drop_namespace 'kaushik'
0 row(s) in 0.0680 seconds

hbase(main):009:0> list_namespace
NAMESPACE
default
hbase
2 row(s) in 0.6190 seconds

hbase(main):010:0>

4.1. Table specific interactions

DDL commands — CRUD operations

Create a table customer under namespace sales with versioning enabled:

# enable versioning (4) for first column family ctInfo
hbase(main):001:0> create 'sales:customer', \
hbase(main):002:0*   {NAME => 'ctInfo', VERSIONS => 4}, \
hbase(main):003:0*   'demo'
0 row(s) in 2.7460 seconds

=> Hbase::Table - sales:customer

Versioning needs to enabled for each column family seperately — if not specified it will be set to default of 1

Here we have enabled four version 4 for column family ctInfo. Therefoe, for each column in this column family maximum 4 timestamped versions will be maintained for each value. We can verify this by describing the created 'sales:customer' table. We notice that for column family ctInfo VERSIONS => '4' where as for column family demo VERSIONS => '1'

hbase(main):004:0> describe 'sales:customer'
Table sales:customer is ENABLED
sales:customer
COLUMN FAMILIES DESCRIPTION
{NAME => 'ctInfo', BLOOMFILTER => 'ROW', VERSIONS => '4', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVE
R', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'demo', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER'
, COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
2 row(s) in 0.3020 seconds

scan — equivalent of select *

hbase(main):005:0> scan 'sales:customer'
ROW                                      COLUMN+CELL
0 row(s) in 0.0720 seconds

Created table has no rows because we have not added any data

put — adding data to table

Add data for ONLY 1 column at a time

hbase(main):006:0> put 'sales:customer', \           # table
hbase(main):007:0*   'C00001', \                     # row key
hbase(main):008:0*   'ctInfo:name', 'Sudhir Mishra'  # col name, value
0 row(s) in 0.0890 seconds

hbase(main):009:0> put 'sales:customer', \
hbase(main):010:0*   'C00001', \
hbase(main):011:0*   'ctInfo:mobile', '9988771122'
0 row(s) in 0.0110 seconds

hbase(main):012:0> put 'sales:customer', \
hbase(main):013:0*   'C00001', \
hbase(main):014:0*   'ctInfo:email', '[email protected]'
0 row(s) in 0.0130 seconds

hbase(main):015:0> put 'sales:customer', \
hbase(main):016:0*   'C00001', \
hbase(main):017:0*   'demo:age', '34'
0 row(s) in 0.0240 seconds

hbase(main):018:0> put 'sales:customer', \
hbase(main):019:0*   'C00001', \
hbase(main):020:0*   'demo:occupation', 'advocate'
0 row(s) in 0.0140 seconds

hbase(main):021:0> put 'sales:customer', \
hbase(main):022:0*   'C00002', \
hbase(main):023:0*   'ctInfo:name', 'Kunal K Bajaj'
0 row(s) in 0.0140 seconds

hbase(main):024:0> put 'sales:customer', \
hbase(main):025:0*   'C00002', \
hbase(main):026:0*   'ctInfo:tel', '02211223344'
0 row(s) in 0.0130 seconds

hbase(main):027:0> put 'sales:customer', \
hbase(main):028:0*   'C00002', \
hbase(main):029:0*   'demo:age', '42'
0 row(s) in 0.0110 seconds

hbase(main):030:0> put 'sales:customer', \
hbase(main):031:0*   'C00002', \
hbase(main):032:0*   'demo:education', 'graduate'
0 row(s) in 0.0110 seconds

See the added data using scan:

hbase(main):033:0> scan 'sales:customer'
ROW                                      COLUMN+CELL
 C00001                                  column=ctInfo:email, timestamp=1616771164898, [email protected]
 C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
 C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
 C00001                                  column=demo:age, timestamp=1616771187409, value=34
 C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
 C00002                                  column=ctInfo:name, timestamp=1616771311930, value=Kunal K Bajaj
 C00002                                  column=ctInfo:tel, timestamp=1616771354236, value=02211223344
 C00002                                  column=demo:age, timestamp=1616771379529, value=42
 C00002                                  column=demo:education, timestamp=1616771405214, value=graduate
2 row(s) in 0.0530 seconds

Logical view of the table will be:

{
  "C00001" : {
    "ctInfo" : {
      1616771022652 : { "name" : "Sudhir Mishra" },
      1616771062503 : { "mobile" : "9988771122" },
      1616771164898 : { "email" : "[email protected]" }
    },
    "demo" : {
      1616771187409 : { "age" : "34" },
      1616771214484 : { "occupation" : "advocate" }
    }
  },
  "C00002" : {
    "ctInfo" : {
      1616771311930 : { "name" : "Kunal K Bajaj" },
      1616771354236 : { "tel" : "02211223344" }
    },
    "demo" : {
      1616771379529 : { "age" : "42" },
      1616771405214 : { "education" : "graduate" }
    }
  }
}

This is a good example of sparse table:

Row Key	ctInfo:name	ctInfo:mobile	ctInfo:email	ctInfo:tel	demo:age	demo:occupation	demo:education
C00001	1616771022652: Sudhir Mishra	1616771062503: 9988771122	1616771164898 : [email protected]		1616771187409: 34	1616771214484 : advocate
C00002	1616771311930: Kunal K Bajaj			1616771354236: 02211223344	1616771379529: 42		1616771405214 : graduate

Now let us add one for email id for "C00001" and do a scan — we see that we get the latest email

hbase(main):034:0> put 'sales:customer', \
hbase(main):035:0*   'C00001', \
hbase(main):036:0*   'ctInfo:email', '[email protected]'
0 row(s) in 0.3130 seconds

hbase(main):037:0> scan 'sales:customer'
ROW                                      COLUMN+CELL
C00001                                  column=ctInfo:email, timestamp=1616771484147, [email protected]
C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
C00001                                  column=demo:age, timestamp=1616771187409, value=34
C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
C00002                                  column=ctInfo:name, timestamp=1616771311930, value=Kunal K Bajaj
C00002                                  column=ctInfo:tel, timestamp=1616771354236, value=02211223344
C00002                                  column=demo:age, timestamp=1616771379529, value=42
C00002                                  column=demo:education, timestamp=1616771405214, value=graduate
2 row(s) in 0.0550 seconds

To get all the versions of the email column values we need to specify version for scan:

hbase(main):039:0> scan 'sales:customer', {VERSIONS => 4}
ROW                                      COLUMN+CELL
C00001                                  column=ctInfo:email, timestamp=1616771484147, [email protected]
C00001                                  column=ctInfo:email, timestamp=1616771164898, [email protected]
C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
C00001                                  column=demo:age, timestamp=1616771187409, value=34
C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
C00002                                  column=ctInfo:name, timestamp=1616771311930, value=Kunal K Bajaj
C00002                                  column=ctInfo:tel, timestamp=1616771354236, value=02211223344
C00002                                  column=demo:age, timestamp=1616771379529, value=42
C00002                                  column=demo:education, timestamp=1616771405214, value=graduate
2 row(s) in 0.0570 seconds

Now we see both email ids

get — fetch a specific row with key

hbase(main):041:0> get 'sales:customer', 'C00001'
COLUMN                                   CELL
ctInfo:email                            timestamp=1616771484147, [email protected]
ctInfo:mobile                           timestamp=1616771062503, value=9988771122
ctInfo:name                             timestamp=1616771022652, value=Sudhir Mishra
demo:age                                timestamp=1616771187409, value=34
demo:occupation                         timestamp=1616771214484, value=advocate
5 row(s) in 0.0220 seconds

Again to get all versions we need to specify version

hbase(main):043:0> get 'sales:customer', 'C00001', {COLUMN => 'ctInfo', VERSIONS => 2}
COLUMN                                   CELL
ctInfo:email                            timestamp=1616771484147, [email protected]
ctInfo:email                            timestamp=1616771164898, [email protected]
ctInfo:mobile                           timestamp=1616771062503, value=9988771122
ctInfo:name                             timestamp=1616771022652, value=Sudhir Mishra
4 row(s) in 0.0410 seconds

Now we see both email ids

delete and deleteall — delete a specific column value or a full row

To delete a spefic version of column value we need to give version:

hbase(main):045:0> delete 'sales:customer', 'C00001', 'ctInfo:email', 1616771164898
0 row(s) in 0.0250 seconds

hbase(main):046:0> get 'sales:customer', 'C00001', {COLUMN => 'ctInfo', VERSIONS => 2}
COLUMN                                   CELL
ctInfo:email                            timestamp=1616771484147, [email protected]
ctInfo:mobile                           timestamp=1616771062503, value=9988771122
ctInfo:name                             timestamp=1616771022652, value=Sudhir Mishra
3 row(s) in 0.0280 seconds

hbase(main):047:0> scan 'sales:customer'
ROW                                      COLUMN+CELL
C00001                                  column=ctInfo:email, timestamp=1616771484147, [email protected]
C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
C00001                                  column=demo:age, timestamp=1616771187409, value=34
C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
C00002                                  column=ctInfo:name, timestamp=1616771311930, value=Kunal K Bajaj
C00002                                  column=ctInfo:tel, timestamp=1616771354236, value=02211223344
C00002                                  column=demo:age, timestamp=1616771379529, value=42
C00002                                  column=demo:education, timestamp=1616771405214, value=graduate
2 row(s) in 0.0530 seconds

hbase(main):048:0> scan 'sales:customer', {VERSIONS => 4}
ROW                                      COLUMN+CELL
C00001                                  column=ctInfo:email, timestamp=1616771484147, [email protected]
C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
C00001                                  column=demo:age, timestamp=1616771187409, value=34
C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
C00002                                  column=ctInfo:name, timestamp=1616771311930, value=Kunal K Bajaj
C00002                                  column=ctInfo:tel, timestamp=1616771354236, value=02211223344
C00002                                  column=demo:age, timestamp=1616771379529, value=42
C00002                                  column=demo:education, timestamp=1616771405214, value=graduate
2 row(s) in 0.1720 seconds

Delete a full row with deleteall

hbase(main):050:0> deleteall 'sales:customer', 'C00002'
0 row(s) in 0.0060 seconds

hbase(main):051:0> scan 'sales:customer', {VERSIONS => 4}
ROW                                      COLUMN+CELL
C00001                                  column=ctInfo:email, timestamp=1616771484147, [email protected]
C00001                                  column=ctInfo:mobile, timestamp=1616771062503, value=9988771122
C00001                                  column=ctInfo:name, timestamp=1616771022652, value=Sudhir Mishra
C00001                                  column=demo:age, timestamp=1616771187409, value=34
C00001                                  column=demo:occupation, timestamp=1616771214484, value=advocate
1 row(s) in 0.0300 seconds

5. Other Ways to load data to HBase

5.1. Using `mapreduce.ImportTsv`

We will create a sales table and populate that with sale information
- From Ambari start HBase service

[maria_dev@sandbox ~]$ su
Password:
[root@sandbox maria_dev]# su hbase
[hbase@sandbox maria_dev]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.2.2.5.0.0-1245, r53538b8ab6749cbb6fdc0fe448b89aa82495fb3f, Fri Aug 26 01:32:27 UTC 2016

hbase(main):001:0> create table "kaushik:sales", "cf_order"
NoMethodError: undefined method `table' for #<Object:0x3d3c886f>

hbase(main):007:0> create "kaushik:sales", "cf_order"
0 row(s) in 3.3440 seconds

=> Hbase::Table - kaushik:sales

hbase(main):008:0> describe "kaushik:sales"
Table kaushik:sales is ENABLED
kaushik:sales
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf_order', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FORE
VER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.3510 seconds

hbase(main):009:0>

We will use salesOrder.csv from the Internet

$ curl -o ./sales.csv https://raw.githubusercontent.com/bsullins/data/master/salesOrders.csv
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  453k  100  453k    0     0   226k      0  0:00:02  0:00:02 --:--:--  200k

$ ls sales.csv
sales.csv

$ head sales.csv
RowKey,Order ID,Order Date,Ship Date,Ship Mode,Profit,Quantity,Sales
1,CA-2011-100006,2013-09-07 00:00:00,2013-09-13 00:00:00,Standard Class,109.61,3.00,377.97
2,CA-2011-100090,2013-07-08 00:00:00,2013-07-12 00:00:00,Standard Class,-19.09,9.00,699.19
3,CA-2011-100293,2013-03-14 00:00:00,2013-03-18 00:00:00,Standard Class,31.87,6.00,91.06
4,CA-2011-100328,2013-01-28 00:00:00,2013-02-03 00:00:00,Standard Class,1.33,1.00,3.93
5,CA-2011-100363,2013-04-08 00:00:00,2013-04-15 00:00:00,Standard Class,7.72,5.00,21.38
6,CA-2011-100391,2013-05-25 00:00:00,2013-05-29 00:00:00,Standard Class,6.73,2.00,14.62
7,CA-2011-100678,2013-04-18 00:00:00,2013-04-22 00:00:00,Standard Class,61.79,11.00,697.07
8,CA-2011-100706,2013-12-16 00:00:00,2013-12-18 00:00:00,Second Class,17.72,8.00,129.44
9,CA-2011-100762,2013-11-24 00:00:00,2013-11-29 00:00:00,Standard Class,219.08,11.00,508.62

We will remove the 1st row:

$ sed -i '1d' sales.csv

$ head sales.csv
1,CA-2011-100006,2013-09-07 00:00:00,2013-09-13 00:00:00,Standard Class,109.61,3.00,377.97
2,CA-2011-100090,2013-07-08 00:00:00,2013-07-12 00:00:00,Standard Class,-19.09,9.00,699.19
...

We will transfer the downloaded file to VM

$ scp -P 2222 sales.csv maria_dev@localhost:/home/maria_dev/.

Now we will copy the file sales.csv to HDFS

$ ssh -p 2222 maria_dev@localhost
maria_dev@localhost's password:
Last login: Fri Mar 26 10:22:44 2021 from 10.0.2.2

[maria_dev@sandbox ~]$ head sales.csv
1,CA-2011-100006,2013-09-07 00:00:00,2013-09-13 00:00:00,Standard Class,109.61,3.00,377.97
2,CA-2011-100090,2013-07-08 00:00:00,2013-07-12 00:00:00,Standard Class,-19.09,9.00,699.19
...

[maria_dev@sandbox ~]$ hadoop fs -copyFromLocal sales.csv /tmp

[maria_dev@sandbox ~]$ hadoop fs -ls /tmp
Found 7 items
drwxrwxrwx   - maria_dev hdfs          0 2020-03-05 17:09 /tmp/.pigjobs
drwxrwxrwx   - maria_dev hdfs          0 2020-03-05 17:09 /tmp/.pigscripts
drwxrwxrwx   - maria_dev hdfs          0 2020-03-05 17:09 /tmp/.pigstore
drwxr-xr-x   - hdfs      hdfs          0 2016-10-25 07:48 /tmp/entity-file-history
drwx-wx-wx   - ambari-qa hdfs          0 2016-10-25 07:51 /tmp/hive
drwx------   - maria_dev hdfs          0 2020-03-05 17:14 /tmp/maria_dev
-rw-r--r--   1 maria_dev hdfs     459327 2021-03-31 16:02 /tmp/sales.csv
[maria_dev@sandbox ~]$

We will now load the kaushik:sales HBase table with data from sales.csv:

[maria_dev@sandbox ~]$ su

[root@sandbox maria_dev]# hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
-Dimporttsv.separator=,         # field seperator in the input file
-Dimporttsv.columns="\          # start specifiying columns
HBASE_ROW_KEY, \                # ... 1st item is row key
cf_order:orderId, \             # ... followed by values for 
cf_order:orderDate, \           # ... other columns
cf_order:shipDate, \
cf_order:shipMode, \
cf_order:profit, \
cf_order:quantity, \
cf_order:sales" \
kaushik:sales \                 # target table name
hdfs://sandbox.hortonworks.com:/tmp/sales.csv       # hdfs path for input csv file

This will fire a map-reduce job:

2021-03-31 16:39:02,868 INFO  [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x62e7f11d connecting to ZooKeeper ensemble=sandbox.hortonworks.com:2181
2021-03-31 16:39:02,879 INFO  [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-1245--1, built on 08/26/2016 00:47 GMT
2021-03-31 16:39:02,879 INFO  [main] zookeeper.ZooKeeper: Client environment:host.name=sandbox.hortonworks.com
2021-03-31 16:39:02,879 INFO  [main] zookeeper.ZooKeeper: Client environment:java.version=1.8.0_111
...

and the table will be filles with rows:

hbase(main):002:0> scan "kaushik:sales"
...
 999                                     column=cf_order:shipDate, timestamp=1617208741407, value=2014-08-23 00:00:00
 999                                     column=cf_order:shipMode, timestamp=1617208741407, value=Same Day
5009 row(s) in 11.8840 seconds

5.1. Using HBase REST API

HBase exposes REST API that can be directly accessed from the cleint system
- We will use a Python client to create a sparse table using the movie rating data:

$ head u.data
0       50      5       881250949
0       172     5       881250949
0       133     1       881250949
196     242     3       881250949
186     302     3       891717742
22      377     1       878887116
244     51      2       880606923
166     346     1       886397596
298     474     4       884182806
115     265     2       881171488

- First column is user id, second column is movie id, 3rd column is rating of the movie by the user and last column is the rating time stamp

- In our sparse table we will keep only the movie ratings by each user &mdash; so logically the table will be something as below:

  +---------+       +--------------+     +---------------+     +---------------+
  | user_id | ----> | movie_id: 50 | --> | movie_id: 172 | --> | movie_id: 133 | 
  +---------+       | rating: 5    |     | rating: 5     |     | rating: 5     |
                    +--------------+     +---------------+     +---------------+

We will use starbase Python package that provide nice programming interface over HBase REST APIs. We can use this using pip install starbase
To use the HBase REST interface we need to first start the HBase REST server

[root@sandbox maria_dev]# /usr/hdp/current/hbase-master/bin/hbase-daemon.sh \
> start rest \        # start HBase REST service
> -p 8000 \           # ... on port 8000
> --infoport 8001     # ... and debugging information streaming at port 8001
starting rest, logging to /var/log/hbase/hbase-maria_dev-rest-sandbox.hortonworks.com.out
[root@sandbox maria_dev]#

Also we need to enable the corresponding port forwarding for the VM"
1. Run VM
2. Running VM -> Right Click -> Settings -> Network -> Advanced -> Port Forwarding -> add 2 New port forwarding rule by clicking + icon:

Name	Protocol	Host IP	Host Port	Guest IP	Guest Port
HBase REST	TCP	127.0.0.1	8000		8000
HBase REST Info	TCP	127.0.0.1	8000		8000