WebHDFS Knox - stanislawbartkowski/hdpactivedirectory GitHub Wiki

WebHDFS - Knox

On this page is described how to enable and verify the WebHDFS service availability through Knox. In production environment, it is not recommended to expose WebHDFS service directly to outside world because of potential security risk. Knox serves as a proxy medium between external consumers and internal HDP cluster.

Test WebHDFS directly

The first thing to do is to verify that WebHDFS is active and responding while accessed directly.
Identify the hostname of NameNode and make sure that WebHDFS is enabled. Assume that hostname is a1.fyre.ibm.com. Usually, this test should be conducted from one of the hosts participating in the cluster or on edge node.
Make sure that Kerberos ticket is obtained or run kinit command. The response body should return the content of HDFS root directory in JSON format.

curl -i -k --negotiate -u : -X GET http://<NNhostname>:50070/webhdfs/v1/?op=LISTSTATUS

...............
{"FileStatuses":{"FileStatus":[
{"accessTime":0,"blockSize":0,"childrenNum":1,"fileId":16887,"group":"hdfs","length":0,"modificationTime":1549312341745,"owner":"hdfs","pathSuffix":"amshbase","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"},
{"accessTime":0,"blockSize":0,"childrenNum":2,"fileId":16392,"group":"hadoop","length":0,"modificationTime":1549369210165,"owner":"yarn","pathSuffix":"app-logs","permission":"777","replication":0,"storagePolicy":0,"type":"DIRECTORY"},
{"accessTime":0,"blockSize":0,"childrenNum":4,"fileId":16569,"group":"hdfs","length":0,"modificationTime":1549825627288,"owner":"hdfs","pathSuffix":"apps","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"},
{"accessTime":0,"blockSize":0,"childrenNum":2,"fileId":16389,"group":"hadoop","length":0,"modificationTime":1549221557966,"owner":"yarn","pathSuffix":"ats","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"},
{"accessTime":0,"blockSize":0,"childrenNum":2,"fileId":30044,"group":"hdfs","length":0,"modificationTime":1549368398674,"owner":"hdfs","pathSuffix":"biginsights","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"},
{"accessTime":0,"blockSize":0,"childrenNum":1,"fileId":16399,"group":"hdfs","length":0,"modificationTime":1549221563793,"owner":"hdfs","pathSuffix":"hdp","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"},
..............

(Cloudera CDP) On Cloudera CDP, the WebHDFS port is the same as dfs.namenode.http-address

vi /etc/hadoop/conf/hdfs-site.xml

 <property>
    <name>dfs.namenode.http-address</name>
    <value>inimical1.fyre.ibm.com:9870</value>
  </property>

The URL for WebHDFS access.

curl -i -k --negotiate -u : -X GET http://inimical1:9870/webhdfs/v1/?op=LISTSTATUS

Modify HDFS property

Verify the value of hadoop.proxyuser.knox.groups property in HDFS panel. HDFS -> Advanced -> Custom core-site. The default value is users. Change it to * (all groups) or provide a list of groups separated by coma allowed to access HDFS thorugh Knox service. If not done or user requesting the access does not belong to group enlisted, the error is thrown like:

{"RemoteException":{"exception":"SecurityException","javaClassName":"java.lang.SecurityException","message":"Failed to obtain user group information: org.apache.hadoop.security.authorize.AuthorizationException: User: knox is not allowed to impersonate user1"}}

Restart all services impacted.

Configure Knox-WebHDFS

If Ranger plugin for Knox is not enabled, Knox-WebHDFS service is enabled out of the box The only possible configuration is to restrict the access to the service. As a default, all users and hosts can access Knox-WebHDFS.
If Ranger plugin for Knox services is enabled, the Knox services are blocked. The proper Ranger policy for Knox should be created beforehand.
More details: https://github.com/stanislawbartkowski/hdpactivedirectory/wiki/Knox#secure-access-to-knox-services

NameNode HA

https://community.hortonworks.com/articles/86188/how-to-configure-a-knox-topology-for-namenode-ha.html
Add additional provider for WebHDFS HA

<provider>
   <role>ha</role>
   <name>HaProvider</name>
   <enabled>true</enabled>
   <param>
      <name>WEBHDFS</name>
      <value>maxFailoverAttempts=3;failoverSleep=1000;maxRetryAttempts=300;retrySleep=1000;enabled=true</value>
   </param>
</provider>

Knox-WebHDFS test

Basic test

Without Knox enabled for SPNEGO, only basic user/password authentication is possible. Run the command:

curl -i -k -vvv -u user1 -X GET "https://<KNOX-hostname>:8443/gateway/default/webhdfs/v1/?op=LISTSTATUS"

............
{"FileStatuses":{"FileStatus":[{"accessTime":0,"blockSize":0,"childrenNum":1,"fileId":16887,"group":"hdfs","length":0,"modificationTime":1549312341745,"owner":"hdfs","pathSuffix":"amshbase","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"},{"accessTime":0,"blockSize":0,"childrenNum":2,"fileId":16392,"group":"hadoop","length":0,"modificationTime":1549369210165,"owner":"yarn","pathSuffix":"app-logs","permission":"777","replication":0,"storagePolicy":0,"type":"DIRECTORY"},{"accessTime":0,"blockSize":0,"childrenNum":4,"fileId":16569,"group":"hdfs","length":0,"modificationTime":1549825627288,"owner":"hdfs","pathSuffix":"apps","permission":"755","replication":0,"storagePolicy":0,"type":"DIRECTORY"},
...........

Assuming KnoxSSO (SPNEGO) is enabled, run equivalent command, the respond body should be the same.

kinit user1
curl -ik --negotiate -u : -X GET "https://<NN-hostname>:8443/gateway/default/webhdfs/v1/?op=LISTSTATUS"

Authorization test

Test environment

Prepare a simple test environment https://github.com/stanislawbartkowski/hdpactivedirectory#ad-users-and-groups-used-for-testing
Assume we have the HDFS directory /apps/datalake.

hdfs dfs -ls /apps
Found 4 items
drwxr-x---   - user2    datascience          0 2019-02-10 20:19 /apps/datalake
...

The recommended option is also to make hdfs the sole owner of the directory, assign 700 permissions and assign privileges using Ranger authorization tool.

Expected result:

  • user2 is a data owner, can upload and delete data in /apps/datalake directory
  • user3 belongs to datascience group, can read data but is not allowed to make any modification in /apps/datalake
  • user1 does not belong to datascience group, is forbidden to access the directory.

user2, upload data to /apps/datalake

echo "Hello, confidential data in datalake" >data.txt
curl -ik -u user2 -X PUT "https://<NNhostname>:8443/gateway/default/webhdfs/v1/apps/datalake/data.txt?op=CREATE&overwrite=true"

Enter host password for user 'user2':
HTTP/1.1 307 Temporary Redirect
Date: Mon, 11 Feb 2019 12:44:32 GMT
...............
Location: https://a1.fyre.ibm.com:8443/gateway/default/webhdfs/data/v1/webhdfs/v1/apps/datalake/data.txt?Content-Type: 
application/octet-stream
..........
..........
Server: Jetty(6.1.26.hwx)
Content-Length: 0

curl -i -k -u user2 -T data.txt -X PUT {copy and paste output after 'Location:' from above command}

Enter host password for user 'user2':
HTTP/1.1 100 Continue

HTTP/1.1 201 Created
Date: Mon, 11 Feb 2019 12:47:38 GMT
Set-Cookie: JSESSIONID=13nsvnduhmht0s6ahjlhj1e6e;Path=/gateway/default;Secure;HttpOnly
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Set-Cookie: rememberMe=deleteMe; Path=/gateway/default; Max-Age=0; Expires=Sun, 10-Feb-2019 12:47:38 GMT
Location: https://a1.fyre.ibm.com:8443/gateway/default/webhdfs/v1/apps/datalake/data.txt
Connection: close
Server: Jetty(9.2.15.v20160210)

Assuming KNOX SPNEGO is enabled:

kinit user2
curl -ik --negotiate -u : -X PUT "https://<NNhostname>:8443/gateway/default/webhdfs/v1/apps/datalake/data.txt?op=CREATE&overwrite=true"
curl -i -k --negotiate -u : -T data.txt -X PUT {copy and paste URL}

user3, can read data but cannot delete or upload data

curl -ik -u user3 -X GET "https://<NNhostname>:8443/gateway/default/webhdfs/v1/apps/datalake/data.txt?op=OPEN"

Enter host password for user 'user3':
HTTP/1.1 307 Temporary Redirect
..........
Location: https://a1.fyre.ibm.com:8443/gateway/default/webhdfs/data/v1/webhdfs/v1/apps/datalake/data.txt?_=AAAACAAAABAAAAEAlM-dTOK8s_PjSijJik_GT0KCyDt5aJCC744pFEZ0ruFWcda0BAsIpDZhAoxyyZa2btb9DX3c1TNbmlyMLHmWxIddmubgyCh2MjjxZ2gWvRcPSvXzRsEJtXiAb
..............

curl -ik -u user3 -X GET { copy and paste URL content after Location field }

Enter host password for user 'user3':
HTTP/1.1 200 OK
...........
Server: Jetty(9.2.15.v20160210)

Hello, confidential data in datalake

Try to delete data

curl -ik -u user3 -X DELETE "https://<NNhostname>:8443/gateway/default/webhdfs/v1/apps/datalake/data.txt?op=DELETE"

Enter host password for user 'user3':
HTTP/1.1 403 Forbidden
..............
{"RemoteException":{"exception":"AccessControlException","javaClassName":"org.apache.hadoop.security.AccessControlException","message":"Permission denied: user=user3, access=WRITE, inode=\"/apps/datalake/data.txt\":user2:datascience:drwxr-x---"}}

Assuming Knox SPNEGO is enabled

kinit user3
curl -ik --negotiate -u : -X GET "https://<NNhostname>:8443/gateway/default/webhdfs/v1/apps/datalake/data.txt?op=OPEN"
curl -ik --negotiate -u : -X GET {copy and parte URL}
curl -ik --negotiate -u : -X DELETE "https://<NNhostname>:8443/gateway/default/webhdfs/v1/apps/datalake/data.txt?op=DELETE"

user1, access to /apps/datalake forbidden

curl -ik -u user1 -X GET "https://<NNhostname>:8443/gateway/default/webhdfs/v1/apps/datalake/data.txt?op=OPEN"

Enter host password for user 'user1':
HTTP/1.1 403 Forbidden
..............
{"RemoteException":{"exception":"AccessControlException","javaClassName":"org.apache.hadoop.security.AccessControlException","message":"Permission denied: user=user1, access=EXECUTE, inode=\"/apps/datalake/data.txt\":user2:datascience:drwxr-x---"}}

Assuming Knox SPNEGO is enabled

kinit user1
curl -ik --negotiate -u : -X GET "https://<NNhostname>:8443/gateway/default/webhdfs/v1/apps/datalake/data.txt?op=OPEN"

⚠️ **GitHub.com Fallback** ⚠️