Hadoop setup - clhedrick/kerberos GitHub Wiki
Known web interfaces into hadoop
- ambari
- hbase overview
- hdfs, browser requires Kerberos authentication 50470 is https
- map reduce job history
- yarn job scheduler
- yarn timeline services, requires Kerberos authentication
- yarn (or mapreduce2?) job history
- livy, Kerberos browser
- zeppelin notebook. login menu, ldaps
- hive dashboard
- namenode information UI
- namenode JMX
- namenode logs
- namenode thread dump
- yarn resourcemanager JMX
- yarn resourcemanager logs
- yarn thread dump
- mapreduce2 jobhistory logs
- mapreduce2 jobhistory JMX
- mapreduce2 thread dump
- Do "virsh list" on ilab1, 2, 3, to see what's already running. Make sure you don't start the same VM on two different hosts.
- Use "virsh start NAME" to start the vm's that aren't running.
- ilab1: data1, dataservices2
- ilab2: data2, jupyter
- ilab3: data3, dataservices1, dataservices3, dataservices4
- browser to https://data-services1.cs.rutgers.edu
- login as admin, with ambari UI password stored in 1password
- In the left margin you'll see a list of services. At the bottom there is an action menu. Do "start all"
- Similarly you can stop with "stop all."
- Starting takes a very long time, like 900 sec.
- if you need to stop the VMs, use "virsh shutdown NAME"
Restore netapp snapshots for all 6 images. E.g.
- on ilab1, cd /var/lib/hadoop-shared
- mv images images.hold
- mkdir images
- cp -a .snapshot/hourly.2018-06-06_0405/images/* images/ - obviously your date will be different
- start all 6 images
- ssh to data-services2
- cd /hadoop/
- rm -rf hadoop/hdfs
- cp -a hdfs hadoop/
This is a log of what I did. However if you're going to install a new verson, you should use the Ambari documentation at hortonworks.com. It has step by step instructions. What I did is based on them. There are a few matters of interpretation, so that will show you have I interpreted them.
in every vm that is going to be used, things with + are done by ansible hadoop.yml
- NOTE: For HDP 3 Kerberos to work, the ambari node must be part of the cluster. It needs all the users. Of course no services are installed on it other than the mandatory.
- make sure java is 1.8
- make sure python is 2.7.x
- +in bash, ulimit -Sn and ulimit -Hn should be both at least 10,000 [number]. I created /etc/security/limits.d/30-nofile.conf with set nofile to 10000 for everyone. It's not so clear to me whether this is needed. ambari creates files for each of its users with their own specification. It may be that where the document talks about this it only means to check whether limits can be set to 10,000, but doesn't mean we should actually do it -- since they'll do it themselves. But for the moment I've set it to 10,000 for everyone.
- +allow root ssh from the main ambari server to each of the vms
- make sure nokrb5conf is set in ansible hosts file. kerberizing will replace krb5.conf, and we need to use theirs. actually we use a mix, starting with theirs but adding the services
- +install the JCE, get http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html
- +install the JCE, unzip -o -j -q jce_policy-8.zip -d /usr/jdk64/jdk1.8.0_60/jre/lib/security/ using the appropriate location. Finding the location can be a challenge. because of the levels of indirection. In my case it's /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64/jre/lib/security/
- +Create the users and the hadoop group. Some of the users have gotten into LDAP. The install scripts try to do changes to /etc/passwd, so this confuses them.
- Make sure the places exist for the data. Normally stuff goes on /hadoop. It should be mounted in a unique place
- In the following data-services1 and data-services3 will change for different installations. Make sure to use the right names.
- install database on ambari node (data-services1 and utility node (data-services3). ambari database , hive database, oozie database
- We will use the default mysql port, users and database names are ambari-db, hive-db, oozie-db, password will be the cluster service password. However the actual database names remove the -, as - isn't legal in them.
- db root password is the same.
- Ambari currently supporrs mysql 2.6 and mariadb 10 (10.2 with HDP 3). Yum has mariadb 5.5. So I downloaded the latest 10.0, the version for use with systemd.
- For HDP 3 I installed the maridb repo per instructions at Mariadb.org and installed from there. There's already a service available called mariadb. To get yum install to work, had to erase mariadb-bench-5.5.60-1.el7_5.x86_64. Skip the following
- untar the distribution in /usr/local/, symlink the versioned directory to mysql (This is the default location.)
- mysql user and group already exist
- following instructions in INSTALL-BINARY
- start with bin/mysqld_safe -u mysql
- ln -s /var/lib/mysql/mysql.sock /tmp
- bin/mysql_secure_installation --basedir=/usr/local/mysql, default answers, set password using hadoop service password in 1password
- mysql -u root, should require password. "use mysql; select * from user" verify that there are only entries for root and they have passwords
- create /etc/systemd/system/mariadb.service.d/start.conf
[Service] ExecStart= ExecStart=/usr/local/mysql/bin/mysqld_safe -u mysql ExecStartPre=This overrides parameters in the system mariadb.service, which we assume exists
- kill the mysql process
- verify that systemctl start mariadb will start it properly
- when you go into production, don't forget "systemctl enable mariadb" on data-services1 and data-services3 so mariadb starts automatically
- verify that mysql-connector-java.noarch is installed. Should be in /usr/share/java
- setup ambari database on data-services1
mysql -u root -p CREATE USER 'ambari-db'@'%' IDENTIFIED BY 'xx'; GRANT ALL PRIVILEGES ON *.* TO 'ambari-db'@'%'; CREATE USER 'ambari-db'@'localhost' IDENTIFIED BY 'xx'; GRANT ALL PRIVILEGES ON *.* TO 'ambari-db'@'localhost'; CREATE USER 'ambari-db'@'data-services5.cs.rutgers.edu' IDENTIFIED BY 'xx'; GRANT ALL PRIVILEGES ON *.* TO 'ambari-db'@'data-services5.cs.rutgers.edu'; FLUSH PRIVILEGES; where xx is replaced with the real password
- on data-services3
HDP 2: CREATE USER 'hive-db'@'localhost' IDENTIFIED BY 'xx'; GRANT ALL PRIVILEGES ON *.* TO 'hive-db'@'localhost'; CREATE USER 'hive-db'@'%' IDENTIFIED BY 'xx'; GRANT ALL PRIVILEGES ON *.* TO 'hive-db'@'%'; CREATE USER 'hive-db'@'data-services3.cs.rutgers.edu' IDENTIFIED BY 'xx'; GRANT ALL PRIVILEGES ON *.* TO 'hive-db'@'data-services3.cs.rutgers.edu'; CREATE DATABASE hivedb; CREATE USER 'oozie-db'@'%' IDENTIFIED BY 'xx'; GRANT ALL PRIVILEGES ON *.* TO 'oozie-db'@'%'; CREATE DATABASE ooziedb; FLUSH PRIVILEGES; HDP 3: ; by default we get root at localhost but not the actual hostname grant all privileges on *.* to 'root'@'data-services7.cs.rutgers.edu' identified by 'xx'; create database hive; grant all privileges on hive.* to 'hive'@'localhost' identified by 'xx'; grant all privileges on hive.* to 'hive'@'%.cs.rutgers.edu' identified by 'xx'; create database ranger; grant all privileges on ranger.* to 'ranger'@'localhost' identified by 'xx'; grant all privileges on ranger.* to 'ranger'@'%.cs.rutgers.edu' identified by 'xx'; create database rangerkms; grant all privileges on rangerkms.* to rangerkms@'localhost' identified by 'xx'; grant all privileges on rangerkms.* to rangerkms@'%.cs.rutgers.edu' identified by 'xx'; create database oozie; grant all privileges on oozie.* to 'oozie'@'localhost' identified by 'xx'; grant all privileges on oozie.* to 'oozie'@'%.cs.rutgers.edu' identified by 'xx'; create database superset DEFAULT CHARACTER SET utf8;; grant all privileges on superset.* to 'superset'@'localhost' identified by 'xx'; grant all privileges on superset.* to 'superset'@'%.cs.rutgers.edu' identified by 'xx'; create database druid DEFAULT CHARACTER SET utf8;; grant all privileges on druid.* to 'druid'@'localhost' identified by 'xx'; grant all privileges on druid.* to 'druid'@'%.cs.rutgers.edu' identified by 'xx'; exit;
- on data-services1: wget -nv http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.6.0.0/ambari.repo -O /etc/yum.repos.d/ambari.repo
- for hdp3: http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.7.3.0/ambari.repo
- yum repolist, verify that ambari is there
- yum install ambari-server
- -- saved data-services1 and 3
- mysql -u ambari-db -p
create database ambaridb; use ambaridb; source /var/lib/ambari-server/resources/Ambari-DDL-MySQL-CREATE.sql;
- ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar
- the command above doesn't do a real ambari setup. It just distributes the connector
- ambari-server setup, take defaults until advanced database setup; say y
- mysql
- localhost
- 3306
- ambaridb
- ambari-db
- services password
- ambari-server start
- go to data-services1:8080, login as admin/admin
- configure
- [not] We need one non-default configuration. There's a problem between yarn and Java 8 that causes yarn to kill jobs because it thinks they are out of memory. To fix it, go to yarn configuration and in yarn-site custom configuration, add yarn.nodemanager.pmem-check-enabled=false
- get ssl cert. need cert and key, and the key may temporarily need to be readable to import it
- ambari-server setup-security
- give it cert and key. I used port 443, since this is the main service on that server
- note that you have to fix the base for tez to be https: port 443
- ambari-server restart
- use openssl s_client -connect krb1.cs.rutgers.edu:636 and also for krb2 and capture the certs in krb1.crt and krb2.crt
- /usr/jdk64/jdk1.8.0_112/bin/keytool -import -file /etc/ssl/cert.crt -alias ambari-server -keystore ambari-server-truststore
- /usr/jdk64/jdk1.8.0_112/bin/keytool -import -file /etc/ssl/krb1.crt -alias krb1 -keystore ambari-server-truststore
- /usr/jdk64/jdk1.8.0_112/bin/keytool -import -file /etc/ssl/krb2.crt -alias krb2 -keystore ambari-server-truststore
- /usr/jdk64/jdk1.8.0_112/bin/keytool -import -file /etc/ssl/krb4.crt -alias krb4 -keystore ambari-server-truststore
- put the trust store in /etc/ssl
- ambari-server stop
- ambari-server setup-security; use option 4, setup truststore to point ambari-server-truststore just created
- ambari-server start
- I had a disaster doing this the first time. I've tried things to make it work that may not actually be needed.
- setenv RANDFILE .rnd [if]
- openssl pkcs12 -export -in ilab_cs_rutgers_edu.crt -inkey ilab_cs_rutgers_edu.key -name zeppelin -out ilab.pk12 [password]
- keytool -importkeystore -deststorepass hadoop -destkeystore zeppelin-keystore.jks -srckeystore ilab.pk12 -srcstoretype PKCS12
- get the first-level CA cert in a pem file. One way to do that is to go to a system with the same cert using firefox, look at the certs, and export the first one up from the host.
- keytool -import -file ca.crt -keystore zeppelin-truststore.jks
- put the file in /etc/ssl/zeppelin on [data-services1] and data-services2. keystore should be 440, group hadoop, but for the moment they are public. probably the copy of data-services1 isn't needed.
- I tried to do this on port 443, but the system won't start when I do. I have no idea why not. So we have SSL on port 9995
Zeppelin.ssl = true Zeppelin.ssl.client.auth = false Zeppelin.ssl.key.manager.password = hadoop Zeppelin.ssl.keystore.password = hadoop Zeppelin.ssl.keystore.path = /etc/ssl/zeppelin/zeppelin-keystore.jks Zeppelin.ssl.keystore.type = JKS Zeppelin.ssl.truststore.password = hadoop Zeppelin.ssl.truststore.path = /etc/ssl/zeppelin/zeppelin-truststore.jks Zeppelin.ssl.truststore.type = JKS
To access LDAP for passwords. Note that you still need to create the users, though there's a sync feature
- ambari-server setup-ldap
- obvious answers, but url is krb1.cs.rutgers.edu:636, not really a url [HDP]. Used krb4 as secondary.
- HDP 3: for ldap specify member attribute memberUid, base cn=compat,dc=cs,dc=rutgers,dc=edu. The default base doesn't give a real member list when there are nested groups.
- ambari-serber restart
- [not] edit /usr/lib/python2.6/site-packages/ambari_server/serverUtils.py. replace SERVER_API_HOST with data-services1.cs.rutgers.edu. otherwise it uses 127.0.0.1, which will fail for ssl
- [not] fix /usr/lib/ambari-server/ambari-server-2.6.0.0.267.jar. This is complex. I'll give instructions below.
- put users you want to sync in a file users.txt. ended up not doing this. Just used a group
- ambari-server sync-ldap --groups=groups.txt
- to sync nightly, put group name (e.g. ambari-users) in /etc/ambari.group, and in /etc/cron.d/sync-ldap, add
3 3 * * * root ambari-server sync-ldap --groups /etc/ambari.group --ldap-sync-admin-name=admin --ldap-sync-admin-password =XXXX
[so] hbase zookeeper session timed out. did three things:
- in hbase, increased heap to 3G from 1
- in hbase, increased zookeeper session timeout to 3 min
- in zookeeper, increased length of single tick to 9000 ms. tick * 20 gives timeout, so this is 3 min. both timeouts have to be adjusted
- hdfs dfs -chmod 777 /warehouse/tablespace/managed/hive
- yarn.scheduler.capacity.root.default.acl_submit_applications=*
- yarn.timeline-service.version set to 1.5f
- yarn.timeline-service.versions set to 1.5f
- In YARN, Node memory 150 GB, max memory of job 20 GB
- In Mapreduce, Map and Appmaster 4 GB, Reduce 8 GB. This is the same as Spark settings in Zeppelin
- In hive, reduce heap size to 20g
- In tez, reduce am memory to 20g
[not] The following section is for fixing sync of users. But if you use a group, it doesn't matter.
There's a bug in the code for retrieving data from ldap. See AMBARI-24029 To fix it,
- retrieve source. This is ambari version 2.6.0.0, so got a tar file and untared it
- make sure mvn and npm are installed
- make sure your ~/.m2/settings.xml includes
<settings> <mirrors> <mirror> <id>public</id> <mirrorOf>*</mirrorof> <url>http://nexus-private.hortonworks.com/nexus/content/groups/public</url> </mirror> </mirrors> </settings>
- edit ./ambari-server/src/main/java/org/apache/ambari/server/security/ldap/AmbariLdapDataPopulator.java
- in line 675 replace the while with. This is adding one null test
} while (configuration.getLdapServerProperties().isPaginationEnabled() && processor.getCookie() != null && processor.getCookie().getCookie() != null);
- in the main directory, do "mvn package -DskipTests=true". You only need to go until ambari-server is built.
- copy ./ambari-server/target/classes/org/apache/ambari/server/security/ldap/AmbariLdapDataPopulator* to org/apache/ambari/server/security/ldap/ in your work directory. There should be 4 .class files
- copy /usr/lib/ambari-server/ambari-server-2.6.0.0.267.jar to your working directory.
- save a copy of the original
- jar uf ambari-server-2.6.0.0.267.jar org/apache/ambari/server/security/ldap/*
- use jar tf to make sure you replaced 4 class files
- put the new version of the jar file in /usr/lib/ambari-server
Initially I started with a reduced set of services. I added some and ended up with all except accumulo, atlas, knox, logsearch, ranger, druid
failed: accumulo, druid; both known bugs I believe the others are unnecessary.
oozie apparently installed and started by the test failed, I believe because "su hdfs" isn't enough in a kerberized environment but the test used it.
Sqoop and mahout apparently installed, but the test failed. Don't know why, but it hung and I'm pretty sure nothing was actually present on the system. I suggest deploying sqoop separately because of this. I did sqoop -version and mahout -version on all hosts to make sure they were present.
Falcon requires a hack. After installing it, you'll get an alert because the web UI doesn't respond. There's a jar that Hortonworks can't ship. It has to be installed manually.
wget http://search.maven.org/remotecontent?filepath=com/sleepycat/je/5.0.73/je-5.0.73.jar -O /usr/hdp/current/falcon-server/server/webapp/falcon/WEB-INF/lib/je-5.0.73.jar chown falcon:hadoop /usr/hdp/current/falcon-server/server/webapp/falcon/WEB-INF/lib/je-5.0.73.jar
I had trouble starting falcon. It complained about locking. I relocated /hadoop/falcon to /var/falcon, leaving a symlink. I was suspicious that NFS might not be handling locking correclty. That seemed to fix it.
- ambari will replace krb5.conf. Go to the kerberos section, configs, advanced krb5.conf. You'll find a template for the genrated krb5.conf. Change renew time to 365d and add the appdefaults section for kgetcred, pam_kmkhomedir, and register-cc. You need to restart the services to get it to regenerate the krb5.conf.
[libdefaults] noaddresses = trueI think this can be done within ambari by specifying a setting, but this seemed safest
- make sure krb5.conf has default_ccache_name = /tmp/krb5cc_%{uid}. There's an ansible setting for this
- install the JCE, unzip -o -j -q jce_policy-8.zip -d /usr/jdk64/jdk1.8.0_60/jre/lib/security/. (version number will be different) This was done above, but to the default java. ambari installs its own. It also installs a jce, so it's not clear to me whether this is needed
- services must be running for this
- On kerberos server do "ipa config-mod --defaultgroup=allusers". Without this attempts to create users will fail, because we have to specify the gid manually. Our default group doesn't have a GID, because there are performance issues if all users are in a Posix group. Thus our default is normallly a non-posix group. That means we have to specify the GID manually when creating users. (There ought to be a better way.)
- [not] in ambari go to host/#/experimental, select enableipa, fairly near the bottom, save
- in ambari, under admin, kerberos, enable kerberos
- choose ipa
- enter various data; i used the actual kerberos admin with the admin password, not saving it. This may not be needed, but I had odd errors before so this seemed safest.
- the rest of automatic
- when it's done, on the kerberos server do "ipa config-mod --defaultgroup=ipausers" to put the group back.
To see what was created in ipa, "ipa user-find ilab" should find the users. To find services you'll need to look for services on all the hosts in the cluster. Or go to /etc/security/keytabs on each host and do klist -k -t on all the keytabs. That's the safest way to find all the principals.
added oozie as a user to ipa, to avoid complaints from the netapp
Once the system is kerberized, the hdfs user is hard to get to. To allow privileged operations, in IPA create a group hdfs and add administrators to it.
By default data is kept in /hadoop. All HDFS data is stored in 6 copies. 3 nodes, as documented in all descriptions of HDFS, but what isn't documented is that ambari sets up each node to eep copies of the data in /hadoop/hdfs and /hadoop/hadoop/hdfs. (This is not true for HDP 3.) It makes no sense to have 6 copies on the same NFS file system, or even the same NFS file server.
First, change the replication factor from 3 to 1. This is done in ambari, hdfs, config. In the search box type replication.
[not] Next, put /hadoop on local storage on all servers, except mount /hadoop/hdfs on NFS. For the name node this will replicate the files on local and NFS, as by default they set up redundant directories. I think this's fine. However on data nodes, the two directories are combined. I.e. it's like striping. So for the data nodes both /hadoop/hdfs and /hadoop/hadoop/hdfs should be on NFS. If you change the mount points you need to update /var/lib/ambari-agent/data/datanode/dfs_data_dir_mount.hist.
Moved /hadoop/yarn and /hadoop/hadoop/yarn to NFS also, mounting in the expected place.
[HDP3] Initially I set up the system with /hadoop mounted on edinburgh-10g, a separate directory for each system. However after everything was set up, I made /hadoop local, copying files in most directories from the NFS directory to local. I mounted just the following from NFS:
- hdfs
- storm, data-services only
- yarn, data nodes only
Spark defaults temporary files to /tmp. To create a space for them, create /hadoop/tmp on all systems, mounted the same place as /haddop/yarn, etc.
Then in spark and spark2 config, Custom spark-defaults,
spark.local.dir=/hadoop/tmp spark.executor.extraJavaOptions=-Djava.io.tmpdir=/hadoop/tmp spark.driver.extraJavaOptions=-Djava.io.tmpdir=/hadoop/tmp
Add /usr/hdp/current/zookeeper-server/bin/zkCleanup.sh -n 4 to crontab on both services and data nodes.
Another exception: data-services3, /hadoop/var is on NFS. It's for statistics. It'a already 1.3 G. I'm worried it will grow too much.
Hbase ACLs don't do quite what we want. You can't create a table without create access, but global create access is too dangerous. (Without Kerberos, things are even worse: anyone can do anything to any table.)
So to give a user access, create a namespace for them and give them full control of that namespace.
On one of the data nodes:
as root kinit -k -t /etc/security/keytabs/hbase.headless.keytab [email protected] [maybe hbase-ilab2, etc] hbase shell create_namespace 'netid' grant 'netid', 'RWCA', '@netid' ^D
Tables in a namespace look like ns:table, i.e. they have the namespace prefixed with a colon. Otherwise it appears that they work the same.
If a class wants to do this, create a script or webapp that lets any user create a namespace.
(It's possible that if we disable ACLs it will do the right thing, but I haven't found documentation for that yet.)
FYI: Zeppelin stores notebooks in HDFS /user/zeppelin/notebook. (In HDP 2, it started there but moves to a local directory, /usr/hdp/current/zeppelin-server/notebook/, after Kerberization. Because notebooks can be shared, Zeppelin owns there all, and there's a file showing who is authorized to access what in HDFS /user/zeppelin/conf/notebook-authorization.json.
The interpreter configuration is stored in HDFS, /user/zeppelin/conf/interpreter.json. I strongly suggest keeping a copy of this file, since Zeppelin now and then will mess it up, and you'll want to be able to restore it.
In ambari, zeppelin, in the shiro.ini section, edit the template.
- In HDP 3, comment out the static password for admin and the 2 lines later for the password matcher. In this version you can't mix static passwords with LDAP.
- uncomment the two lines for PAM authentication. Make the authentication domain zeppelin, not sshd. Make sure others such as ldap are commented.
account [default=ignore success=3] pam_localuser.so account [default=ignore success=2] pam_succeed_if.so user ingroup slide account optional pam_echo.so ****** This system is not intended for login except through Zeppelin; please use data1, data2 or data3 for ssh if you need the hadoop cluster, otherwise use ilab.cs.rutgers.edu account required pam_deny.so
In /etc/pam.d/zeppelin the auth section should be
auth required pam_sepermit.so auth required pam_env.so auth sufficient pam_unix.so nullok try_first_pass auth required pam_sss.so forward_pass auth include postlogin auth optional pam_exec.so /usr/libexec/hdfsmkdir # Used with polkit to reauthorize users in remote sessions -auth optional pam_reauthorize.so prepare
Pam for zeppelin runs as zeppelin. This causes trouble with the normal pam stack. So we use an abbreviated one. pam_unix will always fail unless you have to be in the local passed file, but we need it to read the password. pam_sss will try to get two factors separately, which won't work in a web context. pam_unix will get just one and pass it on to pam_sss.
Also, for the scripts to run, zeppelin needs to be able to sudo. Create /etc/sudoers.d
##over ride defaults Defaults>root shell_noargs , \ preserve_groups , \ ! env_reset , \ ignore_dot , \ ! requiretty ##LCSR Specific Group #zeppelin ALL = (root) NOPASSWD: ALL zeppelin ALL = (ALL) NOPASSWD: ALL
Note that credentials won't be registered for renewal. Zeppelin doesn't call session, and we can't tell who is logged in anyway. Renewal is only done once the user starts an interpreter. See below on session management.
In users section of shiro.ini remove all users except admin. Admin needs both admin and coresysadmins rules, which isn't the default.
back to shiro.ini:
- in roles section remove all roles except admin
- at the end, in URL security, use
/api/version = anon /api/interpreter/setting/restart/** = authc /api/interpreter/** = authc, roles[coresysadmins] /api/configurations/** = authc, roles[admin] /api/credential/** = authc, roles[admin] #/** = anon /** = authc
R by default requires knitr, but Centos doesn't included it with R. TO load it and some other useful packagss, do
sudo R -e "install.packages('devtools', repos = 'http://cran.us.r-project.org')" sudo R -e "install.packages('knitr', repos = 'http://cran.us.r-project.org')" sudo R -e "install.packages('ggplot2', repos = 'http://cran.us.r-project.org')" sudo R -e "install.packages(c('devtools','mplot', 'googleVis'), repos = 'http://cran.us.r-project.org'); require(devtools); install_github('ramnathv/rCharts')" sudo R -e "install.packages('data.table', repos = 'http://cran.us.r-project.org')"
Zeppelin runs as zeppelin. That means that the pam modules run as zeppelin. The pam module creates hdfs directories for users. For that to work:
- zeppelin has to be set so it can setuid to anyone. That has to be true anyway for user impersonation to work
- /etc/security/keytabs/hdfs.headless.keytab has to be readable by zeppelin. There are lots of ways to do this. I used "setfacl -m u:zeppelin:r /etc/security/keytabs/hdfs.headless.keytab"
#!/bin/bash curl -u admin:xxx -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo": {"context" :"Stop Zeppelin via REST"}, "Body": {"ServiceInfo": {"state": "INSTALLED"}}}' https://data-services1.cs.rutgers.edu/api/v1/clusters/ilab/services/ZEPPELIN # actually took 16 sec sleep 60 # Killing zeppelin doesn't kill the interpreters, but leaves them as orphans. # Even though the processes are owned by users, they're part of a single # zeppelin session. ssh data-services2.cs.rutgers.edu loginctl terminate-user zeppelin sleep 60 curl -u admin:xxx -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo": {"context" :"Start Zeppelin via REST"}, "Body": {"ServiceInfo": {"state": "STARTED"}}}' https://data-services1.cs.rutgers.edu/api/v1/clusters/ilab/services/ZEPPELIN # took 48 sec
[HDP] In Ambari, zeppelin custom zeppelin-site, set zeppelin.interpreter.lifecyclemanager.class to org.apache.zeppelin.interpreter.lifecycle.TimeoutLifecycleManager. That will cause idle interpreters to be killed after an hour.
[not] By default, new notebooks use Livy2. We want them to use Spark2. I can't tell how the ordering is done. Edit /usr/hdp/2.6.3.0-235/zeppelin/webapps/webapp/app.05bbdae750681c30f521.js. Find the last occurrence of note.defaultinterpreter. It is
e.note.defaultInterpreter=n.interpreterSettings[0]Change the 0 to 3.
To make it permanent, replace the file in /usr/hdp/2.6.3.0-235/zeppelin/lib/zeppelin-web-0.7.3.2.6.3.0-235.war.
Somewhere that you have a valid Maven pom file, do "mvn dependency:get -DgroupId=org.apache.shiro.tools -DartifactId=shiro-tools-hasher -Dclassifier=cli -Dversion=1.4.0"
Find shiro-tools-hasher-1.4.0-cli.jar in your ~/.m2 directory
In that directory, type "java -jar shiro-tools-hasher-1.4.0-cli.jar -gs -p ". Give it the ambari UI admin password. You'll get back something like "$shiro1$SHA-256$500 .... "
Try that java command without arguments to see what options you have. You might want to use sha512.
Edit shiro.ini. In the main section, add
passwordMatcher = org.apache.shiro.authc.credential.PasswordMatcher iniRealm.credentialsMatcher = $passwordMatcher
In the user section, change admin = to "admin=$shiro1...., admin,coresysadmins". I.e. insert the encrypted password. Surprisingly, it doesn't seem necessary to quote it. The admin,coresysadmins are the groups it's in. We need coresysadmins because that's what is authorized to do some of the configuration.
In zeppelin, zeppelin-env, the script. At the end add
export ZEPPELIN_IMPERSONATE_CMD='sudo -H -u ${ZEPPELIN_IMPERSONATE_USER} bash -c ' export ZEPPELIN_IMPERSONATE_SPARK_PROXY_USER="false"
restart it.
Of course that requires an entry in /etc/sudoers.d allowing zeppelin to become any user.
Now in zeppelin, interpreters, under sh, choose per-user isolated. That will expose the "impersonate user." click it. Now you'll get sh running as the user, though without kerberos authentication. Do the same thing for all the interpreters, except maybe md, which I assume doesn't need a process.
If you're going to use impersonation for anything, you have to use it for everything. To make impersonation work, the user zeppelin has to be able to become any user. That makes it effectively root. We can't let users run as that user. So I've set it for everything except md, which I assume doesn't have an issue
I added an interpret %python. See below. Also, I recreated %spark and %spark2 with default settings, except the python interpreter. The initial one didn't work. I think that's because user impersonate didn't work with the default yarn mode. When I recreated it, it defaulted to mode local[*], where the normal Kerberos authentication works. For %spark and %spark2, make sure SPARK_HOME is set to /usr/hdp/current/spark-client/ or spark2-client.
Because users have to be able to create log files,
chmod 777 /var/log/zeppelin
Important: change zeppelin.interpreter.config.upgrade to false. Otherwise interpreter configs are reset at every restart.
We want each user's interpreters to run in a separate logind session, with its own cgroup. That lets us set memory limits, etc. We also do Kerberos ticket management on a per-interpreter basis. The reason is that we can't tell who is logged in until they start an interpreter. At least in HDP 3
- They get a kerberos ticket on login
- When they start an interpreter, try to renew it. If it works, register it for renewd, so tickets won't expire during a long job. Renewd will work since the interpreter process is owned by the user.
- If we can't renew the Kerberos credential, it's probably expired. They need to log out and login. We won't start the interpreter until they do.
- Thus we can guarantee that interpreters always have credentials avaiable, but if all interpreters time out, eventually they won't be able to start a new one because the Kerberos credentials will have expired. Can't do better with this version of Zeppelin. (The next one will let us see who is logged in.)
At the end of /etc/pam.d/sudo
session [default=2 success=ignore] pam_exec.so quiet /usr/libexec/sudozeppelin -session optional pam_systemd.so session optional pam_exec.so /usr/libexec/setlimits.shThis tests whether sudo is being called from zeppelin, and if so does the two key pam calls.
Here is sudozeppelin:
#!/bin/sh #printenv >> /tmp/sudozep #echo pid $$ >> /tmp/sudozep # This is intended to be called from pam.d/sudo session. # It checks whether this is a sudo being done by zeppelin to start a # user process. If so, it succeeds. Otherwise it fails. # That lets pam call pam_systemd just for zeppelin user jobs. # Normally sudo doesn't want to create a new session, but for # Zeppelin user jobs we do. # Put the sudo process's PID into the root cgroups. That removes # them from the session they're currently in. pam_systemd won't # start a new session if a process is already in a session, so # this is needed for pam_systemd to do anything. if test "$PAM_RUSER" = "zeppelin" -a "$PAM_USER" \!= "root" -a "$PAM_TYPE" = "open_session"; then MYPPID=`awk '/^PPid:/{print $2}' /proc/$$/status` CMD=`cat /proc/$MYPPID/cmdline` if echo "$CMD" | grep -q "sudo.*-H.*-u.*source /usr/hdp/current/zeppelin-server" ; then echo $MYPPID >/sys/fs/cgroup/systemd/cgroup.procs echo $MYPPID >/sys/fs/cgroup/memory/cgroup.procs echo $MYPPID >/sys/fs/cgroup/cpu,cpuacct/cgroup.procs exit 0 fi fi exit 1
/etc/pam.d/sudo, session section.With the new zeppelin we have idle timeout, so we don't need to restart very day. But we also can't tell what users are logged in, so there's no way to know what tickets to renew. Thus we call a script that tries to do kinit -R when you start an interpreter. If it works you have another 24 hours, and we register for renewd. If it fails, the user needs to login again to get Kerberos credentials.
Unfortunately we have no easy way to generate errors that the user will see. It turns out that if a pam script fails, Zeppelin will print the name of the failing script. So we name the script Your-session-has-expired.-Logout-and-login-again.
The first sudozeppelin prepares for system-auth by removing the process from any cgroups. Without that, pam_systemd doesn't create a login session, but assumes they're already in one.
sudozeppelin2 checks whether sudo is being invoked by zeppelin, since we want to check for Kerberos credentials and set memory limits if so.
# if zeppelin, remove from current cgroup so system-auth will work correctly session optional pam_exec.so quiet /usr/libexec/sudozeppelin session optional pam_keyinit.so revoke session required pam_limits.so session optional pam_exec.so /usr/libexec/setlimits.sh session include system-auth # is this zeppelin? session [default=2 success=ignore] pam_exec.so quiet /usr/libexec/sudozeppelin2 # if so, do setlimits session required pam_exec.so /usr/libexec/Your-session-has-expired.-Logout-and-login-again. session optional pam_exec.so /usr/libexec/setlimits.sh
/usr/libexec/sudozeppelin:
#!/bin/sh #printenv >> /tmp/sudozep #echo pid $$ >> /tmp/sudozep # This is intended to be called from pam.d/sudo session. # It checks whether this is a sudo being done by zeppelin to start a # user process. If so, it succeeds. Otherwise it fails. # That lets pam call pam_systemd just for zeppelin user jobs. # Normally sudo doesn't want to create a new session, but for # Zeppelin user jobs we do. # Put the sudo process's PID into the root cgroups. That removes # them from the session they're currently in. pam_systemd won't # start a new session if a process is already in a session, so # this is needed for pam_systemd to do anything. if test "$PAM_RUSER" = "zeppelin" -a "$PAM_USER" \!= "root" -a "$PAM_TYPE" = "open_session"; then MYPPID=`awk '/^PPid:/{print $2}' /proc/$$/status` CMD=`cat /proc/$MYPPID/cmdline` if echo "$CMD" | grep -q "sudo.*-H.*-u.*source /usr/hdp/current/zeppelin-server" ; then echo $MYPPID >/sys/fs/cgroup/systemd/cgroup.procs echo $MYPPID >/sys/fs/cgroup/memory/cgroup.procs echo $MYPPID >/sys/fs/cgroup/cpu,cpuacct/cgroup.procs exit 0 fi fi exit 0
/usr/libexec/sudozeppelin2:
#!/bin/sh #printenv >> /tmp/sudozep #echo pid $$ >> /tmp/sudozep # test to see if this is a sudo from zeppelin for a user interpreter if test "$PAM_RUSER" = "zeppelin" -a "$PAM_USER" \!= "root" -a "$PAM_TYPE" = "open_session"; then MYPPID=`awk '/^PPid:/{print $2}' /proc/$$/status` CMD=`cat /proc/$MYPPID/cmdline` if echo "$CMD" | grep -q "sudo.*-H.*-u.*source /usr/hdp/current/zeppelin-server" ; then exit 0 fi fi exit 1
/usr/libexec/Your-session-has-expired.-Logout-and-login-again.
LOGIN=`getent passwd "$PAM_USER" | cut -d: -f3` if sudo -u "$PAM_USER" kinit -R -c /tmp/krb5cc_"$LOGIN"; then # register for renewed touch /run/renewdccs/FILE:\\tmp\\krb5cc_"$LOGIN" exit 0 else exit 1 fi
You can set any spark property in the interpreter setup. For livy and livy2, prefix the property name with livy.
I've had to set spark.port.maxRetries to 1000. It defaults to 16. That limits the number of spark sessions to 16, which we can't live with. I ended up setting this is spark defaults for spark and spark2 cluster-wide, so the setting here isn't really needed anymore.
Livy and livy2:
livy.spark.executor.cores 4 livy.spark.executor.instances 3 livy.spark.executor.memory 4g livy.spark.driver.cores 4 livy.spark.driver.memory 4g livy.spark.port.maxRetries 10000
Spark and Spark2
SPARK_HOME /usr/hdp/current/spark2-client/ [omit 2 for spark] master local[4] spark.cores.max 4 spark.driver.memory 4g spark.executor.memory 4g spark.port.maxRetries 1000
Remove spark.yarn kerberos keytab and principal. Despite their name, they get used for local as well, which causes obvious problems. (The user can't read the key table.)
data-services2, and data123 must be set up like Jupyter:
- /usr/lib/anaconda3 needs to be there
- /etc/profile.d/pythonuser.sh and .csh need to be there. Note that I'm using a slightly different pythonuser on data123, because I don't want to change any more than I have to mid-semester.
- automount for common/users and /common/clusterdata need to be set up
export PYTHONUSERBASE="/common/clusterdata/$USER/local" export PYSPARK_PYTHON=/usr/lib/anaconda3/bin/python3 # this is probably OK all the time, but to avoid breaking other things I'm # only setting it for interactive shells. This is a bash-ism, but /bin/sh # is bash for us. export HDP_VERSION=2.6.3.0-235 if ! test `expr match "$PATH" "/usr/lib/anaconda3/bin"` -gt 0 ; then PATH="/usr/lib/anaconda3/bin:${PATH}" fipythonuser.csh
setenv PYTHONUSERBASE "/common/clusterdata/$USER/local" setenv PYSPARK_PYTHON "/usr/lib/anaconda3/bin/python3" # this is probably OK all the time, but to avoid breaking other things I'm # only setting it for interactive shells. setenv HDP_VERSION 2.6.3.0-235 if (`expr match "$PATH" "/usr/lib/anaconda3/bin"` == 0) set path = ("/usr/lib/anaconda3/bin" $path)The copies on jupyter and data-services2 don't have the HDP_VERSION, but that's just because I wanted to minimize retesting. I recommend putting them there.
At some point this broke spark-submit. Needed to convert /etc/hadoop/conf/topology_script.py to be compatible with python3. At the beginning replace the whole import section with
from __future__ import print_function import sys, os try: from string import join except ImportError: join = lambda s: " ".join(s) try: import ConfigParser except ModuleNotFoundError: import configparser as ConfigParserthen find the print rack and change to print(rack).
However ambari will overwrite this.
- copy it to topology_script3.py
- change net.topology.script.file.name in Ambari's HDFS configuration to point to topology_script3.py
- /usr/hdp/2.6.3.0-235/zeppelin/bin/install-interpreter.sh -n python
- In interpreters create it, set the python to /usr/lib/anaconda3/bin/python
- set to per-user impersonate
- in the notes, enable it
#!/bin/bash # for %spark2.pyspark and %python # uses special matlib backend and normal spark 2 python stuff export PYTHONPATH="/usr/hdp/2.6.3.0-235/zeppelin/interpreter/lib/python:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client /python/lib/py4j-0.10.4-src.zip:$PYTHONPATH" export PYSPARK_PYTHON="/usr/lib/anaconda3/bin/python3" export PYTHONUSERBASE="/common/clusterdata/$USER/local" exec /usr/lib/anaconda3/bin/python3 "$@"
%spark.pyspark uses /usr/local/bin/zsparkpy1
#!/bin/bash # for %spark2.pyspark # uses special matlib backend and normal spark 2 python stuff export SPARK_HOME="/usr/hdp/current/spark-client" export PYTHONPATH="$SPARK_HOME/python/:$PYTHONPATH" export PYTHONPATH="$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH" #export PYTHONPATH="/usr/hdp/2.6.3.0-235/zeppelin/interpreter/lib/python:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH" export PYSPARK_PYTHON="/usr/bin/python" export PYTHONUSERBASE="/common/clusterdata/$USER/local" exec /usr/bin/python "$@"
The configurations for %spark and %spark2 must be edited to change the python from "python" to /usr/local/bin/zsparkyp[1]. Also, for %spark, you need to add SPARK_HOME as /usr/hdp/current/spark-client/. Without that, python and R don't work.
The newest version of Zeppelin will run ipython if possible. For that to work, the following modules must be installed in the version of python involved. I did it for anaconda's python and the system python2, just in case someone choose python 2 explicitly.
- the command is "python -m pip install PACKAGE", but you may need to type the path for python to get the right one
- python -m pip install ipykernel
- python -m pip install jupyter-client
- python -m pip install ipython
- python -m pip install grpcio
- python -m pip install protobuf
HOWEVER, for the anaconda environment you really want to use conda to do the installation. Several of those things are there already. What you actually need to do is
conda install py4j conda install grpcio conda install protobuf
Documentation for Zeppelin doesn't mention all of this, particularly protobuf. Zeppelin checks for most of it and will warn you, but if protobuf is missing, it will mysteriously fail. (Fixed in a future release.)
To get python2 to work, I ended up installing Anaconda python2 in /usr/lib/anaconda2. It needs the same "conda install" commands. User documentation says how to activate it. Rather than just running the executable, I use a python2 version of /usr/local/bin/zsparkpy, this one called /usr/local/bin/zsparkpy2:
#!/bin/bash # for %spark2.pyspark and %python # uses special matlib backend and normal spark 2 python stuff export PYTHONPATH="/usr/hdp/3.1.0.0-78/zeppelin/interpreter/lib/python/:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH" export PYSPARK_PYTHON="python2" export PYTHONUSERBASE="/common/clusterdata/$USER/local" export PATH="/usr/lib/anaconda2/bin:$PATH" exec /usr/lib/anaconda2/bin/python2 "$@"
Livy is part of spark, so there's no option to add it in the normal services menu. Instead, go to the host that you want to be the server (services2 in this case), and click add. You'll see an optin for the livy server. Similarly add spark2 client and then livy for spark2 to data-services2.
The following says how to set up Livy: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_zeppelin-component-guide/content/config-livy-interp.html. But it looks to me like it' really about using livy for Zeppelin to access Spark. I'm not clear whether that's actually needed.
[this] Files view:
I have terrible problems getting the files view to work. Most of it was that krb5.conf on data-services1 isn't generated by ambari, and thus was the default. noaddress most be set true, or it fails.
Other than that, the published instructions work. It is at least possible that none of this was needed, and the only problem was noaddress in krb5.conf
- In hdfs configuration, in custom core-site [note]
hadoop.proxyuser.ambari-server-ilab.hosts * hadoop.proxyuser.root.groups * hadoop.proxyuser.root.hosts *
- in the view itself,
WebHDFS Authorization auth=KERBEROS;proxyuser=ambari-server-ilab cluster configuration: custom WebHDFS FileSystem URI* webhdfs://data-services2.cs.rutgers.edu:50070
- grant permission to some group all users are in
HDFS has a namenode, a backup name node, and data nodes. In our case the three data* systems are the data nodes.
When you start, the data nodes start first, then the namenode. The name node waits for all of the data nodes to check in. Each one reports the number of blocks it has. The namenode knows how many there are supposed to be. It waits until it sees them all.
If everything works, you have a 2 min wait. (It waits for 2 min before checking.) If it never sees all the blocks, it doesn't come up.
There are a couple of reasons it might not see all the blocks:
- One of the data nodes isn't up, or there's a network or Kerberos issue. You'll need to fix those.
- There's a problem with the file system. Try fsck as explained in the next section. You'll need to fix it until fsck is clean before it will come online. Fsck can be run with the file system still in safe mode. But it it shows a problem, you'll have to exit safe mode. If you do that when ambari is still trying to start, the system will come up. You don't want that. So if there's a failure in fsck, abort the startup, fix hdfs, then stop all and start all.
- Fsck says everything is OK, but a few blocks are missing. On the name node, /var/log/hadoop/hdfs/hadoop-hdfs-namenode-data-servicesX.cs.rutgers.edu.log should show messages saying that not all blocks are there yet.
hdfs dfsadmin -safemode leave
Note that hdfs depends upon Zookeeper. If it didn't start, fix that first.
hdfs is a distributed file system. The data is stored on data1,2,3. Servers run on those nodes. However they are coordinated by the name server, which is on data-services2. If there are issues, they show up with the name server / namenode not coming on line. In Ambari, the startup will simply hang at that point. It's not clear when it's hung, since it takes serveral minutes to come online normally. but you can tell by looking at the log. Do "ls -lt" in data-services2:/var/log/hadoop/hdfs. There will be two new files with the same name, ending in .log and .out. You want .log.
I have seen it work if I try again. I.e. use ambari to shut down all services and then start all services again. Let's assume that doesn't work.
If there's an issue, you will want to abort the ambari startup. That will leave hdfs running, but in readonly mode. It changes to read/write when it comes on line. The HDFS terminology is "safe mode."
The following commands require you to be the user "hdfs"
su - hdfs klist -k -t /etc/security/keytabs/hdfs.headless.keytab ; note the principal, it should be something like [email protected] kinit -k -t /etc/security/keytabs/hdfs.headless.keytab PRINCIPAL
This command will do an fsck of hdfs:
hdfs fsck / > file
The two kinds of errors you're most likely to see are corrupt blocks / files, and under replicated files. Corrupt files will keep the system from coming up. Look through the file to see if you care about any of them. Most likely they'll be temporary files. If there are ones you care about, you can find them from a backup on data-services4. It has daily snapshots, if you need to go back.
The following command will remove all corrupt files. Note that you have to leave safe mode, to allow writes:
hdfs dfsadmin -safemode leave hdfs fsck / -delete
The fsck will report errors. With "-delete" it actually fixes them, but that's not obvious from the output. You can do it again to verify that they're fixed.
Under replicated files mean that somehow something set a replication factor over 3. Since we only have 3 data nodes, you can't replicate a file more than 3. We normally use 1. Since all 3 replicas are on the same NFS file server, replication doesn't seem to buy much. The following will find all under replicated files and fix them, by setting the replication count to 1 on them. Note that this assumes you're using bash.
hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files for hdfsfile in `cat /tmp/under_replicated_files`; do echo "Fixing $hdfsfile :" ; sudo -u hdfs hadoop fs -setrep 1 $hdfsfile; done
Try "hdfs fsck / " again, and make sure everything is OK.
At this point I would do an ambari shutdown followed by an ambari startup. Some of the systems may get into an odd state if HDFS isn't working.
Note that Zeppelin depends upon HDFS. If there's an issue with HDFS, fix it first.
Zeppelin is restarted nightly at 4am. It doesn't recover idle sessions, so this is the only way to prevent an infinite number of sessions building up. (This is fixed in the next release.) The cron job is on data-services1 (the Ambari system), because it uses Ambari commands to do the restart. See "crontab -l" on that host. You'll see a check command following by a few minutes. If the restart failed, it sends me email.
To make sure it's up, go to https://data-services2.cs.rutgers.edu:9995 and make sure you see a login screen.
If the restart fails, the first thing I would do is login to ambari as admin and try to restart Zeppelin again. That might actually work, if there was a transient failure in the environment.
Let's assume the restart fails, or results in a server that gives an error.
To see what is going on, on data-services2, look at /var/log/zeppelin/zeppelin-zeppelin-data-services2.cs.rutgers.edu.log.
There's one failure mode we've seen that is likely to happen again: During startup, Zeppelin loads all the user notebooks. Unfortunately if there's a problem with one of them, it can abort the startup. (This has supposedly been fixed in a newer version.)
You can tell which notebook caused the trouble by going to /usr/hdp/current/zeppelin-server/notebook/. Do
ls -f | cat > /tmp/foo
"ls -f" means to do a listing without sorting. Normally "ls" sorts in alphabetical order. However Zeppelin loads files in the order that the directory entries occur in the directory. That's what "ls -f" shows. The problem will be the one after the latest that the log shows was loaded.
Remove it or move it, and restart zeppelin. I would look at the JSON file in the directory, and either send it to the user or pull out just the code and send it. Otherwise the user could lose substantial work.
We had a crash. After we came up, hbase wouldn't stay online. on services2, in /var/log/hbase/hbase-hbase-master-data-services2.cs.rutgers.edu.log
This entry has information you probably don't need, because some of it is magic that's hard to find and I thought you might need it.
There are two major things to look at:
- the hbase table itself
- the zookeeper state information
If the table data is bad, you can try to fix it with hbase hbck.
Now for the narrative, some of which you won't need:
- The first problem was " Waiting for namespace table to be online. Time waited" Eventually it times out and hbase stops. Google came up with the most drastic solution to delete all of /apps/hbase/data/WALs/. That worked. But WAL is the write log, so this will lose data in progress.
- Now it came up, but the web page said 2 servers were in transition, and they wouldn't go online. I tried hbase hbck, and that didn't help. A real fix looked complex. So I remove hbase from the two servers involved, and the one left worked fine. I then reinstalled hbase on this. I assume this also would lose some data. At this point it looks OK.
- After more experiments it seems likely that just stopping the node in trouble is enough. It causes another node to take over the region.
on dataservices2
- su - hbase
- kinit -k -t /etc/security/keytabs/hbase.headless.keytab [email protected]
- export HBASE_SERVER_JAAS_OPTS=-Djava.security.auth.login.config=/usr/hdp/current/hbase-client/conf/hbase_master_jaas.conf
- hbase hbck
Running the tool is interesting. Create a file /usr/local/zkcli.conf
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab=/etc/security/keytabs/hdfs.headless.keytab storeKey=true useTicketCache=false [email protected] };
Now do
- export JVMFLAGS="-Djava.security.auth.login.config=/usr/local/zkcli.conf"
- /usr/hdp/current/zookeeper-server/bin/zkCli.sh
The problem is that rmr /hbase-secure runs into security problems. To fix, ignore the two things above (which give you onlly read-only access). on data-services2
- cd /usr/hdp/current/zookeeper-server
- java -cp "./zookeeper.jar:lib/slf4j-api-1.6.1.jar" org.apache.zookeeper.server.auth.DigestAuthenticationProvider super:password
- it outputs super:XXX copy that
- in ambari's zookeeper config, find zookeeper-env template config, add to the end
- export SERVER_JVMFLAGS="$SERVER_JVMFLAGS -Dzookeeper.DigestAuthenticationProvider.superDigest=super:XXX" where xxx is the encrypted password printed
- save, and restart zookeeper.
- /usr/hdp/current/zookeeper-client/bin/zkCli.sh -server data-services2.cs.rutgers.edu
- addauth digest super:password
Once it works, I recommend removing the entry from Ambari' zookeeper config and restarting. The problem with this approach is that you're exposing the encryptd form of the password to everyone. Rmember that everyone can see our configurations. Of course it's omly the encrypted password, but still it's not a good idea.
We ran into a situation where components were failing to start, and the logs showed missing info in zookeeper. What's worse, which components failed differed when you started again.
Zookeeper runs on data-services2, data-services3, and data1. Look at the directories under /hadoop/zookeeper. On data1, the current database file was a few bytes, while the others were megabytes. I suspect disk had filled. This explains the inconsistent results, since it depends upon which componenet talked to which copy.
If you have just one bad database, fixing it is easy. Shut down all the hadoop components using ambar. Start the two good zookeepers, just zookeeper, not the rest of hadoop. Remove all files from /hadoop/zookeeper/version-2. It will notice that its files are missing and fetch a good copy. Now you can start the rest of hadoop.
During a period when lots of users are working, you may want to see who is doing what, so you can deal with the possibility that yarn might run out of resources. There are several tools
- resource manager ui - this shows all jobs, current and past, and total resources used. However finding resource usage for each job is a drag.
- "yarn application -list" - will list all jobs currently running. Doesn't show resource usage though.
- "yarn application -list -appStates ALL" - will show the whole history.
- "yarn application -status application_1542400534304_0694" - shows detailed info on one job
- "yarn top" - this is the best summary. In order to see specifics of all jobs you need to su to "yarn" with the right Kerberos credentials. Those should always be set up on data-services3. If you had to set it up you would kinit from /etc/security/keytabs/rm.service.keytab . It doesn't appear that credentials exist other than on data-services3.