Install Kafka - dryshliak/hadoop GitHub Wiki

Prerequisites

  • Three node (min 1 Gb per node)
  • Disk space (min 30GB per node)
  • Ubuntu 16.04
  • Kafka 2.12-2.6.2
  • Java 8
  • SSH access
  1. Install VirtualBox https://www.virtualbox.org/wiki/Downloads

  2. Prepare three instances of appropriate version you can find by below URL
    http://releases.ubuntu.com/16.04/ubuntu-16.04.6-server-amd64.iso

  3. During instance preparing add second adapter “Host-only Adapter” in the Network setting. You can face with recognition second adapter, to resolve this please read this article

  1. Choose "OpenSSH server” to have ssh access to instance

  1. On all instances you need to setup hosts files with FQDN names to resolve local DNS names on each node (as explained here) and also remove this line 127.0.1.1<---->"node name"

  2. Disable firewall

sudo service ufw stop && sudo ufw disable
  1. Before starting of installing any applications or software, please makes sure your list of packages from all repositories and PPA’s is up to date or if not update them by using this command:
sudo apt-get update && sudo apt-get dist-upgrade -y
  1. Install Oracle Java
cd /opt
wget --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz
mkdir /usr/lib/jvm
tar -xf /opt/jdk-8u131-linux-x64.tar.gz -C /usr/lib/jvm
ln -s /usr/lib/jvm/jdk1.8.0_131 /usr/lib/jvm/default-java
update-alternatives --install /usr/bin/java java /usr/lib/jvm/jdk1.8.0_131/bin/java 100
update-alternatives --install /usr/bin/javac javac /usr/lib/jvm/jdk1.8.0_131/bin/javac 100
export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_131/

Copy Java path (in our case JAVA_HOME="/usr/lib/jvm/jdk1.8.0_131") into:

sudo vi /etc/environment
  1. Check Java configuration
update-alternatives --display java
update-alternatives --display javac
java –version

Installing Zookeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Kafka uses Zookeeper for maintaining heart beats of its nodes, maintain configuration, and most importantly to elect leaders.

  1. Download and unpack Zookeeper package
wget https://downloads.apache.org/zookeeper/stable/apache-zookeeper-3.6.3-bin.tar.gz -P /opt
tar -xf /opt/apache-zookeeper-3.6.3-bin.tar.gz -C /opt
ln -s /opt/apache-zookeeper-3.6.3-bin /opt/zookeeper
  1. Create the new Zookeeper user and group using the command
adduser --disabled-password zookeeper --disabled-password
  1. Create zookeeper directory under /var/lib for storing the state associated with the ZooKeeper server and another zookeeper directory under /var/log for Zookeeper logs. Both of the directory ownership need to be changed as zookeeper
mkdir /var/{lib,log}/zookeeper
chown -R zookeeper:zookeeper /var/{lib,log}/zookeeper
  1. Create the server id for the ensemble. Each Zookeeper server should have a unique number in the myid file within the ensemble and should have a value between 1 and 255.
sudo ip a | grep '192.168.56.' | grep -Po 'inet \K[\d.]+' | grep -o '.$' > /var/lib/zookeeper/myid
  1. Go to the conf folder under the Zookeeper home directory (location of the Zookeeper directory after Archive has been unzipped/extracted). By default, a sample conf file with name zoo_sample.cfg will be present in conf directory. You need to make a copy of it with name zoo.cfg as shown below, and edit new zoo.cfg as described across all three Ubuntu machines.
cd /opt/zookeeper/conf
cp zoo_sample.cfg zoo.cfg

and change zoo.cfg like below

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/lib/zookeeper
clientPort=2181
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
server.<id>=<node1 ip or dns name>:2888:3888
server.<id>=<node2 ip or dns name>:2888:3888
server.<id>=<node3 ip or dns name>:2888:3888
  1. Setup logging in log4.properties.
vi /opt/zookeeper/conf/log4j.properties
zookeeper.log.dir=/var/log/zookeeper
zookeeper.tracelog.dir=/var/log/zookeeper
log4j.rootLogger=INFO, CONSOLE, ROLLINGFILE
  1. Add following environment variables to the environment file.
sudo vi /etc/environment
ZOO_LOG_DIR="/var/lib/zookeeper"
SERVER_JVMFLAGS="-Xms256m -Xmx256m -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:/var/lib/zookeeper/zookeeper_gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=7 -XX:GCLogFileSize=10M"
  1. Start zookeeper in all three nodes one by one, using the following command:
chown -R zookeeper:zookeeper /var/{lib,log}/zookeeper #just to be sure
/opt/zookeeper/bin/zkServer.sh start

  1. Verify the Zookeeper Cluster and Ensemble
    In Zookeeper ensemble out of three servers, one will be in leader mode and other two will be in follower mode. You can check the status by running the following commands.
/opt/zookeeper/bin/zkServer.sh status

Installing Kafka

  1. Download and unpack Kafka package
wget https://downloads.apache.org/kafka/2.6.2/kafka_2.12-2.6.2.tgz -P /opt
tar -xf /opt/kafka_2.12-2.6.2.tgz -C /opt
ln -s /opt/kafka_2.12-2.6.2 /opt/kafka
  1. Create kafka user and directories
useradd kafka
mkdir /var/{lib,log}/kafka
chown -R kafka:kafka /var/{lib,log}/kafka
  1. Launching Kafka as a service on startup. For this, we will create a unit file in /etc/systemd/system directory
sudo touch /etc/systemd/system/kafka.service
[Unit]
Description=Apache Kafka
Requires=network.target
After=network.target

[Service]
Type=simple
EnvironmentFile=/opt/kafka/config/kafka
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
Restart=on-failure
User=kafka
Group=kafka
SuccessExitStatus=143

[Install]
WantedBy=multi-user.target
  1. Setup memory setting in environment file
sudo touch /opt/kafka/config/kafka
KAFKA_HEAP_OPTS="-Xms512m -Xmx512m"
  1. Create server.properties file
#create a copy of existing propertie file
mv /opt/kafka/config/server.properties /opt/kafka/config/server.properties.orig
#broker.if must be a unic number
broker.id=1
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/var/lib/kafka
num.partitions=1
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
#zookeeper.connect need to provide ip address of all three zookeeper server
zookeeper.connect=<ip address>:2181,<ip address>:2181,<ip address>:2181/kafka
#ip address of server where installed kafka
listeners=PLAINTEXT://<ip address>:9092
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=0
delete.topic.enable = true
  1. Add following environment to the environment file.
sudo vi /etc/environment
LOG_DIR="/var/log/kafka"
  1. Finished setupping Kafka service
systemctl daemon-reload
systemctl enable kafka
  1. Ensure Permission of Directories
chown -R kafka:kafka /var/{lib,log}/kafka
  1. Starting Kafka services on each instance
systemctl start kafka
systemctl status kafka
  1. Testing installation
#Create topics
/opt/kafka/bin/kafka-topics.sh --create --zookeeper 192.168.56.3:2181,192.168.56.4:2181,192.168.56.5:2181/kafka --replication-factor 3 --partitions 3 --topic test

#Describe topics
/opt/kafka/bin/kafka-topics.sh --describe --zookeeper 192.168.56.3:2181,192.168.56.4:2181,192.168.56.5:2181/kafka --topic test

#Let’s start publishing messages on test topic on one Kafka instance
/opt/kafka/bin/kafka-console-producer.sh --broker-list 192.168.56.3:9092,192.168.56.4:9092,192.168.56.5:9092 --topic test

#We will now create a subscriber on test topic and listen from the beginning of the topic on another Kafka instance
/opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server 192.168.56.3:9092,192.168.56.4:9092,192.168.56.5:9092 --topic test --from-beginning

⚠️ **GitHub.com Fallback** ⚠️