Apache Kafka Series Kafka Streams for Data Processing - vidyasekaran/current_learning GitHub Wiki

Apache Kafka Series - Kafka Streams for Data Processing

Resources of this course : https://docs.confluent.io/current/streams/kafka-streams-examples/docs/index.html https://www.tunnelbear.com/

Source code in git : https://github.com/simplesteph/kafka-streams-course.git

Source code for different kafka versions are found here https://courses.datacumulus.com/downloads/kafka-streams-sn2/

lambda 8 - https://www.tutorialspoint.com/java8/java8_lambda_expressions.htm

Kafka Streams == ks

Ks is easy data processing and transformation library within kafka used for data transformation, data enrichment, fraud deduction, monitoring and alerting There is no need to create seperate cluster for ks, its highly scalable, elastic and fault toleratant

one at a time processing
exaclty once

Kafka Stream Arch for ks

Source --> Connect Cluster --> Kafka cluster -->"process it using ks"

-- Connect Cluster has set of Worker Node. worker get a connector and a configuration and then as first step they will pull data from sources and push it to Kafka cluster, Kafka Streams is used to transform that data , aggregate, some joins

Contender for kakfa stream is nifi,spark, flink.

Running our 1st kafka strems app

We focus on using raw kafka binary as opposed to using it via docker compose in other courses.

download kafka binaries
start zookeeper and kafka
create input and output topics using 'kafka-topics'
publish data into the input topic
run the wordcount example
stream the output topic using 'kafka-console-consumer'

Download and install kafka

Installed kafka in ec2 - Download the source code from this course and go open "course_intro_mac_linux.sh" in D:\Kafka-Streams-Udemy\code_v2\code\1-course-intro and follow instructions to run zookeeper, kafka server, create different topics, create consumer and producer and run word count program.

Check instruction in course intro - D:\Kafka-Streams-Udemy\code_v2\code\1-course-intro

Stream is a sequence of immutable records fully ordered can be replayed (think of kafka topic as parallel) Stream processor is node in streams topology it transforms record one by one and may create a new stream from it. KS we can use high level DSL to write programs

KS application terminology

Source Processor - special processor which takes directly from kafka topic it doesnot transform data.

Sink Processor - sends stream data directly to kafka topic

Starter project Setup

We need kafka stream client log4j libraries