Apache Kafka Series Kafka Streams for Data Processing - vidyasekaran/current_learning GitHub Wiki
Apache Kafka Series - Kafka Streams for Data Processing
Resources of this course : https://docs.confluent.io/current/streams/kafka-streams-examples/docs/index.html https://www.tunnelbear.com/
Source code in git : https://github.com/simplesteph/kafka-streams-course.git
Source code for different kafka versions are found here https://courses.datacumulus.com/downloads/kafka-streams-sn2/
lambda 8 - https://www.tutorialspoint.com/java8/java8_lambda_expressions.htm
- Kafka Streams == ks
Ks is easy data processing and transformation library within kafka used for data transformation, data enrichment, fraud deduction, monitoring and alerting There is no need to create seperate cluster for ks, its highly scalable, elastic and fault toleratant
- one at a time processing
- exaclty once
Kafka Stream Arch for ks
Source --> Connect Cluster --> Kafka cluster -->"process it using ks"
-- Connect Cluster has set of Worker Node. worker get a connector and a configuration and then as first step they will pull data from sources and push it to Kafka cluster, Kafka Streams is used to transform that data , aggregate, some joins
Contender for kakfa stream is nifi,spark, flink.
Running our 1st kafka strems app
We focus on using raw kafka binary as opposed to using it via docker compose in other courses.
- download kafka binaries
- start zookeeper and kafka
- create input and output topics using 'kafka-topics'
- publish data into the input topic
- run the wordcount example
- stream the output topic using 'kafka-console-consumer'
Download and install kafka
Installed kafka in ec2 - Download the source code from this course and go open "course_intro_mac_linux.sh" in D:\Kafka-Streams-Udemy\code_v2\code\1-course-intro and follow instructions to run zookeeper, kafka server, create different topics, create consumer and producer and run word count program.
Check instruction in course intro - D:\Kafka-Streams-Udemy\code_v2\code\1-course-intro
Stream is a sequence of immutable records fully ordered can be replayed (think of kafka topic as parallel) Stream processor is node in streams topology it transforms record one by one and may create a new stream from it. KS we can use high level DSL to write programs
KS application terminology
Source Processor - special processor which takes directly from kafka topic it doesnot transform data.
Sink Processor - sends stream data directly to kafka topic
Starter project Setup
We need kafka stream client log4j libraries