Apache Kafka - sgml/signature GitHub Wiki

Overengineering

As a… I want… So that… Pain With Kafka/CDC Benefit With GitHub Actions + Batch
Data engineer to replace always-on Kafka clusters with scheduled batch jobs I reduce infrastructure cost and eliminate 24/7 compute Kafka brokers run continuously even when idle GitHub Actions only runs when triggered, costing near-zero when idle
Platform engineer to remove CDC connectors and schema registries I reduce maintenance burden and operational noise CDC connectors break on schema drift and require constant babysitting Batch jobs read data directly on schedule with no streaming dependencies
Developer to simplify data movement between services I avoid learning Kafka internals Kafka requires topics, partitions, consumer groups, offsets Batch jobs are simple scripts triggered on cron
SRE to eliminate high-severity incidents caused by streaming lag I reduce pager fatigue Kafka lag spikes cause alerts and require tuning Batch jobs have predictable runtime and no lag concept
Product owner to reduce cloud spend I can reallocate budget to features Kafka clusters, Connect, and monitoring tools are expensive GitHub Actions minutes are cheap and predictable
Analytics engineer to run hourly or daily ETL without streaming infra I avoid overengineering for low-frequency workloads CDC is overkill for once-per-hour data Batch jobs match natural cadence of analytics workloads
Security engineer to reduce attack surface I simplify compliance and audits Kafka requires brokers, Zookeeper/Kraft, ACLs, network rules GitHub Actions has minimal infra and built-in security controls
Engineering manager to reduce onboarding time new hires can contribute faster Kafka requires specialized knowledge Batch jobs use familiar tools (Python, SQL, shell)
Architect to remove unnecessary distributed systems I keep the system maintainable long-term Kafka introduces operational complexity not justified by workload Batch jobs are easy to reason about and evolve
Finance stakeholder to eliminate unpredictable streaming costs I get stable, forecastable billing Kafka cost scales with throughput and retention GitHub Actions cost scales linearly with runs

Examples

Data Contracts

Schema Registry

CLI

Deduplication

Bindings

Glossary

Flink

References

JSONPointer Integration

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-java</artifactId>
    <version>1.15.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-java_2.12</artifactId>
    <version>1.15.0</version>
</dependency>
<dependency>
    <groupId>com.fasterxml.jackson.core</groupId>
    <artifactId>jackson-core</artifactId>
    <version>2.13.0</version>
</dependency>
<dependency>
    <groupId>com.github.fge</groupId>
    <artifactId>jackson-coreutils</artifactId>
    <version>1.9</version>
</dependency>
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import com.fasterxml.jackson.core.JsonPointer;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

public class JsonPathFinder {
    public static void main(String[] args) throws Exception {
        // Set up the streaming execution environment
        var env = StreamExecutionEnvironment.getExecutionEnvironment();

        // Create a sample JSON input stream
        var jsonInputStream = env.fromElements(
                """
                {"foo": "bar"}
                """,
                """
                {"baz": "qux"}
                """
        );

        // Define a FlatMapFunction to find the JSON Pointer path "foo"
        var results = jsonInputStream.flatMap((FlatMapFunction<String, Tuple2<String, Boolean>>) (value, out) -> {
            var mapper = new ObjectMapper();
            var rootNode = mapper.readTree(value);
            var pointer = JsonPointer.compile("/foo");

            var pathExists = !rootNode.at(pointer).isMissingNode();
            out.collect(new Tuple2<>(value, pathExists));
        });

        // Print the results
        results.print();

        // Execute the Flink job
        env.execute("Json Path Finder");
    }
}

Migration

Kafka/Flink for SpringMVC WebFlux Developers

Similar Methods in Apache Kafka

  • Flux.fromIterable() - Similar to Kafka's KafkaConsumer.poll() which retrieves records from a Kafka topic.
  • Flux.subscribe() - Similar to Kafka's KafkaConsumer.subscribe() which subscribes the consumer to one or more topics.
  • Flux.map() - Similar to Kafka's KafkaStreams.map() which transforms records in a stream.
  • Flux.flatMap() - Similar to Kafka's KafkaStreams.flatMap() which transforms records into multiple records.
  • Flux.delayElements() - Similar to Kafka's KafkaProducer.send() which can be used with a delay.

Similar Methods in Apache Flink

  • Flux.fromIterable() - Similar to Flink's DataStream.fromCollection() which creates a DataStream from a collection.
  • Flux.subscribe() - Similar to Flink's DataStream.addSource() which adds a source to the DataStream.
  • Flux.map() - Similar to Flink's DataStream.map() which applies a function to each element in the stream.
  • Flux.flatMap() - Similar to Flink's DataStream.flatMap() which transforms each element into zero or more elements.
  • Flux.delayElements() - Similar to Flink's DataStream.timeWindow() which introduces a delay or windowing in the stream.

Kafka Protocol

Flink Protocol

Kinesis Protocol

Storm Protocol

Integration

Use Cases

Concepts

Internals

Refactoring

Security

Videos

VS

One of Kafka's core features is the partitioning of data by means of a partition key, which can be used to select data for which the order must be maintained and data which can be processed in parallel.

A Kafka cluster consists of brokers that coordinate the writing (and reading) of data to permanent storage. With Kafka, every message is stored. Communicating via permanent storage decouples the send and receive operations from each other

The key benefits of Kafka are its scalability, its ordering guarantees, its wide-scale adoption, and wealth of commercial service offerings.

All message types are brokered. This means that messages can be delivered even if the recipient was disconnected for a moment.

The communication is also decoupled in terms of time so that direct feedback from the recipient to the sender of a message is no longer possible.

References

Kafka Protocol

Kafka Security Vulnerabilities

Kafka Presentations

Cluster Linking Failover

Confluent Client Reference

Data Contracts

Schema Management

Data Contracts

Stream Governance

Consumer Configs

Confluent Security Advisories

⚠️ **GitHub.com Fallback** ⚠️