Apache Kafka - sgml/signature GitHub Wiki
| As a… | I want… | So that… | Pain With Kafka/CDC | Benefit With GitHub Actions + Batch |
|---|---|---|---|---|
| Data engineer | to replace always-on Kafka clusters with scheduled batch jobs | I reduce infrastructure cost and eliminate 24/7 compute | Kafka brokers run continuously even when idle | GitHub Actions only runs when triggered, costing near-zero when idle |
| Platform engineer | to remove CDC connectors and schema registries | I reduce maintenance burden and operational noise | CDC connectors break on schema drift and require constant babysitting | Batch jobs read data directly on schedule with no streaming dependencies |
| Developer | to simplify data movement between services | I avoid learning Kafka internals | Kafka requires topics, partitions, consumer groups, offsets | Batch jobs are simple scripts triggered on cron |
| SRE | to eliminate high-severity incidents caused by streaming lag | I reduce pager fatigue | Kafka lag spikes cause alerts and require tuning | Batch jobs have predictable runtime and no lag concept |
| Product owner | to reduce cloud spend | I can reallocate budget to features | Kafka clusters, Connect, and monitoring tools are expensive | GitHub Actions minutes are cheap and predictable |
| Analytics engineer | to run hourly or daily ETL without streaming infra | I avoid overengineering for low-frequency workloads | CDC is overkill for once-per-hour data | Batch jobs match natural cadence of analytics workloads |
| Security engineer | to reduce attack surface | I simplify compliance and audits | Kafka requires brokers, Zookeeper/Kraft, ACLs, network rules | GitHub Actions has minimal infra and built-in security controls |
| Engineering manager | to reduce onboarding time | new hires can contribute faster | Kafka requires specialized knowledge | Batch jobs use familiar tools (Python, SQL, shell) |
| Architect | to remove unnecessary distributed systems | I keep the system maintainable long-term | Kafka introduces operational complexity not justified by workload | Batch jobs are easy to reason about and evolve |
| Finance stakeholder | to eliminate unpredictable streaming costs | I get stable, forecastable billing | Kafka cost scales with throughput and retention | GitHub Actions cost scales linearly with runs |
- https://github.com/scholzj/kafka-test-apps/blob/main/kafka-producer.yaml
- https://www.stardog.com/labs/blog/stream-reasoning-with-stardog/
- https://towardsdatascience.com/kafka-python-explained-in-10-lines-of-code-800e3e07dad1
- https://github.com/confluentinc/librdkafka/tree/master/examples
- https://medium.com/@ali.mrd318/simplifying-kafka-testing-in-python-a-mockafka-py-tutorial-3a0dbbfe9866
- https://docs.confluent.io/platform/current/schema-registry/fundamentals/data-contracts.html
- https://www.confluent.io/blog/error-handling-patterns-in-kafka/
-
https://docs.confluent.io/platform/current/schema-registry/connect.html
-
https://developer.confluent.io/courses/schema-registry/evolve-schemas-hands-on/
-
https://developer.confluent.io/learn-more/kafka-on-the-go/schemas/
-
https://developer.confluent.io/courses/schema-registry/key-concepts/
-
https://developer.confluent.io/courses/schema-registry/schema-subjects/
-
https://docs.confluent.io/platform/current/schema-registry/fundamentals/data-contracts.html
-
https://docs.confluent.io/platform/current/schema-registry/index.html
-
https://docs.confluent.io/platform/current/schema-registry/fundamentals/index.html
-
https://docs.confluent.io/platform/current/schema-registry/fundamentals/schema-evolution.html
-
https://docs.confluent.io/platform/current/schema-registry/develop/api.html
-
https://docs.confluent.io/platform/current/schema-registry/installation/migrate.html
-
https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/index.html
-
https://docs.confluent.io/operator/current/co-manage-schemas.html
-
https://docs.confluent.io/operator/2.2/co-manage-schemas.html
-
https://www.confluent.io/blog/best-practices-for-confluent-schema-registry/
-
https://docs.confluent.io/platform/current/schema-registry/develop/using.html
-
https://www.confluent.io/blog/how-schema-registry-clients-work/
-
https://developer.confluent.io/patterns/event/schema-on-read/
-
https://www.confluent.io/blog/schema-registry-for-beginners/
- https://www.confluent.io/blog/using-apache-kafka-command-line-tools-confluent-cloud/
- https://greenplum.docs.pivotal.io/streaming-server/1-3-6/kafka/load-from-kafka-example.html
- https://play.vidyard.com/e869cfd0-76d8-4859-a90f-2471c52a7e22
- https://www.slideshare.net/slideshow/stream-data-deduplication-powered-by-kafka-streams-philipp-schirmer-bakdata/249203406
- https://www.slideshare.net/slideshow/embed_code/key/ayg8N2YEG0jw4A
- https://docs.confluent.io/cloud/current/flink/reference/functions/datetime-functions.html
- https://docs.confluent.io/cloud/current/flink/reference/timezone.html
- https://cwiki.apache.org/confluence/display/Flink/FLIP-188%3A+Introduce+Built-in+Dynamic+Table+Storage#FLIP188:IntroduceBuiltinDynamicTableStorage-Retention
- https://www.alibabacloud.com/blog/introduction-to-unified-batch-and-stream-processing-of-apache-flink_601407
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.15.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.12</artifactId>
<version>1.15.0</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
<version>2.13.0</version>
</dependency>
<dependency>
<groupId>com.github.fge</groupId>
<artifactId>jackson-coreutils</artifactId>
<version>1.9</version>
</dependency>
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import com.fasterxml.jackson.core.JsonPointer;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
public class JsonPathFinder {
public static void main(String[] args) throws Exception {
// Set up the streaming execution environment
var env = StreamExecutionEnvironment.getExecutionEnvironment();
// Create a sample JSON input stream
var jsonInputStream = env.fromElements(
"""
{"foo": "bar"}
""",
"""
{"baz": "qux"}
"""
);
// Define a FlatMapFunction to find the JSON Pointer path "foo"
var results = jsonInputStream.flatMap((FlatMapFunction<String, Tuple2<String, Boolean>>) (value, out) -> {
var mapper = new ObjectMapper();
var rootNode = mapper.readTree(value);
var pointer = JsonPointer.compile("/foo");
var pathExists = !rootNode.at(pointer).isMissingNode();
out.collect(new Tuple2<>(value, pathExists));
});
// Print the results
results.print();
// Execute the Flink job
env.execute("Json Path Finder");
}
}
- Flux.fromIterable() - Similar to Kafka's KafkaConsumer.poll() which retrieves records from a Kafka topic.
- Flux.subscribe() - Similar to Kafka's KafkaConsumer.subscribe() which subscribes the consumer to one or more topics.
- Flux.map() - Similar to Kafka's KafkaStreams.map() which transforms records in a stream.
- Flux.flatMap() - Similar to Kafka's KafkaStreams.flatMap() which transforms records into multiple records.
- Flux.delayElements() - Similar to Kafka's KafkaProducer.send() which can be used with a delay.
- Flux.fromIterable() - Similar to Flink's DataStream.fromCollection() which creates a DataStream from a collection.
- Flux.subscribe() - Similar to Flink's DataStream.addSource() which adds a source to the DataStream.
- Flux.map() - Similar to Flink's DataStream.map() which applies a function to each element in the stream.
- Flux.flatMap() - Similar to Flink's DataStream.flatMap() which transforms each element into zero or more elements.
- Flux.delayElements() - Similar to Flink's DataStream.timeWindow() which introduces a delay or windowing in the stream.
- https://www.confluent.io/blog/how-to-share-kafka-connectors-on-confluent-hub/
- https://docs.confluent.io/kafka-connectors/github/current/configuration_options.html
- https://docs.confluent.io/kafka-connectors/aws-lambda/current/lambda_sink_connector_config.html
- https://medium.com/geekculture/heroku-integration-capabilities-the-mini-guide-b8ce745faad1
- https://www.confluent.io/hub/castorm/kafka-connect-http
- https://docs.confluent.io/kafka-connect-aws-cloudwatch-logs/current/overview.html
- https://docs.confluent.io/kafka-connect-sftp/current/source-connector/csv_source_connector.html
- https://rmoff.net/2021/01/11/running-a-self-managed-kafka-connect-worker-for-confluent-cloud/
- https://developer.salesforce.com/blogs/2016/05/streaming-salesforce-events-heroku-kafka
- https://dzone.com/articles/kafka-for-xml-message-integration-and-processing
- https://mozilla-version-control-tools.readthedocs.io/en/latest/hgmo/replication.html
- http://www.liferaysavvy.com/2021/07/liferay-tomcat-access-logs-to-kafka.html
- https://www.oreilly.com/library/view/mastering-kafka-streams/9781492062486/ch01.html
- https://www.confluent.io/kafka-summit-sf18/kafka-as-an-eventing-system-to-replatform-a-monolith-into-microservices/
- https://towardsdatascience.com/getting-started-with-apache-kafka-in-python-604b3250aa05
- https://blog.bosch-si.com/developer/eclipse-hono-supporting-apache-kafka-for-messaging/
- https://github.com/eclipse/hono/issues/8
- https://www.confluent.io/de-de/blog/enabling-exactly-once-kafka-streams/
- https://dev.to/heroku/what-is-a-commit-log-and-why-should-you-care-pib
- https://preparingforcodinginterview.wordpress.com/2019/10/04/kafka-3-why-is-kafka-so-fast/
- https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol
- https://www.oreilly.com/library/view/streaming-architecture/9781491953914/ch04.html
- https://docs.datastax.com/en/kafka/doc/kafka/kafkaHowMessages.html
- https://kafka.apache.org/cve-list
- https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-tools-DumpLogSegments.html
- https://logging.apache.org/log4j/2.x/log4j-users-guide.pdf
- http://events17.linuxfoundation.org/sites/events/files/slides/developing.realtime.data_.pipelines.with_.apache.kafka_.pdf
- https://www.moengage.com/blog/kafka-at-moengage/
- https://www.confluent.io/es-es/blog/kafka-without-zookeeper-a-sneak-peek/
- https://www.confluent.io/blog/apache-flink-apache-kafka-streams-comparison-guideline-users/
- https://stackoverflow.com/questions/60625612/how-does-one-use-kafka-with-openid-connect
- https://developer.ibm.com/tutorials/kafka-authn-authz/
One of Kafka's core features is the partitioning of data by means of a partition key, which can be used to select data for which the order must be maintained and data which can be processed in parallel.
A Kafka cluster consists of brokers that coordinate the writing (and reading) of data to permanent storage. With Kafka, every message is stored. Communicating via permanent storage decouples the send and receive operations from each other
The key benefits of Kafka are its scalability, its ordering guarantees, its wide-scale adoption, and wealth of commercial service offerings.
All message types are brokered. This means that messages can be delivered even if the recipient was disconnected for a moment.
The communication is also decoupled in terms of time so that direct feedback from the recipient to the sender of a message is no longer possible.