AWS Kinesis Data Streams - keshavbaweja-git/guides GitHub Wiki

Overview

Ingest, durably store and process large amounts of streaming data in real time
Kinesis streaming data platform
- Kinesis Video Streams
- Kinesis Data Streams
- Kinesis Data Firehose (Publishes records from a data stream into S3, Redshift, Elasticsearch or Splunk)
- Kinesis Data Analytics
Use cases
- Accelerated log and data feed intake and processing
- Real-time metrics and reporting
- Real-time data analytics
- Complex stream processing

Terms

Data Record

Unit of data stored by Kinesis Data Streams. Each data record has a sequence number that is unique per partition-key within its shard.

Data Stream

A collection of data records

Shard

A shard stores a subset of collection of data records in a data steam.
Total capacity and throughput of a data stream is proportional to the number of shards.
Number of shards associated with a data stream can be increased or decreased
Unit of charging model for data streams

Producer

A producer publishes a data record to data stream for ingestion. A producer must specify - data stream name, partition key, data blob. A partition key determines the shard in the stream to which the data record is added to.

Consumer

Partition Key

A partition key is a Unicode string with a maximum length of 256 bytes. An MD5 hash function is used to map partition key to 128-bit integer values and to map associated data records to shards.

Kinesis Data Stream Quotas

Quotas

Write (per shard)
- 1 MB/sec
- 1000 records/sec
Read (per shard) (Classic Consumer)
- 2 MB/sec
- 5 read transactions per sec
- Each read transaction with GetRecords can return upto 10 MB / 10,000 records
- If first read transaction hits the limit of 2 MB, subsequent read transactions within the same second fail
- 5 read transactions/sec/shard implies a latency of 200ms
Read (per shard) (Enhanced Fan Out Consumer)
- Data is pushed by Kinesis to consumers
- Avg latency of 70 ms in data push

Kinesis Data Firehose

Source

Kineses Data Firehose Agent
AWS SDK
CloudWatch Logs
CloudWatch Events
Kinesis Data Stream

Process

Transform record - attach a Lambda that is invoked for each record
Convert record format - specify AWS Glue Catalog table that describes target schema

Destination

Amazon S3
- S3 Bucket
- Prefix - Optional prefix for Amazon S3 objects. KDF has default prefix of "YYYY/MM/dd/HH" in UTC.
- Errore prefix - Optional
Amazon Redshift
- Cluster details
- Intermediate S3 bucket
- Prefix - for Amazon S3 objects (optional)
- COPY command and options
- Retry duration - Time duration for KDF to retry if data COPY to Amazon Redshift cluster fails
Amazon ES
- ES Domain
- Index
- Index rotation
- Retry duration
- Destination VPC
- Backup mode - back up only failed records or all records
- Backup S3 bucket
- Backup S3 bucket prefix (optional)
Splunk
HTTP endpoint

Settings

Buffer size and interval for destination
Buffer size and interval for S3 backup
S3 Compression
S3 Encryption - KDF supports SSE-KMS for data delivered to S3
Error logging - if record transformation is enabled, failure of lambda function invocations can be logged to CloudWatch logs
IAM role

Security

Data protection

Server side encryption with Kinesis Data Stream as source When source of KDF is a Kinesis Data Stream, KDF does not store data at rest. Kinesis Data Stream encrypts data with SSE-KMS for any data published to it. KDF decrypts data when it reads from Kinesis Data Stream and publishes to destination.
Server side encryption with DIRECT PUT or other sources Server side encryption in KDF is configurable and can be enabled/disabled

Kinesis Data Analytics

Parallelism of Apache Flink application = No. of KPU allocated * ParallelismPerKPU
Limit for no. of KPU allocated is 32
1 KPU - I CPU Core + 4 GB RAM (1 GB native + 3 GB heap) + 50 GB disk space
Auto scaling is enabled by default
Auto scaling results in application downtime
Application can be scaled using API calls
Associated resources - CloudWatch Log Streams, IAM polices, IAM roles
Scaling does not apply to Studio notebook, unless it is deployed as an application
A Snapshot is Kinesis Data Analytics implementation of Apache Flink Savepoint.
A Snapshot is user or service triggered, managed backup of Apache Flink application state