AWS Kinesis Data Streams - keshavbaweja-git/guides GitHub Wiki
Overview
- Ingest, durably store and process large amounts of streaming data in real time
- Kinesis streaming data platform
- Kinesis Video Streams
- Kinesis Data Streams
- Kinesis Data Firehose (Publishes records from a data stream into S3, Redshift, Elasticsearch or Splunk)
- Kinesis Data Analytics
- Use cases
- Accelerated log and data feed intake and processing
- Real-time metrics and reporting
- Real-time data analytics
- Complex stream processing
Terms
Data Record
Unit of data stored by Kinesis Data Streams. Each data record has a sequence number that is unique per partition-key within its shard.
Data Stream
A collection of data records
Shard
- A shard stores a subset of collection of data records in a data steam.
- Total capacity and throughput of a data stream is proportional to the number of shards.
- Number of shards associated with a data stream can be increased or decreased
- Unit of charging model for data streams
Producer
A producer publishes a data record to data stream for ingestion. A producer must specify - data stream name, partition key, data blob. A partition key determines the shard in the stream to which the data record is added to.
Consumer
Partition Key
A partition key is a Unicode string with a maximum length of 256 bytes. An MD5 hash function is used to map partition key to 128-bit integer values and to map associated data records to shards.
Kinesis Data Stream Quotas
- Write (per shard)
- 1 MB/sec
- 1000 records/sec
- Read (per shard) (Classic Consumer)
- 2 MB/sec
- 5 read transactions per sec
- Each read transaction with GetRecords can return upto 10 MB / 10,000 records
- If first read transaction hits the limit of 2 MB, subsequent read transactions within the same second fail
- 5 read transactions/sec/shard implies a latency of 200ms
- Read (per shard) (Enhanced Fan Out Consumer)
- Data is pushed by Kinesis to consumers
- Avg latency of 70 ms in data push
Kinesis Data Firehose
Source
- Kineses Data Firehose Agent
- AWS SDK
- CloudWatch Logs
- CloudWatch Events
- Kinesis Data Stream
Process
- Transform record - attach a Lambda that is invoked for each record
- Convert record format - specify AWS Glue Catalog table that describes target schema
Destination
- Amazon S3
- S3 Bucket
- Prefix - Optional prefix for Amazon S3 objects. KDF has default prefix of "YYYY/MM/dd/HH" in UTC.
- Errore prefix - Optional
- Amazon Redshift
- Cluster details
- Intermediate S3 bucket
- Prefix - for Amazon S3 objects (optional)
- COPY command and options
- Retry duration - Time duration for KDF to retry if data COPY to Amazon Redshift cluster fails
- Amazon ES
- ES Domain
- Index
- Index rotation
- Retry duration
- Destination VPC
- Backup mode - back up only failed records or all records
- Backup S3 bucket
- Backup S3 bucket prefix (optional)
- Splunk
- HTTP endpoint
Settings
- Buffer size and interval for destination
- Buffer size and interval for S3 backup
- S3 Compression
- S3 Encryption - KDF supports SSE-KMS for data delivered to S3
- Error logging - if record transformation is enabled, failure of lambda function invocations can be logged to CloudWatch logs
- IAM role
Security
Data protection
- Server side encryption with Kinesis Data Stream as source When source of KDF is a Kinesis Data Stream, KDF does not store data at rest. Kinesis Data Stream encrypts data with SSE-KMS for any data published to it. KDF decrypts data when it reads from Kinesis Data Stream and publishes to destination.
- Server side encryption with DIRECT PUT or other sources Server side encryption in KDF is configurable and can be enabled/disabled
Kinesis Data Analytics
- Parallelism of Apache Flink application = No. of KPU allocated * ParallelismPerKPU
- Limit for no. of KPU allocated is 32
- 1 KPU - I CPU Core + 4 GB RAM (1 GB native + 3 GB heap) + 50 GB disk space
- Auto scaling is enabled by default
- Auto scaling results in application downtime
- Application can be scaled using API calls
- Associated resources - CloudWatch Log Streams, IAM polices, IAM roles
- Scaling does not apply to Studio notebook, unless it is deployed as an application
- A Snapshot is Kinesis Data Analytics implementation of Apache Flink Savepoint.
- A Snapshot is user or service triggered, managed backup of Apache Flink application state