AWS Kinesis Data Streams - keshavbaweja-git/guides GitHub Wiki

Overview

  • Ingest, durably store and process large amounts of streaming data in real time
  • Kinesis streaming data platform
    • Kinesis Video Streams
    • Kinesis Data Streams
    • Kinesis Data Firehose (Publishes records from a data stream into S3, Redshift, Elasticsearch or Splunk)
    • Kinesis Data Analytics
  • Use cases
    • Accelerated log and data feed intake and processing
    • Real-time metrics and reporting
    • Real-time data analytics
    • Complex stream processing

Terms

Data Record

Unit of data stored by Kinesis Data Streams. Each data record has a sequence number that is unique per partition-key within its shard.

Data Stream

A collection of data records

Shard

  • A shard stores a subset of collection of data records in a data steam.
  • Total capacity and throughput of a data stream is proportional to the number of shards.
  • Number of shards associated with a data stream can be increased or decreased
  • Unit of charging model for data streams

Producer

A producer publishes a data record to data stream for ingestion. A producer must specify - data stream name, partition key, data blob. A partition key determines the shard in the stream to which the data record is added to.

Consumer

Partition Key

A partition key is a Unicode string with a maximum length of 256 bytes. An MD5 hash function is used to map partition key to 128-bit integer values and to map associated data records to shards.

Kinesis Data Stream Quotas

Quotas

  • Write (per shard)
    • 1 MB/sec
    • 1000 records/sec
  • Read (per shard) (Classic Consumer)
    • 2 MB/sec
    • 5 read transactions per sec
    • Each read transaction with GetRecords can return upto 10 MB / 10,000 records
    • If first read transaction hits the limit of 2 MB, subsequent read transactions within the same second fail
    • 5 read transactions/sec/shard implies a latency of 200ms
  • Read (per shard) (Enhanced Fan Out Consumer)
    • Data is pushed by Kinesis to consumers
    • Avg latency of 70 ms in data push

Kinesis Data Firehose

Source

  • Kineses Data Firehose Agent
  • AWS SDK
  • CloudWatch Logs
  • CloudWatch Events
  • Kinesis Data Stream

Process

  • Transform record - attach a Lambda that is invoked for each record
  • Convert record format - specify AWS Glue Catalog table that describes target schema

Destination

  • Amazon S3
    • S3 Bucket
    • Prefix - Optional prefix for Amazon S3 objects. KDF has default prefix of "YYYY/MM/dd/HH" in UTC.
    • Errore prefix - Optional
  • Amazon Redshift
    • Cluster details
    • Intermediate S3 bucket
    • Prefix - for Amazon S3 objects (optional)
    • COPY command and options
    • Retry duration - Time duration for KDF to retry if data COPY to Amazon Redshift cluster fails
  • Amazon ES
    • ES Domain
    • Index
    • Index rotation
    • Retry duration
    • Destination VPC
    • Backup mode - back up only failed records or all records
    • Backup S3 bucket
    • Backup S3 bucket prefix (optional)
  • Splunk
  • HTTP endpoint

Settings

  • Buffer size and interval for destination
  • Buffer size and interval for S3 backup
  • S3 Compression
  • S3 Encryption - KDF supports SSE-KMS for data delivered to S3
  • Error logging - if record transformation is enabled, failure of lambda function invocations can be logged to CloudWatch logs
  • IAM role

Security

Data protection

  • Server side encryption with Kinesis Data Stream as source When source of KDF is a Kinesis Data Stream, KDF does not store data at rest. Kinesis Data Stream encrypts data with SSE-KMS for any data published to it. KDF decrypts data when it reads from Kinesis Data Stream and publishes to destination.
  • Server side encryption with DIRECT PUT or other sources Server side encryption in KDF is configurable and can be enabled/disabled

Kinesis Data Analytics

  • Parallelism of Apache Flink application = No. of KPU allocated * ParallelismPerKPU
  • Limit for no. of KPU allocated is 32
  • 1 KPU - I CPU Core + 4 GB RAM (1 GB native + 3 GB heap) + 50 GB disk space
  • Auto scaling is enabled by default
  • Auto scaling results in application downtime
  • Application can be scaled using API calls
  • Associated resources - CloudWatch Log Streams, IAM polices, IAM roles
  • Scaling does not apply to Studio notebook, unless it is deployed as an application
  • A Snapshot is Kinesis Data Analytics implementation of Apache Flink Savepoint.
  • A Snapshot is user or service triggered, managed backup of Apache Flink application state