Amazon kinesis - vedratna/aws-learning GitHub Wiki

Amazon Kinesis Data Streams has simple pay-as-you-go pricing, with no upfront costs or minimum fees, and you pay only for the resources you consume. An Amazon Kinesis stream is made up of one or more shards. Each shard gives you a capacity of five read transactions per second, up to a maximum total of 2 MB of data read per second. Each shard can support up to 1,000 write transactions per second, and up to a maximum total of 1 MB data written per second.
The data capacity of your stream is a function of the number of shards that you specify for the stream. The total capacity of the stream is the sum of the capacity of each shard. There are two components to pricing:
- Primary pricing includes an hourly charge per shard and a charge for each one million PUT transactions.
- Pricing for optional components for extended retention and enhanced fan-out.
Amazon Kinesis Data Streams has the following anti-patterns:
- Small scale consistent throughput – Even though Kinesis Data Streams works for streaming data at 200 KB per second or less, it is designed and optimized for larger data throughputs.
- Long-term data storage and analytics – Kinesis Data Streams is not suited for long-term data storage. By default, data is retained for 24 hours, and you can extend the retention period by up to 365 days.
Data is automatically replicated synchronously to 3 AZ
Classic Consumer has a default limit of 2 MB/sec per shard or 5 API calls per second per shard across all consumers But in case of Consumer Enhanced Fan-out; each enhanced consumer has limit of 2 MB/sec read on each shard and it's push model hence no need of api calls.

Kinesis Producers

There are 3 ways to produce data to kinesis. Kinesis SDK, Kinesis Producer Library (KPL), Kinesis agent.

Kinesis SDK

It uses API, PutRecord to produce single record and PutRecords to produce multiple records in single api call. Kinesis throws ProvisionThroughputExceeded error in case of api or data limit exceeds. In this case error handling and retry mechanism need to be implemented by SDK user. Also for PutRecords api, it's sdk user's responsibility to implement batching mechanism to collect more records to send it across in single api call. There are few managed services that uses PutRecord(s) api internally when integrated with kinesis. Cloudwatch Logs, AWS IoT and Kinesis Data analytics.

Kinesis Producer Library (KPL)

It has synchronous and asynchronous api calls with inbuilt retry mechanism. It uses batching (collection and aggregation) to send more data to multiple shard in single API call. The KPL supports two types of batching: Aggregation – Storing multiple records within a single Kinesis Data Streams record. Collection – Using the API operation PutRecords to send multiple Kinesis Data Streams records to one or more shards in your Kinesis data stream. The two types of KPL batching are designed to coexist and can be turned on or off independently of one another. By default, both are turned on.

Aggregation

Aggregation refers to the storage of multiple records in a Kinesis Data Streams record. Aggregation allows customers to increase the number of records sent per API call, which effectively increases producer throughput.

Kinesis Data Streams shards support up to 1,000 Kinesis Data Streams records per second, or 1 MB throughput. The Kinesis Data Streams records per second limit binds customers with records smaller than 1 KB. Record aggregation allows customers to combine multiple records into a single Kinesis Data Streams record. This allows customers to improve their per shard throughput.

Consider the case of one shard in region us-east-1 that is currently running at a constant rate of 1,000 records per second, with records that are 512 bytes each. With KPL aggregation, you can pack 1,000 records into only 10 Kinesis Data Streams records, reducing the RPS to 10 (at 50 KB each).

Collection

Collection refers to batching multiple Kinesis Data Streams records and sending them in a single HTTP request with a call to the API operation PutRecords, instead of sending each Kinesis Data Streams record in its own HTTP request.

This increases throughput compared to using no collection because it reduces the overhead of making many separate HTTP requests. In fact, PutRecords itself was specifically designed for this purpose.

Collection differs from aggregation in that it is working with groups of Kinesis Data Streams records. The Kinesis Data Streams records being collected can still contain multiple records from the user.

In KPL Batching, RecordMaxBufferTime is used to decide interval between two producer api calls.

Kinesis Agent

Kinesis Agent is a stand-alone Java software application that offers an easy way to collect and send data to Kinesis Data Streams. The agent continuously monitors a set of files and sends new data to your stream. The agent handles file rotation, checkpointing, and retry upon failures. It delivers all of your data in a reliable, timely, and simple manner. It also emits Amazon CloudWatch metrics to help you better monitor and troubleshoot the streaming process.

By default, records are parsed from each file based on the newline ('\n') character. However, the agent can also be configured to parse multi-line records (see Agent Configuration Settings).

You can install the agent on Linux-based server environments such as web servers, log servers, and database servers. After installing the agent, configure it by specifying the files to monitor and the stream for the data. After the agent is configured, it durably collects data from the files and reliably sends it to the stream.

Kinesis Consumers

Kinesis records can be consumed by using SDK (GetRecords API), KCL (Kinesis Consumer Library), Kinesis Firehose, Kinesis Data Analytics and AWS Lambda. Records are polled by consumers at certain interval. (Except Kinesis Enhanced Consumer that works on Push mechanism)

Consumer SDK uses GetRecords API to consume data that can return up to 10MB or 10000 records per API call. It has a limit of 5 APIs per shard per second. Also each shard has a consumer limit of 2 MB/sec or 5 API per second.
KCL (Kinesis Client library) and Lambda Consumer can internally de-aggregate records published with aggregation through KPL.
KCL 2.0 and Lambda after Nov 2018 does support Fanout Enhanced Consumer. In case of standard consumer, data needs to be pulled and it has overall limit of 2MB/sec data limit. It means if 5 consumers are subscribed to same shard; each consumer can not pull more than 400KB data every second. But in case of Fanout Enhanced Consumer, data is pushed by shard at the rate of 2MB/sec to each consumer. So all consumers subscribed to same shard get data at the rate of 2MB/sec. Fanout Enhanced Consumer costs higher than standard, hence use it only if you need low latency ~70 ms. It also has a default soft limit of 5 consumer per shard.

Kinesis Data Firehose

Kinesis Data Firehose is used to dump lot of data at specific storage near real time with minimum latency of 1 minute.

KPL, Kinesis Agent, Kinesis Data stream, Cloudwatch log events and AWS IoT can produce data to KDF.
KDF can dump data to s3, redshift, splunk or at AWS elastic search.
KDF can do data transformation from json to parquet or OCR while loading it to s3
Other data transformation like from csv to json can be possible through lambda before loading data to destination.
KDF uses configured Buffer size (in few MBs) and Buffer time (minimum is 1 minute) to flush out data buffer to destination.
It supports compression GZIP, ZIP and SNAPPY when target is s3. It supports GZIP for redshift.
You pay only for the size of data loads in KDF.
Separate S3 bucket can be configured to store all data records pass through KDF or any delivery failure or transformation failure. In short you can have separate S3 bucket to store data that didn't get loaded due to any short of failure or to store all the data that processed through KDF as a backup.

Kinesis vs SQS

https://medium.com/nerd-for-tech/system-design-choosing-between-aws-kinesis-and-aws-sqs-2586c814be8d