Kinesis - seanremenyi/Notes_aws_developer GitHub Wiki
a family of services which enables you to collect, process and analyze streaming data, in real-time Allows you to build custom applications for your own business needs Kinesis is a greek word meaning motion or movements AMazon Kinesis deals with data that is in motion, or streaming data
Streaming Data? Data generated continuously by thousands of data sources, which typically send in the data records simultaneously and in small sizes (KB)
Financial transactions stock prices game data (as the gamer plays) social media feeds Location tracking data (Uber) IoT sensors Clickstream data Log files Kinesis Streams: Stream data and video to allow you to build custom applications that process data in real time Data stream and video streams Shards: only in kinesis streams Kinesis streams are made up of shards Each shard is a sequence of one or more data records ands provides a fixed unit of capacity. 5 reads per second. The max total read rate is 2MB per second 1000 writes per second. The max total write rate is 1MB per second The data capacity of the stream is the sum total capacity of its shards If the data rate increases, you can increase capacity on your stream by increasing the number of shards retains data up from 24hours to a week shards continue to consumers to process A shard is a sequence of data records, each with their own unique sequence number As your data rate increases, you increase the number of shards (resharding) What abou consumers?
Kinesis client library runs on the consumer instances. Tracks the number of shards in your stream Discovers new shards when you reshard Kinesis Client library: The KCL ensures that for every shard there is a record processor Manages the number of record processos relative to the number of shards and consumers If you have only one consumer, the kcl will create all the record processors on a single consumer If you have 2 consumers it will load balance and create half the processors on one instance and half on another if 4 shards to one consumer than 1 consumer/4 record processors, if 2 consumers than 2 record processors What about scaling out the consumers? With KCL, generally you should ensure that the number of instances does not exceed the number of shards (except for failure or standby purposes You never need multiple instances to handle the processing load of one shard However, one worker can process multiple shard it's fine if the number of shards exceeds the number of instances Don't think that just because you reshard, that you need to add more instance Instead CPU utilisation is what should drive the quantity of consumer instances you have, NOT the number of shards in your kinesis stream Use an auto scaling group, and base scaling decision on CPU load on your consumer Kinesis Shards: The Kinesis Client Library running on your consumers creates a record proccessor for each shard that is being consumer by your instance If you increase the number of shards, the KCL will add more record processors on your consumers CPU utilisation is what should drive the quantity of consumer instances you have, NOT the number of shards in your kinesis stream Use an auto scaling group and base scaling decisions on CPU load on your consumers Kinesis Video streams: Securely stream video from connected devices to AWS. Videos can be used for analytics and machine learning
Kinesis data firehose: Capture, transform, load data streams into AWS data stores ( or other service providers like Splunk or datadog) to enable near real-time analytics with BI tools no data retention optional lamda can process while it comes in then save to storage no shards and no consumers
Kinesis Data Analytics: Analyze, query and transform streamed data in real time using standard SQL. tore the results in an AWS data store (like S3, redshif) sits after kinesis streams or data firehose