AWS Analytics Services - devian-al/AWS-Solutions-Architect-Prep GitHub Wiki

Athena

  • Athena is an interactive query service which allows you to interact and query data from S3 using standard SQL commands. This is beneficial for programmatic querying for the average developer. It is serverless, requires no provisioning, and you pay per query and per TB scanned. You basically turn S3 into a SQL supported database by using Athena.
  • Example use cases
    • Query logs that are dumped into S3 buckets as an alternative or supplement to the ELK stack
    • Setting queries to run business reports based off of the data regularly entering S3
    • Running queries on click-stream data to have further insight of customer behavior

Amazon Glue

  • A fully managed service to extract, transform, and load (ETL) your data for analytics.
  • Discover and search across different AWS data sets without moving your data.
  • AWS Glue consists of:
    • Central metadata repository
    • ETL engine
    • Flexible scheduler
  • Use Cases
    • Run queries against an Amazon S3 data lake
    • You can use AWS Glue to make your data available for analytics without moving your data.
    • Analyze the log data in your data warehouse
    • Create ETL scripts to transform, flatten, and enrich the data from source to target.
    • Create event-driven ETL pipelines
    • As soon as new data becomes available in Amazon S3, you can run an ETL job by invoking AWS Glue ETL jobs using an AWS Lambda function.
    • A unified view of your data across multiple data stores
    • With AWS Glue Data Catalog, you can quickly search and discover all your datasets and maintain the relevant metadata in one central repository.

AWS Data Pipeline

  • A web service for scheduling regular data movement and data processing activities in the AWS cloud. Data Pipeline integrates with on-premise and cloud-based storage systems.
  • A managed ETL (Extract-Transform-Load) service.
  • Native integration with S3, DynamoDB, RDS, EMR, EC2 and Redshift.
  • AWS Data Pipeline vs Amazon Simple WorkFlow
    • Both services provide execution tracking, handling retries and exceptions, and running arbitrary actions.
    • AWS Data Pipeline is specifically designed to facilitate the specific steps that are common across a majority of data-driven workflows.

Amazon Elastic Map Redude

  • A managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
  • You can process data for analytics purposes and business intelligence workloads using EMR together with Apache Hive and Apache Pig.
  • You can use EMR to transform and move large amounts of data into and out of other AWS data stores and databases.
  • EMR enables you to quickly and easily provision as much capacity as you need, and automatically or manually add and remove capacity.
  • You can leverage multiple data stores, including S3, the Hadoop Distributed File System (HDFS), and DynamoDB.

Amazon ElasticSearch (OpenSearch)

  • Amazon ES lets you search, analyze, and visualize your data in real-time. This service manages the capacity, scaling, patching, and administration of your Elasticsearch clusters for you, while still giving you direct access to the Elasticsearch APIs.
  • The service offers open-source Elasticsearch APIs, managed Kibana, and integrations with Logstash and other AWS Services. This combination is often coined as the ELK Stack.