Diagrams - SaraCmque/TET-Project3-EMR GitHub Wiki

This document provides an overview of the architecture for our AWS project, highlighting the various components and their interactions.

1. Architectural Diagram Image

TELEMATICA - PROJECT #3(1)

You can find the architectural diagram here (looking for the project #3 frame)

2. Components Overview

Amazon S3 Bucket
- Serves as a storage solution for raw, trusted, and refined data. Data is stored using CSV format for traffic and JSON format for climate. Files are organized hierarchically in folders by processing stage and date: raw/, trusted/, and refined/.
- S3 also stores Python and shell scripts used by Lambda functions for processing and cluster creation.
AWS Lambda Functions
- Lambda Function #1:
  - Triggered daily by Amazon EventBridge. It collects traffic data from a MySQL RDS database and weather data from the Open-Meteo API. Both datasets are then stored in the S3 raw folder.
- Lambda Function #2:
  - Triggered automatically when new files are inserted into the raw folder in S3.
  - It uses DynamoDB to check if an EMR cluster is already running to avoid duplicate cluster creation. If not, it launches an EMR cluster to process the data.
  - The pipeline processes the raw data into:
    - Trusted Data: Validated and cleaned data.
    - Descriptive Stats: Summarized metrics from the data.
    - Model Data: Input features and labels for training ML models.
    - Visualization Data: Aggregated and formatted for front-end charts or dashboards.
- Lambda Function #3:
  - Triggered by REST API requests via API Gateway.
  - Queries AWS Athena to fetch refined data and returns JSON responses for client consumption.
Amazon Aurora and RDS (MySQL)
- Handles relational data storage for traffic information, accessible via Lambda #1.
DynamoDB
- Used for concurrency control. It stores flags indicating whether an EMR cluster is already active to prevent multiple clusters from being launched when multiple files are uploaded simultaneously.
Amazon EventBridge
- Provides scheduling for Lambda #1 to execute daily ingestions.
API Gateway
- Acts as the REST interface for end users or front-end applications.
- Routes requests to Lambda #3, which responds with JSON data.
Amazon EMR
- Runs processing steps (using PySpark) to generate trusted and refined datasets: descriptive statistics, visualizations, and ML-ready data.
AWS Athena
- Performs ad-hoc queries over data stored in the refined folder in S3.
- Used by Lambda #3 to fulfill API data requests.

3. Workflow Overview

3.1. Data Ingestion

Daily traffic and weather data are collected by Lambda #1 using EventBridge and stored in S3 (raw folder).

3.2. Data Processing

When new data appears in the raw folder:
- Lambda #2 checks with DynamoDB if EMR is already running.
- If not, it starts EMR and processes data into trusted and refined datasets.

3.3. Refined Data Subsets

trueted folder have the joined data (relation between API data and relational dabatase data)
refined data is further split into:
- Descriptive stats
- Machine learning model data
- Visualization-ready data

3.4. User Interaction

Users or front-end applications make GET requests via the API Gateway using the following endpoints:
- /congestion-stats
- /global-stats
- /correlations
- /visualizations
- /test-data
- /model-evaluation
- /predictions-gbt
- /predictions-rf
The request is routed to Lambda #3, which queries Athena and returns the results as a JSON response.

This architecture diagram showcases a robust, scalable AWS-based pipeline for ingestion, processing, and querying of multi-source data.