GCP ML Pipeline Architecture - stanlypoc/AIRA GitHub Wiki

Here is your GCP ML Pipeline Architecture content converted into a GitHub Wiki-ready format using markdown and code blocks (GitHub supports mermaid, python, bash, and regular markdown):


📦 GCP ML Pipeline Architecture

📊 Architecture Overview

graph TD
    A[Data Sources] --> B[Cloud Storage]
    B --> C[Vertex AI Data Labeling]
    C --> D[BigQuery]
    D --> E[Vertex AI Feature Store]
    E --> F[Vertex AI Training]
    F --> G[Vertex AI Model Registry]
    G --> H[Vertex AI Endpoints]
    H --> I[Cloud Functions/Cloud Run]
    I --> J[End Users]
    K[CI/CD Pipeline] -->|Triggers| F
    K -->|Deploys| H
    L[IAM & Security] --> AllComponents

🧩 Component Details

1. Data Ingestion & Storage

Cloud Storage (GCS)

  • Purpose: Raw data lake for unstructured/semi-structured data
  • Scalability: Auto-scales with multi-regional storage classes
  • Security: IAM roles (storage.objectAdmin, storage.objectViewer)
  • Example: gs://my-ml-data-bucket/raw-images/

BigQuery

  • Purpose: Structured data warehouse for tabular data
  • Performance: Columnar storage + BI Engine
  • IAM: bigquery.dataEditor

2. Data Preparation

Vertex AI Data Labeling

  • Purpose: Human-in-the-loop labeling with AI tools
  • Availability: Global workforce, 99.9% SLA
  • IAM: aiplatform.dataLabeler

Vertex AI Feature Store

  • Purpose: Centralized feature repository
  • Scalability: 10M+ QPS
  • IAM: aiplatform.featureStoreAdmin

3. Model Training

Vertex AI Training

  • Purpose: Managed training with AutoML or custom containers

  • Performance: Supports GPUs (A100/V100) and TPU v4 pods

  • IAM:

    • aiplatform.user
    • aiplatform.customCodeServiceAgent

4. Model Management

Vertex AI Model Registry

  • Purpose: Versioned model storage with lineage
  • Security: CMEK encryption
  • IAM: aiplatform.modelAdmin

5. Serving

Vertex AI Endpoints

  • Purpose: Serverless model serving
  • Scalability: Scale-to-zero, handles 100K+ RPS
  • IAM: aiplatform.endpointAdmin

6. Orchestration

Cloud Composer (Airflow)

  • Purpose: Workflow orchestration
  • Availability: Regional HA
  • IAM: composer.worker

⚙️ Architecture Attributes

1. Scalability

  • Horizontal: Distributed training (Vertex AI), auto-scaling BigQuery slots
  • Vertical: GPU/TPU acceleration for endpoints

2. Performance

  • Low latency: <10ms from Feature Store, Triton-optimized predictions
  • High throughput: TB-level BigQuery query handling

3. Availability

  • Multi-region: GCS + Vertex AI global footprint
  • SLA: Vertex AI (99.9%), BigQuery (99.95%)

4. Security

  • Encryption: Google-managed or CMEK at rest, TLS 1.3 in transit
  • Network: VPC-SC, Private IP, IAM Conditions

5. IAM Setup

# Terraform example for IAM bindings
resource "google_project_iam_binding" "ml_engineer" {
  role    = "roles/aiplatform.user"
  members = ["user:[email protected]"]
}

resource "google_storage_bucket_iam_binding" "data_read" {
  bucket = "my-ml-data-bucket"
  role   = "roles/storage.objectViewer"
  members = ["serviceAccount:[email protected]"]
}

🔁 CI/CD Pipeline

graph LR
    A[GitHub/Cloud Source Repos] --> B[Cloud Build]
    B -->|Triggers| C[Vertex AI Pipeline]
    C --> D[Model Registry]
    D --> E[Vertex AI Endpoint]
    E --> F[Cloud Deploy]
    F --> G[Promotion to Prod]

Tools

  • Cloud Build: Build and deploy with cloudbuild.yaml
  • Vertex AI Pipelines: Kubeflow-based orchestration
  • Cloud Deploy: Canary/promotional rollouts

IAM

  • cloudbuild.builds.editor
  • aiplatform.pipelineRunner

⭐ Key Advantages

  1. Fully Managed: No infrastructure overhead
  2. MLOps Native: Integrated model monitoring and rollback
  3. Cost Efficiency: Preemptible training VMs, scale-to-zero inference

💡 Tip: Add Cloud Armor for DDoS protection and Dataflow for streaming ETL in production.