GCP ML Pipeline Architecture - stanlypoc/AIRA GitHub Wiki
Here is your GCP ML Pipeline Architecture content converted into a GitHub Wiki-ready format using markdown and code blocks (GitHub supports mermaid, python, bash, and regular markdown):
📦 GCP ML Pipeline Architecture
📊 Architecture Overview
graph TD
A[Data Sources] --> B[Cloud Storage]
B --> C[Vertex AI Data Labeling]
C --> D[BigQuery]
D --> E[Vertex AI Feature Store]
E --> F[Vertex AI Training]
F --> G[Vertex AI Model Registry]
G --> H[Vertex AI Endpoints]
H --> I[Cloud Functions/Cloud Run]
I --> J[End Users]
K[CI/CD Pipeline] -->|Triggers| F
K -->|Deploys| H
L[IAM & Security] --> AllComponents
🧩 Component Details
1. Data Ingestion & Storage
Cloud Storage (GCS)
- Purpose: Raw data lake for unstructured/semi-structured data
- Scalability: Auto-scales with multi-regional storage classes
- Security: IAM roles (
storage.objectAdmin,storage.objectViewer) - Example:
gs://my-ml-data-bucket/raw-images/
BigQuery
- Purpose: Structured data warehouse for tabular data
- Performance: Columnar storage + BI Engine
- IAM:
bigquery.dataEditor
2. Data Preparation
Vertex AI Data Labeling
- Purpose: Human-in-the-loop labeling with AI tools
- Availability: Global workforce, 99.9% SLA
- IAM:
aiplatform.dataLabeler
Vertex AI Feature Store
- Purpose: Centralized feature repository
- Scalability: 10M+ QPS
- IAM:
aiplatform.featureStoreAdmin
3. Model Training
Vertex AI Training
-
Purpose: Managed training with AutoML or custom containers
-
Performance: Supports GPUs (A100/V100) and TPU v4 pods
-
IAM:
aiplatform.useraiplatform.customCodeServiceAgent
4. Model Management
Vertex AI Model Registry
- Purpose: Versioned model storage with lineage
- Security: CMEK encryption
- IAM:
aiplatform.modelAdmin
5. Serving
Vertex AI Endpoints
- Purpose: Serverless model serving
- Scalability: Scale-to-zero, handles 100K+ RPS
- IAM:
aiplatform.endpointAdmin
6. Orchestration
Cloud Composer (Airflow)
- Purpose: Workflow orchestration
- Availability: Regional HA
- IAM:
composer.worker
⚙️ Architecture Attributes
1. Scalability
- Horizontal: Distributed training (Vertex AI), auto-scaling BigQuery slots
- Vertical: GPU/TPU acceleration for endpoints
2. Performance
- Low latency: <10ms from Feature Store, Triton-optimized predictions
- High throughput: TB-level BigQuery query handling
3. Availability
- Multi-region: GCS + Vertex AI global footprint
- SLA: Vertex AI (99.9%), BigQuery (99.95%)
4. Security
- Encryption: Google-managed or CMEK at rest, TLS 1.3 in transit
- Network: VPC-SC, Private IP, IAM Conditions
5. IAM Setup
# Terraform example for IAM bindings
resource "google_project_iam_binding" "ml_engineer" {
role = "roles/aiplatform.user"
members = ["user:[email protected]"]
}
resource "google_storage_bucket_iam_binding" "data_read" {
bucket = "my-ml-data-bucket"
role = "roles/storage.objectViewer"
members = ["serviceAccount:[email protected]"]
}
🔁 CI/CD Pipeline
graph LR
A[GitHub/Cloud Source Repos] --> B[Cloud Build]
B -->|Triggers| C[Vertex AI Pipeline]
C --> D[Model Registry]
D --> E[Vertex AI Endpoint]
E --> F[Cloud Deploy]
F --> G[Promotion to Prod]
Tools
- Cloud Build: Build and deploy with
cloudbuild.yaml - Vertex AI Pipelines: Kubeflow-based orchestration
- Cloud Deploy: Canary/promotional rollouts
IAM
cloudbuild.builds.editoraiplatform.pipelineRunner
⭐ Key Advantages
- Fully Managed: No infrastructure overhead
- MLOps Native: Integrated model monitoring and rollback
- Cost Efficiency: Preemptible training VMs, scale-to-zero inference
💡 Tip: Add Cloud Armor for DDoS protection and Dataflow for streaming ETL in production.