Technology Deep Dive - RogerThattt/Data-Flywheel GitHub Wiki

💾 Ingestion & ETL Structured Streaming + Auto Loader for scalable event ingestion:

df = (spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .load("s3://telecom-cdrs/raw/") )

Delta Live Tables (DLT) to declaratively define ETL pipelines with SLOs and lineage:

@dlt.table(comment="Bronze: raw telecom events") def bronze_cdr(): return spark.readStream.format("cloudFiles")...

@dlt.table(comment="Silver: enriched CDRs") def silver_cdr(): df = dlt.read("bronze_cdr") return df.filter(...).withColumn(...)

Expectations for data quality:

dlt.expect("valid_duration", "call_duration >= 0")

ML Lifecycle with MLflow Feature Store: Persist ML-ready features (e.g., user-level or tower-level aggregations).

MLflow: Log models, track experiments, compare metrics:

import mlflow with mlflow.start_run(): mlflow.log_metric("precision", precision_score(y, y_pred)) mlflow.sklearn.log_model(model, "model")

Model Registry: Register models and automate CI/CD into production with tags like Staging, Production.

🚀 Model Deployment Batch inference via scheduled jobs over Delta tables.

Real-time scoring using:

Databricks Model Serving

REST endpoints (for call centers, apps)

Feature lookups joined at inference time

🔄 Closing the Loop (Flywheel Effect) Predictions stored back to Gold tables

Customer actions logged → become new training data

Streaming features (e.g., dropped calls, latency) allow feedback within minutes

Re-train daily/weekly using AutoML or custom jobs

All changes tracked with Unity Catalog & MLflow for governance