Technology Deep Dive - RogerThattt/Data-Flywheel GitHub Wiki
💾 Ingestion & ETL Structured Streaming + Auto Loader for scalable event ingestion:
df = (spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .load("s3://telecom-cdrs/raw/") )
Delta Live Tables (DLT) to declaratively define ETL pipelines with SLOs and lineage:
@dlt.table(comment="Bronze: raw telecom events") def bronze_cdr(): return spark.readStream.format("cloudFiles")...
@dlt.table(comment="Silver: enriched CDRs") def silver_cdr(): df = dlt.read("bronze_cdr") return df.filter(...).withColumn(...)
Expectations for data quality:
dlt.expect("valid_duration", "call_duration >= 0")
ML Lifecycle with MLflow Feature Store: Persist ML-ready features (e.g., user-level or tower-level aggregations).
MLflow: Log models, track experiments, compare metrics:
import mlflow with mlflow.start_run(): mlflow.log_metric("precision", precision_score(y, y_pred)) mlflow.sklearn.log_model(model, "model")
Model Registry: Register models and automate CI/CD into production with tags like Staging, Production.
🚀 Model Deployment Batch inference via scheduled jobs over Delta tables.
Real-time scoring using:
Databricks Model Serving
REST endpoints (for call centers, apps)
Feature lookups joined at inference time
🔄 Closing the Loop (Flywheel Effect) Predictions stored back to Gold tables
Customer actions logged → become new training data
Streaming features (e.g., dropped calls, latency) allow feedback within minutes
Re-train daily/weekly using AutoML or custom jobs
All changes tracked with Unity Catalog & MLflow for governance