Technology and data processing - grishasen/proof_of_value GitHub Wiki

Key technical stack components


  • Streamlit:
    • Description: Web framework for interactive applications
    • Role: Interactive web application and dashboard interface
  • Polars:
    • Description: High-performance DataFrame library
    • Role: Efficient raw and pre-aggregated data manipulation and processing
  • Plotly:
    • Description: Graphing and plotting library
    • Role: Creating dynamic and interactive visualizations and charts
  • DuckDB:
    • Description: In-process SQL OLAP RDMS
    • Role: Persistent cache/storage of aggregated data frames
  • PandasAI:
    • Description: platform to ask questions to your data in natural language.
    • Role: Conversational analysis of pre-aggregated interaction history data

Data processing flow

Key steps while processing input data

  1. Read IH Files/Product Data
    • Discovers new raw files by date-grouping them
  2. Core Pre-processing Logic
    • Rename columns for consistency
    • Inject any missing columns and set default column values from config with literals or fill nulls
    • Perform initial pre-processing
    • Bundles everything into a single, feature-rich lazy DataFrame per batch.
  3. Pre-aggregations & Basic Stats
    • Counts, Sums, Means
    • Variance, StdDev
    • T-Digests
    • Save core aggregations to DuckDB
  4. Calculate Business Metrics
    • Generate reports
    • Calculate business metrics (CTR, CLV)
    • Calculate ML metrics from t-digests (AUC, Precision)
    • Calculate percentiles, variances for EDA
    • Z-Score, Odds ratio etc

Core Pre-processing Logic

This phase ensures that every metric calculation starts from a uniform and performant Polars dataset.

  • Apply global filter

filter = """(pl.col("Outcome").is_in(["Pending", "Impression", "Clicked", "Conversion", "NoConversion"]) & pl.col("ModelCategory").is_in(["Bayesian", "Gradient boosting"]) & pl.col("Channel").is_in(["Web", "Mobile", "Email"]))"""

  • Parse Timestamps & Engineer Date Features
    • Converts raw timestamp strings into Datetime columns.
    • Derives Day/Month/Year/Quarter and computes ResponseTime = Outcome – Decision in seconds.
  • Deduplication & Cleanup
    • Ensures each rank-outcome combination in interaction only appears once (.unique(subset=[INTERACTION\_ID, NAME, RANK, OUTCOME])).
    • Drops raw or meta columns not needed downstream ("FactID", "StreamPartition", "Organization", "Unit", "Division", "Component”, etc)
  • Concatenate & Finalize
    • Merges all file‐level frames into one big lazy frame.
    • Optionally applies any further column expressions.
    • Materializes the scan with a streaming or standard collect, then returns to lazy mode so that metric computations can chain onto it.

Application implementation details and lessons learned

Pre-aggregations & Basic Stats

  • Grouping at lowest level and compacting data using coroutines
  • For each metric perform own config-driven grouping
    • E.g. for engagement: .group\_by(mand\_props\_grp\_by) where mand\_props\_grp\_by = config['group\_by'] + [MODELCONTROLGROUP]
  • Calculate metric-specific basic stats
  • For ML metrics:
    • Group level Personalization, Novelty score (use weighted average in the plots)
    • **T-digests ** if allowed in config file, otherwise calculate group-level ROC AUC, Average Precision
  • For engagement, experiment and conversion
    • Positive and Negative response counts.
    • Sums for Revenue property
  • For descriptive metrics
    • Count, Mean, Variance, Sum
    • T-digests for Percentiles (p25, p50, p75, p90) and Bowley’s skewness
  • For CLV: counts and sums

Report Generation & Visualization

  • Declarative Reports for business and technical KPIs
  • Supported report types
    • line/bar, bar\_polar, treemap, heatmap for all metrics
    • scatter, histogram for specific metrics
  • Global filters across all reports and per-report filters using Streamlit

Business and Technical KPIs

  • CTR, Conversion Rate, Lift vs Random Action, Lift vs Control, Revenue, Action volumes
  • ROC AUC, average precision, personalization
  • Z-Score, G-stat, Odds ratio, Confidence intervals
  • Percentiles per property (e.g. Propensity, FinalPropensity, Priority)
  • Recency, Frequency, Average monetary value, Lifetime Value