Technology and data processing - grishasen/proof_of_value GitHub Wiki

Key technical stack components

Streamlit:
- Description: Web framework for interactive applications
- Role: Interactive web application and dashboard interface
Polars:
- Description: High-performance DataFrame library
- Role: Efficient raw and pre-aggregated data manipulation and processing
Plotly:
- Description: Graphing and plotting library
- Role: Creating dynamic and interactive visualizations and charts
DuckDB:
- Description: In-process SQL OLAP RDMS
- Role: Persistent cache/storage of aggregated data frames
PandasAI:
- Description: platform to ask questions to your data in natural language.
- Role: Conversational analysis of pre-aggregated interaction history data

Data processing flow

Key steps while processing input data

Read IH Files/Product Data
- Discovers new raw files by date-grouping them
Core Pre-processing Logic
- Rename columns for consistency
- Inject any missing columns and set default column values from config with literals or fill nulls
- Perform initial pre-processing
- Bundles everything into a single, feature-rich lazy DataFrame per batch.
Pre-aggregations & Basic Stats
- Counts, Sums, Means
- Variance, StdDev
- T-Digests
- Save core aggregations to DuckDB
Calculate Business Metrics
- Generate reports
- Calculate business metrics (CTR, CLV)
- Calculate ML metrics from t-digests (AUC, Precision)
- Calculate percentiles, variances for EDA
- Z-Score, Odds ratio etc

Core Pre-processing Logic

This phase ensures that every metric calculation starts from a uniform and performant Polars dataset.

Apply global filter

filter = """(pl.col("Outcome").is_in(["Pending", "Impression", "Clicked", "Conversion", "NoConversion"]) & pl.col("ModelCategory").is_in(["Bayesian", "Gradient boosting"]) & pl.col("Channel").is_in(["Web", "Mobile", "Email"]))"""

Parse Timestamps & Engineer Date Features
- Converts raw timestamp strings into Datetime columns.
- Derives Day/Month/Year/Quarter and computes ResponseTime = Outcome – Decision in seconds.
Deduplication & Cleanup
- Ensures each rank-outcome combination in interaction only appears once (.unique(subset=[INTERACTION\_ID, NAME, RANK, OUTCOME])).
- Drops raw or meta columns not needed downstream ("FactID", "StreamPartition", "Organization", "Unit", "Division", "Component”, etc)
Concatenate & Finalize
- Merges all file‐level frames into one big lazy frame.
- Optionally applies any further column expressions.
- Materializes the scan with a streaming or standard collect, then returns to lazy mode so that metric computations can chain onto it.

Application implementation details and lessons learned

Pre-aggregations & Basic Stats

Grouping at lowest level and compacting data using coroutines
For each metric perform own config-driven grouping
- E.g. for engagement: .group\_by(mand\_props\_grp\_by) where mand\_props\_grp\_by = config['group\_by'] + [MODELCONTROLGROUP]
Calculate metric-specific basic stats
For ML metrics:
- Group level Personalization, Novelty score (use weighted average in the plots)
- **T-digests ** if allowed in config file, otherwise calculate group-level ROC AUC, Average Precision
For engagement, experiment and conversion
- Positive and Negative response counts.
- Sums for Revenue property
For descriptive metrics
- Count, Mean, Variance, Sum
- T-digests for Percentiles (p25, p50, p75, p90) and Bowley’s skewness
For CLV: counts and sums

Report Generation & Visualization

Declarative Reports for business and technical KPIs
Supported report types
- line/bar, bar\_polar, treemap, heatmap for all metrics
- scatter, histogram for specific metrics
Global filters across all reports and per-report filters using Streamlit

Business and Technical KPIs

CTR, Conversion Rate, Lift vs Random Action, Lift vs Control, Revenue, Action volumes
ROC AUC, average precision, personalization
Z-Score, G-stat, Odds ratio, Confidence intervals
Percentiles per property (e.g. Propensity, FinalPropensity, Priority)
Recency, Frequency, Average monetary value, Lifetime Value