Technology and data processing - grishasen/proof_of_value GitHub Wiki
Key technical stack components
- Streamlit:
- Description: Web framework for interactive applications
- Role: Interactive web application and dashboard interface
- Polars:
- Description: High-performance DataFrame library
- Role: Efficient raw and pre-aggregated data manipulation and processing
- Plotly:
- Description: Graphing and plotting library
- Role: Creating dynamic and interactive visualizations and charts
- DuckDB:
- Description: In-process SQL OLAP RDMS
- Role: Persistent cache/storage of aggregated data frames
- PandasAI:
- Description: platform to ask questions to your data in natural language.
- Role: Conversational analysis of pre-aggregated interaction history data
Data processing flow
Key steps while processing input data
- Read IH Files/Product Data
- Discovers new raw files by date-grouping them
- Core Pre-processing Logic
- Rename columns for consistency
- Inject any missing columns and set default column values from config with literals or fill nulls
- Perform initial pre-processing
- Bundles everything into a single, feature-rich lazy DataFrame per batch.
- Pre-aggregations & Basic Stats
- Counts, Sums, Means
- Variance, StdDev
- T-Digests
- Save core aggregations to DuckDB
- Calculate Business Metrics
- Generate reports
- Calculate business metrics (CTR, CLV)
- Calculate ML metrics from t-digests (AUC, Precision)
- Calculate percentiles, variances for EDA
- Z-Score, Odds ratio etc
Core Pre-processing Logic
This phase ensures that every metric calculation starts from a uniform and performant Polars dataset.
- Apply global filter
filter = """(pl.col("Outcome").is_in(["Pending", "Impression", "Clicked", "Conversion", "NoConversion"]) &
pl.col("ModelCategory").is_in(["Bayesian", "Gradient boosting"]) &
pl.col("Channel").is_in(["Web", "Mobile", "Email"]))"""
- Parse Timestamps & Engineer Date Features
- Converts raw timestamp strings into Datetime columns.
- Derives Day/Month/Year/Quarter and computes
ResponseTime = Outcome – Decision
in seconds.
- Deduplication & Cleanup
- Ensures each rank-outcome combination in interaction only appears once (
.unique(subset=[INTERACTION\_ID, NAME, RANK, OUTCOME])
). - Drops raw or meta columns not needed downstream ("FactID", "StreamPartition", "Organization", "Unit", "Division", "Component”, etc)
- Ensures each rank-outcome combination in interaction only appears once (
- Concatenate & Finalize
- Merges all file‐level frames into one big lazy frame.
- Optionally applies any further column expressions.
- Materializes the scan with a streaming or standard collect, then returns to lazy mode so that metric computations can chain onto it.
Application implementation details and lessons learned
Pre-aggregations & Basic Stats
- Grouping at lowest level and compacting data using coroutines
- For each metric perform own config-driven grouping
- E.g. for engagement:
.group\_by(mand\_props\_grp\_by)
wheremand\_props\_grp\_by = config['group\_by'] + [MODELCONTROLGROUP]
- E.g. for engagement:
- Calculate metric-specific basic stats
- For ML metrics:
- Group level Personalization, Novelty score (use weighted average in the plots)
- **T-digests ** if allowed in config file, otherwise calculate group-level ROC AUC, Average Precision
- For engagement, experiment and conversion
- Positive and Negative response counts.
- Sums for Revenue property
- For descriptive metrics
- Count, Mean, Variance, Sum
- T-digests for Percentiles (p25, p50, p75, p90) and Bowley’s skewness
- For CLV: counts and sums
Report Generation & Visualization
- Declarative Reports for business and technical KPIs
- Supported report types
line/bar, bar\_polar, treemap, heatmap
for all metricsscatter, histogram
for specific metrics
- Global filters across all reports and per-report filters using Streamlit
Business and Technical KPIs
- CTR, Conversion Rate, Lift vs Random Action, Lift vs Control, Revenue, Action volumes
- ROC AUC, average precision, personalization
- Z-Score, G-stat, Odds ratio, Confidence intervals
- Percentiles per property (e.g. Propensity, FinalPropensity, Priority)
- Recency, Frequency, Average monetary value, Lifetime Value