Analytic Query on OLTP (Bad Idea) vs Same Query in a Data Warehouse - RogerThattt/What-s-Driving-the-Explosion-in-OLTP-Systems- GitHub Wiki

Great point — this is one of the core reasons data warehouses exist. Let’s walk through an example from a telecom company to make this difference very clear.

🔍 Key Idea:

OLTP systems are optimized for quick, small transactions.
Data warehouses (OLAP) are optimized for large, complex queries.

🏢 Use Case: Telecom Call Detail Records (CDRs)

Let’s say you work for a telecom provider. You want to analyze network usage trends.

🔹 OLTP System (Operational DB)

Stores real-time CDRs: Each row = 1 phone call
Schema: cdr(id, customer_id, call_start, call_end, duration, tower_id, charge)
Millions of new rows every day
Indexed by id, maybe customer_id for quick lookups

❌ Analytic Query on OLTP (Bad Idea)

Now imagine you run this query directly on the OLTP system:

SELECT
  tower_id,
  AVG(duration) AS avg_duration,
  COUNT(*) AS total_calls
FROM cdr
WHERE call_start BETWEEN '2025-05-01' AND '2025-05-31'
GROUP BY tower_id;

This is an analytical query:

Scans millions of rows
Groups them
Calculates aggregates

🔻 Problem:
OLTP indexes like B-tree on id or customer_id don’t help with this.
The system is designed to read/write small numbers of rows, not do full-table scans or heavy group-bys.

✅ Same Query in a Data Warehouse

In the data warehouse (e.g., Databricks, Snowflake, BigQuery), you’ve:

Ingested and partitioned CDRs by date/tower
Possibly materialized aggregates or used columnar formats like Parquet
Optimized for scans, group-bys, joins, and sorting

Same query in a warehouse:

SELECT
  tower_id,
  AVG(duration) AS avg_duration,
  COUNT(*) AS total_calls
FROM cdr_warehouse
WHERE call_start BETWEEN '2025-05-01' AND '2025-05-31'
GROUP BY tower_id;

🚀 Performance is drastically better:

Uses columnar storage to skip irrelevant columns

Scans only May partitions

Uses query caching, vectorized execution, and possibly pre-aggregated data

⚙️ Why Indexing Matters

Operation	OLTP Indexing (e.g., B-Tree)	OLAP Optimizations
Fetch 1 customer call log	✅ Very fast with customer_id index	❌ Not optimized for single-row lookups
Count total calls per tower	❌ Needs full scan	✅ Partition pruning + vectorization
Average call duration by date	❌ Not optimized	✅ Columnar scan + pre-aggregation

Would you like a visual or diagram showing OLTP vs OLAP execution paths for this telecom example?