Analytic Query on OLTP (Bad Idea) vs Same Query in a Data Warehouse - RogerThattt/What-s-Driving-the-Explosion-in-OLTP-Systems- GitHub Wiki
Great point β this is one of the core reasons data warehouses exist. Letβs walk through an example from a telecom company to make this difference very clear.
OLTP systems are optimized for quick, small transactions.
Data warehouses (OLAP) are optimized for large, complex queries.
Letβs say you work for a telecom provider. You want to analyze network usage trends.
-
Stores real-time CDRs: Each row = 1 phone call
-
Schema:
cdr(id, customer_id, call_start, call_end, duration, tower_id, charge)
-
Millions of new rows every day
-
Indexed by
id
, maybecustomer_id
for quick lookups
Now imagine you run this query directly on the OLTP system:
SELECT
tower_id,
AVG(duration) AS avg_duration,
COUNT(*) AS total_calls
FROM cdr
WHERE call_start BETWEEN '2025-05-01' AND '2025-05-31'
GROUP BY tower_id;
This is an analytical query:
-
Scans millions of rows
-
Groups them
-
Calculates aggregates
π» Problem:
OLTP indexes like B-tree onid
orcustomer_id
donβt help with this.
The system is designed to read/write small numbers of rows, not do full-table scans or heavy group-bys.
In the data warehouse (e.g., Databricks, Snowflake, BigQuery), youβve:
-
Ingested and partitioned CDRs by date/tower
-
Possibly materialized aggregates or used columnar formats like Parquet
-
Optimized for scans, group-bys, joins, and sorting
Same query in a warehouse:
SELECT
tower_id,
AVG(duration) AS avg_duration,
COUNT(*) AS total_calls
FROM cdr_warehouse
WHERE call_start BETWEEN '2025-05-01' AND '2025-05-31'
GROUP BY tower_id;
π Performance is drastically better:
Uses columnar storage to skip irrelevant columns
Scans only May partitions
Uses query caching, vectorized execution, and possibly pre-aggregated data
Operation | OLTP Indexing (e.g., B-Tree) | OLAP Optimizations |
---|---|---|
Fetch 1 customer call log | β Very fast with customer_id index | β Not optimized for single-row lookups |
Count total calls per tower | β Needs full scan | β Partition pruning + vectorization |
Average call duration by date | β Not optimized | β Columnar scan + pre-aggregation |
Would you like a visual or diagram showing OLTP vs OLAP execution paths for this telecom example?