Analytic Query on OLTP (Bad Idea) vs Same Query in a Data Warehouse - RogerThattt/What-s-Driving-the-Explosion-in-OLTP-Systems- GitHub Wiki

Great point β€” this is one of the core reasons data warehouses exist. Let’s walk through an example from a telecom company to make this difference very clear.


πŸ” Key Idea:

OLTP systems are optimized for quick, small transactions.
Data warehouses (OLAP) are optimized for large, complex queries.


🏒 Use Case: Telecom Call Detail Records (CDRs)

Let’s say you work for a telecom provider. You want to analyze network usage trends.

πŸ”Ή OLTP System (Operational DB)

  • Stores real-time CDRs: Each row = 1 phone call

  • Schema: cdr(id, customer_id, call_start, call_end, duration, tower_id, charge)

  • Millions of new rows every day

  • Indexed by id, maybe customer_id for quick lookups


❌ Analytic Query on OLTP (Bad Idea)

Now imagine you run this query directly on the OLTP system:

SELECT
  tower_id,
  AVG(duration) AS avg_duration,
  COUNT(*) AS total_calls
FROM cdr
WHERE call_start BETWEEN '2025-05-01' AND '2025-05-31'
GROUP BY tower_id;

This is an analytical query:

  • Scans millions of rows

  • Groups them

  • Calculates aggregates

πŸ”» Problem:
OLTP indexes like B-tree on id or customer_id don’t help with this.
The system is designed to read/write small numbers of rows, not do full-table scans or heavy group-bys.


βœ… Same Query in a Data Warehouse

In the data warehouse (e.g., Databricks, Snowflake, BigQuery), you’ve:

  • Ingested and partitioned CDRs by date/tower

  • Possibly materialized aggregates or used columnar formats like Parquet

  • Optimized for scans, group-bys, joins, and sorting

Same query in a warehouse:

SELECT
  tower_id,
  AVG(duration) AS avg_duration,
  COUNT(*) AS total_calls
FROM cdr_warehouse
WHERE call_start BETWEEN '2025-05-01' AND '2025-05-31'
GROUP BY tower_id;

πŸš€ Performance is drastically better:

  • Uses columnar storage to skip irrelevant columns

  • Scans only May partitions

  • Uses query caching, vectorized execution, and possibly pre-aggregated data


βš™οΈ Why Indexing Matters

Operation OLTP Indexing (e.g., B-Tree) OLAP Optimizations
Fetch 1 customer call log βœ… Very fast with customer_id index ❌ Not optimized for single-row lookups
Count total calls per tower ❌ Needs full scan βœ… Partition pruning + vectorization
Average call duration by date ❌ Not optimized βœ… Columnar scan + pre-aggregation

Would you like a visual or diagram showing OLTP vs OLAP execution paths for this telecom example?

⚠️ **GitHub.com Fallback** ⚠️