Data Warehouse concepts - sachit914/datawarehouse GitHub Wiki

topics

section 1

what is data warehouse


"What is a datawarehouse?" "How does it look like?" "How is it different from databases?" "How is it different from spark?" "Rules of data warehouse?"

datawarehouse vs database

datawarehuse

  • store large volume of data
  • designed to read heavy operations
  • high latency (High latency indicates longer delays in data retrieval or processing, )
  • de-normalized (data Redundancy)
  • columnar storage (parquet orc)
  • parallel processing of request
  • Olap (Online Analytical Processing) (Snowflake redshift)

database

  • stores small volume of data
  • designed for write heavy operation
  • low latency (Low latency means quick access to data with minimal delay.)
  • highly normalizzed (low redundancy)
  • row based
  • not optimized for parallel processing
  • oltp (Online Transaction Processing) (mysql,oracle,postgress)

dataWarehouse is built upon database itself but both have diffrent use case

how datawarehouse is it diffeerent from spark

  • in spark we can do data streaming,AI Ml, can perform on semi-structured , structure,un-structures data (in datawarehouse all this are not available)
  • spark is cheap

rules of data warehouse

  • integrated (data is generated from diffirent data source ) , all this data are dumped into one place
  • time variant (can store historical data)
  • Separation of Transactional and Analytical Processing: A data warehouse is optimized for read-intensive operations, unlike transactional systems which are optimized for write-intensive operations.

This separation ensures that the performance of the operational systems is not affected by the analytical queries.

why do we need data warehouse


Reliance Mart

Working at headquarters in Mumbai.

We have total 1000 stores across 20 states.

Products (SKV)

→ Grocery

→ Dairy

→ Beauty

→ Bakery

→ Frozen food

each store has 1000 individual products. we want to maximize the profit

general solution

  • keep cost and inventory optimized
  • vendor costs as low as possible
  • sale as much as you can

promotional activity

  • 50% discount on lifeboy soap

problem statement

  • as a business user i want to analyze which products are selling in which store under what promotional activity

brute force approach

image

  • we have stores in different zone
  • in each zone they have their own pos (point of billing)
  • all the billing data is stores in db ( this are called oltp database which is used for immediate transaction)
  • datawarehouse is llap service

Data Lake

problem : for data warehouse we send structured data , so for that we peroform some etl jobs (extract transform load) ,transform the data to structured format and store those data in data warehouse

so what if we generate lot of data with high volume , high velocity and and varity (3v's) that time we cant transform data before moving data to warehouse , because transforming will increase the load and take lot of time for processing

so in this case we use data lake

  • here we dump data , later when we want we will perform etl process or transform data as requirement
  • we can dump any data ,structures, semi structured , un-structured data
  • data lake can be done horizontal scaling

old architecture

image

all datas from differnt sources are dumped into data lake based on buisness analyst requirement we perform etl (transform data based on requirement) and save the structured data to datawarehouse . now business analysts can pull those structured data

modern architecture

image

injection layer
  • all data from source are dumped into injection layer
  • the data in injection layer will be in raw format (there wont be any transformation or data cleaning) ,how data is generated like that only we will dump
processing layer
  • in this layer we can processing capacity
  • we will add clusters for processing (clusters will have its own ram storage etc)
processed layer
  • this layer will have output of processing layer
  • in this layer will give permission of data analayst to acess data
consumption layer
  • where data can be utilized

the processed data we can store in dataWareHouse or DataMart for further utilization

  • BI useres can fetch data from datawarehouse or datamart for analyisis purpose