Data Warehouse concepts - sachit914/datawarehouse GitHub Wiki
topics
section 1
what is data warehouse
"What is a datawarehouse?" "How does it look like?" "How is it different from databases?" "How is it different from spark?" "Rules of data warehouse?"
datawarehouse vs database
datawarehuse
- store large volume of data
- designed to read heavy operations
- high latency (High latency indicates longer delays in data retrieval or processing, )
- de-normalized (data Redundancy)
- columnar storage (parquet orc)
- parallel processing of request
- Olap (Online Analytical Processing) (Snowflake redshift)
database
- stores small volume of data
- designed for write heavy operation
- low latency (Low latency means quick access to data with minimal delay.)
- highly normalizzed (low redundancy)
- row based
- not optimized for parallel processing
- oltp (Online Transaction Processing) (mysql,oracle,postgress)
dataWarehouse is built upon database itself but both have diffrent use case
how datawarehouse is it diffeerent from spark
- in spark we can do data streaming,AI Ml, can perform on semi-structured , structure,un-structures data (in datawarehouse all this are not available)
- spark is cheap
rules of data warehouse
- integrated (data is generated from diffirent data source ) , all this data are dumped into one place
- time variant (can store historical data)
- Separation of Transactional and Analytical Processing: A data warehouse is optimized for read-intensive operations, unlike transactional systems which are optimized for write-intensive operations.
This separation ensures that the performance of the operational systems is not affected by the analytical queries.
why do we need data warehouse
Reliance Mart
Working at headquarters in Mumbai.
We have total 1000 stores across 20 states.
Products (SKV)
→ Grocery
→ Dairy
→ Beauty
→ Bakery
→ Frozen food
each store has 1000 individual products. we want to maximize the profit
general solution
- keep cost and inventory optimized
- vendor costs as low as possible
- sale as much as you can
promotional activity
- 50% discount on lifeboy soap
problem statement
- as a business user i want to analyze which products are selling in which store under what promotional activity
brute force approach
- we have stores in different zone
- in each zone they have their own pos (point of billing)
- all the billing data is stores in db ( this are called oltp database which is used for immediate transaction)
- datawarehouse is llap service
Data Lake
problem : for data warehouse we send structured data , so for that we peroform some etl jobs (extract transform load) ,transform the data to structured format and store those data in data warehouse
so what if we generate lot of data with high volume , high velocity and and varity (3v's) that time we cant transform data before moving data to warehouse , because transforming will increase the load and take lot of time for processing
so in this case we use data lake
- here we dump data , later when we want we will perform etl process or transform data as requirement
- we can dump any data ,structures, semi structured , un-structured data
- data lake can be done horizontal scaling
old architecture
all datas from differnt sources are dumped into data lake based on buisness analyst requirement we perform etl (transform data based on requirement) and save the structured data to datawarehouse . now business analysts can pull those structured data
modern architecture
injection layer
- all data from source are dumped into injection layer
- the data in injection layer will be in raw format (there wont be any transformation or data cleaning) ,how data is generated like that only we will dump
processing layer
- in this layer we can processing capacity
- we will add clusters for processing (clusters will have its own ram storage etc)
processed layer
- this layer will have output of processing layer
- in this layer will give permission of data analayst to acess data
consumption layer
- where data can be utilized
the processed data we can store in dataWareHouse or DataMart for further utilization
- BI useres can fetch data from datawarehouse or datamart for analyisis purpose