Data_warehousing - kamialie/knowledge_corner GitHub Wiki

Introduction

Data warehouse is a databse designed to enable business intelligent activities. The goal is to consolidate data from all sources and become single source of truth. DW is much more read oriented (query and analysis) as opposed to write/update (transactions). In addition to relational database includes extraction, transportation, transformation, loading (ETL), statistical analysis, reporting, data mining and other activities. Key characteristics of DW are subject oriented, integrated (resource naming conflicts and inconsistencies), nonvolatile (once entered data should not change), time varient (focus on change over time). ETL is the main focus of modern DW; most or all transformation is performed on the host database.

Data mart is a limited score DW that may serve 1 department or line of business (much faster than fill warehouse). There are two types:

independent (directly from source data) - might be inconsistent across several marts
dependent (fed from warehouse) - no inconsistencies, but requires pre-existing warehouse

Operational data store (ODS) is clean and validated data, but not historically deep (f.e. 1 day). ODS supports daily operations and can be used are source to load DW.

Dimension represents a different category, such as region, time, product type, etc. Fact contains measurements (mostly numeric data like grade, wages, etc) and is related to dimenstion. Cube is a multi-dimensional matrix of data that has multiple dimentions (independent variables) and measures (dependent variables) that are created by an OLAP system.

Online transaction processing (OLTP) vs data warehouse (online analytical processing, OLAP)

Subject	OLTP	Data warehouse
Source of data	operational data	consolidated data, data coming from various OLTP databases
Workload	supports only predefined operations, application might be tuned or configured to support only these operations	designed to accomodate add hoc queries, because data might not be known in advance
Schema design	often use fully normalized schemas to optimize update/insert/delete performance and quarantee data consistency	often use partially denormolized schemas to optimize query and analytical performance
Data modification	end users routinly make modifications; is always up-to-date and reflects current state of business	update regularly, f.e. nightly or weekly, using bulk data modification techniques, end users do not update data except when analytical tools
Typical operations	handful or records, short and fast inserts and updates initialized by end users	scanning of thousands or millons of rows, perioduc long-running batch jobs refreshing the data
Historical data	usually store data for weeks or months, store historical data as needed to meet the requirements of the current transaction	usually store data for months or years

Architectures

Enterprise Data Warehouse (EDW) is consolidated storage of raw data as the center of data warehousing architecture.

Type	Description
basic	simple architecture keeps all data (metadata, raw data, summary data) in the middle between data sources and end users. Summaries are pre-computed common expensive and long-running operation results
with a staging area	data is cleaned and processed from data sources before it's put into warehouse; can be done programatically, altough many DWs use staging area
with a staging area and data marts	further customization using data marts to separate data betwen end users based on line of business or other criterias

Single-layer

Also called virtual DWH, is not frequently used in practice. The goal is to minimize the amount of data stored, which is achieved by removing data redundancies. Main weakness is absense of separation between analytical and transactional processing (queries affect regular transactional activities). Good approach if analytical needs are restricted and data volume is huge.

Two-layer

Source layer is separate from DWH and represents heterogeneous source of data (transactional databases, flat files, etc).

Data warehouse layer that contains all information.

Additional layers: data staging layer mainly represents ETL tools and prepares data (remove inconsistencies, gaps, redundancies) to be merged into one common schema; data marts contains smaller version of data designed for specific need; analysis layer represents end users and tools making requests to data marts.

Three-layer

Same as two-layer with reconciled data layer (operational data store) between DWH and source layers. Acts as a layer containing integrated, consisten, correct, current and detailed data. DWH is then loaded from reconciled data.

Independent data marts

Multiple data marts are separately designed and built in nonintegrated fashion. DWH as a central unit does not exist. Good approach as a preparation step before moving towards enterprise-wide warehouse solution or when organizational divisions are loosely coupled.

hub-and-spoke

No DWH as a central unit. Data marts are fed directly from reconciled layer. Users mainly interact with marts, but occasionally query reconciled layer.

Books

"Building the Data Warehouse" Bill Inmon, updated in 2002, top-down approach; enterprise has one DWH and data marts source info from it, in DWH data is stored in 3NF
"The Data Warehouse Toolkit" Ralph Kimball, later updated with data marts, bottom-up approach; DWH is conglomerate of all data marts, info is always stored in the dimentional model