etl (transform,load) - sachit914/datawarehouse GitHub Wiki
-
data unification
-
in different zone we have different tables with same table structure
-
so in data warehouse we can put data from both zone in same structure
-
dropping duplication
-
data type convertion
-
dropping empty rows
Load in ETL
Initial Load:
- This is the first load of data from source systems into the data warehouse.
- It typically involves transferring all historical data from the source to the data warehouse.
- Initial loads are usually more resource-intensive and time-consuming compared to incremental loads because they involve large volumes of data.
Incremental Load:
- After the initial load, subsequent loads are typically incremental.
- Incremental loads involve transferring only the changed or newly added data since the last load.
- This approach is more efficient in terms of time and resources, as it reduces the amount of data transferred and processed compared to the initial load.
types of incremental update
Append (Insert-Only):
- Adds new data from the source to the existing data in the data warehouse.
- Does not update or delete existing records; only inserts new records.
- Simple and maintains historical data but can lead to data duplication.
In-Place Update:
- Updates existing records in the data warehouse based on changes in the source.
- Efficient for handling updates but does not retain historical versions of data.
Complete Replacement (Overwrite):
- Replaces the entire dataset in the data warehouse with a new version from the source.
- Ensures the data warehouse reflects the exact state of the source system but may lead to data loss if not managed carefully.
Rolling Update (Merge or Upsert):
- Combines insert and update operations.
- Inserts new records and updates existing records that have changed in the source.
- Balances performance and data currency, ensuring the data warehouse reflects the latest data effectively.