Data Modeling Guide: Treating Data as a Product - vmware/versatile-data-kit GitHub Wiki
- What is a Data Lake?
- What is a Data Warehouse and Data Marts, and why use them?
- How to Model Data Marts?
- Video Webinar on Practical Dimensional Modeling 101
- Data Mart Loading Strategies
- Wrap-up
Versatile Data Kit provides data engineers with an all-inclusive toolkit for managing data. It covers capabilities from ingestion to processing and warehousing.
This guide aims to help users convert the raw data they acquire into useful, polished, and functional data products that can deliver key insights.
A basic understanding of SQL, data pipelines, and Versatile Data Kit is assumed for this guide.
Data Lakes serve as the initial landing zone for ingested data. Data stored here is structured and can be accessed through SQL, yet remains 'raw.' It's stored in a model closely mirroring the source system and at the highest level of granularity.
It lacks any cleansing, business logic or aggregations. To transform this raw data into cleaned and business-oriented data products, we utilize the data warehouse and data marts.
A Data Warehouse serves as a repository for refined datasets. These datasets are processed and ready to be easily utilized by businesses for specific use-cases.
Data Marts are essentially subsets of the data warehouse. Each data mart is owned by different teams and oriented towards different business lines, allowing each team to focus on their specific data needs.
The fundamental difference between a data lake and a data mart lies in their structure and purpose.
- Data Lake: This is where vast amounts of raw data are stored. The data in a data lake, such as telemetry data for products or services, is unrefined and does not have a predefined structure.
- Data Mart: In contrast, a data mart contains a smaller amount of data, which is built from the raw source data stored in the lake. Here, transformations and aggregations are applied to the raw data with a specific use-case in mind. Consequently, data marts provide a single, easily accessible source that can be used later for various purposes including reporting, analytics, data science, and driving data-powered applications.
In the context of a data mart, datasets are treated as products. Like any product, these datasets should be tested, documented, made available, monitored, and operated upon. This is an essential, yet resource-intensive activity. Hence, making strategic decisions regarding the data mart model is of paramount importance.
The short answer is that it all depends on the team's use case. Below, we outline some common ways.
You can save smaller (filtered or aggregated) versions of your data lake tables in a data mart. This way your reports will be based on smaller tables, thus improving efficiency and speed. This is a straightforward choice if your tables are unrelated to each other and are frequently used for reporting. In this case we also parameterize mart_model_schema_name
to allow us to switch between dev and production easily
CREATE TABLE IF NOT EXISTS {mart_model_schema_name}.fact_sales(
order_id INT COMMENT 'Primary key; Foreign key to mart_model_example.dim_time',
product_id INT COMMENT 'Foreign key to mart_model_example.dim_product',
customer_id STRING COMMENT 'Foreign key to mart_model_example.dim_customer',
employee_id INT COMMENT 'Foreign key to mart_model_example.dim_employee',
total_sales BIGINT COMMENT 'Total sales per order',
total_quantity BIGINT COMMENT 'Total quantity per order',
time_of_sale TIMESTAMP COMMENT 'The time at which the sale was executed'
)
COMMENT 'UK: order_id, provides sales information per order id'
If you are frequently writing the same script, it may be useful to save it as a view. A view is essentially a virtual table from the result of pre-defined (and stored) SQL script. This SQL script can join multiple tables, filter, aggregate, or perform some other logic on one or more tables. The results from a view are updated each time you open/call the view and the results themselves are not physically stored in the database. Views are typically used when working with smaller tables (less than 1 million records/rows). Otherwise, you might consider materializing the view by storing the result as a table.
CREATE VIEW IF NOT EXISTS {mart_model_schema_name}.vw_fact_sales
AS
SELECT CAST(order_id AS INT) AS order_id,
TRIM(LOWER(customer_id)) AS customer_id,
SUM(CAST(total_sale AS INT)) AS total_sales,
SUM(CAST(quantity AS INT)) AS total_quantity,
FROM {raw_data_schema_name}.store_sales_example
GROUP BY
CAST(order_id AS INT),
TRIM(LOWER(customer_id))
DataCamp defines it as a database design technique, which divides tables into smaller tables and connects them via relationships. The goal is to reduce redundancy and increase data integrity. There are different levels of normalisation based on the desired storage optimisation and data integrity. However, with intense normalisation come query complexity and computational burdens. Queries will become much more complex to write (multiple joins) and will require larger compute. For a more thorough dive into normalisation, please refer to here and here.
Oftentimes the raw data or the business use-case require the use of multiple tables, which are related in nature. In addition, we would like our data to be of use to other teams and should be modeled in a straightforward and logical manner. To accomplish this, we can utilize the dimensional model and the star schema.
A fact table contains measures of a certain metric. For example, a store's sales would be a fact table, where each record/row would be a sale made by that store. This type of table changes regularly and is designed to connect to one or more dimension tables.
Dimension tables connect to the fact table to provide more descriptive information about a record. This descriptive information does not change very often or at all. For example, a dimension table can add descriptive information about a product.
The Star schema's dimension tables are extended to one dimension. This means that a dimension table is only related to a fact table and is not further normalised into other dimension tables. Below is a simple example of such a Star schema.
Using store sales example, a possible Star schema could look something like the following. In this example, you have a fact table that contains the store's sales, and IDs for each attribute (product, order, customer, and employee) that are used to link to the dimension tables to acquire more information, if needed. Everything is one left join away.
The easiest way to think about the Snowflake schema is an extension of the Star schema. This is because the dimension tables are further normalized to save storage by reducing redundancy (and to optimize data integrity).
However, the Snowflake model can also greatly increase the complexity of modeling and querying. In addition, when working with an object storage DB, compute is more expensive than storage. Thus, the Snowflake schema is more computationally burdensome than the Star schema. As such, it is often avoided. Below is a simple example of a Snowflake schema.
Checkout the video from one of our community meetings here
With the theory taken care of, it's important to talk about the logistics of building the tables in a data mart. The Versatile Data Kit (VDK) provides users with templates that easily load data into a data mart, based on a table's characteristics.
Typically, users enact the append or overwrite strategies on fact tables and the overwrite or versioned strategies on dimension tables, but that need not be the case for everyone. It depends on the intricacies of your data and use-case.
To get the full list of available Versatile Data Kit templates visit SQL Data Processing Templates Examples
Congratulations on making it to the end of this tutorial! We've traversed a wide range of topics from understanding the difference between Data Lakes, Data Warehouses, and Data Marts, to learning about various data modeling techniques such as Star and Snowflake schemas.
It is important to ensure that data is ingested into a Data Lake with little to no modifications. Data Marts are used to create a well model dataset that can be treated as a product. Furthermore, the decision to use a specific data modeling technique or a particular data mart loading strategy depends heavily on the specific requirements of your use case. The key is to find a balance between computational efficiency, ease of use, and maintenance that best suits your unique situation.
If you want to go a step further, you can use the Versatile Data Kit to ingest this data into a data lake and then transform and load it into a data mart using the different loading strategies discussed. You can refer to the Ingestion Examples and SQL Data Processing Templates Examples for assistance.