Data Warehouse and Data Store - saeed349/quant_infra GitHub Wiki
Data Warehouse
Snowflake vs Databricks
The decision between Snowflake and Databricks for your data warehouse is not a straightforward one. While both platforms may seem similar, they differ significantly, making an apples-to-apples comparison challenging. Personally immersed in the Databricks platform since the inception of my big data journey, it holds a special place for me. I extensively utilize it for work, appreciating its role as not just a data warehouse but also an analytics platform with diverse use cases and applications.
On the other hand, Snowflake centers around its core product - data warehouse and has been expanding its offerings over the years. Deciding which platform to choose is no easy feat. Ideally, having both in an organization could be advantageous, given the overlapping features vying for market share, resulting in a win-win situation for customers. Key metrics for comparison, such as cost, feasibility, user experience, and implementation effort, are intricately tied to organizational characteristics, such as the number of employees, technical expertise, and trade-offs between cost, performance, and implementation time.
Both Snowflake and Databricks support SQL and PySpark, with Snowflake's Snowpark covering the PySpark API. Both also support User Defined Functions (UDFs) and User Defined Table Functions (UDTFs) for distributing Python workloads across clusters. So in terms of functionality both are very similar and the cost and performance are all very similar.
For instance, Snowflake might suit a small team without a dedicated cloud administrator relying on just SQL warehouse for BI reporting. Its simplicity and SAAS structure make it easy to implement and comprehend. On the other end of the spectrum, a large organization with multiple teams spanning BI, ML, Data Engineering, and business operations might benefit from a centralized platform that accommodates diverse requirements.
Crucially, assessing the datasets available in each platform's marketplace can influence the choice to avoid unnecessary ETL efforts. In a startup scenario, I encountered datasets available on Snowflake's marketplace but not on Databricks', leading to a preference for Snowflake due to simplicity and dataset availability. Especially when dealing with datasets with 100s of table tables (for example in S&P CIQ dataset). Building an ETL for such complexity introduces potential pitfalls, making data sharing a valuable expediter for trials and onboarding new datasets.
Nevertheless, certain aspects favoring Databricks should be noted like data governance, advanced ETL tools and ML capabilities making it a complete analytics platform. On the other hand, Snowflake exhibits superior warehouse. Complexity in understanding the pricing structure, coupled with AWS costs, poses challenges for Databricks users, especially when breaking down costs for various resources across different users and teams.
While having multiple options is advantageous for developers, the simplicity and user-friendliness of Snowflake make it a valuable addition to any toolkit. And for this reason I have some of my personal workloads running in Databricks and the one that I have built for my client is based of on Snowflake because a few of the datasets the client was interested in was already available in Snowflake.
Interoperability between Snowflake and Databricks
Databricks is based off on Delta lake architecture and Snowflake has recently started baby steps in this direction and embraced Iceberg format. As a consumer, this is amazing as we can use either platform and quickly use the data in the other platforms.
I recently migrated a whole bunch of data from Delta lake to Snowflake and the migration process was seamless.
Datastore
I am using Duckdb as the local datastore for this project and here are the following use-cases
- Persistant storage for pandas dataframe while doing research. Support for pyarrow makes it super fast as well.
- Temporary storage for data in Warehouse while doing research. By using SQL on both warehouse and datastore, its very easy to develop features and analytics locally on duckdb and easily migrate it to the warehouse. This saves a substantial amount of warehouse cost.
- Duckdb has support for pyarrow, making it really quick to work with pandas dataframes.
- Data warehouses like Databricks and Snowflake don't come cheap and if your datasets are small, then you can easily use Duckdb as your local analytical engine and replace your warehouse with Duckdb. Unfortunately you won't be able to use it for running horizontal scalable distribution workload. But given your datasets are small, you probably won't need that feature.
If you need Datastore that has advanced collaboration and time series support ArcticDB from ManAHL is a great option.
DuckDB has a cloud based product coming up known as MotherDuck. This product tries to do a hybrid way of processing data and seems quite promising, but too early to tell.