motherduck - w4111/w4111.github.io GitHub Wiki

nn2622 Natal Negusu

did the tutorial

vo2195 Victor Odom

Wrote the problem, alternatives, and relationship with 4111

MotherDuck

The Problem and Solution

The problem: Having multiple databases/information in different locations. The solution: When data is stored on different machines or remote locations, MotherDuck operates in the cloud in order to provide the end user with a source that can handle queries requesting data that are in the different locations.

How does the technology solve the problem?

The MotherDuck cloud service provides an integrated cloud management solution that manages data and then allows querying on that data, in an all in one cloud model.

What are the alternatives, and what are the pros and cons of the technology compared with alternatives? (what makes it unique?)

Tableau

Tableau is a data analytics platform providing browser-based analytics to its end-users. Tableau’s overall solution is moreso based in providing automated insights to the end-users from data stored in the cloud as well as from memory. MotherDuck provides a more hands-on approach in allowing users to produce query requests. MD’s automation is used in finding the most efficient access path to seek its data through once a request is made.

Looker

Looker is also a business analytics platform that makes data insight more accessible. It uses the same core idea of providing a centralized service from which it can pull and provide from multiple cloud sources. Looker has a proprietary modeling language, LookML, that provides the user with an easy, beginner friendly language to model data with; allowing the user to have stable data structures and construct data requests where they do not have to focus on SQL complexities. MotherDuck is much more simple in its services, with its own notebook UI, while also providing other presentations and interfaces to work with. For example, in addition to the notebook UI, users can work with the standard duckDB CLI or the standard duckDB python package. Looker’s con may be that it holds users in a proprietary system that is robust but also comes at a higher cost than the free service that MotherDuck currently offers.

Mediafly

Mediafly considers itself a revenue enablement platform providing sales teams with up-to-date data/content from across a business’ platform, accounts, pipelines, deals etc. It also deploys revenue solutions to the end user from the data in order to drive sales. MotherDuck provides an array of queries and data points that is generated by the user, allowing for the user to integrate more specialized tools as well as providing a query results panel for the user to filter through data. Mediafly is more targeted towards sales teams and their information stack in order to drive revenue but this may be their inhibitor too.

How it relates to concpts from 4111.

MotherDuck mostly deals with query execution and optimization. When running alongside DuckDB, it allows querying for data wherever it is as well as finding the most efficient access path; it may do this by providing storage in the cloud or by querying locally in the machine. They call their dual execution model Dual Execution, routing various stages of query execution to the most efficient location. MotherDuck also provides its cloud service so that users do not need additional instances, clusters, or warehouses.

you can use MotherDuck to join data from multiple locations: In MotherDuck, on S3 or other cloud object stores (Azure, GCS, R2, etc), or on your local machine.

Centralized data management, where instead of one end user querying and managing the data set, it is divided into two users. The Provider can manage the database and the Consumer can download the database and use search queries with the local download; though, the Consumer can only read the data and is not able to change the data.

MotherDuck utilizes the cloud data warehouse to alleviate workload on smaller machines as well as a caching layer to process data faster.

It also utilizes DuckDB Wasm in order to have the client-side JavaScript run faster locally, moreso if the data is moved locally, requiring minimum interactions with the cloud server.

Tutorial

Getting Started

First sign up for a free account on MotherDuck. With this account you will have access to MotherDuck UI and their services. After creating an account, you need to go to your settings, and get your API token. This is required when connecting to MotherDuck

Methods

You can utilize MotherDuck's services through methods:

The DuckDB CLI using SQL and DuckDB directly.
MotherDuck UI
Python

We will focus on Python but we will quickly go over MotherDuck UI

MotherDuck UI

MotherDuck UI is an online interface for you to quickly view attached databases and make use of MotherDuck's quick analysis of queries.

The interface is split into three sections:

The left hand side shows you the databases you have attached and the tables within them.
In the central section you can write and execute SQL code (akin to Google Collab). The interface will also suggest corrections if your code fails
The right hand side allows you to quickly parse through you query results.
- For example, querying a table will give you the columns, the number of occurrences of values in those columns
- Performing queries like average on table will show the distribution of data across the relevant column. The results will also include relevant calculations. In this instance, it will return the max, min, median, the quartiles and standard deviation.

Loading Data

We can use the same methods as in DuckDB to load data into MotherDuck. For example, we can load "iowa.csv" by running the command in a cell in MotherDuck UI.

CREATE DATABASE IF NOT EXISTS tutorial_db;
USE tutorial_db;

CREATE OR REPLACE TABLE iowa AS SELECT * FROM read_csv_auto(['iowa.csv']);

Pasted image 20241208181140

Python

Install DuckDB

If you do not already have DuckDB on your system please use pip install it

Connecting to MotherDuck

Before we are able to make use of MotherDuck's services we need to connect and authenticate with MotherDuck using our API token.

import duckdb

api_token = "token_here"

# init MotherDuck connection
con = duckdb.connect(f"md:?motherduck_token={api_token}")

con.close()

Executing SQL

We can use either:

con.execute(""" SQL HERE """)

con.sql(" SQL HERE ")

The second option has useful attributes such as the show() function which display query results in the terminal

Performing Analytical Queries

Loading Data

We will use sales.csv and items.csv for our database. We create new database called production_db and create two tables: sales and items.

import duckdb

api_token = "token_here"

con = duckdb.connect(f"md:?motherduck_token={api_token}")
  
# create new database
con.sql("CREATE DATABASE IF NOT EXISTS production_db")

# switch to production_db
con.sql("Use production_db")

# filepaths
sales_fp = "../Databases/sales.csv"
items_fp = "../Databases/items.csv"

# create tables
con.sql(f"CREATE OR REPLACE TABLE sales AS SELECT * FROM read_csv_auto(['{sales_fp}']);")

con.sql(f"CREATE OR REPLACE TABLE items AS SELECT * FROM read_csv_auto(['{items_fp}']);")

con.sql("SHOW TABLES").show()

con.close()

Querying

We will use an example from hw3. We will consider the family attribute in items. We compute the mean of unit_sales for $\sigma_{family=value}$ and $\sigma_{family\ne value}$ for each family.

import duckdb

api_token = "token_here"

# init MotherDuck connection
con = duckdb.connect(f"md:?motherduck_token={api_token}")

# create new database
con.sql("CREATE DATABASE IF NOT EXISTS production_db")

# switch to production_db
con.sql("Use production_db")

# filepaths
sales_fp = "../Databases/sales.csv"
items_fp = "../Databases/items.csv"

# create tables
con.sql(f"CREATE OR REPLACE TABLE sales AS SELECT * FROM read_csv_auto(['{sales_fp}']);")
con.sql(f"CREATE OR REPLACE TABLE items AS SELECT * FROM read_csv_auto(['{items_fp}']);")

con.sql("SHOW TABLES").show()

# query
q = """ create or replace table mat_suff_stat as
with total as (
    select 
        count(unit_sales) as TC,
        sum(unit_sales) as TS,
        sum(unit_sales * unit_sales) as TQ
    from sales join items on sales.item_nbr = items.item_nbr
),
family_stats as (
    select family, sum(unit_sales) as s, count(unit_sales) as c, sum(unit_sales * unit_sales) as Q
    from sales join items on sales.item_nbr = items.item_nbr
    group by family
)
select family, s, c, Q, TC, TS, TQ
from total, family_stats;

select family, 
s/c as mean_equal, 
(TS - s)/(TC - c) as mean_nequal
from mat_suff_stat """

# execute and show result of query
con.sql(q).show()

con.close()

Note that MotherDuck in the background decides whether to perform the task on their end or locally on your machine. In this instance, Motherduck will decide to perform all task remotely.

Working with multiple databases

Say you have on database in stored in the cloud using of the cloud storage services or MotherDuck and you have another database that is local to your machine. Let's say you need to perform queries on both these databases at the same time. MotherDuck helps us out with this task by allowing us to either attach our local databases to our MotherDuck connection or vice versa

Let us create a new database called remote_db on MotherDuck and another local databases called local_db. We are going to attach the local database to MotherDuck

import duckdb

api_token = "token_here"

# init MotherDuck connection
con = duckdb.connect(f"md:?motherduck_token={api_token}")

# filepaths
sales_fp = "../Databases/sales.csv"
items_fp = "../Databases/items.csv"

# create new database on MotherDuck
con.sql("CREATE DATABASE IF NOT EXISTS remote_db")

# switch to remote_db
con.sql("Use remote_db")

# create sales table in remote_db
con.sql(f"CREATE OR REPLACE TABLE sales AS SELECT * FROM read_csv_auto(['{sales_fp}']);")

# create local database 
con.sql("ATTACH 'local_db.duckdb'")

# switch to local_db
con.sql("Use local_db")

# create table in local_db
con.sql(f"CREATE OR REPLACE TABLE items AS SELECT * FROM read_csv_auto(['{items_fp}']);")

Now we perform the same query as we did earlier using both databases. However, we do need to identify the database we are refering to. Therefore, we write "remote_db.sales" instead of just "sales" in query. When can keep "items" as is because we are currently switched on to the local database where the items table lives.

# query
q = """ create or replace table mat_suff_stat as
with total as (
    select 
        count(unit_sales) as TC,
        sum(unit_sales) as TS,
        sum(unit_sales * unit_sales) as TQ
    from remote_db.sales join items on remote_db.sales.item_nbr = items.item_nbr
),
family_stats as (
    select family, sum(unit_sales) as s, count(unit_sales) as c, sum(unit_sales * unit_sales) as Q
    from remote_db.sales join items on remote_db.sales.item_nbr = items.item_nbr
    group by family
)
select family, s, c, Q, TC, TS, TQ
from total, family_stats;

select family, 
s/c as mean_equal, 
(TS - s)/(TC - c) as mean_nequal
from mat_suff_stat """

# execute and show result of query
con.sql(q).show()


con.sql("SHOW DATABASES").show()

con.close()

We are working with multiple databases and delegating where the processing should be done to MotherDuck.

It is not just between databases on MotherDuck and local databases where we can do this. MotherDuck also supports connecting to AWS. Thus, you can perform queries on table stored in AWS, another on MotherDuck and another locally all at the same time.