In‐DB Machine Learning (Apache MADlib) - w4111/w4111.github.io GitHub Wiki

Up to five students can be part of a team that contributes to a page. List the UNI and names of your team members, and what each person contributed in sufficient detail that the staff can identify your contributions.

jmy2134 Jessica Ye
- Overview
- Alternatives (Pros & Cons)
- Relation to W4111
jj3390 Joseph Jojoe
- Problem & Solution
- Example/Tutorial

In-DB Machine Learning (Apache MADlib)

Apache MADlib is an open-source library for scalable in-database analytics. It provides parallel implementations of mathematical, statistical, graph, and machine learning methods and functions, supporting both structured and unstructured data. MADlib runs on Greenplum database (GPDB) and PostgreSQL.

High-level Overview of MADlib's Architecture Figure 1. MADlib Architecture has three primary components: 1. the Python driver functions, 2. C++ implementation functions, 3. C++ database abstraction layer

Features

The following are features of MADlib that can be implemented using the library.

Classification
Regression
Deep Learning
Clustering
Topic Modeling
Association Rule Mining
Descriptive Statistics
Validation

The Problem and Solution

Explain the problem that it solves.
How does the technology solve the problem?

Problem

Datasets grow ever larger (from GB to TB to PB) and become more complex in terms of an increasing number of input and output features. Traditional machine learning workflows are not robust enough to cope with this because the paradigm of loading all data into memory (RAM) to train and evaluate machine learning models does not suffice when there is more data than RAM has capacity to score. Exporting datasets from databases to external tools for preprocessing causes significant latency and causes a complex mesh of tooling dependencies (e.g. doing some preprocessing with one tool or library, and then other data cleaning with another) which is difficult to keep track of.

Solution

Apache MADlib allows you to compute machine learning and data preprocessing steps directly within the database without having to export the data out to more specialized tools. This allows bandwidth to be saved (if processing is done over a remote connection as is common with many ML datasets as they are stored in the Cloud) and promotes efficient processing of large datasets. It is also scalable as it interfaces to common database types like PostgreSQL which can work across multiple nodes (machines) for data reliability. It has a library of built-in machine learning algorithms such as regression and classification but also supports arbitrarily complex neural networks via TensorFlow/Keras for both training and inference using just Python and base SQL.

Alternatives (Pros & Cons)

MADlib stands out from other alternatives—typically offered by cloud data storage providers—by allowing deployment on user-configured, locally scalable distributed systems (on-premises servers). This allows greater flexibility with server and model hosting and querying.

General Pros:

Performance - When using MADlib on Greenplum Database (GPDB), MADlib is fully parallelized, which enables significantly higher performance on larger datasets, when compared to the runtime and performance with R or Python libraries.
Scalability - Since MADlib operates on a distributed system, like GPDB, you can increase the system's capacity to handle larger datasets and more complex computations, by adding more nodes. Adding nodes increases the system's capacity to handle larger datasets and more complex computations. In contrast, R and Python libraries are limited by the amount of data you can load onto a single node. Model accuracy improves with more data.

General Cons:

Installation and Version - Using MADlib requires the configuration of either a PostgreSQL or Greenplum database server. Furthermore, in order to use MADlib with PostgreSQL, you must be on Python Version 2.7 - currently, Python 3.x is not supported. This made it difficult for us to test and create different models using MADlib with Google Colab.

Alternatives

Google BigQuery ML - Google's service lets users create and run machine learning models by using GoogleSQL queries. BigQuery ML allows access and use of Vertex AI models and Cloud AI APIs to perform AI tasks like text generation or machine translation. Users can also use internally trained models, built into BigQuery ML, or import custom models trained outside of BigQuery (and perform prediction within BigQuery afterward).
- Pros: Similar to MADlib, users can use existing SQL skills to build and evaluate models, and generate results from LLMs and Cloud AI APIs. "Brings ML to the data" - reduced complexity, since fewer tools are required (can do everything on Google BigQuery)
- Cons: Not run locally, more expensive to train and run queries with the models. Serverless architecture - Google manages all the warehouse's aspects. Resources are allocated automatically, rather than managed by the user.
Amazon Redshift ML - Similar to Google's BigQuery ML, users can create, train, and deploy machine learning models using SQL, using Amazon's Redshift data warehouses. There's no need to learn new tools or languages, as the service allows the user to create and train Amazon SageMaker machine-learning models using user data (and query on the trained model). Users can also bring their own model for local or remote inference.
- Pros and cons are similar to Google's BigQuery ML.
- However, one key positive of choosing Amazon Redshift ML over BigQuery is Redshift's more traditional architecture based on a cluster of nodes, that allows for parallel processing architecture to distribute queries, leading to faster speeds.

In contrast with these alternatives, MADlib is local (rather than on a cloud service provider). Furthermore, using MADlib on a distributed system, such as Greenplum Database (GPDB), enables faster and more scalable query processing—management of which can be controlled by the user (in contrast with Google BigQuery ML).

Relation to W4111

MADlib allows the user to do a variety of machine learning and data analysis-related tasks. In our Project 1, students may have conducted machine learning tasks (such as a recommendation engine), using the data stored in their SQL database. With MADlib, there is no longer a need to transfer large datasets from the SQL database to an external environment like Python, which can cause a large delay in application speed and be expensive to do. There is faster query execution, as running machine learning models and data analysis with MADlib inside the database means that the data doesn't have to be extracted, transferred, and reloaded to an external Python (or R) script.

Furthermore, MADlib can streamline Graph algorithms and other measures associated with graphs. A good example is PageRank, which was explored within our Project 2! As an review, PageRank outputs a probability distribution that describes the importance of a specific webpage (or generally, a vertex), by calculating the likelihood of a person randomly navigating a graph to land on a specific vertex. It was originally proposed by Larry Page and Sergey Brin. Using MADlib (tutorial created by MADlib documentation here), we can create a vertex table, which has all the vertex ids, and an edge_table that has all of the connected edges. Using the madlib.pagerank function, we can calculate the PageRank for the inputted vertexes.

Tutorial

Note: Installation is less relevant than a tutorial highlighting the main concepts as they relate to 4111.

Example

In this example, we're extending upon our Project 1, where we created a platform for users to curate their own "Collections" - a webpage of "Exhibits" (images, videos, text, and links). We wanted to build a recommendation engine, so that users can find exhibits that may be interesting to them, based on the exhibits they've posted.

There is an Exhibits table that holds media items for your multimedia curation app. Each record in the table has a name, content url, certain metadata attributes, and a textual summary of the media. There is also a Saved table which similarly contains media items that the end user has saved because they like the content. The task is to build a basic recommendation engine (estimating the likelihood of the user liking an exhibit based on what they have previously saved) using logistic regression.

Tutorial

First, create tables to store data in.

CREATE TABLE Exhibits (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    content_url TEXT NOT NULL,
    metadata JSONB,
    summary TEXT NOT NULL
);

CREATE TABLE Saved (
    id SERIAL PRIMARY KEY,
    exhibit_id INT NOT NULL REFERENCES Exhibits(id),
    saving_user TEXT NOT NULL
);

Then, insert some dummy data into the Exhibits table (assuming it holds all the exhibits) and arbitrarily pick a couple of candidate records/exhibits to insert into the Saved table.

INSERT INTO Exhibits (name, content_url, metadata, summary)
VALUES
('Exhibit 1', 'http://example.com/1', '{"type": "image", "theme": "art"}', 
 'This artwork is a stunning depiction of 19th-century impressionism, showcasing vivid colors and expressive brushwork that conveys a sense of movement and emotion.'),
('Exhibit 2', 'http://example.com/2', '{"type": "video", "theme": "history"}', 
 'This documentary explores the pivotal events of World War II, combining archival footage and expert interviews to provide a comprehensive overview.'),
('Exhibit 3', 'http://example.com/3', '{"type": "audio", "theme": "music"}', 
 'A classical piano piece performed by a world-renowned musician, blending technical mastery with emotive depth to captivate the listener.'),
('Exhibit 4', 'http://example.com/4', '{"type": "image", "theme": "nature"}', 
 'A high-resolution photograph capturing the serene beauty of a forest at dawn, with sunlight streaming through the mist-laden trees.'),
('Exhibit 5', 'http://example.com/5', '{"type": "text", "theme": "literature"}', 
 'An excerpt from a famous novel that delves into the complexities of human relationships and the struggle for identity in a modern world.'),
('Exhibit 6', 'http://example.com/6', '{"type": "image", "theme": "science"}', 
 'A detailed diagram illustrating the structure of a DNA molecule, highlighting the double-helix configuration and its components.'),
('Exhibit 7', 'http://example.com/7', '{"type": "video", "theme": "technology"}', 
 'A presentation on the latest advancements in artificial intelligence, featuring real-world applications and ethical considerations.'),
('Exhibit 8', 'http://example.com/8', '{"type": "audio", "theme": "podcast"}', 
 'An engaging podcast episode discussing the impact of climate change, featuring interviews with environmental scientists and activists.'),
('Exhibit 9', 'http://example.com/9', '{"type": "image", "theme": "sports"}', 
 'An action shot of a soccer player scoring a winning goal during a championship match, capturing the excitement and intensity of the moment.'),
('Exhibit 10', 'http://example.com/10', '{"type": "text", "theme": "philosophy"}', 
 'A philosophical essay exploring the nature of free will, presenting arguments from both deterministic and libertarian perspectives.');

INSERT INTO Saved (exhibit_id, saving_user)
VALUES
(1, 'user1'),
(3, 'user1'),
(5, 'user1'),
(7, 'user1'),
(9, 'user1');

Now, we will create a table to process the data in a format that we can train a regression model on. Specifically, we’re going to model the ‘input’ data (the Exhibit tuples) as the input variables to the regression model. The output is going to be based on whether the specific media item is saved or not, because having an exhibit saved is a good sign and ideally we want to show users more items much like what they have saved. So we denote saved items as having the output ‘1’ and unsaved items (the rest of the Exhibits) as having the output ‘0’. This is called feature extraction.

DROP TABLE IF EXISTS exhibits_training;

CREATE TABLE exhibits_training AS
SELECT 
    e.id AS exhibit_id,
    CASE WHEN s.id IS NOT NULL THEN 1 ELSE 0 END AS saved,
    e.metadata->>'type' AS feature_type,
    e.metadata->>'theme' AS feature_theme
FROM 
    exhibits e
LEFT JOIN 
    saved s ON e.id = s.exhibit_id;

SELECT * FROM exhibits_training ORDER BY exhibit_id;

We will also explicitly save the features in a separate table.

DROP TABLE IF EXISTS exhibits_features;

CREATE TABLE exhibits_features AS
SELECT
    exhibit_id,
    saved,
    CASE WHEN feature_type = 'image' THEN 1 ELSE 0 END AS is_image,
    CASE WHEN feature_type = 'video' THEN 1 ELSE 0 END AS is_video,
    CASE WHEN feature_type = 'audio' THEN 1 ELSE 0 END AS is_audio,
    CASE WHEN feature_type = 'text' THEN 1 ELSE 0 END AS is_text,
    CASE WHEN feature_theme = 'art' THEN 1 ELSE 0 END AS theme_art,
    CASE WHEN feature_theme = 'history' THEN 1 ELSE 0 END AS theme_history,
    CASE WHEN feature_theme = 'science' THEN 1 ELSE 0 END AS theme_science
FROM
    exhibits_training;

Now we train the logistic regression model using the Madlib logregr_train function. It takes in the input table of variables and the tentative training output (which we computed as ‘1’ or ‘0’ and stored in the ‘saved’ column), the output table to store results in, the list of input features, that we’re doing 20 iterations of training, and the optimizer to use (effectively, how the coefficients in the logistic regression model are computed/modified during training).

DROP TABLE IF EXISTS exhibits_logregr;

SELECT madlib.logregr_train(
    'exhibits_features',
    'exhibits_logregr',
    'saved',
    'ARRAY[is_image, is_video, is_audio, is_text, theme_art, theme_history, theme_science]',
    NULL,
    20,
    'irls'
);

Finally, we can output predictions from the logistic regression model for whether each exhibit is likely to be saved or not. This can also be performed as evaluation on a separate (reserved) dataset for more accurate results to prevent information leakage. The first query outputs a binary 0 or 1 prediction (using a default threshold of 0.5) as to whether a specific exhibit is saved or not and iterates through the exhibits to provide this prediction for each one. The second query provides the explicit probability without rounding (known as logits) which can be helpful if the end user needs to adjust the thresholding percentage/proportion owing to some skew or bias in the underlying data.

SELECT
    f.exhibit_id,
    madlib.logregr_predict(coef, ARRAY[is_image, is_video, is_audio, is_text, theme_art, theme_history, theme_science]) AS predicted_saved,
    f.saved AS actual_saved
FROM
    exhibits_features f, exhibits_logregr m
ORDER BY
    f.exhibit_id;

SELECT
    f.exhibit_id,
    madlib.logregr_predict_prob(coef, ARRAY[is_image, is_video, is_audio, is_text, theme_art, theme_history, theme_science]) AS probability_saved,
    f.saved AS actual_saved
FROM
    exhibits_features f, exhibits_logregr m
ORDER BY
    f.exhibit_id;

In‐DB Machine Learning (Apache MADlib) - w4111/w4111.github.io GitHub Wiki