Feldera - w4111/w4111.github.io GitHub Wiki

ab5920 Anushka Bhatnagar
- wrote the project

Feldera

What is Feldera?

Feldera is a specialized streaming query engine which has been developed for handling fast, complex, and dynamic datasets. It allows users to create SQL programs that specify data sources (input tables) and results after computation (views). Its unique feature is the Incremental View Maintenance, which updates the computed outcomes solely based on the modifications in the input (instead of re-computing the entire data again), making it efficient as compared to traditional database systems or streaming engines. This way, Feldera is able to handle millions of records each second and provide query results in milliseconds even after input modifications.

Feldera basically queries at lightning speed:

The problem Feldera solves and how

When we have to deal with dynamic and complex data frequently, we may encounter difficulties with traditional systems as they are unable to deliver query results for streaming data in almost real-time. It's challenging to maintain steady performance when updating large amounts of data, and the operational burden is increased when there are numerous data sources (in different formats) and queries. It is computationally costly to recalculate data for each query or change every time an update is made.

Feldera solves these problems by utilising Incremental View Maintenance (IVM), which, instead of recalculating the complete dataset, dynamically updates query results in response to changes in the input data. This design guarantees that calculated views are updated almost instantly, the use of resources is efficient by steering clear of already information that has already been computed, and also provides scalability in high-throughput settings that process millions of records every second.

How Feldera relates to concepts we learnt in COMS 4111

In class, we learnt concepts which apply to traditional database systems. Here I've listed how similar and different Feldera's features are to what we've learnt this semester:

Like we learnt about relational models in class, Feldera also follows that to maintain structured data, but it also works with streaming data by allowing integration of streams and batch data, which may be tougher to do with traditional SQL. In Feldera the schema can adapt to different streaming contexts without manual intervention, so it also supports semi-structured data.
In standard SQL we have operations like joins, aggregations, nested queries, window functions, etc. Feldera has these as well, but it further supports these features for real-time stream queries like windowing and time-based operations. It has continuous query execution, so results update as new data comes in.

For example, we can do the following in Feldera in real-time:

Standard SQL

SELECT CustomerID, COUNT(*) AS TotalOrders
FROM Orders
GROUP BY CustomerID;

Feldera

SELECT CustomerID, COUNT(*) AS TotalOrders
FROM Orders
GROUP BY CustomerID
TUMBLE(INTERVAL '10 minutes');

Standard SQL has predefined query plans and indexes, as we learnt in class. The static optimization is based on query analysis before execution, and it relies heavily on disk I/O and indexing for performance. Feldera uses incremental view maintenance to update only the affected portions of a query, minimizing overhead. So it can dynamically adapt query plans during execution, which is particularly beneficial for streaming data with fluctuating patterns. It often operates in-memory for low-latency execution.
We learnt about primary keys, foreign keys, unique, not null, and check constraints in standard SQL, which enforce constraints at the point of data insertion or update. Feldera also supports these constraints for static datasets, but it has real-time validation during when the data streams. So it implements constraints dynamically for continuous data streams, such as window-based constraints, like ensuring values within a time window meet specific conditions.

For example, in standard SQL we apply constraints statically:

ALTER TABLE Orders
ADD CONSTRAINT chk_order_date CHECK (OrderDate >= '2020-01-01');

In Feldera we can have window-based constraints for streaming:

CREATE VIEW RecentOrders AS
SELECT * 
FROM Orders
WHERE OrderTimestamp > CURRENT_TIMESTAMP - INTERVAL '1 day';

For recovery, we learnt that traditional systems follow the ACID properties and guarantee strong durability, meaning it "promises" that no data is lost after a crash. Feldera does the same, but it implements stream checkpoints and distributed logs to recover state from failures. It can resume stream processing without requiring a full restart, and it offers configurable recovery options (it differs from case to case, but is mostly based on requirements of throughput and latency).

So we can say that traditional models (and standard SQL) work efficiently with well-defined (mostly static) datasets, and it may be difficult to use them for real-time analytics and stream processing without external integrations. Feldera, on the other hand, is ideal for real-time data pipelines, stream analytics, and scenarios which require quick insights from data streaming dynamically. It is good for combining transactional and streaming workloads.

The alternatives, and the pros and cons of Feldera compared with alternatives

We can compare Feldera with its alternatives in different domains, like traditional relational databases like PostgreSQL, streaming systems like Apache Kafka, and distributed SQL databases like Google Spanner. I have drawn a comparison amongst these (through sources online- mentioned in the references section at the end).

Feature	PostgreSQL	Kafka	Google Spanner	Feldera
Schema	Rigid, predefined schema	Schema-less	Fixed schema, distributed	Flexible schema with dynamic views
Real-Time Data Processing	Limited to batch updates	Excellent as it is event-driven)	Good as there is strong consistency	Excellent as it is stream-based on top of SQL
Transactions	ACID-compliant	Eventual Consistency	Global ACID transactions	ACID-compliant with real-time streaming
Best For	Structured, static data	Event-driven architectures	Cloud-native, globally distributed systems	Hybrid workloads requiring real-time processing and transactions

From the comparison we can see that Feldera works best when both transactional consistency and real-time streaming are required unlike its alternatives, which either excel in transactional or streaming domains separately.

Pros of Feldera over others:

It combines the best features of traditional relational databases and streaming platforms like Kafka, so it eliminates the need for separate systems for transactional and analytical workloads. For example it can handle transactional updates while simultaneously running real-time analytics.
It supports views that adapt dynamically to incoming streams, as compared to traditional databases where views are static.
It uses stream checkpointing to ensure minimal data loss during failures. Compared to Kafka, it has better integration with SQL-based querying and schema enforcement.

Cons of Feldera over others:

Feldera may not outperform these systems in their specific domains, like Kafka for pure streaming or Google Spanner for globally consistent transactions.
Balancing both transactional and streaming workloads could lead to suboptimal performance in highly demanding use cases focused exclusively on one type of workload.
It has fewer third-party tools, plugins, and libraries compared to PostgreSQL, Kafka, and Google Spanner.

What makes Feldera unique?

Its queries are event-aware which enables real-time processing and decision-making without requiring external tools like Kafka.
It combines relational and streaming data models, allowing users to work with both static datasets and continuous data streams using a unified SQL interface.
It supports dynamic, real-time materialized views that automatically update as underlying data changes, optimizing for hybrid workloads.
It has stream-specific indexing strategies to improve the performance of streaming queries compared to traditional indexing.
It extends SQL with intuitive constructs for working with streams which reduces the learning curve compared to other stream processing platforms.
It uses resilient checkpointing mechanisms which ensures high availability and fast recovery for stream-based workloads.

Example and Tutorial

Here I've described a problem scenario and outlined the challenges the corresponding system might face without incorporating Feldera, and how it will become efficient upon utilizing it. This way, we can understand what Feldera is by putting its usability into context.

Problem Scenario

Let's imagine that there is an emergency response agency which has to coordinate aid during natural disasters like hurricanes or wildfires. Their system must do the following:

Track incoming reports, such as affected locations and resource needs, from diverse sources such as social media and field reports
Analyze and update response plans in real-time as new information is reported
Notify teams of urgent changes, for instance if there's a spike in medical supply requests or blocked evacuation routes
Dynamically scale to handle surges in data during peak crises

If the agency incorporates a traditional DBMS, they may face challenges like:

Low latency as processing millions of data records in real-time is challenging
High complexity as a lot of customization may be required to integrate data received from various sources (diverse formats)
Complex resource allocation as determining their priorities could be computationally intensive

So, Feldera may be more efficient here as:

It can connect to real-time data sources like APIs and Kafka streams, and process millions of updates per second using its streaming SQL engine. Its Incremental View Maintenance will ensure only updated portions of the data are processed, which reduces computational overhead
It can reflect changes immediately as data flows in by dynamically using SQL views
Its computational efficiency would minimizes costs as compared to systems which recompute the data entirely

Tutorial

Data may come from various sources in this scenario, so we can first create a table to collate it:

CREATE TABLE emergency_reports (
    report_id INT,
    location VARCHAR,
    resource_type VARCHAR,
    resource_need INT,
    timestamp TIMESTAMP
);

Views in Feldera can handle and process real-time incoming data much more efficiently as compared to traditional DB systems because it only processes and updates the parts of the view which are affected by data changes, and this reduces the computational cost and latency. So here we can use views in the following ways:

We can compute resource demand per region:

CREATE VIEW resource_demand AS
SELECT location, resource_type, SUM(resource_need) AS total_need
FROM emergency_reports
GROUP BY location, resource_type;

We can detect critical resource shortages:

CREATE VIEW critical_shortages AS
SELECT location, resource_type
FROM resource_demand
WHERE total_need > 1000;

We can monitor report trends:

CREATE VIEW report_trends AS
SELECT location, COUNT(*) AS reports_per_minute
FROM emergency_reports
WHERE timestamp > NOW() - INTERVAL '1 MINUTE'
GROUP BY location;

We can create a table for routes then identify blocked routes:

CREATE TABLE route_status (
    route_id INT,
    location VARCHAR,
    status VARCHAR, -- "open", "blocked", "delayed"
    timestamp TIMESTAMP
);

CREATE VIEW blocked_routes AS
SELECT route_id, location, status
FROM route_status
WHERE status = 'blocked';

We can identify routes with delays over 15 minutes:

CREATE VIEW delayed_routes AS
SELECT route_id, location, status
FROM route_status
WHERE status = 'delayed' AND timestamp > NOW() - INTERVAL '15 MINUTES';

This way the agency can deploy the pipeline and connect Feldera to their real-time dashboard and alert system. It may be difficult for alternatives like traditional databases or streaming platforms to perform as efficiently in this scenario as databases fail to handle real-time updates efficiently and streaming platforms require substantial engineering effort and lack SQL’s simplicity. So Feldera's features make it uniquely suited for such an example where real-time data processing is necessary.

References

Used for drawing comparison: