Steampipe

er2788 Emanuela Romanó - project owner

The Problem and Solution

1 - WHAT PROBLEM DOES STEAMPIPE SOLVE?

Steampipe facilitates the retrieval of data across cloud services, SaaS applications and other platforms, using SQL-based queries.

The problems it solves include:

Data Fragmentation Across Platforms: Resources are spread across multiple platforms, each with its own APIs, CLIs, or dashboards, making it more difficult and time consuming to retrieve data.
Inefficiency in Data Access: Retrieving resource information from cloud or SaaS platforms involves developing REST API calls, deciphering responses, or managing different CLIs, which can require different syntaxes.
Limited Cross-Platform Insights: Existing tools are often limited to specific ecosystems. For example, Clusterpedia only provides solutions for Kubernetes.

2 - HOW DOES STEAMPIPE SOLVE THE PROBLEM?

Steampipe solves these issues by:

SQL-based Querying: Provides a SQL interface to query resources across multiple platforms, enabling a consistent and efficient way to fetch data. It allows developers to access the necessary data, as long as they know SQL.
Wide Scope Through Plugins: Steampipe has over 140 plugins. Therefore, it allow integration with many platforms, from cloud services (e.g. GCP, AWS) to SaaS tools like GitHub and Slack. Data sources that Steampipe can query include:
- Cloud providers like AWS, Azure, Oracle Cloud
- Structured files like CSV, YML, and Terraform
- Network services like DNS and HTTP
- Cloud-based services like GitHub, Salesforce, and ServiceNow
Relational Representation of Resources: Resources are represented as tables in a PostgreSQL or SQLite database, making them easy to query/join/aggregate.
Flexibility of Data Outputs: Outputs from Steampipe queries are available in many formats (e.g., JSON, CSV, relational views), which enables compatibility with different automation tools and Business Intelligence dashboards.
Caching and Performance Enhancements: With Steampipe, retrieved data are stored in ephemeral tables. This reduces repetitive API calls, and improves response times.
Cross-Platform Data Aggregation: Since the data from different sources are stored in a relational representation, they can be combined and aggregated in the same way as it is done for standard SQL tables.

3 - WHAT ARE ALTERNATIVES, AND WHAT ARE THE PROS AND CONS OF STEAMPIPE VS ALTERNATIVES?

Alternative technologies include:

Clusterpedia: Designed specifically for managing Kubernetes clusters.
Platform-Specific Tools: AWS CLI, Azure CLI, Google Cloud SDK.
Custom Scripts and SDKs: Write scripts or use SDKs to interact with specific APIs.

Pros and Cons of Alternatives & Steampipe:

Feature	Steampipe	Clusterpedia	Custom Scripts/SDKs	Platform-Specific Tools
Scope	Steampipe is multi-platform	Limited to Kubernetes	Customizable, but it is still platform-specific	Limited to one platform
Ease of Use	SQL-based, easy for most users	Relies on Kubernetes API, specific for K8s users	Requires programming skills	Requires access to CLI/UI and has limited SQL support
Performance	Slower for large setups and not optimized for real-time data pulling	Faster due to proactive data collection: it stores different cluster data in a central repository ahead of time	Dependent on implementation	High for specific use cases, but not very scalable
Insight of Historical Data	Limited, mainly focused on current state	Good for Kubernetes events	Requires a manual setup	Rarely supports historical data
Extensibility	Custom plugins available	Limited	Fully flexible but requires specific coding for different platforms (high effort)	Limited to the specific platform

4 - HOW DOES IT RELATE TO THE CONCEPT OF SQL FEATURES FROM 4111?

We look at how Steampipe relates to SQL Features and focus on 6 different aspects:

Functionality
Data Interaction
Performance
Extensibility
Historical and Real-Time Data
Use Cases

FUNCTIONALITY
Aspect	SQL Query	Steampipe
Purpose	Query Language used to interact with relational databases.	Provides an interface to query different platforms (e.g. Cloud, SaaS) via SQL.
Data Source	Queries data from relational database tables.	Queries live data from APIs of various platforms, and treats them as SQL tables.
Data Type Management	Retrieves relational data.	Converts API responses into relational formats, such as PostgreSQL tables.

DATA INTERACTION
Aspect	SQL Query	Steampipe
Data Retrieval	Queries directly stored data in a database.	Pulls data on-demand from APIs, and chaches results for efficiency purposes.
Query Syntax	Standard SQL syntax.	Extends SQL to allow querying schemas derived from API calls .
Schema Definition	Pre-defined schema with fixed columns and data types.	Dynamically defines schemas based on API responses. It relies on JSONB for nested data.
Cross-Source Joins	Operates within a single database system.	Allows joins across multiple platforms (e.g., AWS + Azure resources).

PERFORMANCE
Aspect	SQL Query	Steampipe
Speed	Optimized for high-speed data retrieval from (indexed) tables.	Speed depends on the response time of API calls; performance is improved through caching but it may lead to stale data.
Data Volume	Efficiently handles large datasets via indexing and query optimization.	Limited by API and real-time retrieval constraints. It is not designed for datasets with high-throughput.
Scalability	Scales well with large databases and distributed database systems.	Performance can decrease significantly with large-scale, multi-platform queries.

EXTENSIBILITY
Aspect	SQL Query	Steampipe
Custom Queries	Supports complex query construction.	Supports SQL features, but JSONB fields complicate querying deeply nested API responses.
Plugin/Integration	Relies on database extensions to add custom functionalities.	Relies on plugins to connect to new platforms or add new data sources.
Data Transformation	Transformation via SQL functions (e.g., CAST)	Includes SQL transformations, but handling JSON data can make transformations more difficult.

HISTORICAL AND REAL-TIME DATA
Aspect	SQL Query	Steampipe
Historical Data	Maintains historical data through transactional logging.	Does not support history by default. It requires external caching or snapshots.
Real-Time Queries	Not inherently real-time unless paired with streaming technologies, such as Kafka.	Retrieves real-time data directly from APIs on-demand. However, due to caching, it can output obsolete information if the same API is called repeatedly.

USE CASES
Aspect	SQL Query	Steampipe
Primary Use	Data retrieval, analysis, and manipulation in a structured database.	Monitoring, reporting, and aggregating metadata across different platforms.
Automation	Frequently integrated into ETL pipelines or dashboards.	Ideal for on-demand reporting and auditing, but it is less suited for large-scale ETL workflows.
User Base	Database administrators, analysts, developers.	Cloud operators, platform engineers, DevOps teams.

Tutorial - Using Steampipe for Doctor Application

Example

For project 1, my teammate and I built an application, which allows patients to book appointments with doctors, and which stores medical information of patients. The flask application frontend was deployed in GCP.

A possible use for Steampipe would be to check if the GCP instance is running, to ensure that the application can be accessed.

Tutorial

Prerequisites: gcloud set up on local machine and connected to the GCP instance.

In the terminal:

Install Steampipe: brew install turbot/tap/steampipe
Install Google Cloud plugin: steampipe plugin install gcp
Start the query interface: steampipe query
To get the status of the VM, run:

SELECT name, status
FROM gcp_compute_instance
WHERE name = 'name-of-vm-instance';

You should see the below in the terminal, where "name-of-vm-instance" is the name of the VM instance that is being queried:

Steampipe - w4111/w4111.github.io GitHub Wiki

Steampipe

The Problem and Solution

Tutorial - Using Steampipe for Doctor Application

Example

Tutorial

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️