Steampipe - w4111/w4111.github.io GitHub Wiki
- er2788 Emanuela Romanó - project owner
1 - WHAT PROBLEM DOES STEAMPIPE SOLVE?
Steampipe facilitates the retrieval of data across cloud services, SaaS applications and other platforms, using SQL-based queries.
The problems it solves include:
-
Data Fragmentation Across Platforms: Resources are spread across multiple platforms, each with its own APIs, CLIs, or dashboards, making it more difficult and time consuming to retrieve data.
-
Inefficiency in Data Access: Retrieving resource information from cloud or SaaS platforms involves developing REST API calls, deciphering responses, or managing different CLIs, which can require different syntaxes.
-
Limited Cross-Platform Insights: Existing tools are often limited to specific ecosystems. For example, Clusterpedia only provides solutions for Kubernetes.
2 - HOW DOES STEAMPIPE SOLVE THE PROBLEM?
Steampipe solves these issues by:
- SQL-based Querying: Provides a SQL interface to query resources across multiple platforms, enabling a consistent and efficient way to fetch data. It allows developers to access the necessary data, as long as they know SQL.
-
Wide Scope Through Plugins:
Steampipe has over 140 plugins. Therefore, it allow integration with many platforms, from cloud services (e.g. GCP, AWS) to SaaS tools like GitHub and Slack. Data sources that Steampipe can query include:
- Cloud providers like AWS, Azure, Oracle Cloud
- Structured files like CSV, YML, and Terraform
- Network services like DNS and HTTP
- Cloud-based services like GitHub, Salesforce, and ServiceNow
- Relational Representation of Resources: Resources are represented as tables in a PostgreSQL or SQLite database, making them easy to query/join/aggregate.
- Flexibility of Data Outputs: Outputs from Steampipe queries are available in many formats (e.g., JSON, CSV, relational views), which enables compatibility with different automation tools and Business Intelligence dashboards.
- Caching and Performance Enhancements: With Steampipe, retrieved data are stored in ephemeral tables. This reduces repetitive API calls, and improves response times.
- Cross-Platform Data Aggregation: Since the data from different sources are stored in a relational representation, they can be combined and aggregated in the same way as it is done for standard SQL tables.
3 - WHAT ARE ALTERNATIVES, AND WHAT ARE THE PROS AND CONS OF STEAMPIPE VS ALTERNATIVES?
Alternative technologies include:
- Clusterpedia: Designed specifically for managing Kubernetes clusters.
- Platform-Specific Tools: AWS CLI, Azure CLI, Google Cloud SDK.
- Custom Scripts and SDKs: Write scripts or use SDKs to interact with specific APIs.
Pros and Cons of Alternatives & Steampipe:
Feature | Steampipe | Clusterpedia | Custom Scripts/SDKs | Platform-Specific Tools |
---|---|---|---|---|
Scope | Steampipe is multi-platform | Limited to Kubernetes | Customizable, but it is still platform-specific | Limited to one platform |
Ease of Use | SQL-based, easy for most users | Relies on Kubernetes API, specific for K8s users | Requires programming skills | Requires access to CLI/UI and has limited SQL support |
Performance | Slower for large setups and not optimized for real-time data pulling | Faster due to proactive data collection: it stores different cluster data in a central repository ahead of time | Dependent on implementation | High for specific use cases, but not very scalable |
Insight of Historical Data | Limited, mainly focused on current state | Good for Kubernetes events | Requires a manual setup | Rarely supports historical data |
Extensibility | Custom plugins available | Limited | Fully flexible but requires specific coding for different platforms (high effort) | Limited to the specific platform |
4 - HOW DOES IT RELATE TO THE CONCEPT OF SQL FEATURES FROM 4111?
We look at how Steampipe relates to SQL Features and focus on 6 different aspects:
- Functionality
- Data Interaction
- Performance
- Extensibility
- Historical and Real-Time Data
- Use Cases
FUNCTIONALITY | ||
---|---|---|
Aspect | SQL Query | Steampipe |
Purpose | Query Language used to interact with relational databases. | Provides an interface to query different platforms (e.g. Cloud, SaaS) via SQL. |
Data Source | Queries data from relational database tables. | Queries live data from APIs of various platforms, and treats them as SQL tables. |
Data Type Management | Retrieves relational data. | Converts API responses into relational formats, such as PostgreSQL tables. |
DATA INTERACTION | ||
---|---|---|
Aspect | SQL Query | Steampipe |
Data Retrieval | Queries directly stored data in a database. | Pulls data on-demand from APIs, and chaches results for efficiency purposes. |
Query Syntax | Standard SQL syntax. | Extends SQL to allow querying schemas derived from API calls . |
Schema Definition | Pre-defined schema with fixed columns and data types. | Dynamically defines schemas based on API responses. It relies on JSONB for nested data. |
Cross-Source Joins | Operates within a single database system. | Allows joins across multiple platforms (e.g., AWS + Azure resources). |
PERFORMANCE | ||
---|---|---|
Aspect | SQL Query | Steampipe |
Speed | Optimized for high-speed data retrieval from (indexed) tables. | Speed depends on the response time of API calls; performance is improved through caching but it may lead to stale data. |
Data Volume | Efficiently handles large datasets via indexing and query optimization. | Limited by API and real-time retrieval constraints. It is not designed for datasets with high-throughput. |
Scalability | Scales well with large databases and distributed database systems. | Performance can decrease significantly with large-scale, multi-platform queries. |
EXTENSIBILITY | ||
---|---|---|
Aspect | SQL Query | Steampipe |
Custom Queries | Supports complex query construction. | Supports SQL features, but JSONB fields complicate querying deeply nested API responses. |
Plugin/Integration | Relies on database extensions to add custom functionalities. | Relies on plugins to connect to new platforms or add new data sources. |
Data Transformation | Transformation via SQL functions (e.g., CAST) | Includes SQL transformations, but handling JSON data can make transformations more difficult. |
HISTORICAL AND REAL-TIME DATA | ||
---|---|---|
Aspect | SQL Query | Steampipe |
Historical Data | Maintains historical data through transactional logging. | Does not support history by default. It requires external caching or snapshots. |
Real-Time Queries | Not inherently real-time unless paired with streaming technologies, such as Kafka. | Retrieves real-time data directly from APIs on-demand. However, due to caching, it can output obsolete information if the same API is called repeatedly. |
USE CASES | ||
---|---|---|
Aspect | SQL Query | Steampipe |
Primary Use | Data retrieval, analysis, and manipulation in a structured database. | Monitoring, reporting, and aggregating metadata across different platforms. |
Automation | Frequently integrated into ETL pipelines or dashboards. | Ideal for on-demand reporting and auditing, but it is less suited for large-scale ETL workflows. |
User Base | Database administrators, analysts, developers. | Cloud operators, platform engineers, DevOps teams. |
For project 1, my teammate and I built an application, which allows patients to book appointments with doctors, and which stores medical information of patients. The flask application frontend was deployed in GCP.
A possible use for Steampipe would be to check if the GCP instance is running, to ensure that the application can be accessed.
Prerequisites: gcloud set up on local machine and connected to the GCP instance.
In the terminal:
- Install Steampipe:
brew install turbot/tap/steampipe
- Install Google Cloud plugin:
steampipe plugin install gcp
- Start the query interface:
steampipe query
- To get the status of the VM, run:
SELECT name, status
FROM gcp_compute_instance
WHERE name = 'name-of-vm-instance';
You should see the below in the terminal, where "name-of-vm-instance" is the name of the VM instance that is being queried:
