How to monitor performance of the IBM Storage Scale bridge for Grafana - IBM/ibm-spectrum-scale-bridge-for-grafana GitHub Wiki
The IBM Storage Scale bridge for Grafana (version 9.0.3 and later) includes a built-in performance monitoring feature that helps you understand how well the bridge is performing. This feature tracks execution times, identifies bottlenecks, and provides insights into the bridge's internal operations.
What can you monitor?
- How long it takes the bridge to process requests from Grafana dashboards or export data to Prometheus
- Which exporter jobs or queries are slow
- Performance trends over time
- Response times from IBM Storage Scale Performance Monitoring backend
Why monitor bridge performance?
- Identify slow dashboards - Find out which dashboards take too long to load and why
- Compare before/after upgrades - Verify that software upgrades don't degrade performance
- Troubleshoot issues - Diagnose performance problems with detailed metrics
- Optimize resources - Understand resource usage and plan capacity
- Compare clusters - Analyze performance differences across multiple IBM Storage Scale clusters
-
Open the analytics,py file, which controls configuration for special analytics settings:
vi source/analytics.py
-
Change the setting http_metrics_enabled to True:
http_metrics_enabled = True
-
Restart the bridge:
sudo systemctl restart grafana-bridge
That's it! The bridge is now collecting performance metrics.
You can access metrics through simple web URLs. The /http_metrics endpoint is available on both ports: openTSDB and prometheus.
View metrics in your browser:
# For OpenTSDB api:
http://bridge-host-ip:4242/http_metrics
# For PrometheusExporter api:
http://bridge-host-ip:9250/http_metricsOr use curl with authentication:
curl -u scale_admin:password http://bridge-host-ip:4242/http_metrics
curl -u scale_admin:password http://bridge-host-ip:9250/http_metricsLocated in source/analytics.py
-
True(Recommended): Metrics are collected and available via REST API -
False: Metrics are only written to log files
You can enable these for more detailed troubleshooting:
inspect_special = 3 # Values: 0-5What it does: Adds detailed debug information to log files
-
0= Off (default) -
1-2= Basic debugging -
3= Detailed debugging (recommended for troubleshooting) -
4-5= Very detailed (generates lots of logs)
When to use: When you need to understand exactly what the bridge is doing internally
For more details: See the Deep Troubleshooting section in HTTP_METRICS_COLLECTION.md
This is your primary endpoint for accessing performance data. Available on both OpenTSDB (port 4242) and Prometheus (port 9250) plugins.
Supported Formats:
Best for: Integration with Prometheus monitoring system
curl -u scale_admin:password http://bridge-host-ip:4242/http_metrics
# Or for Prometheus plugin:
curl -u scale_admin:password http://bridge-host-ip:9250/http_metricsBest for: Quick viewing, scripting, or custom analysis
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=json&limit=100"Best for: Excel analysis, reports, or data science
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > metrics.csvBest for: If you're using OpenTSDB datasource in Grafana
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=opentsdb"Parameters:
-
format: Choose output format (prometheus, json, csv, opentsdb) -
limit: Number of recent metrics to return (for JSON and CSV only)
Returns summary statistics about the metrics collection system itself, including:
- Total number of requests processed
- Number of metrics currently in memory cache
- Memory size used by metrics cache
- Number of bundle IDs registered (for OpenTSDB and Prometheus)
- Memory size used by bundle registries
Example Request:
curl -u scale_admin:password http://bridge-host-ip:4242/internal_statsExample Response:
{
"total_requests": 1523,
"metrics_in_memory_count": 847,
"metrics_in_memory_size": 245632,
"opentsdb_bundle_ids_count": 45,
"prometheus_bundle_ids_count": 23,
"bundle_ids_count_total": 68,
"opentsdb_bundle_registry_memory_size": 12480,
"prometheus_bundle_registry_memory_size": 8960,
"bundle_registry_memory_size_total": 21440
}Use this endpoint when you want to:
- Check how many metrics are currently cached
- Monitor memory usage of the metrics system
- Verify the metrics collection is working
- Get a quick overview without retrieving all metrics data
Lists all tracked query groups (useful for debugging specific dashboard issues).
# For OpenTSDB plugin
curl -u scale_admin:password http://bridge-host-ip:4242/api/bundle_ids
# For Prometheus plugin
curl -u scale_admin:password http://bridge-host-ip:9250/bundle_idsScenario: You upgraded the bridge and want to verify performance didn't degrade.
Steps:
-
Before upgrade: Export baseline metrics
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > before_upgrade.csv
-
Perform upgrade
-
After upgrade: Export new metrics
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > after_upgrade.csv
-
Compare in Excel:
- Open both CSV files
- Create pivot tables to compare average execution times
- Look for significant differences in response times
Example Dashboard: Use the provided dashboard Prometheus scrape jobs execution comparison over time to visualize the comparison.
Scenario: Users complain that certain dashboards load slowly.
Steps:
-
Enable metrics collection (if not already enabled)
-
Reproduce the slow dashboard loading
-
Check the statistics:
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" -
Look for high execution times in the output
-
Identify which collectors or queries are slow
What to look for:
-
OpenTsdbApi_query- Time to process all queries in a single HTTP request received from Grafana for OpenTSDB Datasource (may contain multiple sub-queries) -
perfmon_response_duration- Time waiting for IBM Storage Scale backend - High values (> 1 second) indicate bottlenecks
Use Case 3: Compare Performance of PrometheusExporter Scrape Jobs Across Multiple IBM Storage Scale Clusters
Scenario: You have multiple IBM Storage Scale clusters, each with its own bridge configured as a PrometheusExporter, and you want to compare their performance.
Prerequisites:
- IBM Storage Scale bridge for Grafana must be configured as PrometheusExporter on each cluster (not OpenTSDB)
- The
/http_metricsendpoint must be configured as a separate Prometheus scrape job for each bridge - The provided example dashboards are designed specifically for Prometheus datasources
Note: This use case and the example dashboards are not applicable if you're using the bridge only as an OpenTSDB datasource.
Steps:
-
Enable Prometheus exporter on each bridge:
- Edit
config.inion each bridge - Ensure Prometheus port 9250 is enabled:
[prometheus] port = 9250
- Restart each bridge
- Edit
-
Configure Prometheus to scrape the
/http_metricsendpoint from multiple bridges as separate scrape jobs:# prometheus.yml scrape_configs: - job_name: 'cluster-production' static_configs: - targets: ['bridge-prod:9250'] metrics_path: '/http_metrics' params: format: ['prometheus'] - job_name: 'cluster-development' static_configs: - targets: ['bridge-dev:9250'] metrics_path: '/http_metrics' params: format: ['prometheus']
-
Import the comparison dashboard into Grafana:
-
Use the dashboard to compare performance:
- Use the
job_namevariable to filter by cluster (e.g., "cluster-production", "cluster-development") - Compare execution times per scrape job across different clusters
- Identify which cluster has slower performance
- Use the time-shift feature to compare current vs. historical performance
- Use the
What you can analyze:
- Which cluster's bridge is performing better
- Execution time differences between clusters
- Performance trends over time for each cluster
- Impact of configuration differences between clusters
Example Dashboard: Prometheus scrape jobs execution comparison over time
Scenario: You want continuous monitoring of bridge performance.
Prerequisites:
- IBM Storage Scale bridge for Grafana must be configured as PrometheusExporter
- The provided example dashboards are designed specifically for Prometheus datasources
Steps:
-
Enable Prometheus exporter on the bridge:
- Edit
config.ini - Ensure Prometheus port 9250 is enabled:
[prometheus] port = 9250
- Restart the bridge
- Edit
-
Enable HTTP metrics collection:
- Set
http_metrics_enabled = Trueinsource/analytics.py - Restart the bridge
- Set
-
Configure Prometheus to scrape the
/http_metricsendpoint:# prometheus.yml scrape_configs: - job_name: 'grafana-bridge-internal' static_configs: - targets: ['bridge-host-ip:9250'] # Prometheus exporter port metrics_path: '/http_metrics' params: format: ['prometheus'] basic_auth: username: 'scale_admin' password: 'your_password' scrape_interval: 30s
-
Import the provided Grafana dashboards:
-
Set up alerts in Grafana for high response times (optional)
Scenario: You need to create a performance report or analyze trends in Excel/Python.
Steps:
-
Export metrics to CSV:
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > bridge_metrics.csv
-
For Excel:
- Open the CSV file in Excel
- Create pivot tables to analyze:
- Average execution time by collector
- Peak execution times
- Trends over time
-
For Python/Pandas:
import pandas as pd # Load the data df = pd.read_csv('bridge_metrics.csv') df['timestamp'] = pd.to_datetime(df['timestamp']) # Calculate average by collector avg_by_collector = df.groupby('collector_name')['value'].mean() print("Average execution time by collector:") print(avg_by_collector) # Find slowest operations slowest = df.nlargest(10, 'value') print("\nTop 10 slowest operations:") print(slowest[['timestamp', 'metric_name', 'collector_name', 'value']])
The bridge includes two pre-built dashboards for visualizing HTTP metrics:
File: Prometheus scrape jobs execution comparison over time-1779302096160.json
What it shows:
- Compare execution times across multiple Prometheus scrape jobs
- Compare current performance vs. historical (time-shift feature)
- Aggregated view by collector and job
Best for:
- Comparing multiple IBM Storage Scale clusters
- Comparing different time periods (today vs. yesterday)
- Identifying performance trends across your infrastructure
Key Features:
- Time-shift comparison: Compare current metrics with 1 hour ago, 1 day ago, 1 week ago, etc.
- Multi-job view: See all your clusters/datasources in one dashboard
- Table format: Easy to spot differences at a glance
File: Single Prometheus scrape job execution comparison over time-1779302278182.json
What it shows:
- Detailed analysis of a single Prometheus scrape job
- Perfmon query response times
- Time series graphs for trend analysis
- Historical comparison
Best for:
- Deep-diving into specific collector performance
- Troubleshooting slow queries
- Understanding performance patterns over time
Key Features:
- Detailed metrics: See individual collector execution times
- Time series graphs: Visualize trends and spikes
- Historical comparison: Compare current vs. baseline performance
- Open Grafana web interface
- Navigate to Dashboards → Import
- Click Upload JSON file and select the dashboard file
- Select your Prometheus datasource
- Click Import
Both dashboards support these variables for filtering:
- Datasource: Select your Prometheus datasource
-
collector_name: Filter by specific collector (supports regex, e.g.,
GPFS.*for all GPFS collectors) - job_name: Filter by Prometheus job name (supports regex)
-
time_shift: Time offset for comparison (e.g.,
1d,1w,1h)
Time-shift examples:
-
1h- Compare with 1 hour ago -
1d- Compare with yesterday -
1w- Compare with last week
| Metric Name | What It Measures | Good Value | Investigate If |
|---|---|---|---|
OpenTsdbApi_query |
Time to process all queries in a single HTTP request (may contain multiple sub-queries) | < 1s | > 3s |
PrometheusExporter_metrics |
Time to process Prometheus scrape request (typically one sensor per scrape job) | < 0.5s | > 2s |
perfmon_response_duration |
Time waiting for IBM Storage Scale backend | < 0.5s | > 1s |
perfmon_response_amount |
Size of data from backend (bytes) | Varies | Unusually large |
- Fast queries: 0.1 - 0.5 seconds
- Moderate queries: 0.5 - 2 seconds
- Slow queries: > 2 seconds (investigate)
- Very slow queries: > 5 seconds (needs attention)
Solution:
-
Enable detailed metrics:
# In source/analytics.py http_metrics_enabled = True requests_elapsed_time = True
-
Restart the bridge
-
Check where the time is spent:
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" -
Look for:
- High
perfmon_response_duration→ Backend is slow - High
PrometheusExporter_metricsorOpenTsdbApi_query→ Request processing is slow
- High
Possible causes:
-
Metrics not enabled:
- Check
http_metrics_enabled = Trueinsource/analytics.py - Restart the bridge
- Check
-
Authentication required:
- Use
-u username:passwordwith curl - Check your bridge authentication settings
- Use
-
Wrong port:
- OpenTSDB plugin: port 4242
- Prometheus plugin: port 9250
- Both ports support
/http_metricsendpoint
Solution:
Use the limit parameter to reduce the dataset:
# Get only last 100 metrics
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv&limit=100" > metrics.csv- Metrics are stored in memory (default: last 1000 requests)
- Typical memory usage: 1-5 MB
- Automatic cleanup prevents memory leaks
- Minimal: ~1-5% additional CPU usage
-
With
inspect_special=3: ~10-15% additional CPU usage -
With
inspect_special=5: ~20-30% additional CPU usage
| Environment | Recommended Settings |
|---|---|
| Production |
http_metrics_enabled = Trueinspect_special = 0
|
| Development |
http_metrics_enabled = Trueinspect_special = 1-2
|
| Troubleshooting |
http_metrics_enabled = Trueinspect_special = 3requests_elapsed_time = True
|
| Deep Debug | All enabled, inspect_special = 4-5(Disable after troubleshooting!) |
Add this to your prometheus.yml:
scrape_configs:
- job_name: 'grafana-bridge-internal'
static_configs:
- targets: ['bridge-host-ip:4242']
metrics_path: '/http_metrics'
params:
format: ['prometheus']
basic_auth:
username: 'scale_admin'
password: 'your_password'
scrape_interval: 30sscrape_configs:
- job_name: 'bridge-cluster1'
static_configs:
- targets: ['bridge1.example.com:4242']
metrics_path: '/http_metrics'
params:
format: ['prometheus']
basic_auth:
username: 'scale_admin'
password: 'password1'
- job_name: 'bridge-cluster2'
static_configs:
- targets: ['bridge2.example.com:4242']
metrics_path: '/http_metrics'
params:
format: ['prometheus']
basic_auth:
username: 'scale_admin'
password: 'password2'# Average response time
avg(PrometheusExporter_metrics)
# Response time by collector
avg by (collector_name)(PrometheusExporter_metrics)
# 95th percentile response time
histogram_quantile(0.95, PrometheusExporter_metrics)
# Request rate (requests per second)
rate(PrometheusExporter_metrics[5m])
# Compare current vs. 1 day ago
avg(PrometheusExporter_metrics) - avg(PrometheusExporter_metrics offset 1d)
A: No, the overhead is minimal (1-5%). The bridge is designed to collect metrics efficiently.
A: By default, metrics are kept for 1 hour (3600 seconds) or up to 1000 requests, whichever comes first.
A: Yes, you can set up a cron job:
# Add to crontab (runs every hour)
0 * * * * curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" >> /var/log/bridge_metrics_$(date +\%Y\%m\%d).csvA: /http_metrics returns the actual performance metrics data (execution times, response durations, etc.) in various formats, while /internal_stats returns summary statistics about the metrics collection system itself (number of requests, cache size, memory usage, bundle IDs count). Use /http_metrics for performance analysis and /internal_stats for monitoring the health of the metrics collection system.
A: Yes! Use format=opentsdb:
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=opentsdb"A: Set http_metrics_enabled = False in source/analytics.py and restart the bridge.
A: Both ports support /http_metrics. Use port 4242 for OpenTSDB plugin and port 9250 for Prometheus plugin. Choose based on which plugin you're using.