How to monitor performance of the IBM Storage Scale bridge for Grafana - IBM/ibm-spectrum-scale-bridge-for-grafana GitHub Wiki

The IBM Storage Scale bridge for Grafana (version 9.0.3 and later) includes a built-in performance monitoring feature that helps you understand how well the bridge is performing. This feature tracks execution times, identifies bottlenecks, and provides insights into the bridge's internal operations.

What can you monitor?

How long it takes the bridge to process requests from Grafana dashboards or export data to Prometheus
Which exporter jobs or queries are slow
Performance trends over time
Response times from IBM Storage Scale Performance Monitoring backend

Why monitor bridge performance?

Identify slow dashboards - Find out which dashboards take too long to load and why
Compare before/after upgrades - Verify that software upgrades don't degrade performance
Troubleshoot issues - Diagnose performance problems with detailed metrics
Optimize resources - Understand resource usage and plan capacity
Compare clusters - Analyze performance differences across multiple IBM Storage Scale clusters

Quick Start

Step 1: Enable HTTP Metrics

Open the analytics,py file, which controls configuration for special analytics settings:
```
vi source/analytics.py
```
Change the setting http_metrics_enabled to True:
```
http_metrics_enabled = True
```
Restart the bridge:
```
sudo systemctl restart grafana-bridge
```

That's it! The bridge is now collecting performance metrics.

Step 2: View Your Metrics

You can access metrics through simple web URLs. The /http_metrics endpoint is available on both ports: openTSDB and prometheus. View metrics in your browser:

# For OpenTSDB api:
http://bridge-host-ip:4242/http_metrics
# For PrometheusExporter api:
http://bridge-host-ip:9250/http_metrics

Or use curl with authentication:

curl -u scale_admin:password http://bridge-host-ip:4242/http_metrics
curl -u scale_admin:password http://bridge-host-ip:9250/http_metrics

Understanding the Settings

Main Setting: http_metrics_enabled

Located in source/analytics.py

True (Recommended): Metrics are collected and available via REST API
False: Metrics are only written to log files

Optional Setting: inspect_special

You can enable these for more detailed troubleshooting:

inspect_special = 3  # Values: 0-5

What it does: Adds detailed debug information to log files

0 = Off (default)
1-2 = Basic debugging
3 = Detailed debugging (recommended for troubleshooting)
4-5 = Very detailed (generates lots of logs)

When to use: When you need to understand exactly what the bridge is doing internally

⚠️ Warning: Levels 4-5 create large log files. Use only when actively troubleshooting, then set back to 0.

For more details: See the Deep Troubleshooting section in HTTP_METRICS_COLLECTION.md

Available Endpoints

1. /http_metrics - Main Metrics Endpoint

This is your primary endpoint for accessing performance data. Available on both OpenTSDB (port 4242) and Prometheus (port 9250) plugins.

Supported Formats:

Prometheus Format (Default)

Best for: Integration with Prometheus monitoring system

curl -u scale_admin:password http://bridge-host-ip:4242/http_metrics
# Or for Prometheus plugin:
curl -u scale_admin:password http://bridge-host-ip:9250/http_metrics

JSON Format

Best for: Quick viewing, scripting, or custom analysis

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=json&limit=100"

CSV Format

Best for: Excel analysis, reports, or data science

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > metrics.csv

OpenTSDB Format

Best for: If you're using OpenTSDB datasource in Grafana

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=opentsdb"

Parameters:

format: Choose output format (prometheus, json, csv, opentsdb)
limit: Number of recent metrics to return (for JSON and CSV only)

2. /internal_stats - Statistics Summary

Returns summary statistics about the metrics collection system itself, including:

Total number of requests processed
Number of metrics currently in memory cache
Memory size used by metrics cache
Number of bundle IDs registered (for OpenTSDB and Prometheus)
Memory size used by bundle registries

Example Request:

curl -u scale_admin:password http://bridge-host-ip:4242/internal_stats

Example Response:

{
  "total_requests": 1523,
  "metrics_in_memory_count": 847,
  "metrics_in_memory_size": 245632,
  "opentsdb_bundle_ids_count": 45,
  "prometheus_bundle_ids_count": 23,
  "bundle_ids_count_total": 68,
  "opentsdb_bundle_registry_memory_size": 12480,
  "prometheus_bundle_registry_memory_size": 8960,
  "bundle_registry_memory_size_total": 21440
}

Use this endpoint when you want to:

Check how many metrics are currently cached
Monitor memory usage of the metrics system
Verify the metrics collection is working
Get a quick overview without retrieving all metrics data

3. /bundle_ids - Query Tracking

Lists all tracked query groups (useful for debugging specific dashboard issues).

# For OpenTSDB plugin
curl -u scale_admin:password http://bridge-host-ip:4242/api/bundle_ids

# For Prometheus plugin
curl -u scale_admin:password http://bridge-host-ip:9250/bundle_ids

Common Use Cases

Use Case 1: Compare Performance Before and After Upgrade

Scenario: You upgraded the bridge and want to verify performance didn't degrade.

Steps:

Before upgrade: Export baseline metrics

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > before_upgrade.csv

Perform upgrade

After upgrade: Export new metrics

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > after_upgrade.csv

Compare in Excel:
- Open both CSV files
- Create pivot tables to compare average execution times
- Look for significant differences in response times

Example Dashboard: Use the provided dashboard Prometheus scrape jobs execution comparison over time to visualize the comparison.

Use Case 2: Identify Slow Dashboards

Scenario: Users complain that certain dashboards load slowly.

Steps:

Enable metrics collection (if not already enabled)
Reproduce the slow dashboard loading

Check the statistics:

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv"

Look for high execution times in the output
Identify which collectors or queries are slow

What to look for:

OpenTsdbApi_query - Time to process all queries in a single HTTP request received from Grafana for OpenTSDB Datasource (may contain multiple sub-queries)
perfmon_response_duration - Time waiting for IBM Storage Scale backend
High values (> 1 second) indicate bottlenecks

Use Case 3: Compare Performance of PrometheusExporter Scrape Jobs Across Multiple IBM Storage Scale Clusters

Scenario: You have multiple IBM Storage Scale clusters, each with its own bridge configured as a PrometheusExporter, and you want to compare their performance.

Prerequisites:

IBM Storage Scale bridge for Grafana must be configured as PrometheusExporter on each cluster (not OpenTSDB)
The /http_metrics endpoint must be configured as a separate Prometheus scrape job for each bridge
The provided example dashboards are designed specifically for Prometheus datasources

Note: This use case and the example dashboards are not applicable if you're using the bridge only as an OpenTSDB datasource.

Steps:

Enable Prometheus exporter on each bridge:
- Edit config.ini on each bridge
- Ensure Prometheus port 9250 is enabled:
```
[prometheus]
port = 9250
```
- Restart each bridge

Configure Prometheus to scrape the /http_metrics endpoint from multiple bridges as separate scrape jobs:

# prometheus.yml
scrape_configs:
  - job_name: 'cluster-production'
    static_configs:
      - targets: ['bridge-prod:9250']
    metrics_path: '/http_metrics'
    params:
      format: ['prometheus']
  
  - job_name: 'cluster-development'
    static_configs:
      - targets: ['bridge-dev:9250']
    metrics_path: '/http_metrics'
    params:
      format: ['prometheus']

Import the comparison dashboard into Grafana:
- Prometheus scrape jobs execution comparison over time
Use the dashboard to compare performance:
- Use the job_name variable to filter by cluster (e.g., "cluster-production", "cluster-development")
- Compare execution times per scrape job across different clusters
- Identify which cluster has slower performance
- Use the time-shift feature to compare current vs. historical performance

What you can analyze:

Which cluster's bridge is performing better
Execution time differences between clusters
Performance trends over time for each cluster
Impact of configuration differences between clusters

Example Dashboard: Prometheus scrape jobs execution comparison over time

Use Case 4: Monitor Performance of PrometheusExporter scrape jobs Over Time

Scenario: You want continuous monitoring of bridge performance.

Prerequisites:

IBM Storage Scale bridge for Grafana must be configured as PrometheusExporter
The provided example dashboards are designed specifically for Prometheus datasources

Steps:

Enable Prometheus exporter on the bridge:
- Edit config.ini
- Ensure Prometheus port 9250 is enabled:
```
[prometheus]
port = 9250
```
- Restart the bridge
Enable HTTP metrics collection:
- Set http_metrics_enabled = True in source/analytics.py
- Restart the bridge

Configure Prometheus to scrape the /http_metrics endpoint:

# prometheus.yml
scrape_configs:
  - job_name: 'grafana-bridge-internal'
    static_configs:
      - targets: ['bridge-host-ip:9250']  # Prometheus exporter port
    metrics_path: '/http_metrics'
    params:
      format: ['prometheus']
    basic_auth:
      username: 'scale_admin'
      password: 'your_password'
    scrape_interval: 30s

Import the provided Grafana dashboards:
- Prometheus scrape jobs execution comparison over time
- Single Prometheus scrape job execution comparison over time
Set up alerts in Grafana for high response times (optional)

Use Case 5: Export Data for Analysis

Scenario: You need to create a performance report or analyze trends in Excel/Python.

Steps:

Export metrics to CSV:

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > bridge_metrics.csv

For Excel:
- Open the CSV file in Excel
- Create pivot tables to analyze:
  - Average execution time by collector
  - Peak execution times
  - Trends over time

For Python/Pandas:

import pandas as pd

# Load the data
df = pd.read_csv('bridge_metrics.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Calculate average by collector
avg_by_collector = df.groupby('collector_name')['value'].mean()
print("Average execution time by collector:")
print(avg_by_collector)

# Find slowest operations
slowest = df.nlargest(10, 'value')
print("\nTop 10 slowest operations:")
print(slowest[['timestamp', 'metric_name', 'collector_name', 'value']])

Example Grafana Dashboards

The bridge includes two pre-built dashboards for visualizing HTTP metrics:

1. Multi-Job Comparison Dashboard

File: Prometheus scrape jobs execution comparison over time-1779302096160.json

What it shows:

Compare execution times across multiple Prometheus scrape jobs
Compare current performance vs. historical (time-shift feature)
Aggregated view by collector and job

Best for:

Comparing multiple IBM Storage Scale clusters
Comparing different time periods (today vs. yesterday)
Identifying performance trends across your infrastructure

Key Features:

Time-shift comparison: Compare current metrics with 1 hour ago, 1 day ago, 1 week ago, etc.
Multi-job view: See all your clusters/datasources in one dashboard
Table format: Easy to spot differences at a glance

2. Single Job Deep-Dive Dashboard

File: Single Prometheus scrape job execution comparison over time-1779302278182.json

What it shows:

Detailed analysis of a single Prometheus scrape job
Perfmon query response times
Time series graphs for trend analysis
Historical comparison

Best for:

Deep-diving into specific collector performance
Troubleshooting slow queries
Understanding performance patterns over time

Key Features:

Detailed metrics: See individual collector execution times
Time series graphs: Visualize trends and spikes
Historical comparison: Compare current vs. baseline performance

How to Import Dashboards

Open Grafana web interface
Navigate to Dashboards → Import
Click Upload JSON file and select the dashboard file
Select your Prometheus datasource
Click Import

Dashboard Variables

Both dashboards support these variables for filtering:

Datasource: Select your Prometheus datasource
collector_name: Filter by specific collector (supports regex, e.g., GPFS.* for all GPFS collectors)
job_name: Filter by Prometheus job name (supports regex)
time_shift: Time offset for comparison (e.g., 1d, 1w, 1h)

Time-shift examples:

1h - Compare with 1 hour ago
1d - Compare with yesterday
1w - Compare with last week

Understanding the Metrics

Key Metrics Explained

Metric Name	What It Measures	Good Value	Investigate If
`OpenTsdbApi_query`	Time to process all queries in a single HTTP request (may contain multiple sub-queries)	< 1s	> 3s
`PrometheusExporter_metrics`	Time to process Prometheus scrape request (typically one sensor per scrape job)	< 0.5s	> 2s
`perfmon_response_duration`	Time waiting for IBM Storage Scale backend	< 0.5s	> 1s
`perfmon_response_amount`	Size of data from backend (bytes)	Varies	Unusually large

What's Normal?

Fast queries: 0.1 - 0.5 seconds
Moderate queries: 0.5 - 2 seconds
Slow queries: > 2 seconds (investigate)
Very slow queries: > 5 seconds (needs attention)

Troubleshooting

Problem: Dashboards are slow to load

Solution:

Enable detailed metrics:

# In source/analytics.py
http_metrics_enabled = True
requests_elapsed_time = True

Restart the bridge

Check where the time is spent:

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv"

Look for:
- High perfmon_response_duration → Backend is slow
- High PrometheusExporter_metrics or OpenTsdbApi_query → Request processing is slow

Problem: Can't access `/http_metrics` endpoint

Possible causes:

Metrics not enabled:
- Check http_metrics_enabled = True in source/analytics.py
- Restart the bridge
Authentication required:
- Use -u username:password with curl
- Check your bridge authentication settings
Wrong port:
- OpenTSDB plugin: port 4242
- Prometheus plugin: port 9250
- Both ports support /http_metrics endpoint

Problem: Too much data in CSV export

Solution:

Use the limit parameter to reduce the dataset:

# Get only last 100 metrics
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv&limit=100" > metrics.csv

Performance Impact

Memory Usage

Metrics are stored in memory (default: last 1000 requests)
Typical memory usage: 1-5 MB
Automatic cleanup prevents memory leaks

CPU Overhead

Minimal: ~1-5% additional CPU usage
With inspect_special=3: ~10-15% additional CPU usage
With inspect_special=5: ~20-30% additional CPU usage

Recommendations

Environment	Recommended Settings
Production	`http_metrics_enabled = True` `inspect_special = 0`
Development	`http_metrics_enabled = True` `inspect_special = 1-2`
Troubleshooting	`http_metrics_enabled = True` `inspect_special = 3` `requests_elapsed_time = True`
Deep Debug	All enabled, `inspect_special = 4-5` (Disable after troubleshooting!)

Integration with Prometheus

Basic Configuration

Add this to your prometheus.yml:

scrape_configs:
  - job_name: 'grafana-bridge-internal'
    static_configs:
      - targets: ['bridge-host-ip:4242']
    metrics_path: '/http_metrics'
    params:
      format: ['prometheus']
    basic_auth:
      username: 'scale_admin'
      password: 'your_password'
    scrape_interval: 30s

Multiple Clusters Configuration

scrape_configs:
  - job_name: 'bridge-cluster1'
    static_configs:
      - targets: ['bridge1.example.com:4242']
    metrics_path: '/http_metrics'
    params:
      format: ['prometheus']
    basic_auth:
      username: 'scale_admin'
      password: 'password1'
  
  - job_name: 'bridge-cluster2'
    static_configs:
      - targets: ['bridge2.example.com:4242']
    metrics_path: '/http_metrics'
    params:
      format: ['prometheus']
    basic_auth:
      username: 'scale_admin'
      password: 'password2'

Useful Prometheus Queries

# Average response time
avg(PrometheusExporter_metrics)

# Response time by collector
avg by (collector_name)(PrometheusExporter_metrics)

# 95th percentile response time
histogram_quantile(0.95, PrometheusExporter_metrics)

# Request rate (requests per second)
rate(PrometheusExporter_metrics[5m])

# Compare current vs. 1 day ago
avg(PrometheusExporter_metrics) - avg(PrometheusExporter_metrics offset 1d)

FAQ

Q: Will enabling metrics slow down my bridge?

A: No, the overhead is minimal (1-5%). The bridge is designed to collect metrics efficiently.

Q: How long are metrics stored?

A: By default, metrics are kept for 1 hour (3600 seconds) or up to 1000 requests, whichever comes first.

Q: Can I export metrics to a file automatically?

A: Yes, you can set up a cron job:

# Add to crontab (runs every hour)
0 * * * * curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" >> /var/log/bridge_metrics_$(date +\%Y\%m\%d).csv

Q: What's the difference between `/http_metrics` and `/internal_stats`?

A: /http_metrics returns the actual performance metrics data (execution times, response durations, etc.) in various formats, while /internal_stats returns summary statistics about the metrics collection system itself (number of requests, cache size, memory usage, bundle IDs count). Use /http_metrics for performance analysis and /internal_stats for monitoring the health of the metrics collection system.

Q: Can I use this with OpenTSDB datasource in Grafana?

A: Yes! Use format=opentsdb:

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=opentsdb"

Q: How do I disable metrics collection?

A: Set http_metrics_enabled = False in source/analytics.py and restart the bridge.

Q: Which port should I use - 4242 or 9250?

A: Both ports support /http_metrics. Use port 4242 for OpenTSDB plugin and port 9250 for Prometheus plugin. Choose based on which plugin you're using.

How to monitor performance of the IBM Storage Scale bridge for Grafana - IBM/ibm-spectrum-scale-bridge-for-grafana GitHub Wiki

Quick Start

Step 1: Enable HTTP Metrics

Step 2: View Your Metrics

Understanding the Settings

Main Setting: http_metrics_enabled

Optional Setting: inspect_special

Available Endpoints

1. /http_metrics - Main Metrics Endpoint

Prometheus Format (Default)

JSON Format

CSV Format

OpenTSDB Format

2. /internal_stats - Statistics Summary

3. /bundle_ids - Query Tracking

Common Use Cases

Use Case 1: Compare Performance Before and After Upgrade

Use Case 2: Identify Slow Dashboards

Use Case 3: Compare Performance of PrometheusExporter Scrape Jobs Across Multiple IBM Storage Scale Clusters

Use Case 4: Monitor Performance of PrometheusExporter scrape jobs Over Time

Use Case 5: Export Data for Analysis

Example Grafana Dashboards

1. Multi-Job Comparison Dashboard

2. Single Job Deep-Dive Dashboard

How to Import Dashboards

Dashboard Variables

Understanding the Metrics

Key Metrics Explained

What's Normal?

Troubleshooting

Problem: Dashboards are slow to load

Problem: Can't access /http_metrics endpoint

Problem: Too much data in CSV export

Performance Impact

Memory Usage

CPU Overhead

Recommendations

Integration with Prometheus

Basic Configuration

Multiple Clusters Configuration

Useful Prometheus Queries

FAQ

Q: Will enabling metrics slow down my bridge?

Q: How long are metrics stored?

Q: Can I export metrics to a file automatically?

Q: What's the difference between /http_metrics and /internal_stats?

Q: Can I use this with OpenTSDB datasource in Grafana?

Q: How do I disable metrics collection?

Q: Which port should I use - 4242 or 9250?

⚠️ **GitHub.com Fallback** ⚠️

Problem: Can't access `/http_metrics` endpoint

Q: What's the difference between `/http_metrics` and `/internal_stats`?

⚠️ GitHub.com Fallback ⚠️