How to monitor performance of the IBM Storage Scale bridge for Grafana - IBM/ibm-spectrum-scale-bridge-for-grafana GitHub Wiki

The IBM Storage Scale bridge for Grafana (version 9.0.3 and later) includes a built-in performance monitoring feature that helps you understand how well the bridge is performing. This feature tracks execution times, identifies bottlenecks, and provides insights into the bridge's internal operations.

What can you monitor?

  • How long it takes the bridge to process requests from Grafana dashboards or export data to Prometheus
  • Which exporter jobs or queries are slow
  • Performance trends over time
  • Response times from IBM Storage Scale Performance Monitoring backend

Why monitor bridge performance?

  • Identify slow dashboards - Find out which dashboards take too long to load and why
  • Compare before/after upgrades - Verify that software upgrades don't degrade performance
  • Troubleshoot issues - Diagnose performance problems with detailed metrics
  • Optimize resources - Understand resource usage and plan capacity
  • Compare clusters - Analyze performance differences across multiple IBM Storage Scale clusters

Quick Start

Step 1: Enable HTTP Metrics

  1. Open the analytics,py file, which controls configuration for special analytics settings:

    vi source/analytics.py
  2. Change the setting http_metrics_enabled to True:

    http_metrics_enabled = True
  3. Restart the bridge:

    sudo systemctl restart grafana-bridge

That's it! The bridge is now collecting performance metrics.

Step 2: View Your Metrics

You can access metrics through simple web URLs. The /http_metrics endpoint is available on both ports: openTSDB and prometheus. View metrics in your browser:

# For OpenTSDB api:
http://bridge-host-ip:4242/http_metrics
# For PrometheusExporter api:
http://bridge-host-ip:9250/http_metrics

Or use curl with authentication:

curl -u scale_admin:password http://bridge-host-ip:4242/http_metrics
curl -u scale_admin:password http://bridge-host-ip:9250/http_metrics

Understanding the Settings

Main Setting: http_metrics_enabled

Located in source/analytics.py

  • True (Recommended): Metrics are collected and available via REST API
  • False: Metrics are only written to log files

Optional Setting: inspect_special

You can enable these for more detailed troubleshooting:

inspect_special = 3  # Values: 0-5

What it does: Adds detailed debug information to log files

  • 0 = Off (default)
  • 1-2 = Basic debugging
  • 3 = Detailed debugging (recommended for troubleshooting)
  • 4-5 = Very detailed (generates lots of logs)

When to use: When you need to understand exactly what the bridge is doing internally

⚠️ Warning: Levels 4-5 create large log files. Use only when actively troubleshooting, then set back to 0.

For more details: See the Deep Troubleshooting section in HTTP_METRICS_COLLECTION.md

Available Endpoints

1. /http_metrics - Main Metrics Endpoint

This is your primary endpoint for accessing performance data. Available on both OpenTSDB (port 4242) and Prometheus (port 9250) plugins.

Supported Formats:

Prometheus Format (Default)

Best for: Integration with Prometheus monitoring system

curl -u scale_admin:password http://bridge-host-ip:4242/http_metrics
# Or for Prometheus plugin:
curl -u scale_admin:password http://bridge-host-ip:9250/http_metrics

JSON Format

Best for: Quick viewing, scripting, or custom analysis

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=json&limit=100"

CSV Format

Best for: Excel analysis, reports, or data science

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > metrics.csv

OpenTSDB Format

Best for: If you're using OpenTSDB datasource in Grafana

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=opentsdb"

Parameters:

  • format: Choose output format (prometheus, json, csv, opentsdb)
  • limit: Number of recent metrics to return (for JSON and CSV only)

2. /internal_stats - Statistics Summary

Returns summary statistics about the metrics collection system itself, including:

  • Total number of requests processed
  • Number of metrics currently in memory cache
  • Memory size used by metrics cache
  • Number of bundle IDs registered (for OpenTSDB and Prometheus)
  • Memory size used by bundle registries

Example Request:

curl -u scale_admin:password http://bridge-host-ip:4242/internal_stats

Example Response:

{
  "total_requests": 1523,
  "metrics_in_memory_count": 847,
  "metrics_in_memory_size": 245632,
  "opentsdb_bundle_ids_count": 45,
  "prometheus_bundle_ids_count": 23,
  "bundle_ids_count_total": 68,
  "opentsdb_bundle_registry_memory_size": 12480,
  "prometheus_bundle_registry_memory_size": 8960,
  "bundle_registry_memory_size_total": 21440
}

Use this endpoint when you want to:

  • Check how many metrics are currently cached
  • Monitor memory usage of the metrics system
  • Verify the metrics collection is working
  • Get a quick overview without retrieving all metrics data

3. /bundle_ids - Query Tracking

Lists all tracked query groups (useful for debugging specific dashboard issues).

# For OpenTSDB plugin
curl -u scale_admin:password http://bridge-host-ip:4242/api/bundle_ids

# For Prometheus plugin
curl -u scale_admin:password http://bridge-host-ip:9250/bundle_ids

Common Use Cases

Use Case 1: Compare Performance Before and After Upgrade

Scenario: You upgraded the bridge and want to verify performance didn't degrade.

Steps:

  1. Before upgrade: Export baseline metrics

    curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > before_upgrade.csv
  2. Perform upgrade

  3. After upgrade: Export new metrics

    curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > after_upgrade.csv
  4. Compare in Excel:

    • Open both CSV files
    • Create pivot tables to compare average execution times
    • Look for significant differences in response times

Example Dashboard: Use the provided dashboard Prometheus scrape jobs execution comparison over time to visualize the comparison.

Use Case 2: Identify Slow Dashboards

Scenario: Users complain that certain dashboards load slowly.

Steps:

  1. Enable metrics collection (if not already enabled)

  2. Reproduce the slow dashboard loading

  3. Check the statistics:

    curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv"
  4. Look for high execution times in the output

  5. Identify which collectors or queries are slow

What to look for:

  • OpenTsdbApi_query - Time to process all queries in a single HTTP request received from Grafana for OpenTSDB Datasource (may contain multiple sub-queries)
  • perfmon_response_duration - Time waiting for IBM Storage Scale backend
  • High values (> 1 second) indicate bottlenecks

Use Case 3: Compare Performance of PrometheusExporter Scrape Jobs Across Multiple IBM Storage Scale Clusters

Scenario: You have multiple IBM Storage Scale clusters, each with its own bridge configured as a PrometheusExporter, and you want to compare their performance.

Prerequisites:

  • IBM Storage Scale bridge for Grafana must be configured as PrometheusExporter on each cluster (not OpenTSDB)
  • The /http_metrics endpoint must be configured as a separate Prometheus scrape job for each bridge
  • The provided example dashboards are designed specifically for Prometheus datasources

Note: This use case and the example dashboards are not applicable if you're using the bridge only as an OpenTSDB datasource.

Steps:

  1. Enable Prometheus exporter on each bridge:

    • Edit config.ini on each bridge
    • Ensure Prometheus port 9250 is enabled:
      [prometheus]
      port = 9250
    • Restart each bridge
  2. Configure Prometheus to scrape the /http_metrics endpoint from multiple bridges as separate scrape jobs:

    # prometheus.yml
    scrape_configs:
      - job_name: 'cluster-production'
        static_configs:
          - targets: ['bridge-prod:9250']
        metrics_path: '/http_metrics'
        params:
          format: ['prometheus']
      
      - job_name: 'cluster-development'
        static_configs:
          - targets: ['bridge-dev:9250']
        metrics_path: '/http_metrics'
        params:
          format: ['prometheus']
  3. Import the comparison dashboard into Grafana:

  4. Use the dashboard to compare performance:

    • Use the job_name variable to filter by cluster (e.g., "cluster-production", "cluster-development")
    • Compare execution times per scrape job across different clusters
    • Identify which cluster has slower performance
    • Use the time-shift feature to compare current vs. historical performance

What you can analyze:

  • Which cluster's bridge is performing better
  • Execution time differences between clusters
  • Performance trends over time for each cluster
  • Impact of configuration differences between clusters

Example Dashboard: Prometheus scrape jobs execution comparison over time

Use Case 4: Monitor Performance of PrometheusExporter scrape jobs Over Time

Scenario: You want continuous monitoring of bridge performance.

Prerequisites:

  • IBM Storage Scale bridge for Grafana must be configured as PrometheusExporter
  • The provided example dashboards are designed specifically for Prometheus datasources

Steps:

  1. Enable Prometheus exporter on the bridge:

    • Edit config.ini
    • Ensure Prometheus port 9250 is enabled:
      [prometheus]
      port = 9250
    • Restart the bridge
  2. Enable HTTP metrics collection:

    • Set http_metrics_enabled = True in source/analytics.py
    • Restart the bridge
  3. Configure Prometheus to scrape the /http_metrics endpoint:

    # prometheus.yml
    scrape_configs:
      - job_name: 'grafana-bridge-internal'
        static_configs:
          - targets: ['bridge-host-ip:9250']  # Prometheus exporter port
        metrics_path: '/http_metrics'
        params:
          format: ['prometheus']
        basic_auth:
          username: 'scale_admin'
          password: 'your_password'
        scrape_interval: 30s
  4. Import the provided Grafana dashboards:

  5. Set up alerts in Grafana for high response times (optional)

Use Case 5: Export Data for Analysis

Scenario: You need to create a performance report or analyze trends in Excel/Python.

Steps:

  1. Export metrics to CSV:

    curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" > bridge_metrics.csv
  2. For Excel:

    • Open the CSV file in Excel
    • Create pivot tables to analyze:
      • Average execution time by collector
      • Peak execution times
      • Trends over time
  3. For Python/Pandas:

    import pandas as pd
    
    # Load the data
    df = pd.read_csv('bridge_metrics.csv')
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    
    # Calculate average by collector
    avg_by_collector = df.groupby('collector_name')['value'].mean()
    print("Average execution time by collector:")
    print(avg_by_collector)
    
    # Find slowest operations
    slowest = df.nlargest(10, 'value')
    print("\nTop 10 slowest operations:")
    print(slowest[['timestamp', 'metric_name', 'collector_name', 'value']])

Example Grafana Dashboards

The bridge includes two pre-built dashboards for visualizing HTTP metrics:

1. Multi-Job Comparison Dashboard

File: Prometheus scrape jobs execution comparison over time-1779302096160.json

What it shows:

  • Compare execution times across multiple Prometheus scrape jobs
  • Compare current performance vs. historical (time-shift feature)
  • Aggregated view by collector and job

Best for:

  • Comparing multiple IBM Storage Scale clusters
  • Comparing different time periods (today vs. yesterday)
  • Identifying performance trends across your infrastructure

Key Features:

  • Time-shift comparison: Compare current metrics with 1 hour ago, 1 day ago, 1 week ago, etc.
  • Multi-job view: See all your clusters/datasources in one dashboard
  • Table format: Easy to spot differences at a glance

2. Single Job Deep-Dive Dashboard

File: Single Prometheus scrape job execution comparison over time-1779302278182.json

What it shows:

  • Detailed analysis of a single Prometheus scrape job
  • Perfmon query response times
  • Time series graphs for trend analysis
  • Historical comparison

Best for:

  • Deep-diving into specific collector performance
  • Troubleshooting slow queries
  • Understanding performance patterns over time

Key Features:

  • Detailed metrics: See individual collector execution times
  • Time series graphs: Visualize trends and spikes
  • Historical comparison: Compare current vs. baseline performance

How to Import Dashboards

  1. Open Grafana web interface
  2. Navigate to DashboardsImport
  3. Click Upload JSON file and select the dashboard file
  4. Select your Prometheus datasource
  5. Click Import

Dashboard Variables

Both dashboards support these variables for filtering:

  • Datasource: Select your Prometheus datasource
  • collector_name: Filter by specific collector (supports regex, e.g., GPFS.* for all GPFS collectors)
  • job_name: Filter by Prometheus job name (supports regex)
  • time_shift: Time offset for comparison (e.g., 1d, 1w, 1h)

Time-shift examples:

  • 1h - Compare with 1 hour ago
  • 1d - Compare with yesterday
  • 1w - Compare with last week

Understanding the Metrics

Key Metrics Explained

Metric Name What It Measures Good Value Investigate If
OpenTsdbApi_query Time to process all queries in a single HTTP request (may contain multiple sub-queries) < 1s > 3s
PrometheusExporter_metrics Time to process Prometheus scrape request (typically one sensor per scrape job) < 0.5s > 2s
perfmon_response_duration Time waiting for IBM Storage Scale backend < 0.5s > 1s
perfmon_response_amount Size of data from backend (bytes) Varies Unusually large

What's Normal?

  • Fast queries: 0.1 - 0.5 seconds
  • Moderate queries: 0.5 - 2 seconds
  • Slow queries: > 2 seconds (investigate)
  • Very slow queries: > 5 seconds (needs attention)

Troubleshooting

Problem: Dashboards are slow to load

Solution:

  1. Enable detailed metrics:

    # In source/analytics.py
    http_metrics_enabled = True
    requests_elapsed_time = True
  2. Restart the bridge

  3. Check where the time is spent:

    curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv"
  4. Look for:

    • High perfmon_response_duration → Backend is slow
    • High PrometheusExporter_metrics or OpenTsdbApi_query → Request processing is slow

Problem: Can't access /http_metrics endpoint

Possible causes:

  1. Metrics not enabled:

    • Check http_metrics_enabled = True in source/analytics.py
    • Restart the bridge
  2. Authentication required:

    • Use -u username:password with curl
    • Check your bridge authentication settings
  3. Wrong port:

    • OpenTSDB plugin: port 4242
    • Prometheus plugin: port 9250
    • Both ports support /http_metrics endpoint

Problem: Too much data in CSV export

Solution:

Use the limit parameter to reduce the dataset:

# Get only last 100 metrics
curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv&limit=100" > metrics.csv

Performance Impact

Memory Usage

  • Metrics are stored in memory (default: last 1000 requests)
  • Typical memory usage: 1-5 MB
  • Automatic cleanup prevents memory leaks

CPU Overhead

  • Minimal: ~1-5% additional CPU usage
  • With inspect_special=3: ~10-15% additional CPU usage
  • With inspect_special=5: ~20-30% additional CPU usage

Recommendations

Environment Recommended Settings
Production http_metrics_enabled = True
inspect_special = 0
Development http_metrics_enabled = True
inspect_special = 1-2
Troubleshooting http_metrics_enabled = True
inspect_special = 3
requests_elapsed_time = True
Deep Debug All enabled, inspect_special = 4-5
(Disable after troubleshooting!)

Integration with Prometheus

Basic Configuration

Add this to your prometheus.yml:

scrape_configs:
  - job_name: 'grafana-bridge-internal'
    static_configs:
      - targets: ['bridge-host-ip:4242']
    metrics_path: '/http_metrics'
    params:
      format: ['prometheus']
    basic_auth:
      username: 'scale_admin'
      password: 'your_password'
    scrape_interval: 30s

Multiple Clusters Configuration

scrape_configs:
  - job_name: 'bridge-cluster1'
    static_configs:
      - targets: ['bridge1.example.com:4242']
    metrics_path: '/http_metrics'
    params:
      format: ['prometheus']
    basic_auth:
      username: 'scale_admin'
      password: 'password1'
  
  - job_name: 'bridge-cluster2'
    static_configs:
      - targets: ['bridge2.example.com:4242']
    metrics_path: '/http_metrics'
    params:
      format: ['prometheus']
    basic_auth:
      username: 'scale_admin'
      password: 'password2'

Useful Prometheus Queries

# Average response time
avg(PrometheusExporter_metrics)

# Response time by collector
avg by (collector_name)(PrometheusExporter_metrics)

# 95th percentile response time
histogram_quantile(0.95, PrometheusExporter_metrics)

# Request rate (requests per second)
rate(PrometheusExporter_metrics[5m])

# Compare current vs. 1 day ago
avg(PrometheusExporter_metrics) - avg(PrometheusExporter_metrics offset 1d)

FAQ

Q: Will enabling metrics slow down my bridge?

A: No, the overhead is minimal (1-5%). The bridge is designed to collect metrics efficiently.

Q: How long are metrics stored?

A: By default, metrics are kept for 1 hour (3600 seconds) or up to 1000 requests, whichever comes first.

Q: Can I export metrics to a file automatically?

A: Yes, you can set up a cron job:

# Add to crontab (runs every hour)
0 * * * * curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=csv" >> /var/log/bridge_metrics_$(date +\%Y\%m\%d).csv

Q: What's the difference between /http_metrics and /internal_stats?

A: /http_metrics returns the actual performance metrics data (execution times, response durations, etc.) in various formats, while /internal_stats returns summary statistics about the metrics collection system itself (number of requests, cache size, memory usage, bundle IDs count). Use /http_metrics for performance analysis and /internal_stats for monitoring the health of the metrics collection system.

Q: Can I use this with OpenTSDB datasource in Grafana?

A: Yes! Use format=opentsdb:

curl -u scale_admin:password "http://bridge-host-ip:4242/http_metrics?format=opentsdb"

Q: How do I disable metrics collection?

A: Set http_metrics_enabled = False in source/analytics.py and restart the bridge.

Q: Which port should I use - 4242 or 9250?

A: Both ports support /http_metrics. Use port 4242 for OpenTSDB plugin and port 9250 for Prometheus plugin. Choose based on which plugin you're using.

⚠️ **GitHub.com Fallback** ⚠️