How to setup Grafana instance to monitor multiple IBM Storage Scale clusters running in a cloud or mixed environment - IBM/ibm-spectrum-scale-bridge-for-grafana GitHub Wiki

It’s possible to manage the connection configuration for multiple IBM Storage Scale clusters (data sources) in Grafana by adding a YAML config file in the provisioning/datasources directory. Follow reading Provision an IBM Storage Scale data source to manage the connection configuration of multiple IBM Storage Scale clusters running on bare metal in Grafana.

In a cloud environment, data source provisioning can be performed by deploying DataSource CR maintained by RedHat Grafana-Operator. Starting with version 5, Redhat Grafana Operator supports the management of cross-namespace data sources instances. This new feature provides the ability to monitor with a single Grafana instance multiple systems running in a cloud environment. An example of usage such such feature might be the CNSA AFM regional DR setup.

/images/Openshift/multiple_ocps_management.png

The same configuration opportunities can be used for a MIXED environment.

/images/Openshift/mixed_environment_management.png

The placement and number of managed Grafana instances in each particular environment depends on the business strategy: 1. Centralized or distributed monitoring 2. Grafana instance running in a container environment or running externally

Connecting a grafana-bridge running in a remote Openshift cluster to a Grafana instance running outside this cluster

Let's make out that the Openshift cluster where the Grafana instance will be deployed and running is called "local-ocp". Another Openshift cluster where the remote gafana-bridge instance already deployed and running is called “remote-ocp”.

We start with grafana-bridge running in remote-ocp. A route to the grafana-bridge service path must be deployed to enable external access to the grafana-bridge.

kind: Route
apiVersion: route.openshift.io/v1
metadata:
  name: grafanabridge
  namespace: ibm-spectrum-scale
  labels:
    app.kubernetes.io/instance: ibm-spectrum-scale
    app.kubernetes.io/name: grafanabridge
  annotations:
    openshift.io/balance: source
spec:
  to:
    kind: Service
    name: ibm-spectrum-scale-grafana-bridge
    weight: 100
  port:
    targetPort: https
  tls:
    termination: passthrough

In the local-ocp deploy a Grafana kind resource if not already done.

Grafana instance requires ssl connection data to communicate with a grafana-bridge. With Grafana Operator V.5 it is possible to apply ssl/tls connection data during datasource connection dynamically from a tls secret.

Create in the local-ocp a kind secret for storing the ssl cerificate and key from grafana-bridge running in remote-ocp cluster.

apiVersion: v1
data:
  tls.crt: ''
  tls.key: ''
kind: Secret
metadata:
  name: grafana-bridge-tls-cert-remote
type: kubernetes.io/tls

From "remote-ocp" copy and temporarily store the grafana-bridge SSL connection data available in the ibm-spectrum-scale-grafana-bridge-service-cert secret.

TLS_CERT=`oc get secret ibm-spectrum-scale-grafana-bridge-service-cert -n ibm-spectrum-scale -o json |jq '.data["tls.crt"]' | tr -d \"`

TLS_KEY=`oc get secret ibm-spectrum-scale-grafana-bridge-service-cert -n ibm-spectrum-scale -o json |jq '.data["tls.key"]' | tr -d \"`

Update the grafana-bridge-tls-cert-remote secret with TLS_CERT and TLS_KEY variables content.

# oc get secrets grafana-bridge-tls-cert-remote -n $NAMESPACE -o json  | jq ".data[\"tls.key\"] |= \"$TLS_KEY\"" | jq ".data[\"tls.crt\"] |= \"$TLS_CERT\""| oc apply -f -

Create a GrafanaDatasource kind object with a reference to the grafana-bridge-tls-cert-remote secret and the grafana-bridge route url in the "local-ocp".

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDatasource
metadata:
  name: bridge-grafanadatasource-remote
spec:
  valuesFrom:
  - targetPath: "secureJsonData.tlsClientCert"
    valueFrom:
      secretKeyRef:
        key: "tls.crt"
        name: "grafana-bridge-tls-cert-remote"
  - targetPath: "secureJsonData.tlsClientKey"
    valueFrom:
      secretKeyRef:
        key: "tls.key"
        name: "grafana-bridge-tls-cert-remote"
  datasource:
    access: proxy
    editable: true
    isDefault: true
    jsonData:
      httpHeaderName1: Authorization
      timeInterval: 5s
      tlsAuth: true
      tlsSkipVerify: true
      tsdbVersion: '2.3'
    name: grafana-bridge-remote
    secureJsonData:
      tlsClientCert: ${tls.crt}
      tlsClientKey: ${tls.key}
    type: opentsdb
    url: < grafana-bridge service route url >
  instanceSelector:
    matchLabels:
      dashboards: my-dashboards

Verify status of the fresh deployed GrafanaDataSource instance.

# oc get GrafanaDataSource bridge-grafanadatasource-remote -n grafana-for-cnsa -o json | jq '.status'
{
  "hash": "a209926ee6db83847d16e0ac08fcf71542c298064a5575ca0d25a519d7d2900d",
  "lastResync": "2023-11-20T19:18:53Z",
  "uid": "504a1eed-a6c0-4a7a-b4ec-fc650d4fd4a4"
}

Monitoring Performance Data from multiple CNSA clusters

To make usage of monitoring the performance data from multiple CNSA clusters, deploy a Grafana instance and a GrafanaDatasource kind resource on the Openshift cluster running CNSA ("local-ocp").

On the different Openshift cluster running CNSA deploy grafana-bridge route. After that done switch back to the "local-ocp" and create a tls data secret and a GrafanaDatasource kind instance pointing to the grafana-bridge route from "remote-ocp". For more details please check the section above.

Verify both GrafanaDatasource instances are up and running.

# oc get GrafanaDatasource -n grafana-for-cnsa
NAME                              NO MATCHING INSTANCES   LAST RESYNC   AGE
bridge-grafanadatasource                                  62s           52d
bridge-grafanadatasource-remote                           62s           50d

Now, you should be able to see both data sources in Grafana web explorer.

/images/Openshift/openshift_multiple_bridge_conn.png

As next you can provision a Dashboard kind instance from example dashobards or create your own.

Performance Monitoring of an AFM RegionalDR setup configured over 2 CNSA clusters (primary and secondary side)

RegionalDR provides asynchronous replication of PVs between two CNSA clusters

  • A primary (cashe) cluster that hosts the application
  • A secondary (home) cluster that is a passive standby for the application

Data is transferred between primary and secondary via NFSv4.1 protocol. The secondary site runs one or more NFS servers that export the independent filesets representing the replication targets. The gateway nodes on the primary site mount those exports and write the data to them.

The End-to-End AFM RegionalDR Performance Monitoring requires metrics observation on both sides: primary and secondary. The health of AFM filesets is mapped to the AyncReplication CR Healthy Condition on the primary side. The health of the NFS servers is mapped to the RegionalDR CR Healthy Condition on the secondary side.

Provision the "cnsa-afm-over-nfs-dashboard" GrafanaDashboard:

# oc apply -f https://raw.githubusercontent.com/IBM/ibm-spectrum-scale-bridge-for-grafana/master/examples/openshift_deployment_scripts/examples_for_grafana-operator_v5/provision_dashboard/cnsa-afm-over-nfs-dashboard-v5.yml -n grafana-for-cnsa
grafanadashboard.grafana.integreatly.org/cnsa-afm-nfs-dashboard created

# oc get GrafanaDashboard -n grafana-for-cnsa
NAME                                   NO MATCHING INSTANCES   LAST RESYNC   AGE
cnsa-afm-nfs-dashboard                                         23s           23s

Open the "cnsa-afm-over-nfs-dashboard" from dashboards view in Grafana web explorer.

/images/Openshift/afm_in_dashboards_view.png

You should be able to select your primary side CNSA as cacheCluster source, and secondary side CNSA as homeCluster source.

/images/Openshift/cnsa_afm_over_nfs_collapsed.png

In the CACHE CLUSTER section you can check the total number of bytes written to the remote system as a result of cache updates, the number of messages that are currently enqueued or used memory in bytes by the messages enqueued.

In parallel, you can check NFS throughput/IO Rate in the HOME CLUSTER section.

/images/Openshift/cnsa_afm_over_nfs.png