Troubleshooting - turbonomic/kubeturbo GitHub Wiki

Troubleshooting Kubeturbo Startup and Connectivity

This document reviews how to troubleshoot some common issues with a kubeturbo pod not being able to connect to Turbo Server, unable to validate or collect metrics, or will not run.

Prerequisites

Make sure you have reviewed the prerequisites on how to set up kubeturbo, either by yaml, Helm or operator.

Let's get started:

For version 8.3.6 and higher, review the KubeTurbo Health Notification information below
Have a copy of the yamls, custom resource or helm output (helm get manifest {release_name}) ready to review
Familiarity with vi/vim, and
kubectl/oc cli commands to get, describe, edit resources, and to get a log from kubeturbo. Alternatively, have access to the kubernetes / OpenShift / Rancher etc dashboard with permissions to deploy, edit resources, and
can get and configure log settings for Kubeturbo and Kubeturbo Operator pods.

Changing Cluster Roles

If you are trying to change the Role you initially deployed Kubeturbo with and getting an error in the Operator, Events or Logs similar to failed upgrade (cannot patch \"turbo-all-binding-kubeturbo-release-turbo\" with kind ClusterRoleBinding please follow the steps in the article here to resolve the issue

KubeTurbo Health Notification: Product Version

KubeTurbo as a remote probe will work best if the version of the probe matches the Turbo Server version. This will ensure the user can properly leverage new features and use cases, as well run the supported and tested combination of probe and server components.

Version mismatch can happen due to upgrades, where the probe will be temporarily behind the server version, the customer may be instructed by Turbo Support to run a different version, or could just have been an oversight. Health Check Notifications were created to inform the customer of version differences and to advise them what to do.

Version is defined as the Product Version, which applies to both the Turbo Server and all probes. Container images are provided with tag values that are by design equal to product version

The following table is a guide to the notifications you can find by going to Settings -> Target Configurations and focus on Cloud Native target types. The Turbo user must have an administrator or site-administrator role to view this page.

Notification	Risk Color	Definition
Good (no notification)	Green	Kubeturbo product version matches the Turbo Server version. Probe has validated, and sending data.
Custom version	Yellow	Kubeturbo is using a custom version that you should be confirm is correct
Ahead / Newer	Yellow	Kubeturbo product version is newer than the Turbo Server version, and confirm this is correct.
Behind / Older	Orange	Kubeturbo product version is older / behind the Turbo Server version. Recommendation is to update KubeTurbo image.
No version detected	Orange	Kubeturbo product version is older (does not have product version support implemented) than Turbo Server version (which does). Recommendation is to update KubeTurbo image.
Validation Failed	Red	Kubeturbo credentials being used by the service account does not have sufficient cluster level access
Discovery Failed: Communication	Red	Kubeturbo is no longer communicating with the Turbo Server
Discovery Failed: Duplicate Target	Red	More than one Kubeturbo is reporting in from the same k8s/OCP cluster -> remove one kubeturbo instance. Or more than one Kubeturbo is using the same targetName parameter -> reconfigure the targetName on one cluster.

Troubleshoot based on the notification seen on the Target Page

FYI:

KubeTurbo will start reporting product version as of 8.3.6, which is the property used to determine mismatch. The Target details view for a single Kubernetes cluster shows kubeturbo image information, which may be customized and may not represent actual product version. Product version property will be added to target details in an upcoming release.
In Turbo 8.3.6, a version mismatch will show a Gray band next to the Kubernetes-Cluster target in the Target Status page, instead of the correct risk color of Yellow or Orange. This will be addressed in the next version.

General: Modifying your deployment

The steps described below to update or modify an existing kubeturbo will be different if you deployed kubeturbo via operator or by helm chart. Modifications to an operator should start with editing the custom resource (cr yaml). Helm chart changes should be done via helm upgrade. Examples will be provided. Contact support for any questions.

General: Collecting and Configuring kubeturbo logs

All logging related logging information is detailed here

Troubleshooting Kubeturbo Startup or Validation Issues

Websocket handshake errors

KubeTurbo communicates with the Turbonomic Server, and for some network connectivity issues you will see an error in KubeTurbo log showing a wss handshake connectivity error. In addition to the descriptions below, check for the following:

Firewall is not configured for wss
The Turbo Server URL is invalid. Check the URL reported in the log. Check the URL configured and remove a "/" at the end of the URL. The format should be https://my.turbosrver.io

Frequent reconnections, missing target

KubeTurbo needs to maintain a connection to the Turbo server and has a 60 second heartbeat. Check the load balancer timeout , and if set needs to be > 60 seconds.

Kubeturbo crashed or did not start

If kubeturbo does not have valid information in the configMap, or if there are access issues communicating with the Turbo Server or the nodes in the cluster, kubeturbo will crash. Review these scenarios and tips below if you are manually modifying yamls:

Double check yaml syntax. The most common issue is the configMap json portion is not formatted correctly. if you see a message close to this:

E0418 14:59:04.000267 1 kubeturbo_builder.go:226] Failed to generate correct TAP config: Unmarshall error :invalid character '"' after object key:value pair

Check the error and then look at your configMap to see what is incorrectly formatted. Common examples include not properly closing (missing a }) or missing a "," at the end of the line after Turbo version, or after the Turbo user id. Edit the configMap resource, and restart kubeturbo pod.
Image pull issue. Look at the events for the pod. You can get this by running
```
      kubectl describe pod {pod_name} -n {namespace}
```
At the bottom of the output you will see events. Look to see if the kubeturbo image was pulled. If not, then the node may not be connected to the internet, or there is a network policy preventing access. Consider staging kubeturbo image in a private registry that the cluster has access to, and update the deployment image location accordingly. If a network policy, work with the DevOps team to edit the k8s policy. Can't figure it out? Open a ticket.
Pod pending? You can see the state kubeturbo is in by using the describe command in #1 (or use kubectl get pods -n {namespace}), and if pending it can mean there are not enough resources in the cluster, or there is a quota in the namespace kubeturbo is deployed in that would be exceeded, or someone has modified the deployment to go to a specific node, using node labels, and that node is unable to run the pod. Suggest you deploy kubeturbo without these constraints. Can't figure it out? Open a ticket.

Turbo Server and/or kubelet connection issues. (Can cause crash too)

Kubeturbo requires connectivity to both the Turbo Server, passing the correct credentials, and the kubelet on every node in the cluster. Review the following scenarios and ways to fix:

Cannot communicate with the Turbo Server: configMap values or format errors.
- Check the configMap to make sure the Turbo Version is specified, URL is correct.
- The Turbo Server User ID in the configMap must be an administrator role in the Turbo Server.
- Make sure there are no empty values, and that the json portion is formatted correctly (watch where your "," and close brackets are placed!).
Cannot communicate with the Turbo Server: network issues. You will notice that the Turbo server does not have a k8s target registered. The kubeturbo documentation shows what is expected to have kubeturbo be able to communicate with the Turbo Server: https/443. Check for :
- firewall issues that are blocking the port,
- pod network policies that do not allow access out of the cluster (rare),
- if there is a proxy server, it must allow websocket communication. See "How to Collect Data & Validate Connections" below for sample commands to test connectivity from a running kubeturbo pod.
- use a valid Turbo Server endpoint which will be the URL to access the Turbo UI. If you are running Turbo Server on a k8s/OCP cluster, then make sure you have properly set up the ingress.
Incorrect Turbo Server version parameters. Kubeturbo starts up but is unable to connect to the Turbo Server. Check for:
- Correct Turbo Server version. If you see an error regarding a protocol version negotiation failed:
E0428 17:02:15.451052 1 sdk_client_protocol.go:110] Protocol version negotiation failed REJECTED) :Protocol version "7.21.4" is not allowed to communicate with server E0428 17:02:15.451746 1 sdk_client_protocol.go:36] Failure during Protocol Negotiation, Registration message will not be sent

Check the Turbo Server version provided to the deployment via the configMap. Refer to your deployment method option for details on how to update that parameter. NOTE: if you update from x.y.z release, any changes to "x" or "y" (but not "z") requires you to update the Turbo Server version. Also as of 8.x, you do not need to increment the minor version.
- Correct Turbo user id and credentials. You will see an error indicating that authentication with the Turbo Server failed. Correct the user id/ password in the configMap, or k8s secret. Refer to your deployment method option for details on how to update that parameter.

Kubeturbo validation errors: Communicating with K8S control plane

Kubeturbo has connected to the Turbo Server, but shows a "Validation Error". Main cause: Cannot communicate with all the nodes in the cluster due to network access or credentials issues. The kubeturbo documentation describes that the pod will need to start with a service account that needs to have a cluster-admin role binding in order to communicate with the kubelet on every node. This role can be view only, but must have access to all nodes in the cluster.
Also kubeturbo needs access to the kubelet, whether https client is required, and which port is needed. Kubeturbo will throw errors in the pod log showing unauthorized access errors for a node IP (unable to perform the API call http/s:{nodeIP}/spec - see Scenario: Kubeturbo Communication / Connections Issues). Sample error:

I1021 22:37:06.948210 1 cluster_processor.go:120] There are 3 nodes. E1021 22:37:06.954338 1 kubelet_client.go:179] failed to get machine[0.0.0.0] machine info: request failed - "401 Unauthorized", response: "Unauthorized".

Check the following:

Is the kubeturbo pod's service account resource created in the right namespace? "kubectl get sa -n {namespace}"
- If not, create resource, or delete existing and recreate.
Is the cluster role binding resource created "kubectl get clusterrolebinding" (look for the one you defined), and is the pod's service account assigned to it "kubectl describe clusterrolebinding {name}"?
- If not, delete the clusterrolebinding (kubectl delete clusterrolebinding {name}) and recreate the resource.
- Note the API versions and correlation to older k8s versions! See the sample yaml for more details.
Check if the issue is the kubelet properties defined in the kubeturbo deployment yaml under "args". Do you need http access to port 10255, or https access to port 10250? These values are controlled within the kubeturbo deployment arguments, where adding in kubelet-https=true and kubelet-port=10250 args enables secure access to the kubelet, and alternatively kubelet-https=false and kubelet-port=10255.
- If running k8s 1.10 or older, or AKS, EKS, start with kubelet-https=false and kubelet-port=10255 args.
- If running any version of OpenShift 3.x or higher, you must specify - --kubelet-https=true and - --kubelet-port=10250
- If running kubernetes 1.11 or higher, start with specifying - --kubelet-https=true and - --kubelet-port=10250
- If you cannot connect one way, trying changing the parameters.
Check for a duplicate cluster id. Each k8s cluster will have 1 kubeturbo pod running, communicating with only 1 Turbo Server, and a Turbo Server can have only 1 kubeturbo pod from each cluster reporting in. When you are managing more than 1 cluster, each kubeturbo pod will register with a unique id, as specified in the configMap resource under "targetConfig". If your cluster target does not validate, look in the kubeturbo logs to see if the last line is: tap_service.go:62] Error while adding Cloud Native Kubernetes-3569606604 target: Target &{Cloud Native [] [0xc0002a7ee0 0xc000536000 0xc0005360d0] [] Kubernetes-3569606604 } exists Edit the configmap and make sure targetConfig -> targetName is unique. Delete running kubeturbo pod.

You can edit a running deployment by using the "kubectl edit deployment {deployment name} -n {namespace}", which should restart the pod. If not, just delete the running pod (kubectl delete pod {podName} -n {namespace}.

Confirming Kubeturbo Communication

Collecting data to Troubleshoot Kubeturbo Communication / Connections issues. For these issues you will want to test connectivity from Node to Kubeturbo to Turbo Server.

Kubeturbo <-> Turbo Server: Validate the pod can communicate with the Turbonomic server. You will want to exec a curl command to reach the Turbo Server from within a running kubeturbo pod. You will use the "kubectl exec" command:
```
kubectl exec -it {podName} -n {namespace} sh
```
Then run a simple curl command:
```
curl -k https://{TurboIPaddress}/vmturbo/rest/admin/versions
```
Or run the API call to get the Turbo Server Version, passing the admin user and password to confirm access and credentials:
```
curl -k https://{TurboAdminId}:{TurboAdminPwd}@{TurboIPaddress}/vmturbo/rest/admin/systemstatus
```
Kubeturbo <-> Node kubelet: Kubeturbo collects data by communicating with the kubelet on every node in the cluster. We can test kubeturbo to kubelet connectivity through a k8s /spec or /stats API. This includes testing for Nodes/Workloads missing from the supply chain, or when a Node (or VM in the supply chain) is reported with an UNKNOWN state in Turbo: this state is set when kubeturbo receives a NotReady status from the k8s apiServer.

Validate that kubeturbo can or cannot collect data from the node. Will need access to kubeturbo pod to perform next task:
1. Want to ssh into the kubeturbo pod
```
kubectl exec -it {kubeturbo_podName} -n {kubeturbo_namespace} -- /bin/bash
```
2. From this shell, run an API call to access the internal IP address of the node that is in Unknown state
```
curl -k http(s)://{nodeInternalIPAddress}:{kubeletPort}/spec
```
  and
```
curl -sSk -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://IP:10250/stats/summary
```
- If there is connectivity we will get a response. If the Node is not showing up in the UI, increase Kubeturbo logging to v=4 and open a ticket.
- If there is no connectivity, we will get an error, and then the customer needs to find out if there was a firewall or some network issue preventing access.

Troubleshooting Stitching Issues

k8s cluster not stitching to IaaS

Turbonomic's Unified view of k8s + underlying infrastructure is based on stitching together data from kubernetes, the IaaS layer, and if added Application Performance Management targets (Prometurbo, Instana, etc). If your k8s cluster topology stops at the bottom with the Node (VM), and you think it should include either public cloud details, or hosts/regions and related storage, review these steps:

Is the IaaS layer targeted with valid credentials that have access to the infrastructure resources of the k8s cluster?
Is the IaaS layer supported by Turbo?
Does your IaaS target have any discovery errors?
Is your k8s cluster configured with a cloud identity provider? This is based on how your k8s cluster was set up, and provides a unique identifier that Turbo will use to stitch to a UUID from the IaaS layer. This identifier is provided by default with hosted k8s services (AKS, EKS, GKE) and OpenShift. If its not there, then you can change the kubeturbo deployment to use a container arg of "- --stitch-uuid=false" which will switch to IP based stitching. See yaml and helm set up for more details.

Troubleshooting Data Collection Issues

KubeTurbo will query 4 kubelet endpoints for each node to get the following data:

/stats/summary for the traditional cpu/memory usage metrics
/configz for node eviction thresholds
/metrics/cadvisor for throttling metrics
/spec for the machine info

As well as run a job to collect CPU node frequency information. If you have errors in the logs, validate the following:

turbo-user service account is a subject of a Cluster Role Binding that has the minimum cluster privileges
there are no network or security policies preventing node to node communication. Kubeturbo will attempt to proxy through the API server. Contact Turbonomic Support if you still see warnings and incomplete data.

Collect data for investigating Kubernetes deployment issue

To be able to investigate the issue of Kubernetes deployment of Turbo server, we need to document steps to collect these data for further investigation of the issue:

full version of rsyslog.
CR of the turbo server.
operator log
configMap for the Turbo server.

Get the full version of `rsyslog`

Depending on how long you running the Kubeturbo, kubectl might not be able to print the complete log through logs command. If you just want to glimpse the latest log from rsyslog you can use the following command to look at the log.

kubectl -n turbonomic logs deployment/rsyslog
# OR
kubectl -n turbonomic logs $(kubectl -n turbonomic get pods | grep rsyslog | awk '{print $1}')

The -p flag enables the log command to print all pods' log for the selected deployment, but the printed log is limited to the time when the deployment restarts. In case you just need the complete log of the current pod, you can use the following command.

kubectl -n turbonomic logs -p deployment/rsyslog
# OR
kubectl -n turbonomic logs -p $(kubectl -n turbonomic get pods | grep rsyslog | awk '{print $1}')

The complete log for rsyslog is stored on the host machine via a persistent volume claimed as turbonomic/rsyslog-syslogdata. So the way to find the complete log is to find the host path of the persistent volume and then find the log file on the host machine.

PRESISTENT_VOLUME_HOST_DIR=$(kubectl -n turbonomic get persistentVolume $(kubectl -n turbonomic get persistentVolume | grep rsyslog-syslogdata | awk '{print $1}') -o yaml | grep -oP "path: \K.*")
# You can do cp/cat/grep from the following log file path for your own purpose
LOG_FILE="${PRESISTENT_VOLUME_HOST_DIR}/rsyslog/log.txt"

CR of the turbo server.

To get the CR info of your turbo server you can either dive directly to the definition file or query the XL.

The file is located at: ./kubernetes/operator/deploy/crds/charts_v1alpha1_xl_cr.yaml

In case you cannot locate the file you can try the following command:

kubectl -n turbonomic get XL xl-release -o yaml | head -n  150

`t8c-operator` log

The operator doesn't store the complete log (since deployment creation) on any local file so the most complete log you can get is from the current container. You can use the following command to query complete operator logs from the current operator container.

kubectl -n turbonomic logs -p deployment/t8c-operator
# OR
kubectl -n turbonomic logs -p $(kubectl -n turbonomic get pods | grep t8c-operator | awk '{print $1}')

In general, you can query the current log buffer via the following command.

kubectl -n turbonomic logs deployment/t8c-operator
#OR to follow the current log
kubectl -n turbonomic logs deployment/t8c-operator -f

configMap for the Turbo server.

You can query the configMap for the Turbo server via the following command.

kubectl -n turbonomic get cm $(kubectl -n turbonomic get deploy kubeturbo -o jsonpath='{.spec.template.spec.volumes[*].configMap.name}') -o yaml

Changes to Cluster Role Names and Cluster Role Binding Names

When using an Operator or OpenShift OperatorHub based kubeturbo deployment as of kubeturbo-operator 8.9.5 there were changes introduced to have a new unique Cluster Role Binding name. Additionally in 8.9.6 there were changes introduced to have a new unique Cluster Role name. This was done in order to support deploying multiple Operator based kubeturbo in a single cluster. As such there are some manual changes needed to ensure your Operator based deployment of kubeturbo will continue to work and use these new unique Cluster Role and Cluster Role Binding names. Since the kubeturbo Operator is helm based these manual steps are required to delete the old Cluster Role and Cluster Role Binding names.

There are a few different scenarios where you will be required to perform the manual steps below to use these new Cluster Role and Cluster Role Binding Names.

Your Operator or OpenShift OperatorHub based deployment of kubeturbo uses a custom Cluster Role such as turbo-cluster-reader and you now want to change it to turbo-cluster-admin to allow kubeturbo to execute actions
Your Operator or OpenShift OperatorHub based deployment of kubeturbo is using a custom Cluster Role such as turbo-cluster-reader or turbo-cluster-admin and you originally had a version prior to 8.9.5 deployed and you have upgraded to a later version.

Manual Steps to follow

Delete all Cluster Roles that start with the names: turbo-cluster-reader and turbo-cluster-admin (there could be a total of 4 of them)
Delete the Cluster Role Binding that starts with the name: turbo-all-binding-kubeturbo
After a few minutes the kubeturbo-operator will automatically re-create the required non-default Cluster Role and Cluster Role Binding using the new names.