Troubleshooting - severalnines/ccx-admin-guide GitHub Wiki
Troubleshooting
This page describes troubleshooting scenarios.
CCX Error Report
The script to generate an error report is here - https://github.com/severalnines/helm-ccx/tree/main/scripts. Please attach the error report (.tar.gz file) to every ticket created for CCX support. This tremendously helps to diagnose the issue.
For the CMON managing the CMONDB and CCXDB, see below.
CMON (Controller of Datastores)
Use kubectl and open a bash terminal on the cmon-master container.
S9S CLI commands
The 's9s job' commands can be used to debug why a datastore failed to create. Example:
# Check cluster status
s9s cluster --list --long
# Check node status
s9s node --list --long
# Check node status of particular cluster
s9s node --list --long --cluster-id=CLUSTERID
# Check status of replication links
s9s replication --list --long
## List jobs and view logs
s9s job --list
s9s job --job-id=NNN --log
Creating an Error Report
The error report contains a lot of information about the system and is a very good help for support to understand the configuration, topology and states. From the ClusterControl Admin UI, you can create an error report by going the problematic datastore/cluster -> "Report" and then "Create Error Report". A error report in .tar.gz format will be generated and downloadable from the UI.
Getting information of a failed job
Get details about the cluster, in case you know the CLUSTER_UUID (this can be obtained from logs, or from UI e.g).
s9s cluster --list | grep CLUSTER_UUID
List all the failed jobs of a cluster NNN. You get NNN from the s9s cluster --list
above.
s9s job --list --cluster-id=NNN |grep FAILED
If you dont know the cluster id, run:
s9s job --list |grep FAILED
Locate the failed job you are interested in, and obtain the jobid (first column in the output).
s9s job --job-id=MMM --log
Additionally, an error report is needed in many cases as it contains detailed information about datastores. See Creating an Error Report;
Common Issues
Monitoring is being setup/Charts/Dashboards are not loading up.
Launch the install agents job again on the controller:
s9s cluster --list --long | grep CLUSTER_UUID
# take the cluster id , NNN
s9s cluster --deploy-agents --cluster-id=NNN --log
Then check if it solved the issue. If not, contact the CCX Team.
How do I determine what SSH credentials to use for connecting to the datastore's VM?
vault kv get kv/ccx-private-key/DATASTORE_UUID
Recreating 'ccxadmin' user
Obtain the cluster-id of the problematic cluster (--cluster-id=NNN
below):
MySQL/Percona
s9s account --cluster-id=NNN --create --account='ccxadmin:PASSWORD@%' --privileges='*.*:SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, RELOAD, PROCESS, REFERENCES, INDEX, ALTER, SHOW DATABASES, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, CREATE VIEW, REPLICATION_SLAVE_ADMIN, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, CREATE USER, EVENT, TRIGGER, GRANT'
s9s account --revoke --account='ccxadmin:PASSWORD@%' --privileges='mysql.*:INSERT, UPDATE, DELETE, CREATE, DROP, REFERENCES, INDEX, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, EVENT, TRIGGER;sys.*:INSERT, UPDATE, DELETE, CREATE, DROP, REFERENCES, INDEX, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, EVENT, TRIGGER'
MariaDB
s9s account --cluster-id=NNN --create --account='ccxadmin:PASSWORD@%' --privileges='ccxdb.*:ALL, GRANT;*.*:CREATE USER, REPLICATION SLAVE, REPLICATION SLAVE ADMIN, SLAVE MONITOR'
PostgreSQL
s9s account --cluster-id=NNN --create --account='ccxadmin:PASSWORD@%' --privileges='NOSUPERUSER, CREATEROLE, LOGIN, CREATEDB'
Certificates
Services are bouncing
Failed to deploy datastore
- Check logs
- Quotas/Resource issue, i.e cannot allocate RAM/DISK/CPU
Upgrading Control Plane
Issues related to upgrading / migrations
Auxiliary Infrastructure (Databases etc, external to Kubernetes)
CCX Databases
All of the production databases are added in their respective ClusterControl installation. You can manage the most common problems from there.
CMON (Controller of CMONDB and CCXDB)
If there are issues with the controller managing CMONDB and CCXDB, then please open a support request and attach an error report to the support issue.
Create an error report
Create an error report using the CCUIv2:
- Select the Cluster (CCXDB or CMONDB).
- Select Reports, and then Create Error Report.
- The error report will now be created and presented in the web UI.
Networking issues
If there are network issues (no route to host e.g, from the CCX control plane to the database nodes or the openstack infra) then it may help to restart the pods, because the underlying network may have changed. You can make a change to the environment (in Helm chart e.g) and then do helm update
, which will trigger all pods to be restarted.