Debug k8s issues - ganeshahv/Contrail

Problem-1

ubuntu@ip-172-31-32-211:~$ kubectl get nodes -o wide
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

Soln

kubeadm reset
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
sudo chmod 777 $HOME/.kube/config
export KUBECONFIG=/etc/kubernetes/kubelet.conf
export KUBECONFIG=/home/ubuntu/.kube/config
kubectl get nodes

Problem-2

PersistentVolume and PersistentVolumeClaim stuck in Terminating state

ubuntu@ip-172-31-32-211:~$ kubectl get pv
NAME      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS        CLAIM             STORAGECLASS   REASON   AGE
10gpv01   10Gi       RWO            Retain           Terminating   default/myclaim                           5h55m
pvvol-1   1Gi        RWX            Retain           Available                                               65m
ubuntu@ip-172-31-32-211:~$ kubectl get pvc
NAME      STATUS        VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
myclaim   Terminating   10gpv01   10Gi       RWO                           5h39m

Soln

Edit the pv and the pvc resource and delete the finalizers stanza. Finalizers are arbitrary string values, that when present ensure that a hard delete of a resource is not possible while they exist.

ubuntu@ip-172-31-32-211:~$ kubectl edit pv 10gpv01
persistentvolume/10gpv01 edited
ubuntu@ip-172-31-32-211:~$ kubectl edit pvc myclaim
persistentvolumeclaim/myclaim edited

Problem-3

Unable to do a curl to the master k8s node when using a Ingress. Only the worker node is listed in the endpoint info.

ubuntu@ip-172-31-32-211:~$ curl  -H "Host: www.giri.com" http://<master_node_public_IP> 
curl: (7) Failed to connect to 13.235.214.225 port 80: Connection refused

ubuntu@ip-172-31-32-211:~$ kubectl describe ep -n kube-system traefik-ingress-service
Name:         traefik-ingress-service
Namespace:    kube-system
Labels:       <none>
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2020-08-31T07:18:48Z
apiVersion: v1
Subsets:
  Addresses:          192.168.89.13
  NotReadyAddresses:  <none>
  Ports:
    Name   Port  Protocol
    ----   ----  --------
    admin  8080  TCP
    web    80    TCP

Events:  <none>

Soln

The master node had a taint and hence the ingress-controllers were not launching on it.

Only one pod was being launched on the worker node.

This is because Daemonset launched one pod on all the nodes,

unlike Deployment where replicas are launched based on the user input.

ubuntu@ip-172-31-32-211:~$ kubectl get po -n kube-system -o wide | grep traefi
traefik-ingress-controller-lv9nv           1/1     Running   0  

ubuntu@ip-172-31-32-211:~$     kubectl describe nodes | grep -i taint
Taints:             node-role.kubernetes.io/master:NoSchedule

Removed the taint on the master node, which created the pod on both the nodes.

ubuntu@ip-172-31-32-211:~$ kubectl get po -n kube-system -o wide | grep traefi
traefik-ingress-controller-d8rfb           1/1     Running   0          25m   172.31.38.235   ip-172-31-38-235   <none>           <none>
traefik-ingress-controller-ggdmj           1/1     Running   0          25m   172.31.32.211   ip-172-31-32-211   <none>           <none>

ubuntu@ip-172-31-32-211:~$ kubectl describe ep -n kube-system traefik-ingress-service
Name:         traefik-ingress-service
Namespace:    kube-system
Labels:       <none>
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2020-08-31T08:57:25Z
Subsets:
  Addresses:          172.31.32.211,172.31.38.235
  NotReadyAddresses:  <none>
  Ports:
    Name   Port  Protocol
    ----   ----  --------
    admin  8080  TCP
    web    80    TCP

Events:  <none>
ubuntu@ip-172-31-32-211:~$

ubuntu@ip-172-31-32-211:~$ curl  -H "Host: www.shourya.com" http://k8smaster
<!DOCTYPE html>
<html>
<head>
<titleShourya</title>

Problem-4

coredns-pods are stuck in ContainerCreating status

Soln

Delete the pods and check to see they show a Running state.

Problem-5

ubuntu@ip-172-31-32-211:~$ kubectl get nodes
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

Soln - Reset the cluster

On both the master and worker nodes, do a kubeadm reset.
On both the master and worker nodes, clean the CNI configuration, pki files, etcd directory and flush the IP Tables.

rm -rf /etc/cni/net.d
rm -rf /etc/kubernetes/pki/*
rm -rf /var/lib/etcd/*
iptables --flush

On the master node, clean up the conf file

rm -rf /home/ubuntu/.kube/config

Do a kubeadm init on the master.
Do a kubeadm join on the worker.
On the master node, do the following:

 mkdir -p $HOME/.kube
 sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
 sudo chown $(id -u):$(id -g) $HOME/.kube/config

Enable CNI by doing a kubectl apply -f calico.yaml on the master node.

Debug k8s issues - ganeshahv/Contrail_SRE GitHub Wiki

Problem-1

Soln

Problem-2

Soln

Problem-3

Soln

Problem-4

Soln

Problem-5

Soln - Reset the cluster

⚠️ GitHub.com Fallback ⚠️

Debug k8s issues - ganeshahv/Contrail_SRE GitHub Wiki

Problem-1

Soln

Problem-2

Soln

Problem-3

Soln

Problem-4

Soln

Problem-5

Soln - Reset the cluster

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️