Blog 2024 11 - lago-morph/chiller GitHub Wiki

2024-11-19

Playing around with Argo Workflows (documentation)

Installed the 3.6.0 release from GitHub in the argo namespace - binary and manifests
The release manifests didn't properly set up permissions for the argo serviceaccount. Need to patch the argo-role Role with permissions defined here
When submitting jobs, need to specify both the namespace and serviceaccount - e.g. argo submit hello-world.yaml -n argo --serviceaccount argo
- I should be able to automate this by setting defaults in the default workflow spec
- The fields to specify should be workflow.spec.serviceAccountName and workflow.metadata.namespace

2024-11-18

Modified AWS EKS config to use EFS as a provider for PersistentVolumeClaims. This was surprisingly tricky, but kind of fun to figure out

2024-11-14

Time to do another brain dump on what is needed to finish things up before tackling Continuous Deployment

Install
- Write up documentation for Helm chart (install.md) in chiller-doc
- Modify Helm chart to use namespace-per-install, including CRDs for Prometheus Operator and ConfigMap for Grafana dashboard
Load simulation
- Containerize the Locust load simulator
- Helm chart for installing load simulator and running it against a specific namespace (to correspond to an application install)
- Write load.md in chiller-doc
General cleanup
- Merge release-0.1.1 into main, and create a new branch for testing (so much, including GitHub Actions and Helm stuff are in that release)
- Move Helm and load directories from chiller into different repositories
Documentation/demo
- Update README.md in chiller repo to make it more engaging
- Finish other documentation in chiller-doc
  - api_definition.md
  - development_platform.md
  - developer_testing.md
- Second pass at PPT to have at least one slide with a diagram for each component
- Re-do mermaid diagrams to be more readable/useful
- Pass through all documentation and ensure there are links/references to where the code is for each component being documented

Forward-thinking stuff

Install and experiment with Argo Rollouts
Prototype GitHub Action getting AWS Access Key ID and Secret Access Key from GitHub Environment secret
Go through old blog posts to find things I said I wanted to experiment with but apparently forgot about
Figure out how hard it would be to use AWS Athena instead of a Postgres pod

Stuff collected from past blog posts to look at again

Istio and Kiali
ArgoCD and Kargo
Terminal GUIs to manage k8s k9s, kdash
pew improved Python venvs
Jaeger and jaeger-operator, OpenTelemetry

2024-11-11

Created initial draft of IaC documentation using Terraform and installing the entire environment (VPC, EKS, Load Balancer Controller, Prometheus and Grafana)
Created initial draft of Observability - Metrics documentation, including design for namespace-specific version (right now it only works with one install of the application per cluster, in the default namespace).

Fix race condition in Terraform install of Load Balancer Controller

The problem with the Terraform config is with the load balancer controller. The LBC Helm chart is installed (and completes installation) before the EKS node group is finished creating, and even before the VPC-CNI addon is done installing (how do you do networking without that?). The LBC installs a mutating webhook for services. Terraform tries to then install the kube-prometheus helm chart (which installs services). The webhook is called, but it cannot execute because the pods for the service backing the webhook have not started up yet (because the k8s data plane is still being created). I am a little confused as to why the helm install starts (and finishes) before the nodes are even running. By default the Helm provider is supposed to wait until pods and services are up before saying the helm install is complete. Looks like this is an active issue with the Helm provider.
This is the raw error:

│       * Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"

One suggestion I found was to create a null resource in Terraform that just waits "a while", and set up dependencies on it to EKS cluster, and then on the helm chart to it. This is a hack that is hard to stomach. Unfortunately it looks like the "wait-for-job" tag, which would be a good way to handle this, has a similar bug to the "wait" tag in the helm provider - it doesn't actually wait. This could be done with a Null resource, but that is terrible (though other modules seem to do this a lot, including early versions of the EKS module).
Another hack way to handle this is to make sure the monitoring stack is deployed before the LBC. Then the webhook startup won't cause an error. However, it means that anything I install with Helm after the LBC will always face this issue.
Turns out that making a dependency on the monitoring stack for the LBC worked. It appears that the monitoring stack cannot complete until after the EKS node group is running. Not sure if this is because of CRDs or what. Still confused, but it works now at least.
The "right" way to do this is to have a Terraform null resource that depends on the load balancer controller Helm install and runs kubectl wait on the aws-load-balancer-webhook service and the associated aws-load-balancer-controller pods. The Null resource then blocks anything dependent on it until the service/pods backing the webhook are up and running. Then make the monitoring Helm stack resource (and any other Helm chart installs done through Terraform) be dependent on that null resource. This is what I will do if I encounter this again. I have a feeling this is not fixed, just masked.

2024-11-10

Found the problem with my AWS EKS Terraform code. The node security group did not allow port 80 from other nodes. Fixed that, now it works great.
Load Balancer Controller also works. Updated Helm chart for Chiller to install an Ingress for the chiller-frontend, and it works.
Turns out the default config for grafana in the kube-prometheus stack is to enable custom dashboards via configmap from any namespace. So don't have to change config for that. Just need to create a configmap to get the dashboard loaded.
Got everything working piecemeal. There is a race condition between the EKS terraform creation and terraform installation of the monitoring stack. Too late to debug tonight. Doing Terraform apply twice worked. Grafana dashboard automatically created, ALB created based on ingress in application Helm chart.

2024-11-08

I've been working most of this week on IaC for AWS and the Helm chart for the chiller application

I have a Terraform configuration for AWS that creates a VPC, the appropriate subnets, creates an EKS cluster, an autoscaling group for the worker nodes, and installs the Load Balancer Controller that AWS recommends for handling ingress
I switched from the kube-prometheus manifest-style install method for installing the monitoring stack to using the prometheus-kube-stack community helm chart. Of course there were a bunch of minor differences that I had to track down, and learned a lot more about how the prometheus operator works
I modified my application helm chart to remove kustomize as a post-processor when installing images created using the CI/CD flow. I can now set the tag for the application containers as a --set appTag=whatever argument to helm when installing the chart. It probably still works with the old ttl image method, but I haven't tested it. I updated the Makefile in chiller/helm appropriately
I plan to continue working on the Terraform configuration to add installation of these two helm charts automatically. One thing I'm going to defer for now is the automated install of the chiller grafana dashboard. That can be done reasonably easily by changing the installation parameters of the grafana subchart (setting sidecar.dashboards.enabled to true and creating a ConfigMap that has the dashboard definition)
For some reason I'm having database connectivity issues within my application. It is difficult to track down the cause. I need to add some better logging to the application itself to make this easier to fix.
Terraform now installs the monitoring helm chart. Still need to diagnose the brand new database connectivity issue in the application. Might be related to the load balancer controller?