Secrets Vault - department-of-veterans-affairs/abd-vro GitHub Wiki

Secrets are stored in a LHDI's HashiCorp Vault, which resides in the VA network. Secrets include credentials, tokens, and certificates for all deployment environments. Scripts and Helm configurations have been created to formalize and reproduce deployment of secrets to all LHDI environments.

HashiCorp Vault

Secrets for all LHDI deployment environments are stored in a single vault at https://ldx-mapi.lighthouse.va.gov/vault, which requires VA network access. Log in using OIDC with the default, empty role and you will have access to secrets engines belonging to your GitHub teams, following the security principle of least privilege.

(Context: A separate VRO Admins GitHub Team was created to limit access to secrets. By default, LHDI allows all members of VA-ABD-RRD GitHub Team to have access to a vault store, which is contrary to the principle of least privilege. There's a vault store for va-abd-rrd but it is unused.)

VRO and partner team secrets are stored in the va-abd-rrd secrets engine, with secrets being organized under the deploy/ folder. Subfolders for each environment are used as follows:

default: provides default secrets for all environments; used for the LHDI dev environment
qa, sandbox, prod-test, prod: used for the respective LHDI environment and overrides any default secrets
- Only differences from default secrets need to be present. As a result, there are few secrets in the qa environment and there is no dev subfolder.

Under each LHDI environment folder are additional subfolders which encapsulate secrets for each application. This unique path per-environment-and-application, e.g., /data/deploy/{env}/{app} is used to reference an application's secret path from a helm chart using the ArgoCD Vault plugin.

End-to-End Secrets Overview using ArgoCD and Vault

Application-specific secrets

VRO and platform team applications are configured via the ArgoCD applications vault repository to map Vault secrets to a local key in the helm secrets template, which is then referenced when injecting environment variables into a pod in the deployment template. Using secrets.yaml for svc-bie-kafka as an example:

Each application may have its own secrets configuration, identified by a unique name per-app:

metadata:  
  name: vro-secrets-bie-kafka-vault

An application's secrets template uses a secrets_env variable, defined in the environment's yaml configuration (e.g., dev.yaml), to map to that specific set of secrets in Vault by path:

va-abd-rrd/data/deploy/{{ .Values.global.secrets_env }}/VRO_SECRETS_BIE_KAFKA

Secret keys in Vault, referenced by <NAME>, are mapped to local secret keys:

bie-kafka-rbac-username: <BIE_KAFKA_RBAC_USERNAME>  
bie-kafka-rbac-password: <BIE_KAFKA_RBAC_PASSWORD>
...

Those local secret keys are then referenced in deployment.yaml when providing environment variables to the pod:

env:
	- name: BIE_KAFKA_RBAC_USERNAME  
	  valueFrom:  
	    secretKeyRef:  
	      name: vro-secrets-bie-kafka-vault  
	      key: bie-kafka-rbac-username
	- name: BIE_KAFKA_RBAC_PASSWORD
		...

Our pods then have access to environment variables containing decoded secrets, for use by the application.

Shared Secrets

Most of our applications also used a few sets of shared secrets, for example to connect to Datadog with the same credentials. This was done to limit the effort required to update these secrets when they change.

Using Datadog as an example:

A new chart is created under shared/datadog, which gets referenced by another chart as a dependency.
In the new chart, a secrets template is added in the same way as above, but with the template name containing a variable based on the application name:

metadata:  
  name: {{ .Values.global.labels.app }}-secrets-datadog-vault

The custom vault secrets object in ArgoCD is then referenced in a convenience template file called vro-lhdi-libchart/templates/_datadog.tpl as a variable:

{{- define "vro.vault.datadog.envVars" -}}  
- name: DD_SITE  
  valueFrom:  
    secretKeyRef:  
      name: {{ .Values.global.labels.app }}-secrets-datadog-vault  
      key: dd-site
- name: ...

...which is then referenced in an app's deployment.yaml:

env:  
  {{- include "vro.vault.datadog.envVars" . | nindent 12 }}
  ...

Adding/Modifying a secret

Ask a VRO Admin to add, remove, or update the secret in Vault. Securely provide the secret for each LHDI environment -- minimally, one secret value for dev and another for prod.

If the secret is added to an existing VRO_SECRETS_* group, no Helm configuration changes are needed.
If the secret is added to another group, Helm configurations should be updated to use the new secret.

Updates to secrets not being propagated?

With the ArgoCD Vault plugin, secrets don't seamlessly update in pods or trigger syncs in ArgoCD. A hard refresh is required.

Symptom: a secrets update in Vault is not being reflected in the pod environment, sometimes presenting as missing environment variables or connection failures due to still using old credentials.
Cause: ArgoCD secrets objects created with the Vault plugin maintain a cache which must be cleared. The old values will remain until they are evicted.
Workaround: None at the moment; explorations are needed around TTL on the secrets cache as well as for other configuration items to determine if this can happen without manual intervention.
Resolution: a hard refresh of the secrets object in ArgoCD clears its cache and prompts for syncing when it then discovers there are updated. Sync the secrets object, then restart the service.
Additional diagnosis: Vault secrets are visible in Lens, and can also be checked to verify secrets updates have propagated.

For old k8s secrets:

There are circumstances where Kubernetes logs "Error: couldn't find key VRO_SECRETS_... in Secret va-abd-rrd-.../vro-secrets" -- see Slack thread for screenshots. This occurred because a single aggregate vro-secrets secret was used for all VRO_SECRETS_* groups, as that introduces issues with propagation of secret updates because containers still references the old aggregate secret:

Symptom: Sometimes redeploying the pod works and sometimes it fails with this error.
Current hypothesis for this inconsistent error: If other running pods reference the vro-secrets secret, then old versions of it may be available and is being used by new pods. This article prompted the hypothesis.
Workaround: Restart all old pods that reference the vro-secrets secret. Then start everything back up. If a restart isn't sufficient, a complete shutdown of all pods may not be necessary to remove all references to the old secret.
Additionally, marking the secret immutable may be contributing to the use of old secrets because immutable secrets aren't expected to change, so any changes (included a destroy and re-create) are not propagated. As a result, the vro-secrets secret is marked mutable in set-k8-secrets.sh.

Now the vro-secrets-* secrets are individual secrets, where an individual secret is used by one or a very small number of containers. This reduces the number of containers that need to be shut down simultaneously to release all references to the old secret. This improvement should mitigate this probably of this problem.

Adding a non-secret environment variable

To set a non-secret environment variable for a container in an LHDI environment, add it to the relevant Helm chart(s) under helm/. If the variable value is different for each environment, also add it to helm/values-for-*.yaml files.

With that said, before adding an environment variable, please read the next section.

Configuration setting vs. Environment variable

It is preferred to use a configuration file scoped to only the application/microservice/container (e.g., ).

An environment variable is needed when any of the following are true:

it is a secret (username, password, token, private certificate, ...) -- use Hashicorp Vault (as described on this page)
used by multiple containers -- set it in helm/values*.yaml files and reference it in Helm charts (under helm/)
needs to be manually changed in deployment environments -- let's discuss

We should minimize the number of unnecessary Helm configurations, which will reduce DevOps maintenance and overhead, and reduce the number of factors that can cause VRO deployments to fail.

Setting the Vault-token secret

A Vault token is needed to access vault. The automation (a self-hosted GitHub Runner) expects a the Vault token to be a Kubernetes secret named vro-vault in the LHDI dev environment. The token expires monthly. Run scripts/set-secret-vault-token.sh "$VAULT_TOKEN" to set the token, where $VAULT_TOKEN equals the string copied from the Vault web UI (click "Copy token" in the upper-right corner drop-down menu).

Setting Kubernetes access tokens

Kubernetes access tokens for each cluster (i.e., non-prod and prod) are needed to be able to deploy the secrets to the LHDI environments. The access tokens expire in 90 days. Run scripts/set-secret-kube-config.sh to set the devops-kubeconfig secret.

Setting GHCR secret

A GHCR secret in Kubernetes named devops-ghcr needs to be set for LHDI to pull images. Run scripts/set-secret-ghcr.sh "$ENV" "$PAT" for each LHDI environment, where $ENV is dev, qa, etc. The $PAT is a GitHub personal access token -- generate one using the abd-vro-machine account. This only needs to be run once (or every time the PAT expires).

FAQ

Why use Vault?

It is a centralized, secure location in the VA's network designed to hold secrets. From the Vault, secrets can be quickly and consistently redeployed to various LHDI environments in case they need to be reset or rotated.

Why use a self-hosted GitHub Runner?

Our GitHub Action workflow starts a self-host runner that runs within our LHDI dev environment to pull Vault secrets and set Kubernetes secrets, all within the VA's network. This is more secure than using HashiCorp's GitHub Action which would pull Vault secrets outside the VA network and into the GitHub Action workflow environment. The runner is a container in the vro-set-secrets-... Kubernetes pod and can deploy secrets to any LHDI environment when initiated by the Deploy secrets from Vault GitHub Action workflow. The pod is deployed to the dev LHDI environment (because that environment doesn't require SecRel-signed images) and can deploy secrets to other environments.

Why are some Kubernetes secrets immutable?

There has been unexplained occurrences where Kubernetes secrets have changed and caused problems. Making them immutable aims to reduce (but not entirely prevents) this problem.