Troubleshooting - bcgov/eagle-dev-guides GitHub Wiki
Common issues and solutions for Eagle deployments.
Symptom: Deploy to Dev fails at Security Scan step with vulnerabilities found.
Cause: Container image has CRITICAL or HIGH severity vulnerabilities.
Solution:
- Review the scan output to identify vulnerable packages
- Update base image or add security patches:
FROM nginx:1.27-alpine # Add this line to get latest security patches RUN apk upgrade --no-cache
- Commit and push to trigger new build
Symptom:
Error from server (NotFound): deployments.apps "eagle-admin" not found
Cause: Environment has legacy DeploymentConfig but workflow expects Deployment.
Solution:
- Install Helm chart to create Deployment (see Helm Charts)
- Adopt existing Service/Route with Helm labels
- Delete legacy DeploymentConfig after verifying new deployment works
Symptom:
Service "eagle-api-mongodb" exists and cannot be imported into the current release:
invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "eagle-api"
Cause: Existing Kubernetes Service lacks Helm ownership labels.
Solution:
# Add Helm ownership metadata to existing service
oc patch svc eagle-api-mongodb -n 6cdc9e-dev --type=merge -p '{
"metadata": {
"labels": {
"app.kubernetes.io/managed-by": "Helm"
},
"annotations": {
"meta.helm.sh/release-name": "eagle-api",
"meta.helm.sh/release-namespace": "6cdc9e-dev"
}
}
}'
# Retry Helm deployment
helm upgrade --install eagle-api ./helm/eagle-api \
-n 6cdc9e-dev -f ./helm/eagle-api/values-dev.yamlSymptom: Multiple defunct/zombie mongo processes accumulating in MongoDB pod.
oc exec <mongodb-pod> -- ps aux | grep defunct
# Shows: 13+ defunct mongo processesCause: exec-based health probes spawn mongo shell processes that become zombies:
readinessProbe:
exec:
command: ["mongo", "--eval", "quit()"] # ❌ WRONGSolution: Use tcpSocket probes instead:
readinessProbe:
tcpSocket:
port: 27017 # ✅ CORRECT
initialDelaySeconds: 5Verify fix:
# Should return no results
oc exec <mongodb-pod> -- ps aux | grep defunctApplied to: All environments (dev/test/prod) as of Feb 2026.
Note: Migrations are disabled by default for eagle-api since all existing migrations (2019-2020) have been applied. Only enable when adding new migrations.
Symptom: Helm upgrade hangs or fails with migration job error (when migrations.enabled=true).
Cause: The pre-upgrade migration job failed or timed out.
Diagnosis:
# Check if migration job ran
oc get jobs -n 6cdc9e-dev -l app.kubernetes.io/component=migration
# View migration logs
oc logs -n 6cdc9e-dev -l app.kubernetes.io/component=migration
# Describe job for events
oc describe job -n 6cdc9e-dev -l app.kubernetes.io/component=migrationCommon causes:
- MongoDB connection failed: Check MONGODB_SERVICE_HOST is correct, MongoDB pod is running
- Migration script error: Review logs for JavaScript errors in the new migration file
- Duplicate migration: Migration with same timestamp already exists
Solution:
# Delete failed job to allow retry
oc delete job -n 6cdc9e-dev -l app.kubernetes.io/component=migration
# Re-run deployment with migrations enabled
helm upgrade eagle-api ./helm/eagle-api \
--namespace 6cdc9e-dev \
--values ./helm/eagle-api/values-dev.yaml \
--set migrations.enabled=trueDisable migrations (go back to normal deployments):
# After migration succeeds, disable for future deployments
helm upgrade eagle-api ./helm/eagle-api \
--namespace 6cdc9e-dev \
--values ./helm/eagle-api/values-dev.yaml \
--set migrations.enabled=falseSymptom:
Error: Unable to continue with install: Service "eagle-admin" exists and
cannot be imported into the current release: invalid ownership metadata
Cause: Existing Service/Route not labeled for Helm management.
Solution: Add Helm annotations and labels to existing resources:
# Service
oc annotate service eagle-admin -n 6cdc9e-dev \
meta.helm.sh/release-name=eagle-admin \
meta.helm.sh/release-namespace=6cdc9e-dev --overwrite
oc label service eagle-admin -n 6cdc9e-dev \
app.kubernetes.io/managed-by=Helm \
app.kubernetes.io/name=eagle-admin \
app.kubernetes.io/instance=eagle-admin --overwrite
# Route
oc annotate route eagle-admin -n 6cdc9e-dev \
meta.helm.sh/release-name=eagle-admin \
meta.helm.sh/release-namespace=6cdc9e-dev --overwrite
oc label route eagle-admin -n 6cdc9e-dev \
app.kubernetes.io/managed-by=Helm \
app.kubernetes.io/name=eagle-admin \
app.kubernetes.io/instance=eagle-admin --overwriteSymptom: Pods stuck in ImagePullBackOff or ErrImagePull.
Causes & Solutions:
-
Image tag doesn't exist:
# Verify tag exists oc get imagestreamtag eagle-public:v1.2.3 -n 6cdc9e-tools -
Wrong image repository:
# Check deployed image oc get deployment eagle-public -n 6cdc9e-dev -o jsonpath='{.spec.template.spec.containers[0].image}'
-
Registry authentication:
# Check pull secret exists oc get secret -n 6cdc9e-dev | grep pull
Symptom: Browser console shows POST /api/analytics 404 (Not Found) in production.
Cause: Stale cached env.js from pre-v2.4.1 deployments. Before v2.4.1, env.js had ANALYTICS_API_URL = '/api/analytics' (wrong path). The file was served with 1-year cache + immutable flag, so browsers kept using the old value even after it was corrected to /analytics.
Fix applied (April 2026):
-
rproxy v1.0.5: Added
location = /api/analyticsthat proxies to penguin-analytics (immediate fix for cached clients) -
eagle-public/admin Dockerfiles: Added
location = /env.jswithno-cacheheaders (prevents future caching issues)
Verification:
# Both paths should return 201
curl -sI -X POST -H "Content-Type: application/json" \
-d '{"timestamp":"2026-01-01T00:00:00Z","eventType":"test","sessionId":"test","sourceApp":"test"}' \
https://projects.eao.gov.bc.ca/analytics
curl -sI -X POST -H "Content-Type: application/json" \
-d '{"timestamp":"2026-01-01T00:00:00Z","eventType":"test","sessionId":"test","sourceApp":"test"}' \
https://projects.eao.gov.bc.ca/api/analyticsSymptom: Backup CronJob pods fail with dial tcp <mongodb-ip>:27017: i/o timeout.
Cause: The deny-by-default network policy blocks all pod-to-pod traffic unless explicitly allowed. The network policy eagle-epic-allow-services-to-database only allows pods with role: backup to connect to MongoDB on port 27017. If the backup pod is missing this label, the connection times out.
Solution: Ensure the CronJob pod template includes the role: backup label:
# In mongodb-backup-cronjob.yaml
spec:
jobTemplate:
spec:
template:
metadata:
labels:
role: backup # Required by NetworkPolicyVerification:
# Check pod labels include role=backup
oc get cronjob eagle-api-mongodb-backup -n 6cdc9e-prod \
-o jsonpath='{.spec.jobTemplate.spec.template.metadata.labels}' | jq .
# Check network policy allows role=backup
oc get networkpolicy eagle-epic-allow-services-to-database -n 6cdc9e-prod \
-o jsonpath='{.spec.ingress[0].from}' | jq .Symptom: Pods repeatedly crashing and restarting.
Diagnosis:
# Check pod logs
oc logs -n 6cdc9e-dev deployment/eagle-public --previous
# Check pod events
oc describe pod -n 6cdc9e-dev -l app.kubernetes.io/name=eagle-publicCommon causes:
- Missing environment variables: Check ConfigMap/Secret references
- Health check failures: Adjust probe timing or paths
- Resource limits: Increase memory if OOMKilled
Symptom:
- Root path (/) works correctly (200 OK)
- API endpoints work (/api/config returns 200)
- Admin path (/admin/) returns 504 Gateway Timeout
Cause: NetworkPolicy blocking traffic between rproxy and eagle-admin due to incorrect pod labels.
Diagnosis:
# Check pod labels - must have role=rproxy-eagle-epic
oc get pods -l app.kubernetes.io/name=rproxy -n 6cdc9e-test --show-labels | grep "role="
# Test connectivity from rproxy to eagle-admin
oc exec deployment/rproxy -n 6cdc9e-test -- wget -q -O- --timeout=5 http://eagle-admin:8080/admin/ | head -5Solution: Ensure Helm values include correct pod labels:
podLabels:
role: "rproxy-eagle-epic" # Required by NetworkPolicy
app: "eagle-epic"Then redeploy: helm upgrade rproxy ./helm/rproxy -f values-test.yaml -n 6cdc9e-test
Symptom: Accessing /admin (without trailing slash) redirects to localhost/admin/ or shows :8080 in URL.
Cause: nginx's default trailing slash redirect exposes internal port. Missing explicit redirect block.
Solution: Add explicit redirect in server.conf.tmpl:
location = /admin {
return 301 $scheme://$host/admin/;
}Important: Use $host not $http_host. The envsubst command doubles the http_ prefix, causing $http_host to become $http_http_host which is invalid.
Applied to: All environments as of Mar 2026 (rproxy v1.0.3).
Symptom:
Error: Version must be in format v1.2.3
Cause: Entered version without v prefix or wrong format.
Solution: Use semantic versioning with v prefix: v1.2.3, v2.0.0, etc.
Symptom: Release job skipped, shows "Tag already exists".
Explanation: This is expected behavior for rollbacks. The workflow detects an existing version and enters rollback mode instead of creating a new release.
Symptom: Git tag created but GitHub Release failed.
Cause: Usually permissions or API rate limiting.
Solution: Manually create release:
gh release create v1.2.3 --title "Release v1.2.3" --generate-notesSymptom:
Warning: The `set-output` command is deprecated
Cause: Using outdated GitHub Action versions.
Solution: Update action versions in workflow YAML, use gh CLI instead of deprecated actions.
Symptom: eagle-admin document upload returns "File failed virus check" error even for clean files.
Diagnosis:
# Check eagle-api logs for actual error
oc logs deployment/eagle-api --tail=100 | grep -i "clam\|virus\|scan"Causes & Solutions:
-
Empty/malformed PDF: ClamAV may flag empty or corrupted files as suspicious. Try uploading a valid file with actual content.
-
ClamAV connectivity: Verify environment variables:
oc get configmap eagle-api -o yaml | grep -E "CLAMAV|VIRUS" # Should show: CLAMAV_SERVICE_HOST=clamav, CLAMAV_SERVICE_PORT=3310
-
Library incompatibility ("Malformed Response[UNKNOWN COMMAND]"):
- Old
clamav.jslibrary uses deprecated STREAM protocol - ClamAV 1.0+ removed STREAM support
-
Fix: eagle-api v2.1.5+ uses
clamscanlibrary which supports modern INSTREAM protocol
- Old
Symptom: Every staff user — regardless of their IDIR roles — sees the "You need to provide your email address for access" page immediately after Keycloak login. URL shows /not-authorized.
Root cause: BC Government CSS removed roles from the Keycloak realm's default client scopes. JWTs are issued without realm_access.roles, so isAuthorized() in keycloak.service.ts always returns false.
Fix (eagle-admin v2.4.3+): The scope: 'openid roles' option is now explicitly passed to keycloakAuth.init() so roles are always included regardless of CSS default scope configuration. See Keycloak Configuration for details.
If recurs: Check that scope: 'openid roles' is present in keycloak.service.ts init() call.
Symptom: eagle-admin shows "eao-epic - Client not found" error on login.
Cause: eagle-admin is using eagle-api-console instead of eagle-admin-console. ConfigService may not be preserving the client ID from env.js.
Diagnosis:
# Check env.js ConfigMap has correct client ID
oc get configmap eagle-admin-env-js -n 6cdc9e-prod -o jsonpath='{.data.env\.js}'
# Expected: KEYCLOAK_CLIENT_ID = 'eagle-admin-console'Fix: Verify the eagle-admin-env-js ConfigMap is minimal (only configEndpoint and KEYCLOAK_CLIENT_ID). If missing, recreate it and restart the pod. See Keycloak Configuration for the correct ConfigMap format.
Symptom: Browser loads but shows blank page.
Diagnosis:
- Check browser console for errors (F12)
- Check if
/api/public/configreturns data - Verify nginx is serving files correctly
Common causes:
- Config endpoint failing: API not running or misconfigured
- Base href mismatch: Angular base href doesn't match route path
- JavaScript errors: Check browser console
Symptom:
GET /api/public/config net::ERR_CONNECTION_REFUSED
Causes:
- eagle-api not running: Check API pods
- Service misconfigured: Verify service endpoints
- rproxy not routing: Check nginx configuration
Diagnosis:
# Check API pods
oc get pods -n 6cdc9e-dev | grep eagle-api
# Test API directly
oc exec -n 6cdc9e-dev deployment/eagle-public -- curl -s http://eagle-api:3000/api/Symptom:
GET /api/config 503 (Service Unavailable)
All routes return "Application is not available" page. API pods are running but rproxy shows default nginx config.
Cause: The rproxy pod has lost its templated nginx configuration. The eao-nginx image uses environment variable substitution (${NGINX__EPIC__PROXY__API}) that should create location blocks for /api, /admin, /analytics, etc.
Diagnosis:
# Check if nginx has EPIC locations configured
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) \
-- grep -c "location /api" /etc/nginx/conf.d/*.conf
# If returns 0, config is missing. Check for template markers:
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) \
-- cat /etc/nginx/conf.d/default.conf | head -20
# Verify environment variables are set
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) \
-- env | grep NGINX__EPIC
# Test API directly (bypassing nginx)
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) \
-- curl -s http://eagle-api:3000/api/config | head -c 200Solution: Rebuild and redeploy the rproxy image:
# Trigger rebuild of rproxy image
oc start-build rproxy -n 6cdc9e-tools --follow
# Wait for build to complete, then rollout new deployment
oc rollout latest dc/rproxy -n 6cdc9e-test
oc rollout status dc/rproxy -n 6cdc9e-test
# Verify config is now present
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) \
-- grep "location /api" /etc/nginx/conf.d/*.confAlternative - Manual ConfigMap (temporary fix): If rebuild isn't possible, create a ConfigMap with the nginx config and mount it. See API Architecture for nginx template.
Symptom:
GET /api/config 503 (Service Unavailable)
Returns "Application is not available" page, but rproxy config is correct and works via localhost/cluster access.
Cause: The OpenShift route targets the wrong port. The eagle-api route should target 8080-tcp (rproxy's port), not 3000-tcp (eagle-api's direct port).
Diagnosis:
# Check route target port
oc get route eagle-api -n 6cdc9e-test -o jsonpath='{.spec.port.targetPort}'
# Should show: 8080-tcp
# Check rproxy service ports
oc get svc rproxy -n 6cdc9e-test -o jsonpath='{.spec.ports[*].name}'
# Shows: 8080-tcp
# Test rproxy directly works
oc exec -n 6cdc9e-test $(oc get pod -l deploymentconfig=rproxy -o name | head -1) \
-- curl -s http://localhost:8080/api/config | head -100Solution: Patch the route to target the correct port:
# Fix route to use rproxy's port
oc patch route eagle-api -n 6cdc9e-test \
--type='json' \
-p='[{"op": "replace", "path": "/spec/port/targetPort", "value": "8080-tcp"}]'
# Verify fix
curl -s https://eagle-test.apps.silver.devops.gov.bc.ca/api/config | head -100Symptom:
GET /api/config 502 (Bad Gateway)
Frontend loads but API requests fail with 502 errors.
Cause: rproxy has cached an old Service IP for eagle-api. This happens when:
- eagle-api Service was deleted/recreated (e.g., after Helm uninstall)
- New Helm deployment created Service with different ClusterIP
Diagnosis:
# Check current eagle-api Service IP
oc get svc eagle-api -n 6cdc9e-test -o jsonpath='{.spec.clusterIP}'
# Check what IP rproxy is using
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) -- env | grep EAGLE_API_SERVICE_HOSTIf the IPs don't match, rproxy needs to be restarted.
Solution:
# Restart rproxy to pick up new Service IP
oc rollout latest dc/rproxy -n 6cdc9e-test
oc rollout status dc/rproxy -n 6cdc9e-test
# Verify API is now accessible
curl -k https://test.projects.eao.gov.bc.ca/api/docsPrevention: After deleting/recreating eagle-api Helm releases, always restart rproxy.
Symptom: penguin-analytics-api pods fail with Init:CrashLoopBackOff, download-geoip logs show tar: invalid magic.
Cause: MaxMind credentials missing in OpenShift Secret. Credentials are stored in GitHub Secrets and must be synced via workflow.
Diagnosis:
# Check if credentials are empty (0 = missing)
oc get secret penguin-analytics-geoip-secrets -n 6cdc9e-test \
-o jsonpath='{.data.MAXMIND_ACCOUNT_ID}' | base64 -d | wc -cSolution: Run the GeoIP update workflow:
gh workflow run "Update GeoIP Databases" \
--repo digitalspace/penguin-analytics -f environment=testSymptom: API pods show password authentication failed for user "analytics_user".
Cause: Helm upgrade changed Secret password but TimescaleDB has old password.
Solution:
# Get actual password from database
DB_PASS=$(oc exec -n 6cdc9e-test deploy/penguin-analytics-database \
-- sh -c 'echo $POSTGRES_PASSWORD')
# Apply to secret
oc patch secret penguin-analytics-secrets -n 6cdc9e-test \
--type='json' \
-p="[{\"op\": \"replace\", \"path\": \"/data/POSTGRES_PASSWORD\", \"value\":\"$(echo -n $DB_PASS | base64)\"}]"
# Restart API
oc rollout restart deployment/penguin-analytics-api -n 6cdc9e-testSymptom: App starts but configuration is wrong/missing.
Check:
-
Verify
configEndpoint = truein built image:oc exec -n 6cdc9e-dev deployment/eagle-public -- cat /tmp/app/dist/eagle-public/browser/env.js | head -5
-
Test config endpoint:
curl -s https://eagle-dev.apps.silver.devops.gov.bc.ca/api/public/config | jq
Symptom: Loading spinner shows indefinitely despite data loading successfully. Console logs show loading state is false but UI doesn't update.
Cause: Signal not called with () in template. Signals are functions - checking @if (loading) tests if the function exists (always truthy), not its value.
Wrong:
@if (loading) {
<spinner />
}Correct:
@if (loading()) {
<spinner />
}Diagnosis:
// In browser console, get the component
const el = document.querySelector('app-my-component');
const component = window.ng?.getComponent(el);
// Check signal value - should match what template shows
console.log(component.loading()); // false if data loaded
// Check DOM state
document.querySelector('.loading-spinner'); // Should be null if loading() is falseIf loading() returns false but spinner is visible, the template isn't calling the signal.
Prevention:
- Always use
()when reading signals in templates:{{ count() }},@if (isVisible()) - This applies to
signal(),computed(), andtoSignal() - See Configuration Management
Symptom: Activity or news update entries on eagle-public show garbled formatting — no paragraph breaks, all text on one line, or massive invisible whitespace. Content was pasted from Microsoft Word (Desktop or Online) into the TinyMCE editor in eagle-admin.
Root cause: Word injects proprietary HTML artifacts that break plain rendering:
-
Word Desktop:
class="MsoNormal",mso-*CSS, inlinefont-family/font-size/colorstyles -
Word Online: deeply nested
<div class="OutlineElement SCXW*">wrappers with large inlinestyle=attributes
Three-layer fix (applied Apr 2026):
-
Layer 1 — TinyMCE paste cleanup (eagle-admin v2.4.4):
paste_preprocesshook inadd-edit-activity.component.tsstrips Word attributes on paste, so new content is clean before saving. -
Layer 2 — Render-time sanitizer (eagle-public v2.4.1):
sanitizeWordHtml()insrc/app/shared/utils/word-html-sanitizer.tsstrips any remaining Word artifacts fromgetSafeHtml()before[innerHTML]binding. Acts as defence-in-depth for any content that slipped through. -
Layer 3 — Data migration: One-off migration
eagle-api/one_off_migrations/sanitizeActivityWordHtml.jscleaned all existingRecentActivityrecords (506 updated out of 2,467, 0 errors, Apr 2026).
If new formatting issues appear: Add the offending attribute/class pattern to the regex chain in word-html-sanitizer.ts and add a corresponding test in word-html-sanitizer.spec.ts. Re-run the data migration if needed (it is idempotent).
# Get all resources for an app
oc get all -n 6cdc9e-dev -l app.kubernetes.io/name=eagle-public
# Watch pod status
oc get pods -n 6cdc9e-dev -w
# Stream logs
oc logs -f -n 6cdc9e-dev deployment/eagle-public
# Execute command in pod
oc exec -n 6cdc9e-dev deployment/eagle-public -- ls -la /tmp/app/
# Port forward for local testing
oc port-forward -n 6cdc9e-dev svc/eagle-public 8080:8080
# Check resource usage
oc adm top pods -n 6cdc9e-dev
# Get events (sorted by time)
oc get events -n 6cdc9e-dev --sort-by='.lastTimestamp'# List all tags for an image
oc get imagestreamtag -n 6cdc9e-tools | grep eagle-public
# Get details of specific tag
oc describe imagestreamtag eagle-public:v1.2.3 -n 6cdc9e-toolsIf you can't resolve an issue:
- Check logs thoroughly - Pod logs, workflow logs, events
- Search existing issues - Someone may have encountered it before
-
Ask in Rocket.Chat -
#falcon-generalchannel - Open a GitHub issue - Include logs, steps to reproduce, and expected vs actual behavior
- Deployment Pipeline - How deployments work
- Helm Charts - Helm configuration
- Rollback Procedures - How to rollback
- Architecture Overview - System components