Troubleshooting - bcgov/eagle-dev-guides GitHub Wiki

Troubleshooting

Common issues and solutions for Eagle deployments.

Deployment Issues

Trivy Scan Blocking Deployment

Symptom: Deploy to Dev fails at Security Scan step with vulnerabilities found.

Cause: Container image has CRITICAL or HIGH severity vulnerabilities.

Solution:

  1. Review the scan output to identify vulnerable packages
  2. Update base image or add security patches:
    FROM nginx:1.27-alpine
    
    # Add this line to get latest security patches
    RUN apk upgrade --no-cache
  3. Commit and push to trigger new build

Deployment Not Found

Symptom:

Error from server (NotFound): deployments.apps "eagle-admin" not found

Cause: Environment has legacy DeploymentConfig but workflow expects Deployment.

Solution:

  1. Install Helm chart to create Deployment (see Helm Charts)
  2. Adopt existing Service/Route with Helm labels
  3. Delete legacy DeploymentConfig after verifying new deployment works

Helm Service Adoption Failure

Symptom:

Service "eagle-api-mongodb" exists and cannot be imported into the current release:
invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "eagle-api"

Cause: Existing Kubernetes Service lacks Helm ownership labels.

Solution:

# Add Helm ownership metadata to existing service
oc patch svc eagle-api-mongodb -n 6cdc9e-dev --type=merge -p '{
  "metadata": {
    "labels": {
      "app.kubernetes.io/managed-by": "Helm"
    },
    "annotations": {
      "meta.helm.sh/release-name": "eagle-api",
      "meta.helm.sh/release-namespace": "6cdc9e-dev"
    }
  }
}'

# Retry Helm deployment
helm upgrade --install eagle-api ./helm/eagle-api \
  -n 6cdc9e-dev -f ./helm/eagle-api/values-dev.yaml

MongoDB Zombie Processes

Symptom: Multiple defunct/zombie mongo processes accumulating in MongoDB pod.

oc exec <mongodb-pod> -- ps aux | grep defunct
# Shows: 13+ defunct mongo processes

Cause: exec-based health probes spawn mongo shell processes that become zombies:

readinessProbe:
  exec:
    command: ["mongo", "--eval", "quit()"]  # ❌ WRONG

Solution: Use tcpSocket probes instead:

readinessProbe:
  tcpSocket:
    port: 27017  # ✅ CORRECT
  initialDelaySeconds: 5

Verify fix:

# Should return no results
oc exec <mongodb-pod> -- ps aux | grep defunct

Applied to: All environments (dev/test/prod) as of Feb 2026.

Database Migration Job Failures (eagle-api)

Note: Migrations are disabled by default for eagle-api since all existing migrations (2019-2020) have been applied. Only enable when adding new migrations.

Symptom: Helm upgrade hangs or fails with migration job error (when migrations.enabled=true).

Cause: The pre-upgrade migration job failed or timed out.

Diagnosis:

# Check if migration job ran
oc get jobs -n 6cdc9e-dev -l app.kubernetes.io/component=migration

# View migration logs
oc logs -n 6cdc9e-dev -l app.kubernetes.io/component=migration

# Describe job for events
oc describe job -n 6cdc9e-dev -l app.kubernetes.io/component=migration

Common causes:

  • MongoDB connection failed: Check MONGODB_SERVICE_HOST is correct, MongoDB pod is running
  • Migration script error: Review logs for JavaScript errors in the new migration file
  • Duplicate migration: Migration with same timestamp already exists

Solution:

# Delete failed job to allow retry
oc delete job -n 6cdc9e-dev -l app.kubernetes.io/component=migration

# Re-run deployment with migrations enabled
helm upgrade eagle-api ./helm/eagle-api \
  --namespace 6cdc9e-dev \
  --values ./helm/eagle-api/values-dev.yaml \
  --set migrations.enabled=true

Disable migrations (go back to normal deployments):

# After migration succeeds, disable for future deployments
helm upgrade eagle-api ./helm/eagle-api \
  --namespace 6cdc9e-dev \
  --values ./helm/eagle-api/values-dev.yaml \
  --set migrations.enabled=false

Helm Cannot Import Existing Resources

Symptom:

Error: Unable to continue with install: Service "eagle-admin" exists and 
cannot be imported into the current release: invalid ownership metadata

Cause: Existing Service/Route not labeled for Helm management.

Solution: Add Helm annotations and labels to existing resources:

# Service
oc annotate service eagle-admin -n 6cdc9e-dev \
  meta.helm.sh/release-name=eagle-admin \
  meta.helm.sh/release-namespace=6cdc9e-dev --overwrite
oc label service eagle-admin -n 6cdc9e-dev \
  app.kubernetes.io/managed-by=Helm \
  app.kubernetes.io/name=eagle-admin \
  app.kubernetes.io/instance=eagle-admin --overwrite

# Route
oc annotate route eagle-admin -n 6cdc9e-dev \
  meta.helm.sh/release-name=eagle-admin \
  meta.helm.sh/release-namespace=6cdc9e-dev --overwrite
oc label route eagle-admin -n 6cdc9e-dev \
  app.kubernetes.io/managed-by=Helm \
  app.kubernetes.io/name=eagle-admin \
  app.kubernetes.io/instance=eagle-admin --overwrite

Image Pull Error

Symptom: Pods stuck in ImagePullBackOff or ErrImagePull.

Causes & Solutions:

  1. Image tag doesn't exist:

    # Verify tag exists
    oc get imagestreamtag eagle-public:v1.2.3 -n 6cdc9e-tools
  2. Wrong image repository:

    # Check deployed image
    oc get deployment eagle-public -n 6cdc9e-dev -o jsonpath='{.spec.template.spec.containers[0].image}'
  3. Registry authentication:

    # Check pull secret exists
    oc get secret -n 6cdc9e-dev | grep pull

Analytics POST Returns 404

Symptom: Browser console shows POST /api/analytics 404 (Not Found) in production.

Cause: Stale cached env.js from pre-v2.4.1 deployments. Before v2.4.1, env.js had ANALYTICS_API_URL = '/api/analytics' (wrong path). The file was served with 1-year cache + immutable flag, so browsers kept using the old value even after it was corrected to /analytics.

Fix applied (April 2026):

  1. rproxy v1.0.5: Added location = /api/analytics that proxies to penguin-analytics (immediate fix for cached clients)
  2. eagle-public/admin Dockerfiles: Added location = /env.js with no-cache headers (prevents future caching issues)

Verification:

# Both paths should return 201
curl -sI -X POST -H "Content-Type: application/json" \
  -d '{"timestamp":"2026-01-01T00:00:00Z","eventType":"test","sessionId":"test","sourceApp":"test"}' \
  https://projects.eao.gov.bc.ca/analytics

curl -sI -X POST -H "Content-Type: application/json" \
  -d '{"timestamp":"2026-01-01T00:00:00Z","eventType":"test","sessionId":"test","sourceApp":"test"}' \
  https://projects.eao.gov.bc.ca/api/analytics

MongoDB Backup Job Network Policy

Symptom: Backup CronJob pods fail with dial tcp <mongodb-ip>:27017: i/o timeout.

Cause: The deny-by-default network policy blocks all pod-to-pod traffic unless explicitly allowed. The network policy eagle-epic-allow-services-to-database only allows pods with role: backup to connect to MongoDB on port 27017. If the backup pod is missing this label, the connection times out.

Solution: Ensure the CronJob pod template includes the role: backup label:

# In mongodb-backup-cronjob.yaml
spec:
  jobTemplate:
    spec:
      template:
        metadata:
          labels:
            role: backup  # Required by NetworkPolicy

Verification:

# Check pod labels include role=backup
oc get cronjob eagle-api-mongodb-backup -n 6cdc9e-prod \
  -o jsonpath='{.spec.jobTemplate.spec.template.metadata.labels}' | jq .

# Check network policy allows role=backup
oc get networkpolicy eagle-epic-allow-services-to-database -n 6cdc9e-prod \
  -o jsonpath='{.spec.ingress[0].from}' | jq .

Pods CrashLoopBackOff

Symptom: Pods repeatedly crashing and restarting.

Diagnosis:

# Check pod logs
oc logs -n 6cdc9e-dev deployment/eagle-public --previous

# Check pod events
oc describe pod -n 6cdc9e-dev -l app.kubernetes.io/name=eagle-public

Common causes:

  • Missing environment variables: Check ConfigMap/Secret references
  • Health check failures: Adjust probe timing or paths
  • Resource limits: Increase memory if OOMKilled

Reverse Proxy (eao-nginx) Issues

504 Gateway Timeout on /admin/ Route

Symptom:

  • Root path (/) works correctly (200 OK)
  • API endpoints work (/api/config returns 200)
  • Admin path (/admin/) returns 504 Gateway Timeout

Cause: NetworkPolicy blocking traffic between rproxy and eagle-admin due to incorrect pod labels.

Diagnosis:

# Check pod labels - must have role=rproxy-eagle-epic
oc get pods -l app.kubernetes.io/name=rproxy -n 6cdc9e-test --show-labels | grep "role="

# Test connectivity from rproxy to eagle-admin
oc exec deployment/rproxy -n 6cdc9e-test -- wget -q -O- --timeout=5 http://eagle-admin:8080/admin/ | head -5

Solution: Ensure Helm values include correct pod labels:

podLabels:
  role: "rproxy-eagle-epic"  # Required by NetworkPolicy
  app: "eagle-epic"

Then redeploy: helm upgrade rproxy ./helm/rproxy -f values-test.yaml -n 6cdc9e-test

/admin Redirects to Invalid URL or Port 8080

Symptom: Accessing /admin (without trailing slash) redirects to localhost/admin/ or shows :8080 in URL.

Cause: nginx's default trailing slash redirect exposes internal port. Missing explicit redirect block.

Solution: Add explicit redirect in server.conf.tmpl:

location = /admin {
    return 301 $scheme://$host/admin/;
}

Important: Use $host not $http_host. The envsubst command doubles the http_ prefix, causing $http_host to become $http_http_host which is invalid.

Applied to: All environments as of Mar 2026 (rproxy v1.0.3).

Workflow Issues

Version Format Error

Symptom:

Error: Version must be in format v1.2.3

Cause: Entered version without v prefix or wrong format.

Solution: Use semantic versioning with v prefix: v1.2.3, v2.0.0, etc.

Git Tag Already Exists (Not an Error)

Symptom: Release job skipped, shows "Tag already exists".

Explanation: This is expected behavior for rollbacks. The workflow detects an existing version and enters rollback mode instead of creating a new release.

GitHub Release Creation Failed

Symptom: Git tag created but GitHub Release failed.

Cause: Usually permissions or API rate limiting.

Solution: Manually create release:

gh release create v1.2.3 --title "Release v1.2.3" --generate-notes

Deprecated Action Warnings

Symptom:

Warning: The `set-output` command is deprecated

Cause: Using outdated GitHub Action versions.

Solution: Update action versions in workflow YAML, use gh CLI instead of deprecated actions.

Application Issues

Document Upload Fails: "File failed virus check"

Symptom: eagle-admin document upload returns "File failed virus check" error even for clean files.

Diagnosis:

# Check eagle-api logs for actual error
oc logs deployment/eagle-api --tail=100 | grep -i "clam\|virus\|scan"

Causes & Solutions:

  1. Empty/malformed PDF: ClamAV may flag empty or corrupted files as suspicious. Try uploading a valid file with actual content.

  2. ClamAV connectivity: Verify environment variables:

    oc get configmap eagle-api -o yaml | grep -E "CLAMAV|VIRUS"
    # Should show: CLAMAV_SERVICE_HOST=clamav, CLAMAV_SERVICE_PORT=3310
  3. Library incompatibility ("Malformed Response[UNKNOWN COMMAND]"):

    • Old clamav.js library uses deprecated STREAM protocol
    • ClamAV 1.0+ removed STREAM support
    • Fix: eagle-api v2.1.5+ uses clamscan library which supports modern INSTREAM protocol

All Admin Users Redirected to "Email for access" (CSS Removed roles Scope)

Symptom: Every staff user — regardless of their IDIR roles — sees the "You need to provide your email address for access" page immediately after Keycloak login. URL shows /not-authorized.

Root cause: BC Government CSS removed roles from the Keycloak realm's default client scopes. JWTs are issued without realm_access.roles, so isAuthorized() in keycloak.service.ts always returns false.

Fix (eagle-admin v2.4.3+): The scope: 'openid roles' option is now explicitly passed to keycloakAuth.init() so roles are always included regardless of CSS default scope configuration. See Keycloak Configuration for details.

If recurs: Check that scope: 'openid roles' is present in keycloak.service.ts init() call.

Keycloak "Client not found" (eagle-admin)

Symptom: eagle-admin shows "eao-epic - Client not found" error on login.

Cause: eagle-admin is using eagle-api-console instead of eagle-admin-console. ConfigService may not be preserving the client ID from env.js.

Diagnosis:

# Check env.js ConfigMap has correct client ID
oc get configmap eagle-admin-env-js -n 6cdc9e-prod -o jsonpath='{.data.env\.js}'
# Expected: KEYCLOAK_CLIENT_ID = 'eagle-admin-console'

Fix: Verify the eagle-admin-env-js ConfigMap is minimal (only configEndpoint and KEYCLOAK_CLIENT_ID). If missing, recreate it and restart the pod. See Keycloak Configuration for the correct ConfigMap format.

App Shows Blank Page

Symptom: Browser loads but shows blank page.

Diagnosis:

  1. Check browser console for errors (F12)
  2. Check if /api/public/config returns data
  3. Verify nginx is serving files correctly

Common causes:

  • Config endpoint failing: API not running or misconfigured
  • Base href mismatch: Angular base href doesn't match route path
  • JavaScript errors: Check browser console

API Connection Refused

Symptom:

GET /api/public/config net::ERR_CONNECTION_REFUSED

Causes:

  1. eagle-api not running: Check API pods
  2. Service misconfigured: Verify service endpoints
  3. rproxy not routing: Check nginx configuration

Diagnosis:

# Check API pods
oc get pods -n 6cdc9e-dev | grep eagle-api

# Test API directly
oc exec -n 6cdc9e-dev deployment/eagle-public -- curl -s http://eagle-api:3000/api/

API Returns 503 Service Unavailable (rproxy Config Missing)

Symptom:

GET /api/config 503 (Service Unavailable)

All routes return "Application is not available" page. API pods are running but rproxy shows default nginx config.

Cause: The rproxy pod has lost its templated nginx configuration. The eao-nginx image uses environment variable substitution (${NGINX__EPIC__PROXY__API}) that should create location blocks for /api, /admin, /analytics, etc.

Diagnosis:

# Check if nginx has EPIC locations configured
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) \
  -- grep -c "location /api" /etc/nginx/conf.d/*.conf

# If returns 0, config is missing. Check for template markers:
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) \
  -- cat /etc/nginx/conf.d/default.conf | head -20

# Verify environment variables are set
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) \
  -- env | grep NGINX__EPIC

# Test API directly (bypassing nginx)
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) \
  -- curl -s http://eagle-api:3000/api/config | head -c 200

Solution: Rebuild and redeploy the rproxy image:

# Trigger rebuild of rproxy image
oc start-build rproxy -n 6cdc9e-tools --follow

# Wait for build to complete, then rollout new deployment
oc rollout latest dc/rproxy -n 6cdc9e-test
oc rollout status dc/rproxy -n 6cdc9e-test

# Verify config is now present
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) \
  -- grep "location /api" /etc/nginx/conf.d/*.conf

Alternative - Manual ConfigMap (temporary fix): If rebuild isn't possible, create a ConfigMap with the nginx config and mount it. See API Architecture for nginx template.


API Returns 503 Service Unavailable (Route Port Mismatch)

Symptom:

GET /api/config 503 (Service Unavailable)

Returns "Application is not available" page, but rproxy config is correct and works via localhost/cluster access.

Cause: The OpenShift route targets the wrong port. The eagle-api route should target 8080-tcp (rproxy's port), not 3000-tcp (eagle-api's direct port).

Diagnosis:

# Check route target port
oc get route eagle-api -n 6cdc9e-test -o jsonpath='{.spec.port.targetPort}'
# Should show: 8080-tcp

# Check rproxy service ports
oc get svc rproxy -n 6cdc9e-test -o jsonpath='{.spec.ports[*].name}'
# Shows: 8080-tcp

# Test rproxy directly works
oc exec -n 6cdc9e-test $(oc get pod -l deploymentconfig=rproxy -o name | head -1) \
  -- curl -s http://localhost:8080/api/config | head -100

Solution: Patch the route to target the correct port:

# Fix route to use rproxy's port
oc patch route eagle-api -n 6cdc9e-test \
  --type='json' \
  -p='[{"op": "replace", "path": "/spec/port/targetPort", "value": "8080-tcp"}]'

# Verify fix
curl -s https://eagle-test.apps.silver.devops.gov.bc.ca/api/config | head -100

API Returns 502 Bad Gateway

Symptom:

GET /api/config 502 (Bad Gateway)

Frontend loads but API requests fail with 502 errors.

Cause: rproxy has cached an old Service IP for eagle-api. This happens when:

  • eagle-api Service was deleted/recreated (e.g., after Helm uninstall)
  • New Helm deployment created Service with different ClusterIP

Diagnosis:

# Check current eagle-api Service IP
oc get svc eagle-api -n 6cdc9e-test -o jsonpath='{.spec.clusterIP}'

# Check what IP rproxy is using
oc exec -n 6cdc9e-test $(oc get pod -n 6cdc9e-test -l deploymentconfig=rproxy -o name | head -1) -- env | grep EAGLE_API_SERVICE_HOST

If the IPs don't match, rproxy needs to be restarted.

Solution:

# Restart rproxy to pick up new Service IP
oc rollout latest dc/rproxy -n 6cdc9e-test
oc rollout status dc/rproxy -n 6cdc9e-test

# Verify API is now accessible
curl -k https://test.projects.eao.gov.bc.ca/api/docs

Prevention: After deleting/recreating eagle-api Helm releases, always restart rproxy.

Penguin Analytics Issues

GeoIP InitContainer CrashLoopBackOff

Symptom: penguin-analytics-api pods fail with Init:CrashLoopBackOff, download-geoip logs show tar: invalid magic.

Cause: MaxMind credentials missing in OpenShift Secret. Credentials are stored in GitHub Secrets and must be synced via workflow.

Diagnosis:

# Check if credentials are empty (0 = missing)
oc get secret penguin-analytics-geoip-secrets -n 6cdc9e-test \
  -o jsonpath='{.data.MAXMIND_ACCOUNT_ID}' | base64 -d | wc -c

Solution: Run the GeoIP update workflow:

gh workflow run "Update GeoIP Databases" \
  --repo digitalspace/penguin-analytics -f environment=test

Database Password Mismatch After Helm Upgrade

Symptom: API pods show password authentication failed for user "analytics_user".

Cause: Helm upgrade changed Secret password but TimescaleDB has old password.

Solution:

# Get actual password from database
DB_PASS=$(oc exec -n 6cdc9e-test deploy/penguin-analytics-database \
  -- sh -c 'echo $POSTGRES_PASSWORD')

# Apply to secret
oc patch secret penguin-analytics-secrets -n 6cdc9e-test \
  --type='json' \
  -p="[{\"op\": \"replace\", \"path\": \"/data/POSTGRES_PASSWORD\", \"value\":\"$(echo -n $DB_PASS | base64)\"}]"

# Restart API
oc rollout restart deployment/penguin-analytics-api -n 6cdc9e-test

Configuration Not Loading

Symptom: App starts but configuration is wrong/missing.

Check:

  1. Verify configEndpoint = true in built image:

    oc exec -n 6cdc9e-dev deployment/eagle-public -- cat /tmp/app/dist/eagle-public/browser/env.js | head -5
  2. Test config endpoint:

    curl -s https://eagle-dev.apps.silver.devops.gov.bc.ca/api/public/config | jq

Angular Frontend Issues

Loading Spinner Stuck (Signal Not Called in Template)

Symptom: Loading spinner shows indefinitely despite data loading successfully. Console logs show loading state is false but UI doesn't update.

Cause: Signal not called with () in template. Signals are functions - checking @if (loading) tests if the function exists (always truthy), not its value.

Wrong:

@if (loading) {
  <spinner />
}

Correct:

@if (loading()) {
  <spinner />
}

Diagnosis:

// In browser console, get the component
const el = document.querySelector('app-my-component');
const component = window.ng?.getComponent(el);

// Check signal value - should match what template shows
console.log(component.loading());  // false if data loaded

// Check DOM state
document.querySelector('.loading-spinner');  // Should be null if loading() is false

If loading() returns false but spinner is visible, the template isn't calling the signal.

Prevention:

  • Always use () when reading signals in templates: {{ count() }}, @if (isVisible())
  • This applies to signal(), computed(), and toSignal()
  • See Configuration Management

Activity/Updates Page Text Formatting Lost (Word HTML)

Symptom: Activity or news update entries on eagle-public show garbled formatting — no paragraph breaks, all text on one line, or massive invisible whitespace. Content was pasted from Microsoft Word (Desktop or Online) into the TinyMCE editor in eagle-admin.

Root cause: Word injects proprietary HTML artifacts that break plain rendering:

  • Word Desktop: class="MsoNormal", mso-* CSS, inline font-family/font-size/color styles
  • Word Online: deeply nested <div class="OutlineElement SCXW*"> wrappers with large inline style= attributes

Three-layer fix (applied Apr 2026):

  1. Layer 1 — TinyMCE paste cleanup (eagle-admin v2.4.4): paste_preprocess hook in add-edit-activity.component.ts strips Word attributes on paste, so new content is clean before saving.

  2. Layer 2 — Render-time sanitizer (eagle-public v2.4.1): sanitizeWordHtml() in src/app/shared/utils/word-html-sanitizer.ts strips any remaining Word artifacts from getSafeHtml() before [innerHTML] binding. Acts as defence-in-depth for any content that slipped through.

  3. Layer 3 — Data migration: One-off migration eagle-api/one_off_migrations/sanitizeActivityWordHtml.js cleaned all existing RecentActivity records (506 updated out of 2,467, 0 errors, Apr 2026).

If new formatting issues appear: Add the offending attribute/class pattern to the regex chain in word-html-sanitizer.ts and add a corresponding test in word-html-sanitizer.spec.ts. Re-run the data migration if needed (it is idempotent).

OpenShift CLI Reference

Useful Commands

# Get all resources for an app
oc get all -n 6cdc9e-dev -l app.kubernetes.io/name=eagle-public

# Watch pod status
oc get pods -n 6cdc9e-dev -w

# Stream logs
oc logs -f -n 6cdc9e-dev deployment/eagle-public

# Execute command in pod
oc exec -n 6cdc9e-dev deployment/eagle-public -- ls -la /tmp/app/

# Port forward for local testing
oc port-forward -n 6cdc9e-dev svc/eagle-public 8080:8080

# Check resource usage
oc adm top pods -n 6cdc9e-dev

# Get events (sorted by time)
oc get events -n 6cdc9e-dev --sort-by='.lastTimestamp'

Check Image Stream Tags

# List all tags for an image
oc get imagestreamtag -n 6cdc9e-tools | grep eagle-public

# Get details of specific tag
oc describe imagestreamtag eagle-public:v1.2.3 -n 6cdc9e-tools

Getting Help

If you can't resolve an issue:

  1. Check logs thoroughly - Pod logs, workflow logs, events
  2. Search existing issues - Someone may have encountered it before
  3. Ask in Rocket.Chat - #falcon-general channel
  4. Open a GitHub issue - Include logs, steps to reproduce, and expected vs actual behavior

Related Documentation

⚠️ **GitHub.com Fallback** ⚠️