Thermos Auth Refresh Workflow Debugging Guide - ComposioHQ/helm-charts GitHub Wiki
This guide provides comprehensive debugging steps for authentication refresh workflow failures in the Thermos service. The auth refresh workflow is responsible for periodically refreshing authentication tokens for connected accounts across various toolkits and custom applications.
# Check if auth refresh worker is running
kubectl logs -l app=thermos -c thermos | grep "AuthRefreshWorkflow"
# Check for worker startup errors
kubectl logs -l app=thermos -c thermos | grep "auth-refresh"# Verify environment variables
kubectl exec -l app=thermos -c thermos -- env | grep -E "(COMPOSIO_ENV|TEMPORAL_|APOLLO_)"# List recent auth refresh workflows
temporal workflow list --query "WorkflowType='AuthRefreshWorkflow'" --limit 10
# Get specific workflow execution details
temporal workflow describe --workflow-id <workflow-id>
# Check workflow history
temporal workflow show --workflow-id <workflow-id># Check OpenTelemetry metrics for auth refresh
curl -s "http://localhost:8080/metrics" | grep "auth_refresh"
# Check StatsD metrics
# Look for metrics: auth_refresh.workflow.success, auth_refresh.workflow.failed, auth_refresh.workflow.skipped# Workflow startup and configuration
kubectl logs -l app=thermos -c thermos | grep -E "(Starting Auth Refresh Workflow|AuthRefreshWorkflow called)"
# Connection retrieval issues
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to get connections|GetConnections|GetConnectionsForProject)"
# Token refresh failures
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to refresh tokens|Error refreshing tokens|RefreshAuth)"
# Batch processing issues
kubectl logs -l app=thermos -c thermos | grep -E "(Processing batch|Failed to get future response|batchFailedCount)"
# Apollo API communication
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to execute HTTP request|unexpected status code|Failed to decode response)"# Check for ERROR level logs
kubectl logs -l app=thermos -c thermos | grep "ERROR" | grep -E "(auth|refresh|token)"
# Check for WARN level logs (skipped connections)
kubectl logs -l app=thermos -c thermos | grep "WARN" | grep -E "(Skipped connections|auth|refresh)"
# Check workflow ID correlation
kubectl logs -l app=thermos -c thermos | grep "workflow_id" | grep -E "(auth|refresh)"Symptoms:
- No auth refresh logs in the system
- Missing workflow executions in Temporal
Debug Steps:
# Verify worker registration
kubectl logs -l app=thermos -c thermos | grep "CreateAuthRefreshWorker"
# Check environment mode
kubectl logs -l app=thermos -c thermos | grep "COMPOSIO_ENV"
# Check for worker startup errors
kubectl logs -l app=thermos -c thermos | grep "auth-refresh"Common Causes:
- Running in local environment (auth refresh skipped)
- Temporal client connection issues
- Worker registration failures
Symptoms:
Failed to get connections for <toolkit>- Empty connection lists
- Database connection errors
Debug Steps:
# Check database connectivity
kubectl logs -l app=thermos -c thermos | grep -E "(database|connection|ent)"
# Verify Apollo API connectivity
kubectl logs -l app=thermos -c thermos | grep -E "(Apollo|apollo|APOLLO)"
# Check for specific toolkit issues
kubectl logs -l app=thermos -c thermos | grep -E "(toolkit|app).*error"Common Causes:
- Database connection issues
- Apollo API unavailability
- Invalid toolkit slug or project ID
- Network connectivity problems
Symptoms:
Failed to execute HTTP requestunexpected status code: <code>Failed to decode response
Debug Steps:
# Check Apollo API status
curl -I "https://<apollo-endpoint>/api/v3/admin/auth-refresh"
# Verify admin token
kubectl logs -l app=thermos -c thermos | grep "x-composio-admin-token"
# Check request headers and payload
kubectl logs -l app=thermos -c thermos | grep -E "(connectionIds|request body)"Common Causes:
- Apollo API service down
- Invalid admin token
- Network connectivity issues
- Rate limiting from Apollo API
Symptoms:
Failed to get future responsebatchFailedCount > 0- Workflow fails due to batch failures
Debug Steps:
# Check batch processing logs
kubectl logs -l app=thermos -c thermos | grep -E "(Processing batch|batch.*failed)"
# Check batch processing settings
kubectl logs -l app=thermos -c thermos | grep "parallelism"
# Check semaphore and rate limiting
kubectl logs -l app=thermos -c thermos | grep -E "(semaphore|rate.*limit)"Common Causes:
- High parallelism causing resource exhaustion
- Rate limiting from external services
- Memory or CPU constraints
- Network timeouts
Symptoms:
failureCount > int(float64(len(connections))*0.1)- Workflow fails due to >10% failure rate
- Many individual connection failures
Debug Steps:
# Check individual connection failures
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to refresh connections|failures.*connectionId)"
# Check skipped connections
kubectl logs -l app=thermos -c thermos | grep -E "(Skipped connections|skipped.*connectionId)"
# Analyze failure patterns
kubectl logs -l app=thermos -c thermos | grep -E "(error.*token|expired|invalid)"Common Causes:
- Expired refresh tokens
- Revoked OAuth tokens
- Invalid connection configurations
- External service rate limiting
Key Metrics to Monitor:
# Auth refresh workflow metrics
auth_refresh.workflow.success
auth_refresh.workflow.failed
auth_refresh.workflow.skipped
# Polling trigger auth refresh errors
polling.trigger.auth_refresh_errorQuery Examples:
# Check success rate over time
curl -s "http://localhost:8080/metrics" | grep "auth_refresh_workflow_success_total"
# Check failure patterns
curl -s "http://localhost:8080/metrics" | grep "auth_refresh_workflow_failed_total"
# Check error types
curl -s "http://localhost:8080/metrics" | grep "polling_trigger_auth_refresh_error_total"Datadog Dashboard Queries:
# Success rate
sum:auth_refresh.workflow.success{*} by {app,group}
# Failure rate
sum:auth_refresh.workflow.failed{*} by {app,group}
# Skipped connections
sum:auth_refresh.workflow.skipped{*} by {app,group}
# Enable debug logging
kubectl logs -l app=thermos -c thermos | grep -E "(DEBUG|debug)" | grep -E "(auth|refresh)"
# Check activity heartbeats
kubectl logs -l app=thermos -c thermos | grep -E "(heartbeat|Heartbeat)"
# Check retry policies
kubectl logs -l app=thermos -c thermos | grep -E "(retry|RetryPolicy)"# Check connection counts
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT COUNT(*) FROM connected_accounts WHERE status = 'ACTIVE';"
# Check specific toolkit connections
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT toolkit_slug, COUNT(*) FROM connected_accounts WHERE status = 'ACTIVE' GROUP BY toolkit_slug;"# Test Apollo API connectivity
kubectl exec -l app=thermos -c thermos -- curl -I "https://<apollo-endpoint>/health"
# Test database connectivity
kubectl exec -l app=thermos -c thermos -- pg_isready -h <db-host> -p <db-port>
# Test external service connectivity
kubectl exec -l app=thermos -c thermos -- nslookup <external-service>- 1600: BadRequest - Invalid Thermos request
- 1601: InternalServerError - Thermos service error
- 1602: ServiceUnavailable - Thermos service unavailable
- 1603: NotFound - Thermos resource not found
- 1604: Conflict - Thermos resource conflict
- 400: Bad Request - Invalid request format
- 401: Unauthorized - Authentication required
- 403: Forbidden - Access denied
- 404: Not Found - Resource not found
- 429: Too Many Requests - Rate limit exceeded
- 500: Internal Server Error - Server error
- 503: Service Unavailable - Service temporarily unavailable
# Restart Thermos service
kubectl rollout restart deployment/thermos
# Check service health
kubectl get pods -l app=thermos
kubectl describe pod <thermos-pod-name># Terminate failed workflows
temporal workflow terminate --workflow-id <failed-workflow-id>
# Reset workflow schedules
temporal schedule list
temporal schedule delete --schedule-id <schedule-id># Check connection status
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT id, status, toolkit_slug FROM connected_accounts WHERE status != 'ACTIVE';"
# Reset failed connections
kubectl exec -l app=thermos -c thermos -- psql -c "UPDATE connected_accounts SET status = 'ACTIVE' WHERE status = 'FAILED';"# Set up alerts for:
# - High failure rates (>10%)
# - Workflow execution failures
# - Apollo API unavailability
# - Database connection issues
# - High memory/CPU usage# Daily workflow execution check
temporal workflow list --query "WorkflowType='AuthRefreshWorkflow'" --limit 1
# Connection health check
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT COUNT(*) FROM connected_accounts WHERE status = 'ACTIVE';"
# Metrics health check
curl -s "http://localhost:8080/metrics" | grep "auth_refresh" | wc -l- Verify Thermos service is running
- Check LaunchDarkly feature flags
- Confirm environment configuration
- Validate Temporal client connection
- Check Apollo API connectivity
- Verify database connections
- Review recent log entries
- Check OpenTelemetry metrics
- Analyze failure patterns
- Test network connectivity
- Verify authentication tokens
- Check resource utilization
- Review error codes
- Test recovery procedures
For persistent issues:
-
Collect Debug Information:
- Workflow execution IDs
- Error logs with timestamps
- Metrics snapshots
- Configuration details
-
Create Support Ticket:
- Include error codes and messages
- Provide workflow execution details
- Attach relevant log excerpts
- Describe reproduction steps
-
Emergency Contacts:
- On-call engineer for critical failures
- Platform team for infrastructure issues
- Database team for connection problems