Thermos Auth Refresh Workflow Debugging Guide - ComposioHQ/helm-charts GitHub Wiki

Overview

This guide provides comprehensive debugging steps for authentication refresh workflow failures in the Thermos service. The auth refresh workflow is responsible for periodically refreshing authentication tokens for connected accounts across various toolkits and custom applications.

Debugging Steps

1. Initial Assessment

Check Workflow Status

# Check if auth refresh worker is running
kubectl logs -l app=thermos -c thermos | grep "AuthRefreshWorkflow"

# Check for worker startup errors
kubectl logs -l app=thermos -c thermos | grep "auth-refresh"

Verify Environment Configuration

# Verify environment variables
kubectl exec -l app=thermos -c thermos -- env | grep -E "(COMPOSIO_ENV|TEMPORAL_|APOLLO_)"

2. Temporal Workflow Debugging

Check Workflow Execution

# List recent auth refresh workflows
temporal workflow list --query "WorkflowType='AuthRefreshWorkflow'" --limit 10

# Get specific workflow execution details
temporal workflow describe --workflow-id <workflow-id>

# Check workflow history
temporal workflow show --workflow-id <workflow-id>

Monitor Workflow Metrics

# Check OpenTelemetry metrics for auth refresh
curl -s "http://localhost:8080/metrics" | grep "auth_refresh"

# Check StatsD metrics
# Look for metrics: auth_refresh.workflow.success, auth_refresh.workflow.failed, auth_refresh.workflow.skipped

3. Log Analysis

Key Log Patterns to Search For

# Workflow startup and configuration
kubectl logs -l app=thermos -c thermos | grep -E "(Starting Auth Refresh Workflow|AuthRefreshWorkflow called)"

# Connection retrieval issues
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to get connections|GetConnections|GetConnectionsForProject)"

# Token refresh failures
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to refresh tokens|Error refreshing tokens|RefreshAuth)"

# Batch processing issues
kubectl logs -l app=thermos -c thermos | grep -E "(Processing batch|Failed to get future response|batchFailedCount)"

# Apollo API communication
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to execute HTTP request|unexpected status code|Failed to decode response)"

Log Levels and Context

# Check for ERROR level logs
kubectl logs -l app=thermos -c thermos | grep "ERROR" | grep -E "(auth|refresh|token)"

# Check for WARN level logs (skipped connections)
kubectl logs -l app=thermos -c thermos | grep "WARN" | grep -E "(Skipped connections|auth|refresh)"

# Check workflow ID correlation
kubectl logs -l app=thermos -c thermos | grep "workflow_id" | grep -E "(auth|refresh)"

4. Common Failure Scenarios

Scenario 1: Workflow Not Starting

Symptoms:

  • No auth refresh logs in the system
  • Missing workflow executions in Temporal

Debug Steps:

# Verify worker registration
kubectl logs -l app=thermos -c thermos | grep "CreateAuthRefreshWorker"

# Check environment mode
kubectl logs -l app=thermos -c thermos | grep "COMPOSIO_ENV"

# Check for worker startup errors
kubectl logs -l app=thermos -c thermos | grep "auth-refresh"

Common Causes:

  • Running in local environment (auth refresh skipped)
  • Temporal client connection issues
  • Worker registration failures

Scenario 2: Connection Retrieval Failures

Symptoms:

  • Failed to get connections for <toolkit>
  • Empty connection lists
  • Database connection errors

Debug Steps:

# Check database connectivity
kubectl logs -l app=thermos -c thermos | grep -E "(database|connection|ent)"

# Verify Apollo API connectivity
kubectl logs -l app=thermos -c thermos | grep -E "(Apollo|apollo|APOLLO)"

# Check for specific toolkit issues
kubectl logs -l app=thermos -c thermos | grep -E "(toolkit|app).*error"

Common Causes:

  • Database connection issues
  • Apollo API unavailability
  • Invalid toolkit slug or project ID
  • Network connectivity problems

Scenario 3: Token Refresh API Failures

Symptoms:

  • Failed to execute HTTP request
  • unexpected status code: <code>
  • Failed to decode response

Debug Steps:

# Check Apollo API status
curl -I "https://<apollo-endpoint>/api/v3/admin/auth-refresh"

# Verify admin token
kubectl logs -l app=thermos -c thermos | grep "x-composio-admin-token"

# Check request headers and payload
kubectl logs -l app=thermos -c thermos | grep -E "(connectionIds|request body)"

Common Causes:

  • Apollo API service down
  • Invalid admin token
  • Network connectivity issues
  • Rate limiting from Apollo API

Scenario 4: Batch Processing Failures

Symptoms:

  • Failed to get future response
  • batchFailedCount > 0
  • Workflow fails due to batch failures

Debug Steps:

# Check batch processing logs
kubectl logs -l app=thermos -c thermos | grep -E "(Processing batch|batch.*failed)"

# Check batch processing settings
kubectl logs -l app=thermos -c thermos | grep "parallelism"

# Check semaphore and rate limiting
kubectl logs -l app=thermos -c thermos | grep -E "(semaphore|rate.*limit)"

Common Causes:

  • High parallelism causing resource exhaustion
  • Rate limiting from external services
  • Memory or CPU constraints
  • Network timeouts

Scenario 5: High Failure Rate

Symptoms:

  • failureCount > int(float64(len(connections))*0.1)
  • Workflow fails due to >10% failure rate
  • Many individual connection failures

Debug Steps:

# Check individual connection failures
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to refresh connections|failures.*connectionId)"

# Check skipped connections
kubectl logs -l app=thermos -c thermos | grep -E "(Skipped connections|skipped.*connectionId)"

# Analyze failure patterns
kubectl logs -l app=thermos -c thermos | grep -E "(error.*token|expired|invalid)"

Common Causes:

  • Expired refresh tokens
  • Revoked OAuth tokens
  • Invalid connection configurations
  • External service rate limiting

5. Metrics and Monitoring

OpenTelemetry Metrics

Key Metrics to Monitor:

# Auth refresh workflow metrics
auth_refresh.workflow.success
auth_refresh.workflow.failed
auth_refresh.workflow.skipped

# Polling trigger auth refresh errors
polling.trigger.auth_refresh_error

Query Examples:

# Check success rate over time
curl -s "http://localhost:8080/metrics" | grep "auth_refresh_workflow_success_total"

# Check failure patterns
curl -s "http://localhost:8080/metrics" | grep "auth_refresh_workflow_failed_total"

# Check error types
curl -s "http://localhost:8080/metrics" | grep "polling_trigger_auth_refresh_error_total"

StatsD Metrics

Datadog Dashboard Queries:

# Success rate
sum:auth_refresh.workflow.success{*} by {app,group}

# Failure rate
sum:auth_refresh.workflow.failed{*} by {app,group}

# Skipped connections
sum:auth_refresh.workflow.skipped{*} by {app,group}

6. Advanced Debugging

Temporal Workflow Debugging

# Enable debug logging
kubectl logs -l app=thermos -c thermos | grep -E "(DEBUG|debug)" | grep -E "(auth|refresh)"

# Check activity heartbeats
kubectl logs -l app=thermos -c thermos | grep -E "(heartbeat|Heartbeat)"

# Check retry policies
kubectl logs -l app=thermos -c thermos | grep -E "(retry|RetryPolicy)"

Database Query Debugging

# Check connection counts
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT COUNT(*) FROM connected_accounts WHERE status = 'ACTIVE';"

# Check specific toolkit connections
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT toolkit_slug, COUNT(*) FROM connected_accounts WHERE status = 'ACTIVE' GROUP BY toolkit_slug;"

Network Connectivity Testing

# Test Apollo API connectivity
kubectl exec -l app=thermos -c thermos -- curl -I "https://<apollo-endpoint>/health"

# Test database connectivity
kubectl exec -l app=thermos -c thermos -- pg_isready -h <db-host> -p <db-port>

# Test external service connectivity
kubectl exec -l app=thermos -c thermos -- nslookup <external-service>

7. Error Code Reference

Thermos Error Codes (1600-1699)

  • 1600: BadRequest - Invalid Thermos request
  • 1601: InternalServerError - Thermos service error
  • 1602: ServiceUnavailable - Thermos service unavailable
  • 1603: NotFound - Thermos resource not found
  • 1604: Conflict - Thermos resource conflict

HTTP Status Codes

  • 400: Bad Request - Invalid request format
  • 401: Unauthorized - Authentication required
  • 403: Forbidden - Access denied
  • 404: Not Found - Resource not found
  • 429: Too Many Requests - Rate limit exceeded
  • 500: Internal Server Error - Server error
  • 503: Service Unavailable - Service temporarily unavailable

8. Recovery Procedures

Immediate Recovery

# Restart Thermos service
kubectl rollout restart deployment/thermos

# Check service health
kubectl get pods -l app=thermos
kubectl describe pod <thermos-pod-name>

Workflow Recovery

# Terminate failed workflows
temporal workflow terminate --workflow-id <failed-workflow-id>

# Reset workflow schedules
temporal schedule list
temporal schedule delete --schedule-id <schedule-id>

Data Recovery

# Check connection status
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT id, status, toolkit_slug FROM connected_accounts WHERE status != 'ACTIVE';"

# Reset failed connections
kubectl exec -l app=thermos -c thermos -- psql -c "UPDATE connected_accounts SET status = 'ACTIVE' WHERE status = 'FAILED';"

9. Prevention and Monitoring

Proactive Monitoring

# Set up alerts for:
# - High failure rates (>10%)
# - Workflow execution failures
# - Apollo API unavailability
# - Database connection issues
# - High memory/CPU usage

Regular Health Checks

# Daily workflow execution check
temporal workflow list --query "WorkflowType='AuthRefreshWorkflow'" --limit 1

# Connection health check
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT COUNT(*) FROM connected_accounts WHERE status = 'ACTIVE';"

# Metrics health check
curl -s "http://localhost:8080/metrics" | grep "auth_refresh" | wc -l

10. Troubleshooting Checklist

  • Verify Thermos service is running
  • Check LaunchDarkly feature flags
  • Confirm environment configuration
  • Validate Temporal client connection
  • Check Apollo API connectivity
  • Verify database connections
  • Review recent log entries
  • Check OpenTelemetry metrics
  • Analyze failure patterns
  • Test network connectivity
  • Verify authentication tokens
  • Check resource utilization
  • Review error codes
  • Test recovery procedures

Support and Escalation

For persistent issues:

  1. Collect Debug Information:

    • Workflow execution IDs
    • Error logs with timestamps
    • Metrics snapshots
    • Configuration details
  2. Create Support Ticket:

    • Include error codes and messages
    • Provide workflow execution details
    • Attach relevant log excerpts
    • Describe reproduction steps
  3. Emergency Contacts:

    • On-call engineer for critical failures
    • Platform team for infrastructure issues
    • Database team for connection problems
⚠️ **GitHub.com Fallback** ⚠️