Thermos Auth Refresh Workflow Debugging Guide - ComposioHQ/helm-charts GitHub Wiki

Overview

This guide provides comprehensive debugging steps for authentication refresh workflow failures in the Thermos service. The auth refresh workflow is responsible for periodically refreshing authentication tokens for connected accounts across various toolkits and custom applications.

Debugging Steps

1. Initial Assessment

Check Workflow Status

# Check if auth refresh worker is running
kubectl logs -l app=thermos -c thermos | grep "AuthRefreshWorkflow"

# Check for worker startup errors
kubectl logs -l app=thermos -c thermos | grep "auth-refresh"

Verify Environment Configuration

# Verify environment variables
kubectl exec -l app=thermos -c thermos -- env | grep -E "(COMPOSIO_ENV|TEMPORAL_|APOLLO_)"

2. Temporal Workflow Debugging

Check Workflow Execution

# List recent auth refresh workflows
temporal workflow list --query "WorkflowType='AuthRefreshWorkflow'" --limit 10

# Get specific workflow execution details
temporal workflow describe --workflow-id <workflow-id>

# Check workflow history
temporal workflow show --workflow-id <workflow-id>

Monitor Workflow Metrics

# Check OpenTelemetry metrics for auth refresh
curl -s "http://localhost:8080/metrics" | grep "auth_refresh"

# Check StatsD metrics
# Look for metrics: auth_refresh.workflow.success, auth_refresh.workflow.failed, auth_refresh.workflow.skipped

3. Log Analysis

Key Log Patterns to Search For

# Workflow startup and configuration
kubectl logs -l app=thermos -c thermos | grep -E "(Starting Auth Refresh Workflow|AuthRefreshWorkflow called)"

# Connection retrieval issues
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to get connections|GetConnections|GetConnectionsForProject)"

# Token refresh failures
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to refresh tokens|Error refreshing tokens|RefreshAuth)"

# Batch processing issues
kubectl logs -l app=thermos -c thermos | grep -E "(Processing batch|Failed to get future response|batchFailedCount)"

# Apollo API communication
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to execute HTTP request|unexpected status code|Failed to decode response)"

Log Levels and Context

# Check for ERROR level logs
kubectl logs -l app=thermos -c thermos | grep "ERROR" | grep -E "(auth|refresh|token)"

# Check for WARN level logs (skipped connections)
kubectl logs -l app=thermos -c thermos | grep "WARN" | grep -E "(Skipped connections|auth|refresh)"

# Check workflow ID correlation
kubectl logs -l app=thermos -c thermos | grep "workflow_id" | grep -E "(auth|refresh)"

4. Common Failure Scenarios

Scenario 1: Workflow Not Starting

Symptoms:

No auth refresh logs in the system
Missing workflow executions in Temporal

Debug Steps:

# Verify worker registration
kubectl logs -l app=thermos -c thermos | grep "CreateAuthRefreshWorker"

# Check environment mode
kubectl logs -l app=thermos -c thermos | grep "COMPOSIO_ENV"

# Check for worker startup errors
kubectl logs -l app=thermos -c thermos | grep "auth-refresh"

Common Causes:

Running in local environment (auth refresh skipped)
Temporal client connection issues
Worker registration failures

Scenario 2: Connection Retrieval Failures

Symptoms:

Failed to get connections for <toolkit>
Empty connection lists
Database connection errors

Debug Steps:

# Check database connectivity
kubectl logs -l app=thermos -c thermos | grep -E "(database|connection|ent)"

# Verify Apollo API connectivity
kubectl logs -l app=thermos -c thermos | grep -E "(Apollo|apollo|APOLLO)"

# Check for specific toolkit issues
kubectl logs -l app=thermos -c thermos | grep -E "(toolkit|app).*error"

Common Causes:

Database connection issues
Apollo API unavailability
Invalid toolkit slug or project ID
Network connectivity problems

Scenario 3: Token Refresh API Failures

Symptoms:

Failed to execute HTTP request
unexpected status code: <code>
Failed to decode response

Debug Steps:

# Check Apollo API status
curl -I "https://<apollo-endpoint>/api/v3/admin/auth-refresh"

# Verify admin token
kubectl logs -l app=thermos -c thermos | grep "x-composio-admin-token"

# Check request headers and payload
kubectl logs -l app=thermos -c thermos | grep -E "(connectionIds|request body)"

Common Causes:

Apollo API service down
Invalid admin token
Network connectivity issues
Rate limiting from Apollo API

Scenario 4: Batch Processing Failures

Symptoms:

Failed to get future response
batchFailedCount > 0
Workflow fails due to batch failures

Debug Steps:

# Check batch processing logs
kubectl logs -l app=thermos -c thermos | grep -E "(Processing batch|batch.*failed)"

# Check batch processing settings
kubectl logs -l app=thermos -c thermos | grep "parallelism"

# Check semaphore and rate limiting
kubectl logs -l app=thermos -c thermos | grep -E "(semaphore|rate.*limit)"

Common Causes:

High parallelism causing resource exhaustion
Rate limiting from external services
Memory or CPU constraints
Network timeouts

Scenario 5: High Failure Rate

Symptoms:

failureCount > int(float64(len(connections))*0.1)
Workflow fails due to >10% failure rate
Many individual connection failures

Debug Steps:

# Check individual connection failures
kubectl logs -l app=thermos -c thermos | grep -E "(Failed to refresh connections|failures.*connectionId)"

# Check skipped connections
kubectl logs -l app=thermos -c thermos | grep -E "(Skipped connections|skipped.*connectionId)"

# Analyze failure patterns
kubectl logs -l app=thermos -c thermos | grep -E "(error.*token|expired|invalid)"

Common Causes:

Expired refresh tokens
Revoked OAuth tokens
Invalid connection configurations
External service rate limiting

5. Metrics and Monitoring

OpenTelemetry Metrics

Key Metrics to Monitor:

# Auth refresh workflow metrics
auth_refresh.workflow.success
auth_refresh.workflow.failed
auth_refresh.workflow.skipped

# Polling trigger auth refresh errors
polling.trigger.auth_refresh_error

Query Examples:

# Check success rate over time
curl -s "http://localhost:8080/metrics" | grep "auth_refresh_workflow_success_total"

# Check failure patterns
curl -s "http://localhost:8080/metrics" | grep "auth_refresh_workflow_failed_total"

# Check error types
curl -s "http://localhost:8080/metrics" | grep "polling_trigger_auth_refresh_error_total"

StatsD Metrics

Datadog Dashboard Queries:

# Success rate
sum:auth_refresh.workflow.success{*} by {app,group}

# Failure rate
sum:auth_refresh.workflow.failed{*} by {app,group}

# Skipped connections
sum:auth_refresh.workflow.skipped{*} by {app,group}

6. Advanced Debugging

Temporal Workflow Debugging

# Enable debug logging
kubectl logs -l app=thermos -c thermos | grep -E "(DEBUG|debug)" | grep -E "(auth|refresh)"

# Check activity heartbeats
kubectl logs -l app=thermos -c thermos | grep -E "(heartbeat|Heartbeat)"

# Check retry policies
kubectl logs -l app=thermos -c thermos | grep -E "(retry|RetryPolicy)"

Database Query Debugging

# Check connection counts
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT COUNT(*) FROM connected_accounts WHERE status = 'ACTIVE';"

# Check specific toolkit connections
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT toolkit_slug, COUNT(*) FROM connected_accounts WHERE status = 'ACTIVE' GROUP BY toolkit_slug;"

Network Connectivity Testing

# Test Apollo API connectivity
kubectl exec -l app=thermos -c thermos -- curl -I "https://<apollo-endpoint>/health"

# Test database connectivity
kubectl exec -l app=thermos -c thermos -- pg_isready -h <db-host> -p <db-port>

# Test external service connectivity
kubectl exec -l app=thermos -c thermos -- nslookup <external-service>

7. Error Code Reference

Thermos Error Codes (1600-1699)

1600: BadRequest - Invalid Thermos request
1601: InternalServerError - Thermos service error
1602: ServiceUnavailable - Thermos service unavailable
1603: NotFound - Thermos resource not found
1604: Conflict - Thermos resource conflict

HTTP Status Codes

400: Bad Request - Invalid request format
401: Unauthorized - Authentication required
403: Forbidden - Access denied
404: Not Found - Resource not found
429: Too Many Requests - Rate limit exceeded
500: Internal Server Error - Server error
503: Service Unavailable - Service temporarily unavailable

8. Recovery Procedures

Immediate Recovery

# Restart Thermos service
kubectl rollout restart deployment/thermos

# Check service health
kubectl get pods -l app=thermos
kubectl describe pod <thermos-pod-name>

Workflow Recovery

# Terminate failed workflows
temporal workflow terminate --workflow-id <failed-workflow-id>

# Reset workflow schedules
temporal schedule list
temporal schedule delete --schedule-id <schedule-id>

Data Recovery

# Check connection status
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT id, status, toolkit_slug FROM connected_accounts WHERE status != 'ACTIVE';"

# Reset failed connections
kubectl exec -l app=thermos -c thermos -- psql -c "UPDATE connected_accounts SET status = 'ACTIVE' WHERE status = 'FAILED';"

9. Prevention and Monitoring

Proactive Monitoring

# Set up alerts for:
# - High failure rates (>10%)
# - Workflow execution failures
# - Apollo API unavailability
# - Database connection issues
# - High memory/CPU usage

Regular Health Checks

# Daily workflow execution check
temporal workflow list --query "WorkflowType='AuthRefreshWorkflow'" --limit 1

# Connection health check
kubectl exec -l app=thermos -c thermos -- psql -c "SELECT COUNT(*) FROM connected_accounts WHERE status = 'ACTIVE';"

# Metrics health check
curl -s "http://localhost:8080/metrics" | grep "auth_refresh" | wc -l

10. Troubleshooting Checklist

Support and Escalation

For persistent issues:

Collect Debug Information:
- Workflow execution IDs
- Error logs with timestamps
- Metrics snapshots
- Configuration details
Create Support Ticket:
- Include error codes and messages
- Provide workflow execution details
- Attach relevant log excerpts
- Describe reproduction steps
Emergency Contacts:
- On-call engineer for critical failures
- Platform team for infrastructure issues
- Database team for connection problems

Thermos Auth Refresh Workflow Debugging Guide - ComposioHQ/helm-charts GitHub Wiki

Overview

Debugging Steps

1. Initial Assessment

Check Workflow Status

Verify Environment Configuration

2. Temporal Workflow Debugging

Check Workflow Execution

Monitor Workflow Metrics

3. Log Analysis

Key Log Patterns to Search For

Log Levels and Context

4. Common Failure Scenarios

Scenario 1: Workflow Not Starting

Scenario 2: Connection Retrieval Failures

Scenario 3: Token Refresh API Failures

Scenario 4: Batch Processing Failures

Scenario 5: High Failure Rate

5. Metrics and Monitoring

OpenTelemetry Metrics

StatsD Metrics

6. Advanced Debugging

Temporal Workflow Debugging

Database Query Debugging

Network Connectivity Testing

7. Error Code Reference

Thermos Error Codes (1600-1699)

HTTP Status Codes

8. Recovery Procedures

Immediate Recovery

Workflow Recovery

Data Recovery

9. Prevention and Monitoring

Proactive Monitoring

Regular Health Checks

10. Troubleshooting Checklist

Support and Escalation

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️