Debugging ECS - rajeshamdev/containers-orchestration GitHub Wiki
The serverless compute such as AWS Lambda (with API Gateway) and Fargate was relatively easy to set up and run the production level workloads. But ECS with EC2 as infrastructure is slightly involved and needs debugging tools in the event of unexpected issues.
Lets walk through the debugging steps. I was trying to host a container for Golang REST API backend server in multi-zone with Auto Scaling Group. Created an ECS cluster with a container instance (so one EC2 instance) in two availability zones. EC2 Infra selected:
ASG (Auto Scaling Group),
On-demand
Amazon Linux 2(kernel 5.10)
t2.micro (1vCPU, 1GiB)
ecsInstanceRole
desired capacity min 2, max 4
VPC. 2 public subnets. Security group with inbound ports 22, 80, 8080. Outbound ports 80 and 443 (this
is required for EC2 initiated connections, for example, to install any package)
Auto-assign public IP "On"
A service was created with my golang app container pushed to ECR (Elastic Container Registry) with 2 desired tasks.
ECS GUI reported:
There was an error deploying gotask
Resource handler returned message: "Error occurred during operation 'ECS Deployment Circuit Breaker was triggered'." (RequestToken:
0664244b-24a8-24dd-09f9-d2db796b6d16, HandlerErrorCode: GeneralServiceException)
ECS -> Services -> Service Name -> Deployments logs
service [test](https://us-east-2.console.aws.amazon.com/ecs/v2/clusters/dev/services/test?region=us-east-2)
deployment ecs-svc/5582666653287178022 deployment failed: tasks failed to start.
service [test](https://us-east-2.console.aws.amazon.com/ecs/v2/clusters/dev/services/test?region=us-east-2) was
unable to place a task because no container instance met all of its requirements. The closest matching container-instance
[5efb64c4bba34de8840c4ae1131aadf2](https://us-east-2.console.aws.amazon.com/ecs/v2/clusters/dev/infrastructure/container-instances/5efb64c4bba34de8840c4ae1131aadf2?region=us-east-2
has insufficient memory available. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide.
Fortunately, this error log was clear enough to look for possible issues. Here are few issues:
- The container instances (EC2) vCPU and Memory was small.
- The Task's vCPU and Memory size was the same as EC2.
- Containers CPU/Memory (soft limit) also the same as the Task.
This is a bad configuration. Ideally:
- Task CPU/Memory resource capacity should be less than EC2 instance resource capacity.
- Container CPU/Memory resource capacity in the Task should be less than Task CPU/Memory resource capacity
After fixing this, the containers started OK.
But the golang service was not reachable on port 8080. I connected to EC2 instances (so creating ssh key pair during the Infrastructure
creation was important). docker ps, lsof, netstat ecs-agent logs
are other good tools to root cause the issue.
mars ~ $ ssh -i ecs.pem [email protected]
__| __| __|
_| ( \__ \ Amazon Linux 2 (ECS Optimized)
____|\___|____/
For documentation, visit http://aws.amazon.com/documentation/ecs
[ec2-user@ip-10-0-7-48 ~]$
[ec2-user@ip-10-0-7-48 ~]$
[ec2-user@ip-10-0-7-48 ~]$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5a2c7bfeb3e8 <account>.dkr.ecr.us-east-2.amazonaws.com/rajeshamdev "./server-linux" About a minute ago Up About a minute ago (healthy) ecs-goapp-task-definition-22-goapp-container-dcf3c18d878dbbde5700
db595f7da062 amazon/amazon-ecs-pause:0.1.0 "/pause" About a minute ago Up About a minute ecs-goapp-task-definition-22-internalecspause-e8bc85eaa9e9e8846200
fed22edaf00d amazon/amazon-ecs-agent:latest "/agent" 26 minutes ago Up 26 minutes (healthy) ecs-agent
[ec2-user@ip-10-0-7-48 ~]$
[ec2-user@ip-10-0-7-48 ~]$ netstat -na | grep LISTEN
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:51679 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:43313 0.0.0.0:* LISTEN
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN
tcp6 0 0 :::51678 :::* LISTEN
tcp6 0 0 :::111 :::* LISTEN
tcp6 0 0 :::22 :::* LISTEN
[ec2-user@ip-10-0-7-48 ~]$
[ec2-user@ip-10-0-7-48 ~]$ sudo lsof -i :8080
[ec2-user@ip-10-0-7-48 ~]$
[ec2-user@ip-10-0-7-48 ~]$
[ec2-user@ip-10-0-7-48 ~]$ docker logs 5a2c7bfeb3e8
You must set ALLOW_CORS_ORIGINS in production environment
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
- using env: export GIN_MODE=release
- using code: gin.SetMode(gin.ReleaseMode)
[GIN-debug] GET /v1/api/channel/:id/insights --> github.com/rajeshamdev/analytics/youtube/utube.GetChannelInsights (4 handlers)
[GIN-debug] GET /v1/api/channel/:id/videos --> github.com/rajeshamdev/analytics/youtube/utube.GetChannelVideos (4 handlers)
[GIN-debug] GET /v1/api/video/:id/insights --> github.com/rajeshamdev/analytics/youtube/utube.GetVideoInsights (4 handlers)
[GIN-debug] GET /v1/api/video/:id/sentiments --> github.com/rajeshamdev/analytics/youtube/utube.VideoSentiment (4 handlers)
[GIN-debug] GET /v1/api/health --> github.com/rajeshamdev/analytics/youtube/utube.HealthCheck (4 handlers)
serverStart starting
healthcheck OK
[GIN] 2024/08/02 - 16:39:30 | 200 | 9.282µs | 127.0.0.1 | GET "/v1/api/health"
[GIN] 2024/08/02 - 17:07:34 | 200 | 117.989µs | 127.0.0.1 | GET "/v1/api/health"
[ec2-user@ip-10-0-7-48 ~]$
[ec2-user@ip-10-0-7-48 ~]$ cat /var/log/ecs/ecs-agent.log
[ec2-user@ip-10-0-7-48 ~]$
As seen from the output, PORTS are not visible. This is the reason why container port 8080 is not reachable. In the Task definition, the Network mode selection was updated to "bridge" from "awsvpc" which fixed the issue.
Here are the details from AWS console:
The network mode specifies what type of networking the containers in the task use. The following are available:
-
The
awsvpc
network mode, which provides the task with an elastic network interface (ENI). When creating a service or running a task with this network mode you must specify a network configuration consisting of one or more subnets, security groups, and whether to assign the task a public IP address. The awsvpc network mode is required for tasks hosted on Fargate. -
The
bridge
network mode uses Docker's built-in virtual network, which runs inside each Amazon EC2 instance hosting the task. The bridge is an internal network namespace that allows each container connected to the same bridge network to communicate with each other. It provides an isolation boundary from containers that aren't connected to the same bridge network. You use static or dynamic port mappings to map ports in the container with ports on the Amazon EC2 host. If you choose bridge for the network mode, under Port mappings, for Host port, specify the port number on the container instance to reserve for your container. -
The
default
mode uses Docker's built-in virtual network mode on Windows, which runs inside each Amazon EC2 instance that hosts the task. This is the default network mode on Windows if a network mode isn't specified in the task definition. -
The
host
network mode has the task bypass Docker's built-in virtual network and maps container ports directly to the ENI of the Amazon EC2 instance hosting the task. As a result, you can't run multiple instantiations of the same task on a single Amazon EC2 instance when port mappings are used. -
The
none
network mode provides a task with no external network connectivity.
For tasks hosted on Amazon EC2 instances, the available network modes are awsvpc, bridge, host, and none. If no network mode is specified, the bridge network mode is used by default.
References