Setting up Amazon ECS using CloudFormation - tooltwist/documentation GitHub Wiki
Please Note:
These instructions are now obsolete, replaced by aws-explorer
provisioning.
In the following instructions we set up an ECS environment using Amazon Cloud Formation. This tool uses scripts to create servers, security groups, databases, load balancers, etc in a fast and repeatable way.
We'll create a cluster named myCluster and a task named myTask. As you follow these instructions, replace these names with a real cluster name and the project name used in the configs repo.
There are three main phases in these instructions:
- In steps 1 to 4 we create a cluster and provision it with one or more EC2 instances.
- In steps 5 to 9 we add a task (application) to the cluster. These steps can be repeated to ad multiple tasks to the cluster.
- Step 10 is very important, as this is where we close of security risks to the cluster.
Important: Keep the cluster name short - three or four characters is good.
- Go to the Cloudformation page and create a new bucket using template
https://s3-ap-southeast-1.amazonaws.com/ttcf-templates/ttcf-1-create-cluster
For the stack name, use the name of your cluster. Skip over the 2nd page (Options).
-
Once the stack has been created, you can use the links on the Outputs section to check that the cluster and S3 bucket have been created.
-
Copy the text from the Outputs tab, which you will save in the next step.
-
Create /Development/Cluster/clustername using a skeleton config.
$ mkdir -p /Development/Cluster/<clustername> $ cd /Development/Cluster/<clustername> $ curl -# https://s3-ap-southeast-1.amazonaws.com/ttcf-templates/skeleton-cluster.tar.gz | tar xvz
-
Edit SETENV and set the parameters carefully. Make sure you update the S3 bucket's name. You can get the DOCKER_AUTH and DOCKER_EMAIL values by using
docker login
to set your local credentials, then runcat ~/.docker/config.json
. More details can be found here, but don't use the format with username and password in the file. -
Load the ecs.config file into the S3 bucket:
./sync-ecs.config
Sometimes this gives a long error message "An unexpected error has occurred.... etc". This is usually a network related error and resolves if you try a few times.
-
Save the Outputs copied in the previous section into a file named 'stack-output.tab'.
$ cat >> stack-output.tab PAGES 1.0. S3Bucket ttcf-xytz-configs 2.3 - S3 bucket ... (press <Ctrl-D> to finish)
-
Go to https://drive.google.com/drive/u/0/folders/0ByzEB7u5S7PbNVBjODh6MkFMdDg?ths=true and create a new Google Sheet with the name Cluster <your-cluster-name>.
Select File->Import and upload stack-outputs.tab with the "Replace current sheet" option, and "Convert text to numbers and dates" set to No.
Sort the spreadsheet by column C.
Share this spreadsheet as required so developers can use the ECS Configuration for their task.
- Run CloudFormation script
https://s3-ap-southeast-1.amazonaws.com/ttcf-templates/ttcf-3-add-instance-to-cluster
Stack name: <clustername>-instance-1.
EcsSecurityGroup: <clustername>-ecsSecurityGroup
InstanceType: usually t2.small (unless you have plans to run multiple projects on the cluster
KeyName: choose an SSH key pair you have on your local machine (e.g. phil-singapore).
-
Save the outputs for this stack to
stack-outputs-instance-1.tab
.$ cat >> stack-outputs-instance-1.tab INSTANCE healthCheck2 http://52.221.251.14:PORT/api/healthcheck Healthcheck example 2 temporaryLogin ssh -i ~/.ssh/phil-singapore.pem ec2-user@... ... (press <Ctrl-D> to finish)
Load this file into the previous Google sheet, on a new tab named Instance 1.
IMPORTANT
The ability to log directly into the server is only a temporary measure, while you confirm the set up. Despite the security provided by RSA encryption and authentication, we must consider SSH to be a weak point for hackers. Please follow the following rules, under punishment of death!
-
Only ever have one IP address whitelisted.
-
If an IP address is already whitelisted, overwrite it.
-
Remove this entry when you are finished, within the same day.
To add temporary access:
-
Click on the
ecsSecurityGroupPage
link in your spreadsheet. On the Inbound tab, press edit. Add a temporary rule for SSH with the source as "My IP". -
You should now be able to log into the EC2 instance from your command line, using the
temporaryLogin
command. -
Check that Docker and the ECS agent are running using the
dockerPs
command in the spreadsheet:$ ssh -i ~/.ssh/phil-singapore.pem [email protected] docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b733865597db amazon/amazon-ecs-agent:latest "/agent" 22 minutes ago Up 22 minutes ecs-agent
-
Verify that the S3 bucket has been mounted correctly by using the
viewConfigs
command:$ ssh -i ~/.ssh/phil-singapore.pem [email protected] ls -l /CONFIGS_S3_BUCKET /Scripts /Volumes /CONFIGS_S3_BUCKET: total 1 ---------- 1 root root 182 Nov 4 03:33 ecs.config /Scripts/: total 0 /Volumes/: total 0
Important: the length of the cluster name and the task name should not be more than 5 charaters.
- Run CloudFormation script
https://s3-ap-southeast-1.amazonaws.com/ttcf-templates/ttcf-5-add-task-to-cluster
Stack name: clustername-<taskname>
dbAccessLocation: ignore this
ecsSecurityGroup: <clustername>-ecsSecurityGroup-xxxx
NeedCache, NeedDB: up to you...
TaskName: the name of the project, as stored in configs.tooltwist.com (eg. drinkcircle)
Subnets and Vpc: I use the entries with an IP address 172.*
-
This will create the database and REDIS cache as required. It will also create the Application Load Balancer (ALB) and log files from the tasks's Docker containers will be forwarded to CloudWatch.
-
Append the outputs section for this stack to
stack-outputs.tab
, but type in a newline and "TASK" by hand to separate the sections.$ cat >> stack-outputs-<taskname>.tab TASK dbPort 3306 Database host appDomain ttcf-xxx-alb-crowdhound-1063185692.ap-southeast-1.elb.amazonaws.com Application endpoint cacheHost ttcf-xxx-crowdhound.i07nfr.0001.apse1.cache.amazonaws.com Cache host logGroup ttcf-xxx-ecs-crowdhound Log group dbHost ttcf-xxx-crowdhound.clcuugpfmr3s.ap-southeast-1.rds.amazonaws.com Database host cachePort 6379 Cache host (press <Ctrl-D> to finish)
Add a new tab to your Google sheet and load these details.
If your application uses SOLR for searching, you will need to use CloudSearch in the ECS environment. Using a Docker container for SOLR won't work, because when the application starts the SOLR core will not be initialised, the healthcheck will fail, and the container will be killed and restarted, fail again, restart again, etc.
Cloudsearch provides a fully managed and backed-up service. CloudSearch can be set up manually from the Cloudsearch Dashboard. Use the domain name ecs-clustername-taskname. Creating the domain takes a while.
The best way to define the fields to be indexed is to create a small JSON file containing an example document record. Copy example-search-document.json
to a new file and modify it to represent the data of your application.
Note that the source required to use Cloudsearch is similar to, but different, to the SOLR API. The crowdhound project can provide an example of how documents can be loaded, updated and searched. For details of the API see here.
After creating the Cloudsearch cluster, manually update the task tab on your spreadsheet with the search and doc endpoints.
- Create a directory to contain the task's definition, config files and scripts. The config files will be loaded up to the S3 bucket, from where they will be installed onto the EC2 instance(s) into /Volumes so they can be accessed by your application's Docker containers. The scripts will similarly get installed to /Scripts on the EC2 instance(s). The task definition will be uploaded to ECS, allowing your application to be run as an ECS task or an ECS service.
We create the initial directory from a skeleton:
$ mkdir -p /Development/Cluster/<custername>/<taskname>
$ cd /Development/Cluster/<custername>/<taskname>
$ curl -# https://s3-ap-southeast-1.amazonaws.com/ttcf-templates/skeleton-task-crowdhound.tar.gz | tar xvz
(I'll add a skeleton for ToolTwist applications soon)
-
Follow the numbered scripts to set the configuration and upload the files to the S3 bucket and ECS. Check the values in SETENV carefully, setting them from your spreadsheet.
-
The config and script changes take a few minutes to propagate through to the EC2 instance(s). The
viewConfigs
command from your spreadsheet can be used to check.
Database initialisation is performed from the EC2 instance, using scripts you edit in the cluster/task config directory (ie. in /Development/Cluster/clustername/taskname).
This directory can contain various useful scripts, for example for
- to run a healthcheck from the command line
- to initialise the database schema
- to load the database
- to access the REDIS cache
- to load the search engine
The exact scripts required will depend upon your application - you will need to create and update them to suit your needs. Each time you make changes, sync them to the S3 bucket and wait a few minutes for them to propagate through to the EC2 instance(s).
Bear in mind that the EC2 instance is unlikely to have client software installed to communicate with the database or REDIS. The easiest approach is to use a temporary docker container to configure and load these back ends - the official mysql and redis images from Docker hub work just fine. See the default scripts for an example.
The following are example commands used to initialise the database and search engine.
$ ssh -i ~/.ssh/phil-singapore.pem [email protected]
Last login: Mon Nov 7 23:32:26 2016 from ppp121-44-67-215.xxxxx.xxxx.internode.on.net
...
[ec2-user@ip-172-31-1-118 ~]$ sudo su
[root@ip-172-31-1-118 ec2-user]# cd /Scripts/crowdhound/
[root@ip-172-31-1-118 crowdhound]# bash db-init
CMD=docker run -i --rm mysql mysql -h ttcf-xxx-crowdhound.clcuugpfmr3s.ap-southeast-1.rds.amazonaws.com -u root -pM0use123
...
[root@ip-172-31-1-118 crowdhound]# bash db-load
CMD=docker run -i --rm mysql mysql -h ttcf-xxx-crowdhound.clcuugpfmr3s.ap-southeast-1.rds.amazonaws.com -u < Dump20161102.sql
...
The first time, it is probably best to start the application progressively, as there are a lot of moving parts. If you try to start everything straight away you will probably get a load balancer health check that fails, with little information about what is actually wrong. Using a series of health checks, you can check that each piece is working correctly before you add more complexity to the picture.
Initially start the application as an ECS Task using the ECS Dashboard, and run health checks hc1 and hc2. Once these pass, shut down the task and start the application as an ECS service.
This checks that the application is working correctly within it's Docker container. Here's the typical steps:
$ ssh -i ~/.ssh/phil-singapore.pem [email protected]
$ sudo su
# cd /Scripts/crowdhound
# bash hc1
If the output contains an error, go to the Cloudwatch application logs and solve the problem before proceeding.
This check can be run from your local machine's command line, or from your browser. First however, you will need to modify the ecsSecurityGroup for your cluster to allow access from the outside, to your docker container's port.
You should then be able to access the healthcheck from your browser using the internal Docker port number and the EC2 instance's IP address, with a URL similar to http://1.2.3.4:12345/api/healthcheck
. You can similarly call it from the command line:
$ curl -v http://1.2.3.4:12345/api/healthcheck
Important: remove the open port from the security group as soon as you are finished.
This is the healthcheck performed by the Application Load Balancer. For this health check and hc4
the application must be running as a service.
- Press Add Service on the Services tab on the Cluster's dashboard page.
- As you start the service press the Configure ELB button.
- Set ELB Name to ttcf-cluster-alb-taskname.
- Select the container for your application (i.e. not REDIS), then press Save to start the service.
- Go to your ECS security group, and add an input rule with Port Range as
0 - 65000
and Source as the alb for your task. It's not obvious, but if you type the name of your task into the source it will provide a list of security groups.
Go to the Target Group page for your cluster/task and check the status.
If the hc1
health check passed but the target group cannot call the same health check, then either it is checking the wrong endpoint (check the Healthcheck tab), or else the security groups configuration is preventing the ALB from communicating with the EC2 instance. Check the configuration of the ecsSecurityGroup for your cluster allows access from the albSecurityGroup for the cluster/task. For example:
This involves running the healthcheck through the standard application endpoint, through the load balancer, on it's normal port (i.e. 80 or 443). For example, ttcf-xxx-alb-crowdhound-1063185692.ap-southeast-1.elb.amazonaws.com/api/healthcheck
. If this healthcheck passes, then you application is available to the outside world.
*** This is Super Important ***
-
Remove all inbound rules from the ECS Security Group that allow access from the outside.
-
Check that the database does not have unrestricted access.
- If the healthcheck requires a container to be initialized in order to return 200, it can run into a restart loop. When the healthcheck fails a few times, the service is shut down then restarted. If you can't get the container initialized fast enough the container will be repeatedly shut down.
For example, when SOLR is run as a Docker container a "core" usually needs to be defined or the healthcheck will fail.
If this is happening, you will see the status of the instances on the Target Groups toggling between initialize
and draining
. The short-term solution is to increase the interval and number of retries for the healthcheck.
In general, the solution must be to use containers that do not need to be loaded after restart, as Amazon may restart a container or service at any time.
- When debugging, work from the inside out, using the healthcheck step described above.
A cluster and it's tasks are deleted by deleting the Cloudformation stacks in the reverse order they were created. Before doing this however, several manual steps are required.
- Stop any services running on the cluster.
- Go to the
ecsSecurityGroup
and remove any Inbound rule from an albSecurityGroup. - Empty the S3 bucket for the cluster.