How to: set up a fully decoupled AWS Stack - GovWizely/webservices GitHub Wiki

[If you're setting up a single-instance stack, follow these instructions also.]

0. Prereqs

Create a new Stack:

Create a VPC that the Stack will live in. Use this guide.
- Note: if you want to use m1.small instance types (i.e. if you're building a staging environment) don't put your subnet in us-east-1e, as apparently you cannot create these previous generation instances there).
Create an IAM Role that ElasticSearch will use for "node discovery & custom Cloudwatch metrics". Give this Role AmazonEC2FullAccess and CloudFrontFullAccess policies. Choose a name that indicates that the Role is to be used by the Stack you're setting up.
Create a new Stack in OpsWorks. Use the VPC we just set up, Ubuntu 14.04, no ssh key, Default IAM instance profile: the IAM Role we just created.
Open Advanced...
Chef 11.10
Custom cookbook URL: [email protected]:GovWizely/webservices-cookbook.git
Ensure the Stack can pull from the repo by adding a Repo ssh key (Read/Write access).
Berkshelf? yes. Version 3.1.5.
Custom JSON: I'll revisit this in the sections that follow.
Use OpsWorks security groups: no.

Stack Users

Before we start creating instances, make sure you configure which Users you'd like to have ssh/sudo access to instances. You can do this via the Users button in the top right nav.

Security Groups

Before we start creating Layers, we need to create the Security Groups that each Layer will assign to its instances. It's easiest to do this all at once. Create five Security Groups:

Webservices-ELB-Rails
- HTTP TCP 80 0.0.0.0/0
- HTTPS TCP 443 0.0.0.0/0
Webservices-Rails
- SSH TCP 22 0.0.0.0/0
- HTTP TCP 80 Webservices-ELB-Rails
Webservices-Redis
- Custom TCP 6379 Webservices-Rails
Webservices-ELB-Elasticsearch
- Custom TCP 9200 Webservices-Rails
Webservices-Elasticsearch
- SSH TCP 22 0.0.0.0/0
- Custom TCP 9200 Webservices-ELB-Elasticsearch
- Custom TCP 9300 Webservices-Elasticsearch

[ Single instance staging Stack: just create one SG called "Webservices-SingleInstance, with SSH, HTTP and HTTPS open to the world.]

1. Elasticsearch Layer

Create a Custom Layer called "Elasticsearch".
- Use "Webservices-Elasticsearch" as its Security Group.
Add these recipes to the Setup list:
- java
- elasticsearch
- elasticsearch::aws
- layer-custom::allocation-awareness
Note: There are two versions of the ES cookbook. This setup is using the old version (called the 0.3.x branch). The new version became the default midway through the development of this Stack, and is still going through active development. Since it's very new, I've left switching to it as a future improvement.
In the Networking section, set the "Public IP addresses" option to yes.
Add the following to the Stack's Custom JSON:

{
  "java": {
    "jdk_version": 8,
    "oracle": {
      "accept_oracle_download_terms": "true"
    },
    "accept_license_agreement": "true",
    "install_flavor": "oracle"
  },
  "elasticsearch": {
    "version": "1.7.1",
    "plugins": {
      "elasticsearch/elasticsearch-cloud-aws": {
        "version": "2.7.1"
      }
    },
    "cluster": {
      "name": "[CHOOSE A SUITABLE NAME]"
    },
    "gateway": {
      "expected_nodes": 3
    },
    "discovery": {
      "type": "ec2",
      "zen": {
        "minimum_master_nodes": 2,
        "ping": {
          "multicast": {
            "enabled": false
          }
        }
      },
      "ec2": {
        "tag": {
          "opsworks:stack": "[THE NAME OF YOUR STACK]"
        },
        "groups": "Webservices-Elasticsearch"
      }
    },
    "cloud": {
      "aws": {
        "region": "us-east-1"
      }
    },
    "path": {
      "data": "/mnt/elasticsearch-data"
    },
    "custom_config": {
      "cluster.routing.allocation.awareness.attributes": "rack_id"
    }
  }
}

Notes on this config:

I explicitly set the ES version to 1.7.1.
The elasticsearch-cloud-aws plugin version has to correspond with the ES version. Mappings between plugin and ES versions can be found in the plugin's documentation. Here we use 2.7.1.
discovery.zen.ping.multicast must be disabled in AWS.
The discovery.ec2.tag.opsworks:stack must (repeat must) be set to the name of the Stack (i.e. the Name field in Stack Settings). Discovery of other nodes won't work unless this is the case.
The discovery.ec2.groups setting should be set to the name of the Security Group given to Elasticsearch Layer instances.

You can now add instances to the Layer. Once you have three instances online, log into one of them and do curl localhost:9200. If you get a 200 response, the cluster is healthy.

[ Single instance staging Stack: expected_nodes, minimum_master_nodes should be 1. ec2.groups should be Webservices-SingleInstance.]

Load Balancer

[ Single instance staging Stack: skip all this.]

In order to balance incoming query requests across all ES nodes, we need a load balancer.

Go to EC2's Load Balancers page.
Create a Load Balancer inside the VPC we set up for the Stack. It cannot be an internal load balancer, since our ES instances live in a public facing subnet.
Add one Listener Configuration: HTTP:9200 -> HTTP:9200.
Select the one subnet we created when setting up the VPC.
Use the Webservices-ELB-Elasticsearch Security Group we created earlier.
- Note: for some reason, the Custom TCP rule on port 9200 for the Webservices-Rails SG rule doesn't seem to take effect. I had to add the each actual Rails instance public IP address to the SG before I could get the Rails instances to communicate with the ELB (e.g. for each instance, I added a rule like "Custom TCP 9200 IP_ADDRESS/32"). This cannot be correct, and needs to be sorted out.
Set the health check endpoint to HTTP:9200/. This will respond with 200 if the node has successfully joined the cluster. If the cluster is generally unhealthy, this endpoint will return 503 from all nodes, meaning the LB won't contain any instances which IMO is the correct behavior if the cluster is unhealthy.
Manually add all instances from the Elasticsearch Layer to be part of the LB.
Create the LB.
Take note of the LB's DNS Name, we'll need it when setting up the Rails App Server Layer.

2. ElasticCache Instance

[ Single instance staging Stack: skip all this.]

Create an ElastiCache Cluster, backed by Redis.
Create an Cache Security Group for the Cluster. Authorize "aws-opsworks-rails-app-server" EC2 Security Groups access. Add this as your Cluster's Security Group.
The Cluster needs only one Node. Note the Node's Endpoint and Port as we'll need it when setting up the Rails Layer.

3. Rails Layer and App

Rails App Server Layer

Create a Rails App Server Layer. The settings I used were as follows:

Ruby version: 2.1
Rails stack: Apache2 and Passenger
Passenger version: 4.0.46
RubyGems version: 2.2.2
Install and manage Bundler: Yes
Bundler version: 1.5.3

Add the following recipes to the deploy list:
- gw_webservices::rails_config
- gw_webservices::cors
- gw_webservices::enforce_https
In the Networking section, set the "Public IP addresses" option to yes.
Add this JSON to the Stack's Custom JSON:

{
  "deploy": {
    "webservices": {
      "config": {
        "secret_token": "???",
        "devise_secrets": {
          "secret_key": "???",
          "pepper": "???"
        },
        "sidekiq": {
          "redis_url": "URL OF YOUR ELASTICACHE NODE"
        },
        "environmental_solution": {
          "username": "???",
          "password": "???",
          "web_auth": "???"
        },
        "elasticsearch": {
          "url": "http://DNS-NAME-OF-ES-LB:9200"
        }
        "sharepoint_trade_article": {
          "aws": {
            "region":            "",
            "access_key_id":     "",
            "secret_access_key": ""
          }
        },
        "tariff_rate": {
          "aws": {
            "region":            "",
            "access_key_id":     "",
            "secret_access_key": ""
          }
        },
        "developerportal_url": "",
        "mailer_sender": "",
        "action_mailer": {
          "default_url_options": {
            "host": "https://api.your.site"
          },
          "smtp_settings": {
            "address": "",
            "user_name": "",
            "password": ""
          }
        },
        "parature_api_access_token": ""
      },
      "symlink_before_migrate": {
        "config/elasticsearch.yml": "config/elasticsearch.yml",
        "config/initializers/secret_token.rb": "config/initializers/secret_token.rb",
        "config/initializers/devise_secrets.rb": "config/initializers/devise_secrets.rb",
        "config/initializers/sidekiq.rb": "config/initializers/sidekiq.rb",
        "config/initializers/environmental_solution.rb": "config/initializers/environmental_solution.rb",
        "config/environments/YOUR-ENV-NAME.rb": "config/environments/YOUR-ENV-NAME.rb"
      }
    }
  }
}

[ Single instance staging Stack:

Remove deploy.webservices.config.sidekiq, deploy.webservices.config.elasticsearch, deploy.webservices.symlink_before_migrate."config/elasticsearch.yml" and deploy.webservices.symlink_before_migrate."config/sidekiq.yml" from the Stack Custom JSON. ]

Rails App

Go to "Apps" and "Add an App".
Call it "webservices".
Git URL is https://github.com/GovWizely/webservices. Note that this is the HTTPS URL, since webservices is open-source. Using the public URL allows us to avoid adding a deploy key etc.
- In the short-term, use the calum/fully-decoupled-staging branch.

Most other settings are relatively straight-forward. Add some instances to the Layer and you're good to go.

Note that we'll revisit the Rails App Server Layer later when setting up a public facing load balancer.

4. Sidekiq Layer

You may have noticed in the last section some config related to Sidekiq. At this stage, the Rails app knows how to queue Sidekiq jobs, as it knows how to talk to Redis (i.e. the ElastiCache instance). Now we need to set up the instance which will execute Sidekiq jobs.

Our approach is as follows: we create a Custom Layer called "Sidekiq", add some recipes that will start sidekiq on deploy etc., then rather than add a new instance to the layer, we just add one of the existing Rails instances (which handily already has the Rails app code necessary for the sidekiq runner).

Create a Custom Layer called "Sidekiq".
Add the following recipes to the deploy list:
- gw_webservices::crontabs
- sidekiq::deploy
- gw_webservices::notify_third_parties
- gw_webservices::adjust_elasticsearch_indices
Note that in addition to installing sidekiq, these recipes install our cronjobs which import documents into our indices. They also do post deploy stuff, like inform Airbrake and New Relic that a deployment happened and adjust our indices based on new code that went out (create new indices, import data into empty indices, etc.).
[ Single instance staging Stack: add redis-server as an OS Package.]
Add an instance, but rather than create a new one, choose one of the existing Rails instances.
Add the following to the Stack Custom JSON (as per instructions in the opsworks-sidekiq cookbook documentation:

{
  "deploy": {
    "webservices": {
      "sidekiq": {
        "start_command": "bundle exec sidekiq -e YOUR-ENV-NAME-HERE 2>&1 >> log/sidekiq.log"
      }
    }
  }
}

You'll have to redeploy to the instance in order to get the additional Sidekiq stuff on there.

Rails Layer Revisited: ELB

Set up an ELB in front of the Rails instances in much the same way as the Elasticsearch ELB was set up. Notable differences:

Port configuration:
- 80 (HTTP) forwarding to 80 (HTTP)
- 443 (HTTPS) forwarding to 80 (HTTP)
Add a necessary SSL cert to the ELB in accordance with the domain you wish to use.
Use the Webservices-ELB-Rails SG.
Add all Rails instances. If you want to dedicate the Sidekiq instance to only running imports, exclude it from the ELB.
For the health check, HTTP:80/ is usable.
If necessary, add a CNAME to your domain's DNS which points to the ELB.

Gotchas / Concerns

ES service doesn’t start when instance is stopped then started. Why? this. Solution? Switch to the new cookbook.
If all ES instances are stopped, when brought back online all indices have been deleted. Why? [I think] due to the way that the ES recipe sets up the machine on start. It does a fresh install, including wiping out the ES data dir. Solution: don’t ever stop all instances in the cluster. Make sure user data is backed up regularly.