Milestone 3: Fault tolerance, Load Balancing, Continuous upgrades without downtime - airavata-courses/TeamAurora GitHub Wiki

Team Aurora - Project Milestone 3: Fault tolerance, Load Balancing, Continuous upgrades without downtime

Team Members:

Kushal Sheth (kmsheth)

Pratik Sanghvi (psanghvi)

Srikanth Srinivas Holavanahalli (sriksrin)

Vikrant Kaushal (vkaushal)

Instructions Manual:

Here we have provided the details regarding the Milestone 3 and how to run and execute the micro-services on EC2. We have also listed the points regarding what tools and methods we have used for this Milestone.

Prerequisites:

Multiple EC2 Instances

Valid Google account

Tools used:

Java: Maven, Jersey for REST APIs

Python: FLASK for python

CI/CD: TRAVIS CI, AWS S3, AWS CodeDeploy, AWS EC2

Database: PostgreSQL

Message Passing: RabbitMQ

Aim:

In continuation with the project, this milestone focuses on making the services fault tolerant such that if service breaks, the request will be passed to other replicated service on different instance on basis of load on the instances and upgrading a service will not give any downtime. Load balancing is done in such a way that request will be passed to the instance with less load. This ensures that the services will be more robust to the number of requests at any particular time.

RabbitMQ:

RabbitMQ is open source that implements the Advanced Message Queuing Protocol (AMQP). In this milestone we have created the Work Queues that will be used to distribute time-consuming tasks among multiple workers.

“The main idea behind Work Queues (aka: Task Queues) is to avoid doing a resource-intensive task immediately and having to wait for it to complete.”[1]

Once message is delivered to the customer, we are not immediately removing from the queue. If a worker dies, we'd like the task to be delivered to another worker. In order to make sure a message is never lost, RabbitMQ supports message acknowledgements and we are using that to achieve fault tolerance.

milestone3 rabbitmq

In our implementation, the message from API Gateway is passed to the appropriate queue. For example, when API-Gateway wants to communicate with Data-Ingestor service, it will post a message in API_DI_QUEUE. This message will be consumed by one of the Data-Ingestor service (consumer). After working, Data-Ingestor will produce a output message and will post it in API_ALL_QUEUE. API-Gateway will pick message from that queue and will generate message for next service.

How do we handle Load Balancing

For load balancing, we have containers for each service in three different EC2 instances. In our queue, we have set the "prefetchCount" as 1. So when there are multiple consumer listening to same queue, RabbitMQ will intelligently assign the message to the consumer with least load. For example, if there is a message in API_DI_QUEUE, there will be 3 Data-Ingestor services that will be listening to this queue in three different instances. RabbitMQ will assign this message to the service with least load.

How do we handle Fault Tolerance

Each client/consumer is registered with the queue. If one of the container of the service goes down, the message for that service will stay in the queue until there is some other container to consume that message. For example, if one of the container of Data-Ingestor service goes down, the message will be retained in API_DI_QUEUE until there is some container of the Data-Ingestor to consume that message.

How do we handle Continuous Upgrades without downtime

Our services are deployed on the EC2 instances one at a time. Hence while upgrading service on one of the instance, the same service on other instance will continue running. As a result, there is no downtime while upgrading any of the service.

##Setting Initial environment. For this milestone, we have 4 EC2 instances working.

One instance will have only API Gateway on it. This instance will work as server or load balancer as RabbitMQ will be installed on this instance.

For this instance, launch new instance on AWS. While creating this instance, give “SGG-Group” as key and “TeamAurora_HA” as tag. This instance will be deployed independent of other services.

Once the instance is launched, configure CodeDeployAgent on it and set it up for working.

Install RabbitMQ on this instance:

1]. Create directory structure as rabbitmq/docker_data inside /home/ec2-user/.

2]. Execute the below command:

   #docker run -d --hostname rabbitmqhost --name rabbitmqserver1 -e RABBITMQ_ERLANG_COOKIE='rabbitmqerlangcookie' -p 8080:15672 -p 5672:5672 -v /home/ec2-user/rabbitmq/docker_data:/var/lib/rabbitmq rabbitmq:3.6.5-management

After this, Create 3 more instances on AWS. While creating this instance, give “SGG-Group” as key and “TeamAurora” as tag. This instances will carry all the other micro-services such that each instance will have all the micro-services except API Gateway.

Once the instance is launched, configure CodeDeployAgent on it and set it up for working.

Now on CodeDeploy Dashboard on AWS, create an application “TeamAurora”. In this application create two deployment groups.

1]. TeamAurora_Services: In this deployment group give key-tag as “SGG-Group TeamAurora”. Hence whatever will be deployed using this deployment group, it will be deployed on all three instances under TeamAurora tag. This group will have 3 instances.

Select following options:

Deployment Config: CodeDeployDefault.OneAtATime

Rollbacks: Rollback when deployment fails

Service Role ARN: AuroraCodeDeployRole 

2]. TeamAurora_HA: In this deployment group give key-tag as “SGG-Group TeamAurora_HA”. Hence whatever will be deployed using this deployment group, it will be deployed on instance under TeamAurora_HA tag. This group will have 1 instance.

Select following options:

Deployment Config: CodeDeployDefault.OneAtATime

Rollbacks: Rollback when deployment fails

Service Role ARN: AuroraCodeDeployRole 

The .travis.yml file for APIGateway service will have “TeamAurora_HA” as deployment group while all other services will have “TeamAurora_Services” as deployment group.

##How to RUN: ##Step 1: Go to https://github.com/airavata-courses/TeamAurora. In branches menu, open feature-apigateway branch. Make edit to ‘Edit Me.txt’ file inside the branch. This will trigger the TRAVIS CI and it will start deploying the service. It will upload the zip folder to S3. Along with this, AWS CodeDeploy is configured to deploy the service on EC2 instance.

Once the TRAVIS CI successfully build the service and AWS CodeDeploy is done deploying/starting the docker containers on EC2 instance, go to branch dataingestorworker and make some changes to ‘Edit Me.txt’ and commit the changes. This will trigger the TRAVIS CI.

After successful build and AWS CodeDeploy is done deploying/starting the docker containers on all 3 EC2 instances, go to Feature-StormDetector branch and make changes to ‘Edit Me.txt’ and commit the changes. This will trigger the TRAVIS CI.

After successful build and AWS CodeDeploy is done deploying/starting the docker containers on all 3 EC2 instances, go to Feature-StormClustering branch and make changes to ‘Edit Me.txt’ and commit the changes. This will trigger the TRAVIS CI.

After successful build and AWS CodeDeploy is done deploying/starting the docker containers on all 3 EC2 instances, forecasttriggerworker branch and make changes to ‘Edit Me.txt’ and commit the changes. This will trigger the TRAVIS CI.

Docker for each service will be deployed on the assigned EC2 instance as per the deployment group. APIGateway will be deployed on one instance while all the other services will be deployed on other instances.

NOTE: DO NOT MAKE CHANGE TO ‘Edit Me.txt’ UNTILL THE PREVIOUS SERVICE IS SUCCESSFULLY BUILD AND DEPLOYED BY TRAVIS AND CODEDEPLOY.

Step 2:

At the end of this cycle, the entire system is up and running on the AWS EC2 instance. You can visit the service at: http://ec2-35-161-35-175.us-west-2.compute.amazonaws.com:8081/apigateway/jsp/login.jsp

The user credentials for authentication are:

Username : Teamaurora

Password : Teamaurora

Or user can use “Sign in with Google”

Services will run on basis of the load on particular instance.

Testing:

1]. Go to the console of any of the instance and stop one of the Docker image for any service:

   #docker stop <docker image id>

Now submit the job from browser, it should complete successfully.

2]. Go to AWS EC2 dashboard. Stop any of the instance.

Now submit the job from browser, it should complete successfully.

3]. Upgrade any of the service from Github. This will deploy the service on the EC2 instance one by one. Hence even while upgrading, there won’t be any downtime. You can try by submitting new job from browser, while micro-service is still deploying.

References:

[1] http://www.rabbitmq.com/tutorials/tutorial-two-python.html

[2] https://en.wikipedia.org/wiki/RabbitMQ