Final Report - airavata-courses/jerrin92 GitHub Wiki

PDF:

PDF version with images is available at https://goo.gl/o9wFXh

Load Balancing Django Portal

What is Load Balancing?

The project mainly involved working on setting up the load balancer for the Apache Airavata’s Django Portal.

A load balancer improves the distribution of workload across multiple systems. It is mainly aimed at maximizing the resource usage and throughput of the cluster. Load balancing can be mainly classified into two types- software and hardware load balancing. Say for example you are trying to access google.com. Obviously, google receives millions of concurrent connections, which cannot be managed by any one single server. So, what actually happens is that the IP of google is mapped to thousands of web servers that serve the single page google.com. So, a load balancer actually helps to serve your website on multiple webservers thus saving a long time in case your website has a high amount of traffic associated with the same.

The hardware load balancing requires a dedicated machine, which controls the traffic that is directed to a website. Software load balancing, on the other hand, uses third-party software to route the traffic to your website. This project deals with the setup of a software load balancer. There are different types of algorithms as well to set up load balancers. Some of the major ones include round robin, weighted round robin, least connections, least response time. In the round-robin approach, the traffic is rotated among different web portals. Weighted round robin also similar to round robin, however, some servers have the higher priority associated with them. In the least connection, the traffic is routed to the server with least connections and in the least response time, the traffic is mapped to the server which has the highest speed of response.

What were the solutions considered?

Nginx:

Nginx was initially released as webserver and proxy server. This is one of the most popular webservers in the market. It has manly major capabilities including load balancing. It has an open source version and a commercial edition. The open source version of nginx does not provide monitoring. However, if load balancing is the only concern this might not be the best choice.

HAPRoxy

Open source solution to load balancing. One major advantage of HAProxy is that it’s only purpose is load balancing and is pretty easy to set up and configure. There are also a wide number of articles and tutorials to set up HaProxy quickly. It supports both layer 4 and layer 7 load balancing. HA proxy also provides a stats page so that you can monitor the servers. However, it does not have a webserver.

Ribbon

The ribbon came out of Netflix labs. Ribbon is one of the best load-balancer in the case of a microservice architecture. It uses Eureka which is a service discovery tool. It also provides the advanced feature of zone affinity, that is it favors the zone in which the calling service is hosted thus reducing latency and saving costs. As the ribbon is the latest among others the documentation available is less and finding information regarding the troubleshooting of issues is hard.

Service Discovery

Service Discovery is the automatic detection of devices and services in the network. In our implementation, though a system can register itself to the cluster once it is up. If a server goes down there is no provision to bring up a replacement server automatically. That is the current cluster is not system aware.

What are the solutions considered?

Eureka

Eureka is a service discovery solution which came out of the Netflix labs. Eureka provides a weakly consistent view of services, using best effort replication. When a client registers with a server that server will make an attempt to replicate to the other servers but provide no guarantee.

Zookeeper

Zookeeper has a server node that requires a quorum of nodes to operate. It is highly flexible in building service discovery systems. As we need to build the entire system this might be pretty hard for new learners to pick up. In addition, even the health checking has to be implemented, which it may not be as effective as other systems.

Consul

Consul is provided by HashiCorp. It has server nodes and requires a quorum of nodes to operate. In addition, it has a very effective health checking, key-value store, and a great user interface. Unlike Eureka, it provides a strong consistency as servers replicate stat using the Raft protocol.

After careful consideration and due to the ease of implementation consul was chose as the solution.

Major Components in Architecture

Portal Server

The portal server has an instance of the Django portal running on port 8000 as well as a consul agent running on the same. The consul agent registers itself to the consul servers reading the configuration files present in the “/etc/consul.d/client/data.json”. The consul agent running on the machine would keep track of the state of the Django Portal with a tcp check on port 8000.

Consul Server

In order to support election, there are always odd number of consul servers. In our implementation, we have 3 consul servers. The configuration for a consul server is located at “/etc/consul.d/server/data.json”. Every consul-server has a website interface through which you can access the user interface that has the status of the nodes and services.

Load Balancer Server

The load balancer server has HAProxy installed. In addition, it has a consul agent as well as consul template running on the same. The consul agent registers itself to the consul cluster. The consul template would render the haproxy’s configuration file “/etc/haproxy/haproxy.cfg” updating it whenever a change in the portal server has occurred. Consul template is connected to the consul agent running locally. The consul template file is located at “/home/consul/haproxy.ctmpl”

Implementation Example

In the following example, the Django portal runs on 3 portal servers. Copied below are the HAProxy configuration file and the status of the nodes as seen from one of the consul server.

Automating Cluster Implementation

Now imagine a case where the entire cluster goes down or say for example a new node comes up. How would you configure it as say, a load balancer server, consul server or a portal server? Setting a cluster once can be a challenging task, however we need to consider cases of a failures. And setting it all up once again can be time consuming and we can even loose customers accessing the clusters. Hence, we felt the need for automating the cluster implementation.

What are the current automation solutions available?

Puppet:

Puppet is built on Ruby and provides custom Domain Specific Language. It is particularly useful, stable and mature solution for enterprises. Initial setup is smooth and supports a variety of OS’s. However, it is not that great a solution if you need to scale deployments.

Chef:

Chef uses a client server architecture and offers configuration in DSL. The architecture is similar to that of Puppet. It is designed for programmers and have very strong documentation available. It is highly stable and reliable. However, the initial set up is complicated and requires a very steep learning curve.

Ansible:

Ansible is the latest addition to configuration management. Designed mainly to simplify orchestration and configuration management. It is also highly secure as it uses SSH and it is easy to setup and easier than other to pick up. However, as it uses SSH, this can slow down the communication.

SaltStack:

Salt stack was designed to enable low-latency and high-speed communication for data collection and remote execution. It is effective for high scalability and has a great user community. However, the installation process can be difficult and the documentation is not that great either.

We have used Ansible as our solution as it has a fairly easy learning curve. It is secure and is gaining much popularity these days.

Ansible Architecture

Ansible can be run from any machine even from your local machine and works with the help of a file named hosts. The host file contains the IP address as well as the tags associated with the IP address. Our current setup also requires a serverlist.txt file initially to add configuration files of consul. The file contains the IP address of the consul servers.

The consul configuration file is generated by a python script. There are three different generator scripts. One each to generate configuration for consul server (consul_server.py), one for consul in portal server (consul_agent.py) and the last for consul running on haproxy (consul_ha_agent.py). There are 3 roles as well to install and configure consul server, Django server and haproxy server. We run the command “ansible-playbook playbook.yml” and the entire cluster would be created. However, before you start you might have to add the settings_local.py to the consul-cluster role, with the IP of the haproxy server added to the allowed hosts.

Step by Step guide to build the Cluster

  1. Before you begin, install ansible on your local machine or on the machine from which you are running the playbooks
  2. Clone Django Portal or get the ansible directory from the load-balancing branch
  3. Launch empty 6 or more centos7 servers on tacc. (1 haproxy instance, 3 consul server instance, any number of Django instances)
  4. Ensure that the login user of tacc is centos and key of your local machine or the machine where you are running ansible is added to the hosts and centos can sudo without password. (Tip: Choose JS-API-Featured-Centos7-Oct-21-2017 8.GB image)
  5. Add the apache airavata django portal's settings_local.py to ansible/roles/consul-client/files/settings_local.py
  6. Ensure that the load balancer server's public IP is in the allowed hosts of settings_local.py
  7. Add the private IP of ConsulSevers to ansible/roles/basic/files/serverlist.txt
  8. Add all the public IP's of the server under the appropriate tags of ansible/hosts
  9. Run ansible-playbook playbook.yml
  10. If there are no firewall issues when you access the IP of the haproxy server the Django server would show up

Troubleshooting

Ports to whitelist: 22,53,80,1936,5000,8000,8300,8301,8400,85008600,8301,8600 (its better to open both TCP and UDP in these) Check if the django portal server is running: http://django-portal-IP:8000 Check if consul servers are running: http://consul-server-IP:8500

Jira Issue

https://issues.apache.org/jira/browse/AIRAVATA-2533

Thread Associated with Issue:

Working on Load Balancers

Commits Associated with Issue

4e1692a: Removing Garbage Data

346186c: Removing Unwanted File

0c94ac0: Production Release of Ansible

69ac1e8: BugFix: Removed Public IP's

9745983: Feature: switching to TCP check

68081ba: BugFix: Configuring to listen on all ports

7820aff: BugFix: Consul Template

b2fc927: BugFix Hostname

e168338: BugFix ansible configurations

8cc8b08: BugFix

ec10d5a: ansible conf w/o haproxy

331f6c9: Bugfix Conf Generator for HaProxy

8dd98cd: Sample test app

8b9617c: Adding Server list to Consul Agent

c1a4404: Consul Client

499b645: Adding Server from Server List

2dc494f: Bugfix Server Command

7e7542c: Bugfix for ui

c0979cd: Ansible roles

68e3cda: Merge branch 'load-balancing' of https://github.com/jerrin92/airavata-django-portal into load-balancing

6a7d6db: Ansible Configurations

93f0766: Get current consul active servers

b8556a7: Installation scripts

a625576: HA Proxy generator

e249ecd: Consul Startup Script

4960b10: Automated Python Scripts

da3bbdb: Load Balancer Configs

7f44f97: Load Balancer config

Future Implementation and Further Works

As it can be seen from the report the current system is not system aware. Say for example one of the servers goes down, we have to manually spin up a new node and run the playbook that is associated with the server. Delete the old instance that has failed. This is one of the major areas that we might have to improve upon. In addition, if we are able to containerize the Django portal Kubernetes can replace the above mentioned set up and effectively. In addition, the current set up only uses one load balancer and there by becoming the single point of failure. In the production system, we will have to set up floating IP for the above system to be production ready.

Though these are some of the concerns and future works that can be implemented, the current setup would work without much hassle.