Redundancy and High Availability - barialim/architecture GitHub Wiki

Table of Content

What is DR, FT and HA?

Overview

We often think about what can happen if we loss a Server, a router, an application or its component. So we've to think about Redundancy and Fault Tolerance.

Redundancy and Fault Tolerance

  • Maintain Uptime

Redundancy and FT are very similar ideas to keep things up and running, and maintain uptime. The organization continue to function.

  • No hardware failure

We need to make sure that we don't have a hardware failure. We may want to have redundant server or within a single server, we may want to have redundant power supply.

  • Redundant hardware components

    • Multiple devices, load balancing power
  • Redundant Array of Independent Disks (RAID)

    • RAID methodology means if we loose one disk then we've an options to keep the system up and running without anybody knowing there was a problem with that piece of hardware.
  • Uninterruptible Power Supplies (UPS)

    • Prepare for the disconnections
    • If your system happen to be somewhere sketchy, If we ever loose power, these UPS systems have inside of them Batteries and some other method to keep system running.
  • Clustering

    • A logical collective servers(downtime is futile)
    • If you want to be sure of that resources running on a server available, you MAY want to consider clustering a number of servers together. That way if you loose a Motherboard, or if you've system piece of software in a system fail, then you can have these extra systems in your cluster to keep everything up and running. And since all of these cluster machines talking to each other, they know if there's an outage, and they'll be able to take those resources, and make sure that everybody is able to run all of the systems that they need to run.
    • Different types of Clustering:
      • Active/Active clustering - all nodes in a cluster constantly communicating with each other. And behind it, you may've a storage that they share. If one node goes down, the traffic is seamlessly served by the next available node in the cluster.
        Active-passive-clustring
      • Active/Passive clustering - is a little bit different where you've one node that is always active, and one node that is always passive. The passive system is sitting there doing nothing, waiting for the problem to occur by communicating with each other. If node 2 notices that node 1 has disappeared; the Active node is no longer there, node 2 automatically makes itself available to the world to start serving the traffic.
        Active-passive-clustring
  • Cold spare

    • an additional spare server you bought keeping it in the store, and you might have 10 servers sitting in the rack, and any of those 10 servers fail, you can go to the store and pull one of the cold spare from there, put it in the rack and configure it to be used.
  • Warm spare

    • You may want to have a warm spare, which you might have even already have it in the rack. You occasionally have it turned on, update it with the latest software, update it with your configurations. That way if you do have a problem, you simply flip the switch, turn it on. Now that warm spare is ready to go, you don't have to load additional configuration or software to function.
  • Hot spare

    • It's always ON running, and always updated. In many cases, it is designed to automatically take over should there be a problem. So if you do have a problem with the system, you can immediately move to the Hot spare, and it has an exact duplicate, exact updated system that everybody can start using.
  • Load Balancing

    • Shared load across components
    • We'll often see systems load balancing these things. Its very important if you've multiple multiple systems in place, you want to have all of them running all the time so that you're balancing the load between them. And if you lose one, everybody will flip over to the other.
  • No Software failure

We also need to think about the software that we're running on these systems. We may want to get a software that is able to notify us whenever there is a problem or working in conjunction with another pieces of software running perhaps in a cluster, if one particular software node fails then you've got other pieces of software running on the same network that is able to pick up the slag should that problem occur.

  • No system failure

We should also think about major system problems. Maybe we would like to have redundant router, redundant firewall, maybe we would like to have redundant WAN to the internet.

High Availability

Now just because you've multiple servers, or system redundancy doesn't always mean your environment is Highly Available.

HA means that your system will always be available regardless of what happens.

With Redundancy, you may have to flip a switch from one server to the other, or you may have to power up a new system to be able to that system available.

HA is generally considered to be Always on, always available.

If you've multiple HA systems, and you loose one, doesn't matter! Everybody continues run because you got an extra system to take up the extra load associated with that resource.

There maybe many different components working together to make this happen (one component dependent on another for it to work E2E), so watch out for Single points of failure (SPOFs)

What is Disaster Recovery

Disaster Recovery comes into picture when there is a calamity i.e. flood, an earthquake or generally some form of natural disaster. So when DR happens, the chances are that most likely the whole of region gets effected.

So we need to think about how do we recover the infrastructure or people.

Generally, when DR happens, cost is very high and recovery time is high too.

What is Recovery Time

RT in the context of DR is the time it takes to recover the environment in a different Data center/region.

RT with just DR configuration is typically high.

Redundancy

  • Geo-Redundancy

Fault Tolerance

Backups and Restore

When we’re talking about Disaster Recovery Plan and Back ups, its very important for us to consider things such as…

  • RPO: How much data can I afford to lose?
  • RTO: How long can I afford to be down?

rpo and rto

Recovery Point Object RPO

RPO– takes into consideration that we’re taking some sort of nightly back up or maybe back up every 2 hours or maybe taking Replication technology into account.

How much of data expressed as time you’re willing to lose in the event of a Disaster?

Say your RPO = 1 Hour worth of Data There is a Disaster and you’re PRIMARY server is destroyed However, your Standby DB was updated within the 5 minutes This is within the 1 Hour RPO As potentially only 5 minutes’ worth of Data is lost.

Recovery Time Object RTO

RTO– take into consideration the following:

  • How long we’re going to afford to be down?
  • How quickly can I bring up the systems back up running and functional for Users?
  • Do I have a redundant Server handy?
  • Do I have a redundant Data Centre?
  • Am I taking advantage of VM fault tolerance?

How much time you’ve before the:

"System is UP -> & Running -> & Functional for Users"

Say your RTO = 5 Hours, and Disaster strikes. If the system is back up and running in 3 hours, then you have met your RTO tick

Tools for backup and replication

There're various tools or software around which can be used for copying/synchronizing files between from one machine to another in order to enable high availability and redundancy.

Some filesystems synch/backup tools: Rsync, LsyncD, Synchronize, FreeFileSync, GoodSync, SyncBack, Unison, SyncToy, Toucon etc.

Here're some of this tools, of which Rsync is default synchronization method:

Rsync

Rsync (Remote sync) is a cross-platform utility/library for efficiently transferring and synchronizing files between a computer and an external hard drive and across networked computers by comparing the modification times and sizes of files.

  • Rsync is suited to master-master files synchronization which replicates a directory (one or more) on a single server to other servers.
  • Rsync uses a delta-transfer algorithm which provides a very fast method for bringing remote files into sync.
  • Rsync normally uses ssh or rsh for communication with remote systems.

Finally, we use cron to schedule our backups. Why? Because backups are SERIOUS BUSINESS.

Use cases

  • Automated backups to disks or remote servers.
  • Synchronization of files and remote directories.
  • Common transfer of files.

Advantages

  • It is cross-platform, runs on any OS
  • It is faster than scp (Secure Copy) because rsync uses a upload protocol that allows only the difference between the files to be transferred just like SCM Git, the first time it transfers the whole file for the second time only the changes that may be there.
  • Transfer of files are done through an SSH channel, which means data is being encrypted while they are backed up using SSH, another of its benefits is that credentials can be protected in case we do not use public keys.
  • It is designed for remote backups.
  • It supports the copy of all the permission of the files for example, owner, soft groups and hard link.
  • It uses less bandwidth when using compression while sending files.

Disadvantages

  • It can consume more system resources and the transfer may not be as fast when transferring files
  • Editing files in the backup is possible.
  • User and group IDs may not match on the destination server.

Reference: R1, R2, R3

Lsyncd

Lsyncd (Live Syncing Daemon) is a synchronization server based primarily on Rsync. It is a server daemon that runs on the “master” server, and it can sync / mirror any file or directory changes within seconds into your “slaves” servers, you can have as many slave servers as you want. Lsyncd is constantly watches a local directory trees event monitor interface (inotify or fsevents). It aggregates and combines events for a few seconds and then spawns one (or more) process(es) to synchronize the changes.

By default lsyncd uses rsync to send the data over the slave machines, however there are other ways to do it. It does not require you to build new filesystems or blockdevices, and does not harm your server I/O performance.

  • Lsyncd 2.2.1 requires rsync >= 3.1 on all source and target machines.

Use cases

  • Lsyncd is suited to master-master sync between two directories
  • Lsyncd helps sync files from a secure location, to a not-so-secure location.
  • Lsyncd also allow monitoring the files in a directory
  • Lsyncd is designed to synchronize a slowly changing local directory tree to a remote mirror.
  • Lsyncd is especially useful to sync data from a secure area to a not-so-secure area.
  • Lsyncd is ideal for automatic bidirectional file synchronization from one host to another. Given that drive A is the primary drive, and drive B is the secondary.

Advantages

  • Lsyncd monitors a local filesystem for changes and mirrors those changes to a filesystem on a remote server. The advantage of lsyncd over rsync is the former’s ability to detect filesystem changes without having to re-scan the entire source and target directory structures. This makes lsyncd faster and more efficient. And you can still use rsync with lsyncd to handle the actual data transfer.
  • It is quite robust solution against network disturbances, but also not completely crash-proof
  • On the one side lsyncd generates much less load (IO and CPU) on the system and on the other hand the possible update intervals are significantly shorter.

Disadvantages

Snapmirror

Disaster Recovery is driven by Recovery Point Objective (RPO) where you maintain a secondary copy to which you can failover to continue business operations. Failover is the process of transferring mission critical workload from the production data center to the disaster recovery data center. This process can be manual or scripted.

Snapmirror is an easy to use cost effective replication solution for High data availability of you mission critical application. You can replicate data over LAN or WAN to on-premise or directly to the Cloud, or even Archive your Tape.

Benefits

  • reduce bandwidth utilization and storage footprint by using built-in storage efficiencies like d-duplication, compression and compaction to accelarate data transfer, and reduce network bandwidth utilization upto 70%.
  • reduce management overhead through simplified data protection with easy setup and configuration.

Feature

  • Snapmirror is configured as RW on source and R-O on destination
  • Snapmirror is not a backup solution, instead, it's a disaster recovery solution. see Scenario for more info.
  • Snapmirror synchronous relationship is very expensive in terms of performance.

Scenario

NetApp1 is our Primary site, used for business critical application NetApp2 is our DR site, in different physical location, its not actively used in serving the data, but its kept online as upto date as possible incase of disaster.

  • Snapmirror is not a backup solution, instead, it's a disaster recovery solution.
    • why & how?
      1. removing a file from /source means removing file in /destination that's bc according to your schedule all info from /source was replicated to /destination. The old files you had in /destination will be replaced/removed
      2. another reason being is that once we loose the /source access for example we've a disaster in our Primary Prod site (e.g. we had lack of power supply, or other issue), in this situation, we'll not be able to access our /source. Having said that, we'll have our snapmirror /destination which is R-O, but we can still access our share on /destination. You can then break snapmirror relationship between /destination and /source, which makes the destination writable. This would then allow us to point our business critical application to point to using /destination volume as RW until we bring up the Primary Prod site, and that essential part of our DR and Business Continuity Plan is now complete. So at this point we've completely recovered from disaster, even though our Primary Prod site is still down.

Lets now assume our Primary site is fixed.

  1. you would online the volume on Primary site
  2. in order to switch over our business critical application back over to Primary share or /source, bc our maybe our Primary site is faster or more reliable, the following things to do:
  3. move all the changes over from DR share (which is our source now) to Primary share (which is destination now), this will be done from Primary site
  4. run snapmirror resync -S secondary primary from Primary site to replicate all info from DR site. One think to note, is that this will overwrite data on Primary site from beginning to when disaster occurred. Do backup this data somewhere else.
  5. now you'll have to stop the application, un-mount from secondary site
  6. go ahead and update the relationship on Primary site from Secondary site
  7. now that Primary site is update, you'll break the snapmirror on primary, which makes Primary writable
  8. recreate snapmirror relationship from secondary site using snapmirror resync destination to activate secondary as DR site

This demonstrate going from normal status to disaster, and from disaster to recovery.

Reference: R1, R2, R3

Points to consider prior to setup

  • Make sure that your file system structure and your project structure remains concise for every instance/snapshot/release of your project. If you change something on one side, it will affect the other, as a consequence.
  • Apart from access to the folders of your project, your system account will need to be a sudoer — sudo: superuser do, or saying pleeeaseee in linux lingo :) You will also need access to a folder that resides outside of the project that will be synced. The latter is needed to make lsyncd behave as close to 2-way sync as possible and avoid an endless sync loop.
  • Both servers should be accessible via ssh, meaning that service sshd should be running.

⚠️ NOTE: Please remember that lsyncd is not primarily designed for a 2-way sync and this is as close as it can get to being one.

Reference: R1, R2, R3

Root Cause Analysis

RCA is a process for identify "root causes" of problems or events such as application/data storage outage.

RCA helps with identifying the problem and but more importantly, finding ways/countermeasures to prevent them happening in the future.

Primary Goals of RCA

The primary goal of using RCA is to analyze problems or events to identify:

  • What happened
  • How it happened
  • Why it happened…so that
  • Actions for preventing reoccurrence are developed

Benefits

Implementing RCA will help organization:

  • Identify barriers and the causes of problems, so that permanent solutions can be found.
  • Develop a logical approach to problem-solving, using data that already exists in the organization.
  • Identify current and future needs for organizational improvement.
  • Establish repeatable, step-by-step processes, in which one process can confirm the results of another.

How do you get RCA to work effectively

root cause analysis

Basic method to use

  • Define the problem.
  • Gather information, data and evidence.
  • Identify all issues and events that contributed to the problem.
  • Determine root causes.
  • Identify recommendations for eliminating or mitigating the reoccurrence of problems or events.
  • Implement the identified solutions.

How to Run a 5 Whys Meeting

  • Step 1: Organize your 5 whys meeting
  • Step 2: Define the Problem Statement
  • Step 3: Ask the first "Why?"
  • Step 4: Ask "Why" four more times
  • Step 5: Determine your countermeasures
  • Step 6: Assign responsibilities
  • Step 7: Monitor progress
  • Step 8: Close the meeting

Terminology

  • Business Continuity Management (BCM):
  • Business Continuity Plan (BCP):