Azure cache for Redis - barialim/architecture GitHub Wiki

Table of Content

Table of contents generated with markdown-toc

Overview

Azure Redis HLD

Azure Redis is a high-performance fully managed in-memory data caching technology for faster retrieval of data that comes with built-in HA. This ensures low-latency, high throughput by reducing the need to perform slow I/O operations.

The goal of HA architecture is to ensure that your managed Redis instance is functioning even when its VMs are impacted by planned or unplanned outages.

Redis cache implements HA using multiple VMs called "Nodes" for cache. It configures these nodes such that data replication and failover happens in coordinated manners. It also orchestrates maintenance operations such as Redis software patching.

Available HA options in various tiers

HA options

  • Standard Replication

    • Azure Redis cache in the Standard/Premium tier runs on a pair (two) of Redis servers by default. The "two" servers are hosted on dedicated VMs.

    • Open-source Redis allows only one server to handle data write requests. This server is the primary node, while the other replica.

    • After it provisions the server nodes, Azure Cache for Redis assigns primary and replica roles to them.

    • Primary Node: is responsible for servicing write as well as read requests from Redis clients. On a write operation, it commits a new key and a key update to its internal memory and replies immediately to the client.

    • Replica Node: It forwards the operation to the replica asynchronously.

    • Redundancy and Resiliency: If the primary node in a Redis cache is unavailable, the replica will promote itself to become the new primary automatically. ⭐ This process is called a failover. The replica will wait for sufficiently long time before taking over in case that the primary node recovers quickly. When a failover happens, Azure Cache for Redis provisions a new VM and joins it to the cache as the replica node. The replica performs a full data synchronization with the primary so that it has another copy of the cache data. Failover and patching for Azure Cache for Redis.

    • Additional/Multi-Replica Nodes: Azure Cache for Redis allows additional replica nodes in the Premium tier. A multi-replica cache can be configured with up to three replica nodes. Having more replicas generally improves resiliency because of the additional nodes backing up the primary. Even with more replicas, an Azure Cache for Redis instance still can be severely impacted by a datacenter- or AZ-wide outage. You can increase cache availability by using multiple replicas in conjunction with zone redundancy.

      standard replica

      ⚠️Note: Normally, a Redis client communicates with the primary node in a Redis cache for all read and write requests. Certain Redis clients can be configured to read from the replica node.

  • Zone Redundancy

    • Azure Redis cache supports Zone redundant configurations in Premium & Enterprise tiers. A zone redundant cache can place its nodes across different Azure Availability Zone (AZ) /Data center in the same region. It eliminates datacenter or AZ outage as a single point of failure and increases the overall availability of your cache.

    • Premium tier: The following diagram illustrates the zone redundant configuration for the Premium tier: zone redundant configuration

    • Distributed nodes in a Zone redundant cache: Azure Cache for Redis distributes nodes in a zone redundant cache in a round-robin manner over the AZs you've selected. It also determines which node will serve as the primary initially.

    • Redundancy & Resiliency: A zone redundant cache provides automatic failover. When the current primary node is unavailable, one of the replicas will take over. Your application may experience higher cache response time if the new primary node is located in a different AZ.

      • Azure Availability Zone (AZs): AZs are geographically separated. Switching from one AZ to another alters the physical distance between where your application and cache are hosted. This change impacts "round-trip" network latencies from your application to the cache. The extra latency is expected to fall within an acceptable range for most applications. We recommend that you test your application to ensure that it can perform well with a zone-redundant cache. When a failover happens, Azure Cache for Redis provisions a new VM and joins it to the cache as the replica node. The replica performs a full data synchronization with the primary so that it has another copy of the cache data.
    • Benefits

      • Azure cache for Redis under Premium tier gives you high-performance, and throughput, low latency, better availability and more feature.
      • Premium caches are deployed on more powerful VMs compared to those under Basic/Standard tier.
  • Geo-replication

    • Geo-replication is a mechanism for linking two or more Azure cache for Redis instances, typically spanning two Azure regions.

    ⚠️ Note: Geo-replication in the Premium tier is designed mainly for DR.

    • Geo-replication is supported in Premium pricing tier.
    • Both Primary and Secondary instances needs to be in the same pricing tier
    • If Primary node in becomes unavailable for any reason (most likely due to planned maintenance), the replica will promote itself to become the new Primary automatically. This process is called failover. The replica will wait for sufficiently long time before taking over in case that the primary node recovers quickly.
    • It also can stop working because of unplanned events such as failures in underlying hardware, software, or network.
    • An Azure Cache for Redis will go through many failovers during its lifetime. The high availability architecture is designed to make these changes inside a cache as transparent to its clients as possible.⭐
    • Data transfer between the two cache instances is secured by TLS.
    • "Multi-replica" Option

      • In addition, Azure Cache for Redis allows additional replica nodes in the Premium tier. A multi-replica cache can be configured with up to three replica nodes. Having more replicas generally improves resiliency because of the additional nodes backing up the primary. Even with more replicas, an Azure Cache for Redis instance still can be severely impacted by a datacenter- or AZ-level outage. You can increase cache availability by using multiple replicas in conjunction with zone redundancy (see above). ⭐
    • Limitations

      • Charges egress costs when data is replicated
    • Caveats

      • Both caches needs to be in the same subscription.
      • Persistence isn't supported with geo-replication.
      • Scaling down your cache from a Premium to Basic/Standard pricing tier is NOT supported.
        • Scaling up from Basic to Premium is not supported. First scale up to Standard, then premium.
      • When a Basic cache is scaled to a new size, all data is lost and the cache is unavailable during the scaling operation.
        • You can't scale from a larger size down to the C0 (250 MB) size.
      • When a Basic cache is scaled to a Standard cache, the data in the cache is typically preserved.
      • While Standard and Premium caches have a 99.9% SLA for availability, there is no SLA for data loss.

      After geo-replication is configured, the following restrictions apply to your linked cache pair:

Pricing Tiers

Azure Redis Cache has three pricing layers with different features, performance, and budget.

  • Standard: this tier offers an SLA & provides a replicated cache. The data automatically replicated between the two nodes - ideal for prod-level apps.
  • Premium: provides better performance, bigger workloads, enhanced security, and DR; Backups and snapshots can be created, and restored in-case of failures. It also offers cache persistence which persists data stored inside in-memory cache. It also provides a cluster, which automatically shares data across multiple nodes. Hence, this allows creating workloads of bigger memory sizes, and get better performance.
  • Enterprise:

Supported Data Types

Azure cache for Redis supports various formats. It supports data structures like Strings, Lists, Sets, and Hashes. For more on data types.

Features

Azure Cache for Redis has many features for management, performance, and high availability. Here are a few of the most important:

  • Fully Managede: ACFR is a fully managed version of an open-source Redis server. i.e., it monitors, manages, hosting, and secure the service by default.
  • High Performance: ACFR enables an application to be responsive even the user load increases. It does so by leveraging the low latency, high-throughput capabilities of the Redis engine.
  • Geo-replication: ACFR allows replicating or syncing the cache in multiple regions in the world. One cache is primary, and other caches act as secondaries. The primary cache has read and write capabilities, but the secondary caches are read-only. If the primary goes down, then one secondary cache becomes primary. The significant advantage of this is high availability and reliability.
  • Cache Cluster: The cluster automatically shards the data in the cache across multiple Azure Cache for Redis nodes. A cluster increases performance and availability. Each shard node is made of two instances. When one instance goes down, the application still works because other instances in the cluster are running.

Geo-replication DR or HA

https://www.ais.com/azure-redis-cache-geo-replication-dr-or-ha/

Cache Refresh

In most cases, data held in cache is copy of data that's held in the original data store.

The data in the original data store might change after it was cached, causing the cached data to become stale (no longer fresh/up-to-date).

Many caching systems enables you to configure the cache to expire data and reduce the period for which data may be out of date, and Redis cache offers this through setting an expiration on the data or through various eviction policies (default is VolatileLRU) to trigger updates in order to avoid data becoming stale. see Memory management for more information.

⚠️ Note: Consider the expiration period for the cache and the objects that it contains carefully. If you make it too short, objects will expire too quickly and you will reduce the benefits of using the cache. If you make the period too long, you risk the data becoming stale.

Security

Connect privately to cache

  • Does service support private link? 
    • Azure cache for Redis does support Private Link (connect privately from VNet to your cache instance offered as PaaS service) through Private Endpoint.
    • By using Private Link, you can connect to an Azure Cache for Redis instance from your virtual network via a private endpoint, which is assigned a private IP address in a subnet within the virtual network. Once a private endpoint is created, all access to your cache instance will be restricted to only connections using the private IP addresses when public network access is disabled.
  • Are there any scenarios where private link cannot be used? 
    • To use private endpoints, your Azure Cache for Redis instance needs to have been created after July 28th, 2020.
    • Currently, the following features are NOT Supported with Private Endpoint
      • geo-replication
      • firewall rules
      • portal console support
      • multiple endpoints per clustered cache
      • persistence to firewall and VNet injected caches
    • Network Security Group (NSG) is NOT enabled for Private Endpoint. see NSG enabled for Private endpoint

ℹ️ Source: can be found on Azure Private Link for Azure Cache for Redis in general availability

Encryption

  • Security and privacy of data in Azure cache for Redis is an important area of concern for organizations thinking of leveraging this service.
  • Azure in general offers some different ways to encrypt data depending of the services used.
  • We will discuss below some of those options in Azure Cache for Redis service.

In Transit

  • Does the service support encryption in transit?

    • Azure cache for Redis supports encryption in transit through SSL/TLS encrypted communication by default.
    • TLS versions 1.0, 1.1 and 1.2 are currently supported.
  • What are the encryption options?

    • Azure cache for Redis offers encryption in transit through TLS >= v1.0 by default, and if your client library or tool doesn't support TLS, then enabling unencrypted connections can be done through the Azure Portal or management APIs. In such cases where encrypted connections aren't possible, placing your cache and client application into a virtual network would be recommended. See secure your cache with a virtual network.
  • What are the limitations?

    • TLS 1.0 and 1.1 are on a path to deprecation industry-wide, so use TLS 1.2 if at all possible.
  • Microsoft gives customers the ability to use Transport Layer Security (TLS) protocol to protect data in transit, when it’s traveling between the cloud services and client applications. Microsoft datacenters negotiate a TLS connection with client systems that connect to Azure services. see Benefits of TLS

At Rest

  • Does the service support encryption at rest?
    • On Azure Redis, all data stays in the Virtual Machine memory all the time.
    • Azure cache for Redis is an in-memory data store. Hence, you shouldn't worry about data at-rest as it is not being persisted by default. Therefore, all data stays in the VM memory all the time, and its impossible to obtain physical access to this fully-managed VM and compromise this data, unless you decide to persist your data outside this VM, where encryption at rest can be enabled only under "Premium" pricing tier through Azure Storage.
  • What are the encryption options?
    • Encryption at rest comes into picture if data is persisted, and not when data is stored in-memory on fully managed VM. Hence, Azure cache for Redis "only" supports encryption at rest when data is persisted under "Premium" pricing tier.
    • More on configuring data persistence in Azure cache for Redis under "Premium" pricing tier.
  • Why is it that data stored in-memory on VM cannot be compromised?
    • Any attempt to encrypt Redis data and using encrypt/decrypt hashes on server-side will use the VM memory at the same way, having the same exposure. For that reason, Redis encryption at rest is not implemented and not supported.
    • In any case, only the Redis process assigned to some memory segments can access it, maintaining all data private without any possibility to externally access it. Is the operating system that guarantees that.
    • Each process on Windows has a virtual address space and all threads of a process can access its virtual address space. However, threads cannot access memory that belongs to another process, which protects a process from being corrupted or data read by another process.
  • What are the limitations (e.g. cannot encrypt metadata in a control plane, etc.)?
    • N/A

⚠️ Note Azure Storage automatically encrypts data when it is persisted. You can use your own keys for the encryption. For more information, see Customer-managed keys with Azure Key Vault.

  • Other tiers
    • On Standard C1 and above tiers (Premium tier included) each Redis node runs on a dedicated Virtual Machine.
    • On Standard C0 and below (Basic tier included), the Redis instances remains in a shared environment and the same Virtual Machine is used by more than one Redis instance.

Customer Managed Key

  • Does the service support encryption with customer managed key?
    • Background: Data in a new storage account is encrypted with MSFT-managed keys by default. You can continue to rely on MSFT managed-keys for encryption at rest if you decided to "Persist" your data through Azure cache for Redis with their "Premium" pricing tier, or you can manage encryption with your own keys.

    • Hence, if you decided to choose to manage encryption with your own keys, you've two options.

      • You can specify a customer-managed key to use for encrypting and decrypting data in Blob storage and in Azure Files.1,2 Customer-managed keys must be stored in Azure Key Vault or Azure Key Vault Managed Hardware Security Model (HSM) (preview). For more information about customer-managed keys, see Use customer-managed keys for Azure Storage encryption.
      • You can specify a customer-provided key on Blob storage operations. A client making a read or write request against Blob storage can include an encryption key on the request for granular control over how blob data is encrypted and decrypted. For more information about customer-provided keys, see Provide an encryption key on a request to Blob storage.
  • What are the limitations of using customer managed key?
    • Customer would have to take the responsibility of keys storage (Azure Key Vault or Key Vault HSM), rotations, and controls.
    • Both customer-provided/ and managed keys option supports limited Azure storage types (Blob and File or just Blob), whereas, MSFT managed-keys supports all types (Blob, File, Queue, and Table). See comparison table for more information.

Persistence

Redis data persistence feature in the Premium tier to increase resiliency against data loss. Azure Redis Cache offers Redis Database (RDB) and Append Only File (AOF) options in Redis persistence. For more information, see How to configure persistence for a Premium Azure Redis Cache.

  • Learn more about the advantages and disadvantages of RDB & AOF persistence.

  • Azure Cache for Redis offers Redis persistence using either RDB (Redis Database) persistence or AOF (Append Only File) persistence models. For more, see Redis persistence options.

  • Azure cache for Redis writes the data into Azure Storage account that you own and manage, and Azure Storage automatically encrypts data when it is persisted. ⭐ More on Azure Storage encryption for data at rest

  • Redis Persistence function "is only supported in Premium tier" in order to persist data stored in Redis.

  • You can also take snapshots and back up the data, which you can load in case of a hardware failure.

⚠️ Note Because these data will be saved externally this needs some special attention related to data security and encryption. ⭐

Conclusion

  • Despite Azure have some different ways to encrypt and secure data, for Azure cache for Redis service, encryption in-transit using SSL/TLS 1.2 is the recommended way.
  • Encryption at-rest is not needed as the Virtual Machine that hosts the Redis node already guarantees the security and privacy of data in-memory, and Redis Persistence is guaranteed by Azure Storage encryption.
  • Any encryption on client-side is seen as CPU intensive and add more time to process, hence, losing the advantage of having a quick Cache service with very low-latency. ⭐

High availability

  • Is the service highly available?

    Yes, by default, caches in the standard or premium tier have built-in replication with a two-node configuration—a primary and a replica hosting two identical copies of your data.

  • How does the high availability work?

    A high availability architecture ensure your managed Redis instance is functioning even when outages affect the underlying virtual machines (VMs), both planned and unplanned outages.

    Azure Cache for Redis implements high availability by using multiple VMs, called nodes, for a cache. The nodes are configured such that data replication and failover happen in coordinated manners. In a Basic cache, the single node is always a primary. In a Standard or Premium cache, there are two nodes: one is chosen as the primary and the other is the replica.

    High availability also aids in maintenance operations such as Redis software patching. See official documentation for various HA options.

    ⚠️ NOTE: A Basic cache doesn't have multiple nodes and doesn't offer a service-level agreement (SLA) for its availability. Use a Standard or Premium cache for a multi-node deployment, to increase availability.

    New in preview, Azure Cache for Redis can now support up to four nodes in a cache distributed across multiple availability zones. This update can significantly enhance the availability of your Azure Cache for Redis instance, giving you greater peace of mind and hardening your data architecture against unexpected disruption.

  • Is it active-active? Active-passive? Multiple followers?

    Azure cache for Redis generally works in a primary/replica relationship. What this mean is that you've a Primary cache instance allows both Read & Write operations and the replica instance Only allows Read operation. The service ensures that both Primary & Replica instances are in sync. This form of configuration is offered in both Standard (in single region) & Premium (in multi-regions/geo-replication (typically spanning across two Azure regions)) tiers.

    ⚠️ NOTE: Geo-replication is not available in standard or Basic tiers.

    Azure also more advance form of geo-replication called Active Geo-replication/multi-master writes with strong eventual consistency configuration through their Enterprise and Enterprise Flash tiers.

    🏆 Active geo-replication groups two Enterprise Azure Cache for Redis instances into a single cache that spans across Azure regions. Both instances act as the local primaries. An application decides which instance(s) to use for read and write requests.

    ⚠️ NOTE: Enabling active geo-replication in the Enterprise and Enterprise Flash tiers increases availability to up to 99.999%. In addition to paying the standard charges for the primary and the replica cache instances, you will also pay some charges for the data transfer between regions.

  • Does it work within Region only? Cross-regions?

    Azure Redis cache can be configured both regionally and cross-region.

  • How service high availability impacts TCO?

  • Are there any implications of having/not having Availability Zones in Region?

    Azure Cache for Redis supports zone redundant configurations in the Premium and Enterprise tiers. A zone redundant cache can place its nodes across different Azure Availability Zones (data centers) in the same region.

    Therefore, enabling Zone-redundancy HA option, it eliminates datacenter or Availability Zone outage as a single point of failure and increases the overall availability of your cache.

    On the other-hand, if you decided to not opt for Zone-redundancy HA option, then your service will be impacted by data center outage resulting in single-point-of-failure (SPOF)

    How much does it cost to replicate my data across Azure Availability Zones?

    ⚠️ NOTE: The data transfer charge is the network egress cost of data moving across the selected Availability Zones.

  • Does functionality of the service decrease when faults occur?

    When a failover occurs when a replica node promotes itself to become a primary node, and the old primary node closes existing connections. See Explanation of a failover or How does patching occurs

    ⚠️ NOTE: Nodes are patched one-at-time to prevent data loss. Basic cache will have a data loss. Clustered caches are patched one shared at a time.

    Multiple caches in the same resource group and region are also patched one at a time. Caches that are in different resource groups or different regions might be patched simultaneously.

    Because full data synchronization happens before the process repeats, data loss is unlikely to occur when you use a Standard or Premium cache. You can further guard against data loss by exporting data and enabling persistence.

    🥇 Whenever a failover occurs, the Standard and Premium caches need to replicate data from one node to the other. This replication causes some load increase in both server memory and CPU. If the cache instance is already heavily loaded, client applications might experience increased latency. In extreme cases, client applications might receive time-out exceptions. To help mitigate the effect of more load, configure the cache's maxmemory-reserved setting. See affect of failure on client application

    A zone redundant cache provides automatic failover. When the current primary node is unavailable, one of the replicas will take over. Your application may experience higher cache response time if the new primary node is located in a different AZ. AZs are geographically separated. Switching from one AZ to another alters the physical distance between where your application and cache are hosted. This change impacts round-trip network latencies from your application to the cache. The extra latency is expected to fall within an acceptable range for most applications. We recommend you test your application to ensure it does well with a zone-redundant cache.

Disaster recovery

  • How service can be recovered in case of region failure?

    Azure cache for Redis provides built-in HA.

    A zone redundant cache provides automatic failover. When the current primary node is unavailable, one of the replicas will take over.

    In the case where you cache instances across the glob, Azure offers linked cache (Once linked together, one instance is named the primary linked cache and the other the secondary linked cache) instances through Geo-replication HA with user controlled failover.

    ⚠️ NOTE: Geo-replication doesn't provide automatic failover because of concern over added network roundtrip time between regions if the rest of your application remains in the primary region. 🥇 You'll need to manage and start the failover by unlinking the secondary cache. Unlinking promotes it to be the new primary instance. See Enterprise tiers for more advanced form of geo-replication.

  • Can service be recovered if both paired regions are unavailable?

    Yes absolutely, if both the regions are down, azure prioritizes to recover at least one region from the pair, so we have our apps and data available again soon. If applications are deployed across regions that are not paired, recovery might be delayed, in the worst case the chosen regions may be the last two to be recovered. For more information, see Azure resiliency technical guidance.

  • How is service's data/configuration being backed up?

    Microsoft recommend using the Redis data persistence feature in the Premium tier to increase the resiliency against data loss. The Premium tier allows you to persist the cache data in an Azure Storage account.

    In a Basic/Standard cache all the data is stored only in memory. 🚩 In case of underlying infrastructure issues there can be potential data loss. We recommend using the Redis data persistence feature in the Premium tier to increase resiliency against data loss. Azure Redis Cache offers Redis Database (RDB) and Append Only File (AOF) options in Redis persistence. For more information, see How to configure persistence for a Premium Azure Redis Cache.

    ⚠️ NOTE: The SLA does not cover protection from data loss. The SLA only covers connectivity to the Cache endpoints. See SLA section below for more info.

  • What are the ranges of RTO/RPO that can be configured?

  • How different DR setups impact service TCO?

    Azure Redis service is fully managed service, and therefore, TCO might have less input

  • Any other limitations to service's DR capabilities?

    • Automatic failover across Azure regions isn't supported for geo-replicated caches, because of concern over added network roundtrip time between regions if the rest of your application remains in the primary region. 🥇 You'll need to manage and start the failover by unlinking the secondary cache. Unlinking promotes it to be the new primary instance. See Enterprise tiers for more advanced form of geo-replication. For more info, see How does failing over to the secondary linked cache work?

    • Although Geo-replication is designed as a disaster-recovery solution, some features aren't supported with geo-replication:

      • Zone Redundancy isn't supported with geo-replication.
      • Persistence isn't supported with geo-replication.
      • Both caches must be in the same subscription.
      • The secondary linked cache is either the same cache size or a larger cache size than the primary linked cache.
      • Both caches are created and in a running state.
      • Clustering is supported if both caches have clustering enabled and have the same number of shards.
      • Caches in the same VNET are supported.
      • Caches in different VNETs are supported with caveats. See Can I use geo-replication with my caches in a VNET? for more information.
    • Data transfer between Azure regions will be charged at standard bandwidth rates.

Services SLA

  • What's the service's SLA?

    Azure Cache for Redis already offers an industry-standard 99.9 percent service level agreement (SLA). With the addition of zone redundancy, the availability increases to a 99.95 percent level, allowing you to meet your availability needs while keeping your application nimble and scalable.

    For any Enterprise and Enterprise Flash tier Cache deployed to three or more Availability Zones in the same Azure region, Azure guarantee that you will have connectivity to the Cache Endpoint at least 99.99% of the time.

    For any Enterprise and Enterprise Flash tier Cache deployed (1) to at least three Azure regions and three or more Availability Zones in each region and (2) with active geo-replication enabled for all Cache instances, we guarantee that you will have connectivity to one regional Cache Endpoint at least 99.999% of the time once the active geo-replication feature becomes generally available (i.e., is not a preview feature).

  • Are there any limitations to this SLA (e.g. only in regions with Availability Zones, etc.)

    The Basic tier of the Azure Cache for Redis Service is not covered by this SLA.

  • Are there any other considerations that have to be taken into account when assessing overall SLA?

    The SLA only covers connectivity to the Cache endpoints. The SLA does not cover protection from data loss. MSFT recommend using the Redis data persistence feature in the Premium tier to increase resiliency against data loss. 🚩

Multi-cloud

  • Can service be used in multi-cloud deployments?

    According to Azure documentation, we can confirm that Azure cache for Redis service supports multi-cloud.

    But generally speaking, open source Redis can run in many compute environments. Common example include:

    • On-premises - Redis caches running in private datacenters.
    • Cloud-based VMs - Redis caches running on Azure VMs, AWS EC2, and so on.
    • Hosting services - Managed Redis services such as AWS ElastiCache, and off course .
    • Different regions - Redis caches located in another Azure region.

    If you have such a cache, you may be able to move it from Azure Cache for Redis with minimal interruption or downtime.

  • Does this/similar service exist on other key cloud provider platforms (AWS, GCP)?

    Yes, both AWS (AWS ElastiCache for Redis) and GCP (Memorystore) have their own fully managed implementation of open source Redis in-memory data store.

  • What would it take to use similar service from other provider if needed?

    At a very high-level in no particular order, you would need to do the following:

    • On-board and register your business with the cloud provider
    • Create your subscription
    • Create the resource i.e. AWS ElastiCache for Redis/GCP Memorystore etc. For more, see Migration options.
  • Can service be integrated with services hosted on different cloud platforms?

    Yes absolutely, see Migration options above for more details.

General FAQ

For more info, see Azure Redis Cache FAQ

Terminology

  • Redis cluster: A cluster is a collection of one or more cache nodes, all of which run an instance of the Redis cache engine software.
⚠️ **GitHub.com Fallback** ⚠️