Redis: applicability to Giraffa - GiraffaFS/giraffa GitHub Wiki

Applicability of Redis to Giraffa

Milan Desai

Introduction

Redis is a key-value cache and store that serves as a data structure server. It is query-based, in-memory, and lightweight and stores data-structures for values. Redis was created by Salvatore Sanfilippo in 2009 and has been actively developed ever since. Currently sponsored by Pivotal, version 3.0.0 was released April 1, 2015. It is available under the three-clause-bsd license. The following lists the advantages and disadvantages of using Redis in the context of Giraffa.

Advantages

The biggest advantage of Redis is that it is a data structure server. So in Giraffa we can use file paths for keys and maps for values, not having to worry about serialization. Because Redis is entirely in-memory, it is fast. At the same time, Redis supports the basic requirements of a database for Giraffa, though some features are currently in Beta mode:

Partitioning: Redis Cluster supports sharding the database into multiple instances on different nodes, allows the addition and removal of instances, and implements a rebalancing operation that can be called at regular intervals. However, Redis Cluster is currently only in beta stage.

Replication: Redis implements master-slave architecture of replication. Master nodes can be configured to replicate its data to slave nodes at regular intervals. When master nodes fail, slaves can be promoted to masters and clients can fail over.

High Availability: In Redis Cluster, as long as the majority of Master nodes are alive, the loss of a Redis instance will not affect the availability of the keys on other instances. When master nodes fail, clients can fail over to slave nodes.

Persistence: Persistence can be scheduled at regular intervals through database snapshots and append-only files, which are similar to HDFS edit logs. Fsync can be scheduled to occur after every operation, but that will slow down the system considerably. However, when scheduled to occur every second, performance is not sacrificed.

Disadvantages

Key Locality: A major disadvantage is that we lose locality when storing keys. The database can be partitioned into multiple Redis instances, with keys stored in a partition based on their hash. When the entire key is hashed, we lose locality. Redis does support hash tags, in which only a portion of the key is hashed; thus with full path row keys, the contents of certain directories can be guaranteed to be assigned the same partition. But it is not possible to partition based on the unhashed ordering of the keys. Even within a partition, we have no control over iteration order. This will make operations of directories such as listing, renaming, and deleting exceptionally slow compared to HBase.

Server-side Computations: Another disadvantage is that server side computations are done by creating Lua scripts that clients send over as part of a command. The use of client-provided scripts as opposed to HBase coprocessors severely limits the extent of what can be done server-side (see: addBlock). Much of the work may have to be done by the client, which could also raise security concerns.

File Versioning: Versioning is not supported in Redis, so snapshots will not be so straightforward to implement.

Replication: The database is persisted to disk rather than through a filesystem API, so it is not possible to directly persist to a distributed file system like HDFS. Replication must occur through manual backups of the persisted data or by reliance on the slave nodes. This means that if a node dies, there will be a suboptimal number of replicas for a partition until a new node is added and configured to replicate the master.

Node Management: When thousands of nodes, with some being added or removed at any given time, management of the cluster can be come extremely complicated. There is no management process that oversees the nodes in the cluster, checks in on their status, and accordingly sets up replication or load balancing; in other words, there is no HMaster. All of this must be overseen by a human administrator and possibly by some automated shell scripts. When dealing with thousands of nodes, this process can become extremely tedious.

Conclusions

Altogether, Redis would not be an improvement over HBase as a database for Giraffa. While it has many features that on the surface support the requirements of Giraffa, they are not optimal for Giraffa’s specific cases. Redis seems to be more valuable when using it as a cache, rather than a store.