DocuSign Dev Blog

Brought to you by the development teams at DocuSign

Building a Redis Sentinel Client for Node.js

We use Redis for sessions and for a short-lived data cache in our node.js application. Like any component in the system, there’s a potential risk of failure, and graceful failover to a “slave” instance is a way to mitigate the impact. We use Redis Sentinel to help manage this failover process.

As the docs describe,

Redis Sentinel is a distributed system, this means that usually you want to run multiple Sentinel processes across your infrastructure, and this processes will use agreement protocols in order to understand if a master is down and to perform the failover.

Essentially, each node server has its own sentinel corresponding to each redis cluster [master and slave(s)] that it connects to. We have one redis cluster, so for N node servers, there are N sentinels. (This isn’t the only way to do it - there could be only one sentinel, or any other configuration really, but the 1:1 ratio seems to be the simplest.) Each sentinel is connected to the master and slaves to monitor their availability, as well as to the other sentinels. If the master goes down, the sentinels establish a “quorum” and agree on which slave to promote to master. They communicate this through their own pub/sub channels.

The sentinel is not a proxy - the connection to the sentinel doesn’t replace the connecton to the master - it’s a separate instance with the sole purpose of managing master/slave availability. So the app connects to the sentinel in parallel with the master connection, and listens to the chatter on the sentinel channels to know when a failover occurred. It then has to manage the reconnection to the new master on its own.

Redis Sentinel Client flow diagram

We’re using the standard node_redis library, which is robust, easy to use, and works “out of the box” for things like sessions. But a year ago, when Sentinel started to gain adoption, the best approach for adding Sentinel awareness to node_redis clients wasn’t clear, so a thread started on Github to figure it out.

One simple approach was for the application to simply hold two connections, for sentinel and master, and when the sentinel reports a failover, to reconnect the master. But the way node_redis works, any data in transit during the failover is lost. Also with this approach, the code listening to the Sentinel’s pub/sub chatter lived in the application, and wasn’t as encapsulated as we thought it should be.

So we decided to create a middle tier, a redis sentinel client, that would handle all this automatically. The goals were:

  1. Transparent, drop-in replacement for a node_redis client, handling connections to master, slave(s), and sentinel in the background.
  2. Handles all RedisClient commands (including pub/sub).
  3. No data loss during failover.

The result - still a work in progress - is the node-redis-sentinel-client module. Initially we added it into a fork of node_redis itself, then we split it into its own module, but still dependent on our fork to export shared components and fix the data loss problem.

The RedisSentinelClient object holds three sub-clients (each a normal RedisClient object): an activeMasterClient which connects to the current master, a sentinelTalker to read from the Sentinel, and a sentinelListener to listen for failovers (because in node_redis’ pubsub mode, a client can only pub or sub, not both.) All commands get proxied to the activeMasterClient, and that client is reconnected to the new master after a failover.

This has worked pretty well so far, including in production. We’ve never actually had a Redis failover in production, fortunately, but in all our tests, the client behaves well: the node processes temporarily lose connectivity, but once the failover completes, they resume gracefully with no data loss.

There are still a few questions and problems with our solution, however:

First, when the RedisSentinelClient is first instantiated, if it can’t immediately connect, it doesn’t handle it very well. This is because of the way the activeMasterClient is first set up, and a simple fix has been elusive. (It becomes “stable” only after this initial connection.)

Second, this middle-tier solution might ultimately be too heavy. Our Redis data is considered volatile: since it’s only for sessions and temporary caching, data loss is at worst a nuisance. So all the effort put into buffering data during a failover might be unnecessary. (On the other hand, Redis supports disk backup, and not every implementation is for volatile data, so a general-purpose solution could err on the side of robustness.)

Third, the changes in our fork to node_redis (submitted as two pull requests) haven’t been accepted, probably because there still isn’t consensus on the right approach. It’s also possible (and a little surprising) that Sentinel itself hasn’t fully caught on. (Surprising because it solves a real problem very nicely, and lacks strong alternatives.)

Do you use Redis Sentinel with node? How do you do it? We’d love to hear about your experience or ideas in the comments.

comments powered by Disqus