Migrate Billions of Redis Keys with Zero Downtime

5 min readFeb 18, 2023

Redis is a popular caching system just after Memcache, widely used for caching due to sub-millisecond query time.

There are many occasions when we need to change the application Redis backend.

Horizontally Scaling

Typically, we begin running an application with Non-cluster Redis, allowing only vertical scaling (i.e., increasing hardware capacity). However, as our traffic grows or scales up, Non-clustered Redis is incapable of horizontal scaling since the single-threaded Redis server cannot handle thousands of QPS (Queries Per Second). A standard Redis Cluster comprises P primary nodes and S secondary nodes, where (S ≥ P), with P representing the number of shards in the Redis Cluster.

A common configuration involves Primary and Secondary nodes, as shown in this diagram. Primary nodes are highlighted in green, while Secondary nodes are depicted in grey. While we have illustrated only one Secondary node per Primary node for simplicity, in a highly scalable system, multiple Secondary nodes per Primary node are possible.

This diagram shows that the Application communicates with all shards. Typically, the client employs CRC32 and node information to locate a specific node. If the client receives a MOVED error, it will direct the command to the corresponding node. This feature is particularly useful for Pipelines, as a pipeline connects to only one node, but the keys could be located on another node.

Shared Redis Server

There are various reasons why a Redis Server might be shared between multiple applications, such as reducing infrastructure costs, resolving setup issues, or meeting release deadlines. However, if a Redis Server is shared across multiple applications, it will eventually need to be migrated to its own Redis Server.

Version Upgrade or Hardware Upgrade

In this case, a new Redis cluster with a new cluster/hardware configuration can be set up to reduce downtime instead of upgrading the same Redis Server.

Highly Available

If you initially deployed a single node or non-clustered Redis setup, but later realized that you need a highly available Redis solution to meet your application’s requirements, there are a couple of options available.

You could implement Redis Sentinel or Redis Cluster to achieve high availability. Another approach would be to set up a new Redis deployment specifically designed to be highly available, even in this scenario.

Let's say your application is horizontally scalable i.e you’re running many instances of the same application code.

There’re two tricks to migrate billion of keys to another Redis Cluster

Dual Write
Partial write

Dual Write

To implement a Dual write setup, our application needs to be set up to read and write data from both the old and new Redis configurations. Once the new setup is in place, data will be backfilled over a period of approximately one week. We can confirm that the migration is successful by monitoring Memory usage or issuing read commands to the new Redis setup, with the cache miss/hit ratio being the primary metric for determining whether all necessary keys have been migrated.

Once we are satisfied with the cache miss/hit ratio, we can redirect traffic to the new setup, which will then serve both read and write requests. However, if there are many infrequently accessed keys, achieving 100% data migration may take a significant longer time. Dual read/write would result in increased overall API response times. To mitigate this, we can use an asynchronous processing technique.

For traffic control we can configuring a property or feature flag to control whether the application should write data to both Redis setups or just the new one.

After A week or in some time your application setup would like this as we have cut the traffic to Old Setup.

Partial Write

To implement this configuration, the application needs to interact with both the old and new Redis setups. Some instances of the application will interact exclusively with the new Redis cluster, while others will interact exclusively with the old Redis cluster. To avoid a high cache miss rate, it’s important to gradually increase the number of instances interacting with the new Redis server over time. This can help to ensure that the new Redis cluster is able to handle the increased load and that data migration is successful. By gradually increasing the number of instances, we can monitor the cache miss rate and make any necessary adjustments to the configuration.

After a week or so your application would only interact with New setup.

If you are using AWS ECS to manage this scenario, it can be straightforward to handle by utilizing task count and reserved instances. With this approach, you can run a specific number of tasks with new task definitions, while the older tasks continue to run on the older task definitions. This can help ensure that your new Redis cluster is gradually handling more load over time.

Additionally, you can set up partial traffic using a load balancer. For example, you can configure the load balancer to send only a small percentage of traffic (e.g., 5%) to the new instances, while the majority of traffic continues to be directed to the older instances. This way, only a small fraction of the traffic will be impacted, and it will not significantly impact the user experience. By monitoring the performance and cache miss rate of the new instances, you can gradually increase the percentage of traffic directed to the new instances until they are handling the full load.

If you found this post helpful, please share and give a thumbs up.