Defeat Cache Stampedes Through Probabilistic Cache Recompute Algorithm

system design algorithms

9 min read

In most large-scale applications, distributed cache servers are employed to efficiently cache data that is computationally expensive to generate at runtime for every request. By utilizing cache servers, expensive calls to databases, external clients, disks, and other resources can be avoided, resulting in improved application throughput. However, under high load, if a frequently accessed item in the cache expires, it can lead to a situation known as a cache stampede, where multiple requests simultaneously attempt to regenerate the expired items. This can cause a sudden spike in resource usage and negatively impact the overall performance of the application.

In this article, we will explore a range of approaches to prevent cache stampedes, starting from the most basic to the most optimal solutions.

External recomputation

This approach transfers the responsibility of cache regeneration for the most accessed data from the request thread to a dedicated daemon thread. This thread is responsible for monitoring the hot cache keys and their respective time-to-live (TTL) values, ensuring their timely regeneration.

While this technique offers advantages, it is important to consider its limitations:

It is primarily suited for static cache keys, which remain unchanged across different clients.
Maintaining the high availability of the daemon thread can pose significant challenges.

To address these limitations, many applications adopt alternative strategies to mitigate potential issues and enhance performance.

Locking

In this approach, when a cache miss occurs, all request threads attempt to acquire a central lock on the cache key to be generated. The thread that successfully acquires the lock proceeds with the cache recomputation, while the other threads have the following options to handle the situation gracefully:

Return back a 404 status to the client and let the client handle it gracefully, if feasible.
Wait for a defined timeout for the cache key to be recomputed, and then return it to the client.

It might look like an easy and foolproof approach to handle cache stampedes but it also has a few limitations:

Implementing a distributed lock flawlessly could be a challenge.
It requires an additional call from each request thread to the datastore or a downstream service to acquire the lock.
Handling the lock TTLs and retries can be complex.

While this approach improves upon the previous one, it still presents challenges that many applications cannot afford to face. So, what is the optimal solution? Let's explore it further.

Probabilistic early recomputation

In this approach, each request thread has the capability to independently decide whether to recompute the cache value before its expiration based on a probabilistic formula. Several factors influence the probability of triggering a recomputation:

A user-defined constant value. It directly affects the probability of initiating a cache key recomputation.
The time required to perform the cache key recomputation. A higher value for this factor increases the likelihood of triggering a recomputation.
The proximity of the cache key's expiration time. The closer it is to expiration, the higher the probability of recomputation.
The frequency of access to the cache key. More frequent access increases the chances of recomputing the cache key.

Now, let’s take a quick glance at a simplified code snippet that demonstrates the implementation of this formula. Don’t worry if you find it a bit overwhelming at first glance. In the next section, we will delve into each part of the code in detail and gain a comprehensive understanding of how it works. So, feel free to just skim through it.

public class CacheDriver {

  private final int BETA = 1; // User defined constant value
  private final long TTL = 300; // This is in seconds

  // More methods

  public Object fetch(String key) {
    CacheData cacheData = readCache(key);
    if (cacheData.getValue() == null || (System.currentTimeMillis() - cacheData.getDelta() * BETA * Math.log(new Random().nextDouble())) >= cacheData.getExpiry()) {
      long startTime = System.currentTimeMillis();
      Object value = recomputeValue(key);
      long delta = startTime - System.currentTimeMillis();
      writeCache(key, value, delta);
    }
    return cacheData.getValue();
  }

  private CacheData readCache(String key) {
    // Some cache calls to fetch data and metadata
  }

  private CacheData writeCache(String key, Object value, long delta) {
    // Some cache calls to store data and metadata
  }

  private Object recomputeValue(String key) {
    // Some business logic, external DB calls etc
  }

  // More methods
}

public class CacheData {

  private Object value;
  private long delta;
  private long expiry;

  // Getters and setters
}

In the provided code snippets, CacheDriver is responsible for handling all cache interactions, recomputation of the data, and executing the probabilistic cache recompute algorithm.

The entry point in this class is the fetch(String key) method. It takes in a key from the client and tries to fetch it from the cache using the internal method of the class called readCache(String key).

If the data is not present, then the cache key is recomputed for sure but if the data is present for it, then we calculate whether we want to recompute it or not. This decision is taken based on the following line:

if (cacheData.getValue() == null || (System.currentTimeMillis() - cacheData.getDelta() * BETA * Math.log(new Random().nextDouble())) >= cacheData.getExpiry()) {
  // Some logic
}

Also, in this code snippet, we utilize an object called cacheData, which is an instance of the CacheData class. The class stores both the data and the associated metadata for a specific cache key. It has three major attributes, first is the value for the cache key. Second, the delta is the time taken to compute the cache key in the last write. The third one is the expiry of the key.

The line of code mentioned above checks if the current time minus a calculated gap (based on the delta, expiry, constant, and random value) is greater than or equal to the current expiry of the cache key or not. If the calculated time is greater than or equal to the expiry, a cache recompute is triggered. This can shortly be represented by the following formula:

currentTime() - (∆ * ß * log(rand(0,1))) >= expiryTime

Where: - ∆ is the time taken to recompute the key (delta) - ß is a user-defined constant - rand(0,1) is a random number between 0 and 1

It’s worth noting that the mentioned gap is always less than or equal to 0 (which makes the equation shared above positive) because it depends on three factors. The constant BETA is an integer greater than zero. The delta, being the time taken to recompute the cache key, is always positive. Lastly, the Math.log(new Random().nextDouble()) generates a random negative number, ensuring that when this value is incorporated into the calculation, the overall value of the gap becomes less than or equal to 0.

Now, you might be curious about how the hotness of a cache key influences the frequency of early recomputations. The explanation is straightforward. Let’s consider a cache key that is infrequently accessed. This implies that a smaller number of threads are responsible for deciding whether the cache key needs to be recomputed or not.

Conversely, if a cache key has higher throughput, it means more threads are involved in determining if a cache recomputation is required or not. This, consequently, raises the probability of triggering the recomputation process.

Ending notes

The probabilistic cache algorithm presents a powerful solution to mitigate cache stampedes and optimize application performance.

Adding a very good research paper on the same topic below:

https://cseweb.ucsd.edu/~avattani/papers/cache_stampede.pdf

Hope you learned something new today.

Thank you for reading. Happy coding!