We have been using Redis as a caching layer, and we have become fond of this technology. Its speed, simplicity, tooling and cloud support are all arguments for using it. Though due to its ease of use it’s easy to overlook performance bottlenecks. In this post, I will provide a story of how we reduced Redis usage during cache invalidation, and I will give a small deep dive on Redis advanced features, namely sorted sets.
Cache invalidation: problem introduction
One of our projects deals with a high degree of personalization where we need to serve highly personalized data through the API. We cache the responses (with a time-to-live (TTL)), so that each subsequent request for the same content gets the response as fast as possible and without increasing load on the main SQL database.
Our cache design is fairly standard. We have all keys in one Redis database, using the GET/SET commands as expected. The key design for user cache conforms to the following design:
In a more specific example:
If a user's preference changes (e.g. user’s settings change), we have to invalidate the cache of multiple responses at once. That was the main issue we faced.
Naive implementation with Redis
The initial implementation done was fairly simple. It was running SCAN command with MATCH option for a particular user (e.g.
SCAN 0 MATCHuser-1:*). This command scans through the whole keyspace. Also it’s the recommended way to find a subset of keys in the keyspace (another option includes
KEYS command - but its use in production setting is discouraged by Redis). Now this works fine and is easy to understand. But there is a Redis implementation detail with
SCAN command and the
MATCH option that could trip somebody up. From the documentation:
> ... the MATCH filter is applied after elements are retrieved from the collection, just before returning data to the client.
This essentially means that the whole keyspace will get scanned no matter how many keys will get matched and returned to the client. This works fine for a smaller number of keys in Redis, but it does not scale.
For example if we have 20 cached keys for one user, we scan with a count of 1000 and we have 1000 users – we will have to perform 20 SCAN requests to get through the keyspace ((20keys/user * 1000 users) / (1000 count)). With new users and/or more keys, Redis will have to do more work. We ran into issues where we had around 300k keys in Redis - so for each invalidation, we had to perform 300 SCAN requests only to remove a handful of keys (20 max).
From our tracing we could see that a substantial amount of time was spent on the scanning operations. To illustrate: if each scan operation in Redis would take 1 ms (value pulled out of a hat), the invalidation would add 300 ms (scans use token pagination and thus must be done sequentially) to the API response (when request is finished, we guarantee to clients that old data is flushed from cache). And that’s not even mentioning the size of the values stored in Redis, which would also add time overhead if left unchecked. Thankfully, our DevOps guys already added monitoring of large Redis keys.
We had couple of options on how to deal with the invalidation issue:
- (dynamically) increase
- rework how we invalidate cache
- split the large key space - e.g. multiple redis databases and/or different redis data structure
We chose to rework the cache invalidation logic to something that would scale with an increased number of users. The requirements were as follows:
- reduce the number of Redis operations during invalidation
- decrease response time of endpoints where the invalidation was done
Reducing the scan operations
The most obvious way to solve this problem is to keep track of all the API endpoints the user visited (for each user) and then invalidate only the relevant ones. Redis offers plenty of native data structures, but which one should be used? Set data structure offers itself as a nice option. We care only about unique items, and set operations are fast and easy to understand. Having implemented invalidation utilizing Redis set, we had the following workflow for cache invalidation:
check if user’s set exists
if not, nothing to invalidate
if yes scan through the set
- find matching keys in the main keyspace, delete them
- and remove all relevant elements in the set (all stale data to invalidate)
With this implementation, we managed to lower the amount of requests for invalidation to 4:
- checking if user's set exists,
- scanning the set (max 20 elements),
- removing the relevant keys from the main keyspace,
- and finally removing the elements from the set.
But with this implementation we introduced a subtle issue: We are still TTL for each key in the main keyspace, so that we wouldn’t serve stale data in case there was no request that would invalidate it. Redis does not offer a way to set TTL for members of a set. This means that we had no way to remove “stale” elements (those referring to expired keys) in the SET, and we would keep elements in the SET that are not in the main keyspace anymore due to them expiring. This could lead to some funky behavior (set would grow indefinitely till the invalidation procedure) with the invalidation.
Sorted sets to the rescue
Thankfully, Redis includes a sorted set (
ZSET) data structure. Using sorted sets allows us to work with keys and scores. What do we need the score for, you ask? Good question!
We emulate the TTL of the set members with the score. We set it to the current timestamp + the TTL of the referred key. Now during each cache invalidation request, we run an additional command ZREMRANGEBYSCORE which allows us to remove elements within provided
max scores. This way, we remove the stale keys from the set, keeping the user’s
ZSET in sync with the actual cached data.
Cache invalidation with Redis
Iterating quickly on a product and adding a caching layer using Redis is fairly standard. But as one of the 2 most difficult problems in computer science (you know, after naming and off-by-1 errors), cache invalidation causes issues. We ran into issues with our naive implementation of cache invalidation mechanism, which served us well in the beginning, but caused performance problems with larger number of keys stored in Redis.
Thankfully, Redis is not just a simple key-value store and has plenty of native data structures which can be utilized. We chose to go with sorted sets and with it we managed to lower the number of requests to Redis during cache invalidation 75 times (at the time when it was implemented). The current solution suits our requirements and performs well. Let’s see how it holds up the test of time.
Feel free to reach out to discuss anything mentioned in this post!