Proposed cacheops invalidation change - HearstCorp/django-cacheops GitHub Wiki

tl;dr

Invalidating Rover's caches has become disastrously slow. With some pretty simple changes to cacheops, we can make invalidation almost instantaneous. If Rover checks the cached invalidation metadata before getting a cached query resultset, then on saves we can just rename those keys to invalidate all the affected resultsets at once.

Background

As described here, cacheops works by storing each serialized Django ORM resultset keyed by a hash of all its query parameters. Then to enable invalidation of all the resultsets that might have contained a new or changed record, cacheops stores that key in another cached set, keyed by a normalized and simplified list of just some of the parameters and their values.

When a record is saved, cacheops gathers up and deletes all the resultset keys in all the sets that were saved for any set of values it matches. Let's work an example and see what commands cacheops sends to Redis!

Detailed example

Cache read and update

An ordinary Rover request for a list of collections might look like /v2/collections?slug=tech-news&site=81bbdae2-81fb-4d95-b643-a0679795a2a4&count=no. Cacheops is hooked into the ORM and when that request is made it'll hash a string like brands.models.Collection:slug=tech-news&site=81bbdae2-81fb-4d95-b643-a0679795a2a4 and look for that key in the cache:

GET "q:c5b99b2514fcc772b8b864271b1eb2bb"

If it were there, Redis would return it and we'd be done. But it's not, so cacheops will let the ORM get it from the DB, then cache it by calling a procedure it has stored in Redis. The procedure is named cache_thing in the cacheops code but it's called by the SHA Redis stored it under:

EVALSHA "c54d7c8aeb17b974055d5eb1f1a08f0dafcaaf81" "1" "q:c5b99b2514fcc772b8b864271b1eb2bb" "\x80\x02]q\x01c...[serialized and encoded resultset data]"

That stored procedure first caches the resultset

SETEX "q:c5b99b2514fcc772b8b864271b1eb2bb" "300" "\x80\x02]q\x01c..."

then updates the invalidation metadata, so it can delete this cached query's results later when any record matching it is saved:

SADD "schemes:collections" "site_id,slug"
SADD "conj:collections:site_id=81bbdae2-81fb-4d95-b643-a0679795a2a4&slug=tech-news" "q:c5b99b2514fcc772b8b864271b1eb2bb"

Cache invalidation

Now let's PATCH the collection that should be in the resultset for that query.

PATCH /v2/collections/f5418929-46d5-4bf9-8224-ad425e64a115

{"title": "I updated this collection"}

A record has been changed and cacheops needs to delete all the cached resultsets that it might have been in – and all those it should be in now. So it's going to call the stored invalidate procedure, first with the old record:

EVALSHA "fb9d58a7cc37037cb301336d2b9939a07620979d" "0" "collections" "{\"body\": \"From reviewing the latest gadgets to sharing breaking news, here's everything you need to know about the latest technology.\", \"status\": 3, \"title\": \"Latest Tech News and Best Gadgets\", \"created_at\": \"2016-05-16 22:31:31.278016+00:00\", \"site_id\": \"81bbdae2-81fb-4d95-b643-a0679795a2a4\", \"updated_at\": \"2016-05-26 19:06:09.767154+00:00\", \"slug\": \"tech-news\", \"parent_id\": null, \"display_type_id\": \"ee745167-ebdb-42ce-b404-f001c13b1cc0\", \"publish_from\": null, \"publish_to\": null, \"sponsor_id\": null, \"ad_category_id\": null, \"id\": \"f5418929-46d5-4bf9-8224-ad425e64a115\", \"metadata\": \"'{\\\"seo_meta_description\\\": \\\"From reviewing the latest gadgets to sharing breaking news, here''s everything you need to know about the latest technology.\\\", \\\"seo_meta_title\\\": \\\"Tech News - Latest Technology News, Product Reviews, Must-Have Gadgets, and More \\\", \\\"seo_meta_keywords\\\": \\\"tech news, latest tech news, technology news, gadgets, tech reviews, product reviews\\\", \\\"sub_heading\\\": \\\"\\\", \\\"nofollow\\\": \\\"0\\\", \\\"exclude_from_homepage\\\": \\\"0\\\", \\\"collection_ramsname\\\": \\\"Tech News\\\", \\\"last_updated\\\": \\\"2016-05-06 19:38:54\\\", \\\"hide_external\\\": \\\"0\\\", \\\"collection_visibility\\\": \\\"1\\\"}'\"}"

which will look up the various combinations of fields on which cached queries have filtered collections:

SMEMBERS "schemes:collections"
1) "id"
2) "site_id,slug"

and combine those with the values from the saved record to get all the keys for the cached resultsets that the record might have been in:

SUNION "conj:collections:id=f5418929-46d5-4bf9-8224-ad425e64a115" "conj:collections:site_id=81bbdae2-81fb-4d95-b643-a0679795a2a4&slug=tech-news"
1) "as:04b0159c26d75d9d88cedb661fdcacb4"
2) "as:75b6433d739e756c7e61ca14607ccb82"
3) "q:0ada9d021078d8b656bc5b89ee559272"
4) "q:c5b99b2514fcc772b8b864271b1eb2bb"

and then delete that metadata and those resultsets:

DEL "conj:collections:id=f5418929-46d5-4bf9-8224-ad425e64a115" "conj:collections:site_id=81bbdae2-81fb-4d95-b643-a0679795a2a4&slug=tech-news"
DEL "as:04b0159c26d75d9d88cedb661fdcacb4" "as:75b6433d739e756c7e61ca14607ccb82" "q:0ada9d021078d8b656bc5b89ee559272" "q:c5b99b2514fcc772b8b864271b1eb2bb"

And then it'll do it all again, with the new record, to invalidate the cached resultsets it should now be in!

EVALSHA "fb9d58a7cc37037cb301336d2b9939a07620979d" "0" "collections" "{\"body\": \"From reviewing the latest gadgets to sharing breaking news, here's everything you need to know about the latest technology.\", \"status\": 3, \"title\": \"I updated this collection\", \"created_at\": \"2016-05-16 22:31:31.278016+00:00\", \"site_id\": \"81bbdae2-81fb-4d95-b643-a0679795a2a4\", \"updated_at\": \"2017-09-29 17:12:15.079759+00:00\", \"slug\": \"tech-news\", \"parent_id\": null, \"display_type_id\": \"70f12355-28c3-4b65-afc4-fb047c862e46\", \"publish_from\": null, \"publish_to\": null, \"sponsor_id\": null, \"ad_category_id\": null, \"id\": \"f5418929-46d5-4bf9-8224-ad425e64a115\", \"metadata\": \"'{\\\"seo_meta_description\\\": \\\"From reviewing the latest gadgets to sharing breaking news, here''s everything you need to know about the latest technology.\\\", \\\"seo_meta_title\\\": \\\"Tech News - Latest Technology News, Product Reviews, Must-Have Gadgets, and More \\\", \\\"seo_meta_keywords\\\": \\\"tech news, latest tech news, technology news, gadgets, tech reviews, product reviews\\\", \\\"sub_heading\\\": \\\"\\\", \\\"nofollow\\\": \\\"0\\\", \\\"collection_ramsname\\\": \\\"Tech News\\\", \\\"last_updated\\\": \\\"2016-05-06 19:38:54\\\", \\\"collection_visibility\\\": \\\"1\\\", \\\"hide_external\\\": \\\"0\\\", \\\"exclude_from_homepage\\\": \\\"0\\\"}'\"}"

OK, so what's the problem?

We are caching half a million resultsets each minute. There are, as I write this, 764586 keys in the conj:collections: set. The next time a collection is saved, cacheops is going to tell Redis to delete all of them, in batches of 1000, which can add up to a couple of seconds – and then delete the set, which takes a few seconds too. Redis is single-threaded, and this is all being done in one EVALSHA call, so Rover is totally blocked while all this happens.

Do we need invalidation? Can't we just let the cached resultsets expire?

Interactive Rover clients that save a record, load the next page, and expect the changes they saved to be reflected in the data on that page, require forced invalidation so they don't get stale data. Right now this is just Edit UI, but I can't think of a way around the requirement.

What can we do?

We can change the way we get cached resultsets, to make invalidating them much faster.

Each conj cache stores the keys of all the resultsets that should be invalidated when a record is saved that its query would match. Another way to put that is: the conj sets for a query contains the keys of all the resultsets that are currently valid for that query. So we can check the validity of a query's cache key in two ways: try to get the key, or see if it's in the query's conj sets.

On a cache get, we could check the conj set first. If the key's not in there, we can skip getting it. So, we could invalidate the keys by deleting the conj set instead of the keys themselves!

Problems and solutions

Set lookups are slow! – Use hashes instead.

To see if a value is in a set, Redis has to examine each value in the set, which could be pretty slow given the size of the sets we're storing. But Redis offers a hash type in which it claims O(1) lookups, so we can use that for conj caches instead.

Turns out checking set membership is O(1) in Redis!

Deletion of a large set is slow! – Don't delete it! Just rename it.

Renaming a cache key is also O(1) in Redis.

Renamed conj records will fill the cache. – Queue an asynchronous job to delete them.

Once we've renamed a conj key to invalidate its cached resultsets, we can pass it to an RQ worker to be deleted without holding up Rover.

That delete will still be slow and block Redis. – Delete it even more slowly!

Since the RQ worker isn't blocking Rover, it can take its time deleting the conj key and its cached resultsets. By deleting the set entries and cache keys in small batches, we can avoid blocking Redis.

Now we're doing two lookups instead of one on each query! – Only for clients that need them.

The cost of immediate invalidation is an extra Redis hash lookup – but only clients that need to immediately read back their writes need to do that. Clients that only update Rover data, or only read Rover data, can just read the cache normally and rely on normal expiration and the asynchronous invalidation process to prevent reading stale data.

That sounds like a big change to Rover. – It's mostly already done.

We added a read_from_primary flag to identify Rover clients that need to read from the primary PostgreSQL database rather than from the replicas – and those are the same clients that need immediate invalidation (it's just Edit UI so far). We can just pass that flag into the cacheops library (which we've already forked).