Lovell previously mentioned improved cache time relevancy in his post about how we scaled the NET-A-PORTER website. My team is responsible for the product API used during the sale and currently being adopted by other applications across the organisation. I thought I’d reveal a few techniques we’ve used to maximise our cache hit rate.
There’s no magic involved, and it is all really rather simple. There are three parts:
- require query parameters to be sorted alphabetically
- require multi-value query parameters to be sorted
- reject unrecognised query parameters
These restrictions serve to guard against accidental or unnecessary differences in the URLs for functionally identical queries.
Require query parameters to be sorted alphabetically
Let’s take a concrete example by comparing two URLs:
The first of those will be rejected (with a helpful error message), while the second will return a list of PIDs (product ids) for a particular brand. In some other APIs you may have come across, both URLs would probably have worked.
Three parameters can be ordered in 3! = 6 different ways; we say that there are six permutations. (3! is pronounced “3 factorial”, and is shorthand for 3 * 2 * 1.) That’s five out of every six requests that potentially won’t be cached when they could have been. That’s not really too bad, except we support a lot more than three parameters, and factorials
grow very quickly. Say I wanted to find red shoes by Gianvito Rossi on NAP (our main site) and display the price as a person in UK would want to see it. I’d use this URL:
That’s five parameters and 5! = 120. Simply by ordering those five query parameters differently you can create 120 different URLs. Add a size filter? Now there are 720 permutations. Specify a non-default sort order? Boom! 5040 permutations. We currently recognise eleven query parameters. Two of them can’t be specified together, but that still leaves 10! = 3,628,800 different URLs all for a single query.
Require multi-value query parameters to be sorted
Several of our query parameters can take multiple values, using comma-separated values for brevity. As an example, you can ask for products by two or more different brands at the same time. When doing so, however, you must sort the values specified for that query parameter — recognising that what’s good for query parameters at the macro level is good for URLs at the micro level as well.
Thus the first of these is accepted, but the second will give you a very helpful error saying your brandIds must be ordered numerically:
Several of our multi-valued query parameters take up to (or in some cases more than) ten values each. We don’t expect clients to intentionally hit us with every permutation possible, but we can avoid accidental permutations: an example would be a client that builds URLs by appending filter components in the order the user selects them, thus getting lower cache performance. Requiring a specific order is not particularly effective against deliberate abuse as there are 889,446,337,783,744,949,208 ways to pick ten brands from the 568 available at the time of writing.
Reject unrecognised query parameters
Have you ever seen attempts at “busting the cache” by appending something like
&r=23456234 to the end of a URL? If you try that against our API you will get a very nice error message to the effect that
r is an invalid parameter and would you mind awfully choosing from one of these instead? (Read that in your best British accent.)
Although we employ this primarily to curb cargo-cult cache busting, it has the really nice side effect of catching typos in query parameter names. It also makes the API more discoverable; you can deliberately append
&foo and get an error message telling you all the legal query parameters you can experiment with.
So there you have it. Three simple ways you can increase the cache hit ratio of your REST API. We were “lucky” in that we thought of these before we built this API. Retrofitting them to an existing API would require careful thought, especially if you don’t have control of all the clients. (You might have to resort to redirecting users to the URL you would have liked them to use rather than simply rejecting the request, for example.)
There’s nothing in the URL RFC about URIs being equivalent despite different ordering of query parameters. Thus you can’t expect off-the-shelf WWW caches to take care of this problem for you — you have to handle it yourself.
Increasing our cache hit ratio not only means less load on our servers, it also gives our users a better experience because it ensures that their requests are more likely to be served by a CDN near them, even if they are using different clients to create those requests.