Recently Ariel and I presented a case study of how we have been using Scala, Akka and Spray on the Wishlist section of the NET-A-PORTER site. In this blog post, we share the part of our talk that focused on how and why we changed our design from using the
ask pattern to using the
First, a quick overview of the application.
Wishlist aggregation API
The wishlist aggregation API has the job of aggregating several RESTful APIs:
- The wishlist domain API exposes mappings from customer IDs to lists of product IDs
- The product domain API exposes information about products, such as the product name and price
- The alert domain API contains notifications for products that customers are interested in, such as when they become low in stock, or go on sale
By aggregating these domains, we can provide a richer API that is more useful for the majority of clients.
The wishlist aggregation API consists of three modules:
spray-routingto handle incoming HTTP requests
application-coremodule has the logic for how to aggregate the wishlist, product and alert domains
spray-clientto define clients for the domain APIs we need to aggregate. The
application-coreonly has a runtime dependency on the
rest-clientmodule to prevent Spray modules leaking into the
First Design with
Our first design used
ask to bridge the gap between our routing layer and the
Actors in the
application-core. This is a common pattern in the spray docs that completes the
RequestContext asynchronously so our routing
Actor can quickly move on to handling the next request.
application-core module contained
Actors that handled different types of request. For example, one
Actor would have the task of retrieving a list of items for a given wishlist. These
Actors would usually make several requests to different REST clients in parallel and aggregate the results. We couldn’t think of a nice way to use
tell here, so instead used the
ask pattern again. The
ask pattern allowed us to immediately get a handle on each API response and aggregate them via
We weren’t very happy with this design. It caused us a couple of problems…
Problem 1 –
ask timeouts are hard to debug
Our design used a lot of
ask requires a mandatory timeout for the
Future. It is important to get the handling of these timeouts right, because even if you don’t see them in development, they will happen at some point in production, when servers become slow or networks become congested. When an
ask timeout is reached, you will see the following
akka.pattern.AskTimeoutException: Timed out at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:312) at akka.actor.DefaultScheduler$$anon$8.run(Scheduler.scala:191) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:137) at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(…) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:262) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(…) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(…)
Exception is useless for debugging purposes:
- The message – “Timed out” doesn’t tell us anything about which
askin our system timed out. Unlike
Futures do not have names, so Akka cannot give us any more information here
- The stack trace doesn’t contains our own
com.netaporterpackages, only those for the Scala and Akka internals. This may get better in the future; see Iulian Dragos’s recent presentation on replacing stack traces in Akka applications with something more useful
When we first started seeing these
AskTimeoutExceptions, we dropped a few debugger breakpoints and eventually worked out it was due to a RESTful API performing slowly. We started using
recoverWith on the
Futures we got back from
ask to give better failure messages:
This made debugging easier, and allowed us to return more useful error messages from our RESTful API, but our first design had a lot of
asks and we weren’t particularly happy with all the clutter this added to our code base.
Shortly after making these changes, we realised that
spray-client actually gives really good error messages in the event of request timeouts. Why weren’t we seeing them?
Problem 2 –
ask timeouts can hide other failures
It turns out we weren’t seeing the
spray-client timeout failures because we had failed to configure our
ask timeouts sensibly. We had configured
spray.can.client.request-timeout to 5 seconds and had also configured our
ask timeouts to 5 seconds, like so:
Can you see the problem here?
ask timeouts are earlier in the stack than the
spray-client timeouts, they will always be reached first. The useful
RequestTimeoutExceptions are not propagated up the chain of
Futures further up the chain have already been completed with the less helpful
AskTimeoutException. This, combined with all our error logging for
Future failures being done in the routing layer, meant that the
RequestTimeoutExceptions were being hidden from logs. This also prevented us from providing more useful error messages to the clients of our RESTful API.
The cause of this problem was subtle, but making the situation better in the short-term was fairly easy:
- To Sort out the timeouts we made sure we had larger timeouts at our Routing layer and then progressively made them smaller as we got deeper into the application. Although this helped, it became a nightmare to manage, as we were using the
askpattern a lot.
- Log failures at the call-site – For belt and braces, in case we accidentally misconfigure the timeouts in the future, we also added logging into our
This further reduced the signal-to-noise ratio in our code, making it harder to read.
Improved Design with
To better deal with these problems, we decided to remove
asks from our design and use
tell instead. Our inspiration for how to solve this came from a post on the Spray mailing list which described the
Actor per-request model.
In the new design we start with no
Actors in the
When a request comes in to our routing layer, we now immediately spin up a new per-request
Actor and pass it the Spray
RequestContext. This frees up the routing
Actor to deal with another request. The job of the per-request
Actor is simply to hold the
RequestContext, spin up another
Actor in the
application-core, send it a
tell, then wait on a reply message or a failure, completing the
RequestContext as appropriate.
Since our “Get Wishlist Items”
Actor is now also scoped to a single request, it can store state related to the request as member fields. This allows us to build up our aggregation of products and wishlists using
tell. Each time we receive a
tell response from one of the REST clients, we can store it in a field and keep doing so until we have all the data we need in fields, at which point we can aggregate the data in those fields.
When the per-request
Actor receives a response from the
application-core, it completes the
RequestContext and then kills itself. This has the nice property of also killing any request-scoped
Actors in the
application-core via the supervision hierarchy.
So how does the new design help with the problems we were facing before?
Solution 1 –
tell is easier to debug
We now never see the useless
AskTimeoutException in our logs. Huzzah!
Also, we are now using
Actors where we were previously using
Actors have names, traceability of messages is much better using
LoggingReceive. This, in turn, makes debugging easier.
Solution 2 –
tell doesn’t hide failures
We no longer have to configure loads of timeouts in the
application-core. Instead, we now have a single timeout set in the per-request
Actor via Akka’s
setReceiveTimeout. This timeout defines how long we are willing to take building a response, before we instead send a timeout error to our clients. If this timeout is reached, then the per-request
Actor will kill itself and any request scoped
Actors in the
Until the per-request
Actor timeout is reached, any non-recoverable failures in the
application-core are free to be escalated up the supervision hierarchy to the per-request
Actor and completed on the
RequestContext as a useful error message. This means we no longer run the risk of shadowing
spray-client timeout errors as we did in the first design.
Downsides of the
Actor per-request model
Before you switch to using the
Actor per request model, there are a couple of points to consider:
Actor per-request may not help you
Actor per-request model is a good fit when you need to manage many request-scoped actors in your
application-core. There are other Akka applications we have built here at NET-A-PORTER that do not have this requirement, and the
Actor per-request model has not been so appropriate in these cases.
We considered whether using the
Actor per-request model was going to be a performance bottleneck. There is an overhead for spinning up several
Actors per request. However, for us it is not big enough to be an issue. An Akka system is designed to work well with millions of
Actors are cheap; they are fast to create and only cost around 300 bytes of memory. This, combined with the ability to scale this service across many machines, means the performance penalty is outweighed by the benefits of a cleaner design.
We’d advise you to use the
tell pattern whenever you can and only
ask when required. If you find yourself using
ask pattern a lot you could find yourself struggling to manage all the timeouts.
If are building a web service similar to ours that involves aggregating data from various sources, then consider using the
Actor per-request model to promote the “