Ask, Tell and Per-request Actors

Recently Ariel and I presented a case study of how we have been using Scala, Akka and Spray on the Wishlist section of the NET-A-PORTER site. In this blog post, we share the part of our talk that focused on how and why we changed our design from using the Actor ask pattern to using the Actor tell pattern.

First, a quick overview of the application.

Wishlist aggregation API

The wishlist aggregation API has the job of aggregating several RESTful APIs:

  • The wishlist domain API exposes mappings from customer IDs to lists of product IDs
  • The product domain API exposes information about products, such as the product name and price
  • The alert domain API contains notifications for products that customers are interested in, such as when they become low in stock, or go on sale

By aggregating these domains, we can provide a richer API that is more useful for the majority of clients.

The wishlist aggregation API consists of three modules:

arch

  • The rest-routing module uses spray-routing to handle incoming HTTP requests
  • The application-core module has the logic for how to aggregate the wishlist, product and alert domains
  • The rest-client module uses spray-client to define clients for the domain APIs we need to aggregate. The application-core only has a runtime dependency on the rest-client module to prevent Spray modules leaking into the application-core.

First Design with ask

Our first design used ask to bridge the gap between our routing layer and the Actors in the application-core. This is a common pattern in the spray docs that completes the RequestContext asynchronously so our routing Actor can quickly move on to handling the next request.

ask1

The application-core module contained Actors that handled different types of request. For example, one Actor would have the task of retrieving a list of items for a given wishlist. These Actors would usually make several requests to different REST clients in parallel and aggregate the results. We couldn’t think of a nice way to use tell here, so instead used the ask pattern again. The ask pattern allowed us to immediately get a handle on each API response and aggregate them via Future composition.

ask

We weren’t very happy with this design. It caused us a couple of problems…

Problem 1 – ask timeouts are hard to debug

Our design used a lot of asks. Unlike tell, ask requires a mandatory timeout for the Future. It is important to get the handling of these timeouts right, because even if you don’t see them in development, they will happen at some point in production, when servers become slow or networks become congested. When an ask timeout is reached, you will see the following Exception logged:

akka.pattern.AskTimeoutException: Timed out
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:312)
at akka.actor.DefaultScheduler$$anon$8.run(Scheduler.scala:191)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:137)
at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(…)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:262)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(…)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1478)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(…)

This Exception is useless for debugging purposes:

  • The message – “Timed out” doesn’t tell us anything about which ask in our system timed out. Unlike Actors, Futures do not have names, so Akka cannot give us any more information here
  • The stack trace doesn’t contains our own com.netaporter packages, only those for the Scala and Akka internals. This may get better in the future; see Iulian Dragos’s recent presentation on replacing stack traces in Akka applications with something more useful

When we first started seeing these AskTimeoutExceptions, we dropped a few debugger breakpoints and eventually worked out it was due to a RESTful API performing slowly. We started using recoverWith on the Futures we got back from ask to give better failure messages:

This made debugging easier, and allowed us to return more useful error messages from our RESTful API, but our first design had a lot of asks and we weren’t particularly happy with all the clutter this added to our code base.

Shortly after making these changes, we realised that spray-client actually gives really good error messages in the event of request timeouts. Why weren’t we seeing them?

Problem 2 – ask timeouts can hide other failures

It turns out we weren’t seeing the spray-client timeout failures because we had failed to configure our ask timeouts sensibly. We had configured spray.can.client.request-timeout to 5 seconds and had also configured our ask timeouts to 5 seconds, like so:

ask-timeout

Can you see the problem here?

Because the ask timeouts are earlier in the stack than the spray-client timeouts, they will always be reached first. The useful spray-client RequestTimeoutExceptions are not propagated up the chain of Futures, as Futures further up the chain have already been completed with the less helpful AskTimeoutException. This, combined with all our error logging for ask Future failures being done in the routing layer, meant that the RequestTimeoutExceptions were being hidden from logs. This also prevented us from providing more useful error messages to the clients of our RESTful API.

The cause of this problem was subtle, but making the situation better in the short-term was fairly easy:

  • To Sort out the timeouts we made sure we had larger timeouts at our Routing layer and then progressively made them smaller as we got deeper into the application. Although this helped, it became a nightmare to manage, as we were using the ask pattern a lot.
  • Log failures at the call-site – For belt and braces, in case we accidentally misconfigure the timeouts in the future, we also added logging into our recoverWith blocks:

This further reduced the signal-to-noise ratio in our code, making it harder to read.

Improved Design with tell

To better deal with these problems, we decided to remove asks from our design and use tell instead. Our inspiration for how to solve this came from a post on the Spray mailing list which described the Actor per-request model.

In the new design we start with no Actors in the application-core:

tell0

When a request comes in to our routing layer, we now immediately spin up a new per-request Actor and pass it the Spray RequestContext. This frees up the routing Actor to deal with another request. The job of the per-request Actor is simply to hold the RequestContext, spin up another Actor in the application-core, send it a tell, then wait on a reply message or a failure, completing the RequestContext as appropriate.

tell1

Since our “Get Wishlist Items” Actor is now also scoped to a single request, it can store state related to the request as member fields. This allows us to build up our aggregation of products and wishlists using tell. Each time we receive a tell response from one of the REST clients, we can store it in a field and keep doing so until we have all the data we need in fields, at which point we can aggregate the data in those fields.

tell2

When the per-request Actor receives a response from the application-core, it completes the RequestContext and then kills itself. This has the nice property of also killing any request-scoped Actors in the application-core via the supervision hierarchy.

So how does the new design help with the problems we were facing before?

Solution 1 – tell is easier to debug

We now never see the useless AskTimeoutException in our logs. Huzzah!

Also, we are now using Actors where we were previously using Futures. Since Actors have names, traceability of messages is much better using LoggingReceive. This, in turn, makes debugging easier.

Solution 2 – tell doesn’t hide failures

We no longer have to configure loads of timeouts in the application-core. Instead, we now have a single timeout set in the per-request Actor via Akka’s setReceiveTimeout. This timeout defines how long we are willing to take building a response, before we instead send a timeout error to our clients. If this timeout is reached, then the per-request Actor will kill itself and any request scoped Actors in the application-core.

Until the per-request Actor timeout is reached, any non-recoverable failures in the application-core are free to be escalated up the supervision hierarchy to the per-request Actor and completed on the RequestContext as a useful error message. This means we no longer run the risk of shadowing spray-client timeout errors as we did in the first design.

Downsides of the Actor per-request model

Before you switch to using the Actor per request model, there are a couple of points to consider:

Actor per-request may not help you

The Actor per-request model is a good fit when you need to manage many request-scoped actors in your application-core. There are other Akka applications we have built here at NET-A-PORTER that do not have this requirement, and the Actor per-request model has not been so appropriate in these cases.

Performance

We considered whether using the Actor per-request model was going to be a performance bottleneck. There is an overhead for spinning up several Actors per request. However, for us it is not big enough to be an issue. An Akka system is designed to work well with millions of Actors. Actors are cheap; they are fast to create and only cost around 300 bytes of memory. This, combined with the ability to scale this service across many machines, means the performance penalty is outweighed by the benefits of a cleaner design.

Conclusion

We’d advise you to use the tell pattern whenever you can and only ask when required. If you find yourself using ask pattern a lot you could find yourself struggling to manage all the timeouts.

If are building a web service similar to ours that involves aggregating data from various sources, then consider using the Actor per-request model to promote the “tell, don’t ask” pattern.

Our code for the per-request Actor can be found on github and also as an activator template.

This entry was posted in Akka, Scala by Ian Forsey. Bookmark the permalink.

About Ian Forsey

Ian Forsey is a server-side developer working on the product and search systems at NET-A-PORTER. He also blogs about programming and technology on his personal blog at http://theon.github.com

5 thoughts on “Ask, Tell and Per-request Actors

  1. “Actors are cheap”; and threads are expensive. Threads are hard to control; Actors shouldn’t be controlled. Threads are opaque by accident; Actors are opaque by design.

    That sums up a lot of the beauty of this model.

    Actor per request sounds like a pretty solid way to avoid monkeying about with Futures, a big plus for Java shops venturing into this space. I am pushing our engineers to use ask pattern/akka Futures exclusively in non-actor code interacting with actors (such as JMS listener threads and service wrappers).

    A question though: have you found the need to govern the creation rate of new Request actors in the router? E.g. under a burst of load, have you encountered request latency you might attribute to many small actors with small mailboxes operating concurrently with no shared priority guarantees?

    • “Monkeying about with Futures” – good phrase! That’s exactly what it feels like.

      We have not found the need to govern the creation rate of per-request Actors, but it would be a good safety measure to put in place. It would also be fairly straight forward to do, just by having the routing actor only spawn new children if `context.children.size` is below a certain value.

      During our load testing, reaching the limits of network IO and disk IO were the biggest factors that increased request latency. Up until then, any performance impact from using the actor per-request model was negligible in comparison.

      • It’s enterprise, no unbounded queues. No unbounded anything. Okay, maybe S3 :D

        I might choose to implement this with a bounded pool of actors managed by the router in two states (working and available). This could give both a size limit and avoid gc. Incidentally, a pool of actors whose responsibility is to act as a temporary dedicated servant for shuttling about work for one party but who are managed by a third might be appropriately named the “Porter pattern.”

  2. Hi Ian,

    You have proposed great solution, but I have some notes about Akka actors performance.

    Default configuration of Akka actors doesn’t use fastest mailbox type “akka.dispatch.SingleConsumerOnlyUnboundedMailbox” and actor creation time is much greater than in other actor libraries (from 4μs to 30μs while in other actor libraries it takes less than 100ns):
    https://github.com/plokhotnyuk/actors/blob/5cf8a073608a62624e38bd07da2f2e21348dcbe2/out2.txt#L368
    https://travis-ci.org/plokhotnyuk/actors/jobs/15581267#L1982

    Also Akka actors are not cheapest in memory usage which is about 430 bytes for minimal actors (current version of Scalaz actors – 184 bytes, old and deprecated Scala actors – 144 bytes, Lift actors – 72 bytes).

    There are also additional costs of memory usage when messages are enqueued in internal actor queues/mailboxes and waiting to be handled…

    • Hi Andriy,

      Thanks for those numbers; it is interesting to see how the different actor libraries compare. I would say that choosing an actor library is a decision that can be made separately from whether to use the actor per-request model or not.

      I think it’s important for people to load test their applications and make their own decision about whether their design combined with their preferred actor library provides satisfactory performance in the context of their use case.

Leave a Reply