When Latency Strikes

As a developer-by-nature, I’ve been fortunate not have to understand
global networking, and the effect of the speed of light, in any great detail.
I’ve been in the lucky position where I have been able to write code that
“does what it’s supposed to” without much thought to global server
positioning.

This changed quickly and suddenly for me towards the end of 2012.

The Setup

We’re a large company now, and have offices in various locations – as
illustrated by the diagram below.

across_the_world_001x700

We have to shuttle data between locations for our day-to-day operations. If these operations take too long, our users become frustrated.

The Problem

For a long time, we didn’t have any problem. Well, not one that was critical
enough to cause disruption to the business.

For a few years, we’ve been shuttling messages back-and-forth from “somewhere
in Europe” to “somewhere far away but not too far” (as illustrated by the
snowy office in the diagram). Things were “a tiny bit slow” – latency was
around the 70ms mark, and within the bounds of people’s acceptance.

Recently, we had to start sending data much further away (so far, that it feels like
we’re sending to Australia!).

The same day we started sending data to this new location, we discovered that
our messages to all locations were horrifically slow. As well as being
unexpected, this was definitely unacceptable behaviour in our infrastructure.

The Cause

After a little investigation, we realised that the cause was how, exactly, we dispatched messages to remote locations – a decision we’d taken about five years earlier.

The pseudo-code for the behaviour would look something like:

distribute_remote_message {
  foreach location in remote_locations
     prepare_remote_message;
     insert_task_in_remote_datastore;
  end
  task_complete;
}

To be cautious, we had a single FIFO queue to transport the messages to their
remote locations.

across_the_world_002x700

Until you add a really slow (high latency) communications link, it’s easy to overlook why this
implementation is less than ideal.

Let’s put some pseudo-random numbers into the mix and see how things pan
out; we’ll use latency as our measure of ‘speed’, because that turned out to be
where our problems came from.

  • Europe → “near Europe” – 10ms
  • Europe → “moderately far away” – 80ms
  • Europe → “quite far away” – 300ms

Before we had somewhere “quite far away”, our implementation took 90ms to
complete a task and move on to the next one.

Once we added “quite far away”, our time to process and move on rocketed to
390ms. That’s over four times longer!

This implementation effectively penalised every remote location by coupling their rate of
consumption to that of the slowest destination we’re sending to.

Now imagine how upset people “near Europe” became when things were visibly slow
getting to them, affecting their time-sensitive processes.

Unpicking the tangled web

Once the issue affected us, it became a case of: “Argh! How do we resolve this
quickly?” There wasn’t time to rewrite existing code to Do Something
Different, nor was Turn Off The Slowest Destination an option.

The solution we conceived was to decouple the transmission to the remote
destinations from each other … each remote location should receive messages
as quickly as we could transport them there, and be independent of any other
locations in the mix.

We modified the process so that
distributing a remote message doesn’t immediately transfer the message to its
intended destination. Instead, it wraps the original message in a new message per remote location,
and marks each new message with one remote location to be sent to.

Rough pseudo-code for that would be:

distribute_remote_message {
  foreach location in remote_locations
     wrap_up_remote_message;
     drop_proxy_message_in_message_queue;
  end
  task_complete;
}
transfer_message_to_remote (location) {
   insert_task_in_remote_datastore (location);
}

This allows us to have one process per location, transporting messages to their
destination as fast as possible. The initial task is freed up much sooner,
allowing it to split tasks into their remote proxy messages much more quickly.

across_the_world_003x700

“Near Europe” and “moderately far away” receive their messages much faster,
and are able to resume operating at full speed again. In both cases, they are
actually receiving messages faster than in the original implementation.

“Quite far away” is still receiving messages more slowly, and there’s often a
backlog of tasks crawling over to there, but now only that location is a victim of
its remoteness, instead of all of them.

Why not use ActiveMQ?

I’m sure some people will wonder why we brewed our own version of a message
broker network.

We did the work to distribute these tasks in our pre-ActiveMQ era, so
we were mostly inventing our own Poor Man’s Message Broker Network. We are
moving communications into more sensible messaging solutions when priorities
and time allow us to.

But why is latency bad?

There are people on the internet who’ve been able to explain this far better
than I could hope to, so I’m going to end the article with some recommended
reading:

Print Friendly
This entry was posted in Architecture, Perl and tagged , , by Chisel. Bookmark the permalink.

About Chisel

I've been interested in technology since my childhood (my Acorn Electron era). Commercially I've worked with C and VBA (Access) [but we don't talk about that]. In 2000 I secured my first role as a Perl Programmer and that's been my primary language since then. I dabble in bash, puppet, and a few other skills that help me dip my toe in the DevOps water, or provide a conduit beween the dev and ops worlds. I joined Net-A-Porter in November 2006 and have been happily evolving with the business since then.

Leave a Reply