ActiveMQ, STOMP, and network problems that aren’t

We have processes that communicate via ActiveMQ. They mostly work, but sometimes we get very strange problems. This is the story of one of them.

The consumer I’m talking about is written in Perl, using the Net::Stomp module.

What was happening

The main symptom was that the consumer process died, after an apparently random time, logging only wrote 4 characters out of the 94 character frame, the STOMP frame in question being a ACK, followed by various other log messages indicating a failed reconnection to the broker.

From the perspective of consumer, the write system call had failed, and this only happens when the socket is broken, so it connected again (reasonably), but the broker did not accept the new connection, at which point the consumer could only stop completely.

From the perspective of the broker, nothing was wrong with the ‘broken’ connection, so the new connection (having the same client id) was a mistake, and was promptly refused.

What was going on, to give the processes at both ends of the socket such different views on the state of the world?

Beneath the surface

Digging a little deeper, we noticed that the TCP send and receive queues on the STOMP socket were very large, on both machines. This would have explained some of the behaviour, since the consumer would have reconnected before the broker could receive the FIN packet. At first, this pointed to a low-level network problem, maybe with the underlying hardware. Rebuilding the VMs on different hardware did not help, though.

After too long (we should have thought of it earlier), we ran a simple experiment: shuttle some random data between the two machines, and see if the TCP queues get large on that connection, as well. They did not.

So it was not a network problem, it was an application problem. I, being of a naturally suspicious and untrusting nature, decided that the problem had to be in the broker. Having read the entire network-related source code of the consumer, and having written most of it, also pointed me in that direction.

Of course, saying “the problem must be in that complex piece of software I don’t understand” does not help much…

As luck would have it, I was in the process of porting that consumer from our original Perl-ActiveMQ libraries to newer and cleaner ones. One of the goals is to get the same behaviour as our current libraries, with added flexibility to change it later. Studying the current libraries in depth, I noticed that the default configuration set the subscription header activemq.exclusive: 1.

Looking at the (rather sparse) documentation about ActiveMQ and STOMP, I discovered two relevant things:

  1. exclusive is not what I want in the future (it interferes with having multiple consumers load-balancing on the same queues)
  2. the default prefetch is not one frame as I thought, but “whatever the broker can get away with”

The cause of the problem

It finally clicked: I knew the cause!

  • the consumer connects and subscribes
  • the broker writes all the frames it can to the socket, then blocks
  • the consumer consumes one frame at a time, sending back one ACK for each
  • the broker listens for ACKs, and for each one received, it marks the corresponding message as ‘dispatched’ (or removes it entirely, it does not matter)

This works fine until you have very many messages in a queue, the consumer is slow, you’re using KahaDB, and the KahaDB access on the broker is slow as well.

When the broker received a ACK, it took more than half a second to find the corresponding message and to mark it as dispatched. This is apparently enough to fill up the TCP queues and block all communication on the socket, with the observed outcome.

The fix

Avoid filling the TCP queue, which translates to “prefetch a single frame”. This avoids filling the receive queue on the consumer’s side, and also helps the broker to process the ACKs faster, since it only has a single half-dispatched message to check instead of hundreds.

(In production, where we use MySQL instead of KahaDB, the problem is nearly non-existent, since MySQL is much faster in retrieving and updating records.)

Print Friendly
This entry was posted in Architecture, Perl by dakkar. Bookmark the permalink.

About dakkar

Gianni is a Perl Architect at NAP. His code from previous lives runs in universities administration software, inside ask.com news system, and even in Antarctica. He's currently busy writing libraries to make "the right thing" be "the easy thing".

Leave a Reply