We have processes that communicate via ActiveMQ. They mostly work, but sometimes we get very strange problems. This is the story of one of them.
The consumer I’m talking about is written in Perl, using the Net::Stomp module.
What was happening
The main symptom was that the consumer process died, after an apparently random time, logging only wrote 4 characters out of the 94 character frame, the STOMP frame in question being a
ACK, followed by various other log messages indicating a failed reconnection to the broker.
From the perspective of consumer, the
write system call had failed, and this only happens when the socket is broken, so it connected again (reasonably), but the broker did not accept the new connection, at which point the consumer could only stop completely.
From the perspective of the broker, nothing was wrong with the ‘broken’ connection, so the new connection (having the same client id) was a mistake, and was promptly refused.
What was going on, to give the processes at both ends of the socket such different views on the state of the world?
Beneath the surface
Digging a little deeper, we noticed that the TCP send and receive queues on the STOMP socket were very large, on both machines. This would have explained some of the behaviour, since the consumer would have reconnected before the broker could receive the
FIN packet. At first, this pointed to a low-level network problem, maybe with the underlying hardware. Rebuilding the VMs on different hardware did not help, though.
After too long (we should have thought of it earlier), we ran a simple experiment: shuttle some random data between the two machines, and see if the TCP queues get large on that connection, as well. They did not.
So it was not a network problem, it was an application problem. I, being of a naturally suspicious and untrusting nature, decided that the problem had to be in the broker. Having read the entire network-related source code of the consumer, and having written most of it, also pointed me in that direction.
Of course, saying “the problem must be in that complex piece of software I don’t understand” does not help much…
As luck would have it, I was in the process of porting that consumer from our original Perl-ActiveMQ libraries to newer and cleaner ones. One of the goals is to get the same behaviour as our current libraries, with added flexibility to change it later. Studying the current libraries in depth, I noticed that the default configuration set the subscription header
Looking at the (rather sparse) documentation about ActiveMQ and STOMP, I discovered two relevant things:
exclusiveis not what I want in the future (it interferes with having multiple consumers load-balancing on the same queues)
- the default
prefetchis not one frame as I thought, but “whatever the broker can get away with”
The cause of the problem
It finally clicked: I knew the cause!
- the consumer connects and subscribes
- the broker writes all the frames it can to the socket, then blocks
- the consumer consumes one frame at a time, sending back one
- the broker listens for
ACKs, and for each one received, it marks the corresponding message as ‘dispatched’ (or removes it entirely, it does not matter)
This works fine until you have very many messages in a queue, the consumer is slow, you’re using
KahaDB, and the
KahaDB access on the broker is slow as well.
When the broker received a
ACK, it took more than half a second to find the corresponding message and to mark it as dispatched. This is apparently enough to fill up the TCP queues and block all communication on the socket, with the observed outcome.
Avoid filling the TCP queue, which translates to “prefetch a single frame”. This avoids filling the receive queue on the consumer’s side, and also helps the broker to process the
ACKs faster, since it only has a single half-dispatched message to check instead of hundreds.
(In production, where we use
MySQL instead of
KahaDB, the problem is nearly non-existent, since
MySQL is much faster in retrieving and updating records.)