All Aboard The Message Bus

message bus

A few months ago we decided it was time to refactor one of the major components that makes up the Capriza cloud platform – the Relay. It’s responsible for routing and relaying all messages passed between the Capriza runtime agent and a user’s mobile device.

Out of this major refactor emerged node-busmq, a high performance, highly available and scalable, message bus for node.js. Today, Capriza is open sourcing node-busmq; you can find it over at github: Feel free to use, fork and contribute to node-busmq. We hope you find it as useful as we do.

This post is about how node-busmq came into being.

Why Refactor?

The relay is a nifty little node.js process capable of serving thousands of connections simultaneously. We have several relay processes running on several machines, in multiple zones and regions around the world.

For a mobile device to communicate with the correct runtime agent, it needs to connect to the exact same relay that the runtime agent is connected to. This architecture served us quite well; it was scalable and performed well, but it also had several problems:

  • The mobile devices needs to know the exact address of the relay to connect to – not very cloud-like
  • The mobile device cannot connect to the correct relay until it receives the exact address from our API – performance penalty
  • Network disconnections are problematic to overcome – weak on robustness
  • Relay restarts are painful – again, weak on robustness
  • Mobile devices often have to connect to a relay that is geographically far away – performance penalty

We wanted a better solution that will scale infinitely, be resilient to restarts and disconnections, and improve performance.

The Clear Solution

Once we decided to tackle these problems, it became apparent that the correct solution was to use a full fledged, globally available (as in geographically) message bus. Here’s why:

  • Messages are stored in named queues, which is the only thing clients need to know
  • If a client disconnects, messages aren’t lost
  • Restarts are transparent
  • It performs well regardless of clients geographical locations

That was the easy part. Next we had to decide on the implementation. There are a plethora of message bus implementations out there, ranging from naive and all the way to enterprise grade commercial products. We were looking for something that had the following features:

  • Node.js binding
  • Named message queues / channels
  • Guaranteed delivery
  • High availability
  • Scalability
  • Federation capabilities
  • Fast
  • Open source

After some research, we narrowed it down to a choice between RabbitMQ and Redis. I know it may seem odd to compare the two; after all, they’re designed for very different purposes. Bare with me, I’ll get to it.

RabbitMQ is a very robust, mature messaging system with all of the features we were looking for, and then some. Redis is a very robust and mature key-value cache, acting as a data store server for sharing data between processes and machines.

The key difference here is that Redis would require additional work to meet our needs, whereas RabbitMQ would work pretty much out-of-the-box.

So, RabbitMQ or Redis?

We started out with RabbitMQ, the obvious choice. We had our concerns around the fact that we had zero experience with erlang, the language on which RabbitMQ is built on, and around hidden unknowns lurking in the shadows of unfamiliar technology.

We put together a simple benchmark that sends and consumes messages as fast as possible to test how RabbitMQ fares. Of course, the benchmark was written in node.js as that was the whole purpose.

First, some (possibly boring) background: the benchmark was performed using two c3.xlarge AWS machines running Debian 7. Each machine has 4 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz and 7.5GB of RAM. One machine was setup to run the RabbitMQ server. The second machine was setup to run 4 node processes executing the benchmarking code.

At first, we were blown away by the sheer speed at which messages were delivered. Around 60K message per second(!). But then, just after a few seconds, the benchmark started to choke; messages were being consumed in a very sporadic manner, until they stopped arriving altogether. We noticed that the all four CPUs of the machine running RabbitMQ were at a constant 100% utilization.

We tuned down the benchmark to produce less messages per second. We did that several times, each time reducing the number of messages the benchmark produced. The only effect it had was to postpone the time at which RabbitMQ entered 100% CPU utilization and choked.

When we reached 10K messages per second, RabbitMQ choked at around 35 seconds from the start of the run.

 All Aboard The Message Bus

The x-axis is time, y-axis is number of messages. The blue line is the number of produced messages, red is the number of consumed.

I have to say that we were pretty disappointed. A feature rich and mature message bus that isn’t stable under a constant flow of messages? Very surprising.

Granted, we ran RabbitMQ with default settings, and it’s entirely possible that with the right tuning, results would have been better. But like I said, there are bound to be “hidden unknowns lurking in the shadows.”

So we turned to Redis. Redis is a high performance in-memory key-value store, with built in queue capabilities. The down side is that we needed more that just queueing capabilities; we needed infinite scalability, high-availability and guaranteed delivery to name a few.

We have very good experience with Redis so we wanted to give it a shot and see if we can create a node.js module that provides the additional features we were looking for from a full fledged message bus, on top of those provided by Redis.

Node-busmq Is Born

We ended up with node-busmq, a node.js module with the following features:

  • Event based message queues
  • Event based bi-directional channels for peer-to-peer communication (backed by message queues)
  • Reliable delivery of messages (AKA Guaranteed Delivery)
  • Persistent Publish/Subscribe
  • Federation capabilities over distributed data centers
  • Auto expiration of queues after a pre-defined idle time
  • Scalability through the use of multiple redis instances and node processes
  • High availability through redis master-slave setup and stateless node processes
  • Tolerance to dynamic addition of redis instances during scale out
  • Fast

To make the comparison with RabbitMQ as fair as possible, we setup a machine running 4 redis servers, one redis per core. Unlike RabbitMQ, redis is single threaded and therefore can only utilize one CPU. We connected the 4 node.js processes to all of the 4 redis servers.

Running the same benchmark at maximum throughput with node-busmq yielded 10K messages per second. Not the whopping 60K RabbitMQ demonstrated, but much more importantly, messages flowed at a stable rate.

All Aboard The Message Bus

Oh, and node-busmq scales linearly, so doubling the number of node.js processes doubles the throughput.

Final Words

We’ve been using the new relay (with node-busmq) in production for a few weeks now and know that the investment we made in this module is paying off. We’re in the midst of deploying the new relay to additional data centers around the world; we’re using geo-based DNS routing to connect users to the closest physical relay, taking advantage of node-busmq’s built-in federation capabilities to reach the runtime agent. As the new relay is rolled out to more and more customers, we expect to see improvement in both performance and reliability.

About the Author

Nadav Fischer

Nadav Fischer is senior software developer at Capriza. He began his career at Mercury Interactive and later HP. His areas of expertise include backend architecture, security and performance. He's a strong believer in open source, and actively contributes to the open source community.