Reliable Pub/Sub Messaging on AWS with QBot
At Raisebook, we’ve built a tool, called “QBot” to help us with ensuring that event-based messages get delivered to where they are supposed…
At Raisebook, we’ve built a tool, called “QBot” to help us with ensuring that event-based messages get delivered to where they are supposed to go, reliably.
Our applications are a fairly typical arrangement of loosely-coupled microservices, which react to events delivered by messages transmitted over the Simple Queue Service (SQS) and Simple Notification Service (SNS) on Amazon Web Services’ cloud platform.
The problem is that, using these services, there is a good chance that sent messages never arrive at their intended destination!
Wait… messaging on AWS isn’t reliable?
The answer to that is, it sort of depends.
It depends on which mechanisms you are using to send the messages, and how you’ve configured them.
SQS is fine…
SQS is a one-to-one mapping of a single queue to a single consumer endpoint. This is the easier case — if you configure your queue with what’s known as a dead-letter queue (DLQ), as long as the receiving end (the consumer) does not acknowledge the message as being processed, it can retry a few times, and eventually the message will find itself on the DLQ, where it can be manually corrected and re-queued.
In this sense, it is reliable, as no messages are lost, they just may be delayed until such time as they can be handled.
SNS, not so much
SNS is a Publisher/Subscriber model of messaging. This means that any number of consumers can subscribe to a topic, so that when a publisher sends a message, all of the subscribers will receive a copy of the same message.
Unfortunately on AWS, there are no guarantees that each subscriber will receive and process the message. Each consumer endpoint must be available, and operating correctly at the time that the message is sent — if it is not, then (after a couple of retries), AWS will drop the message destined for that subscriber, never to be seen again.
… except for SNS to Lambda functions
In December 2016, Amazon added the ability for Lambda functions to drop failed messages into Dead Letter Queues, which partially solves the problem, but only if all your consumer endpoints are Lambda functions.
HTTP or other types of consumers
If, like us, you have SNS consumers which are not Lambda functions, you still have a problem. If you have HTTP(S) endpoints (we do), then you have an additional problem — AWS make it very, very hard to programmatically register them with an SNS topic.
Introducing QBot
What if we could have a tool, which still let us use the Pub/Sub style of messaging, and combined it with the convenience of SQS and Dead Letter Queues?
QBot is a service that acts as the topic subscriber, and then uses SQS to reliably send those messages out to the originally intended endpoints.
There is a bit more set up of resources needed to support using QBot — each subscription now needs two SQS Queues, one main one, and a DLQ. The main one is then set as the subscriber for the Topic. QBot will poll on these queues, and when a message comes in, attempts to deliver it to the final destination endpoint. This can be either a Lambda function or another HTTP(S) microservice.
Auto(ish) configuration
So how does QBot know where to send messages on to? We can attach this information to the queues themselves via the Metadata property (here we’re using Cloudformation to define our environment):
On startup, QBot will query the AWS account for any queues with this metadata set on them, and start up a worker process for each queue it will monitor. As it is written in Elixir, this is very simple and lightweight to manage.
In a future episode…
Soon, we will dive into the specifics of how it works, what message formats it prefers, and more of why Elixir is such a good choice for this type of application.
In the meantime, why don’t you check out the code on Github and let us know what you think!