If you’ve ever designed a large scale distributed system then you’re familiar with the whack-a-mole style issues that occur when introducing messaging – where solving one set of problems only gives rise to new ones.
In the upcoming webinar, MVPs Sean Feldman and Saravana Kumar will share the most common pains they’ve experienced while implementing such systems, using Azure Service Bus, and the solutions they came up with to cure them. They will cover the following topics:
- Understanding message flow in a large system
- Resolving failure conditions
- Understanding the urgency, importance, and impact of these failures
Let’s have a deeper look into what each of these topics will entail.
Understanding message flow in a large system
Software systems that require messaging are typically quite large in terms of endpoints and subsystems. Therefore, the communication paths between these endpoints will become complex and it will be hard to understand how the system actually supports the business processes. The reason for this is that Service Bus provides only the infrastructure to pass messages from one endpoint in a system to another, but it’s not aware of any business level meaning of the messages passed around. It doesn’t know if the content of a message represents a command to be executed by a specific endpoint or if it’s an event representing something that has already happened. Yet at the same time, it needs this information to ensure that messages flow to the right endpoint. How will you manage this dichotomy? And what if the message has a wrong or no meaning at all? Imagine that the communication paths were easy to understand from a business perspective.
Resolving failure conditions
Failure handling is further complicated with messaging. In a complex application architecture, which a combination of different Azure services typically is, there are many reasons why a message could not be executed and it becomes very difficult to identify why a message has failed.
- E.g. message body may have been corrupted by one of the endpoints, or it may exceed service quotas like TimeToLive or MaxDeliveryCount. In each of these cases, the message is considered poisonous by the Service Bus and put aside in a Dead Letter Queue. On the other hand, a message could also have content that is meaningless to the system, or plain erroneous from a business perspective when it misses required information. In these scenarios, the message could successfully be delivered to the application, but it could not be acted upon until someone from the team fixes the code or data. Instead of dead lettering the message, the application needs to put it aside in an error queue for further investigation. In both cases, the failure condition can be identified and after the issue has been resolved, the message needs to be resubmitted to eventually execute the action the user requested. There is, however, no good out of the box toolset that allows you to do this, neither at the infrastructure level nor at the application level.
- Furthermore, application code could also suffer from temporary problems in the processing logic, e.g. the database connection could not be established. This kind of problem could be resolved by the application itself without human intervention.
- Or maybe the processing logic is not operational at all?
Imagine there was a simple tool set that allows you to both identify these issues and resolve it in no-time as well?
Understanding the urgency, importance, and impact of these failures
Not every failure has the same urgency or business impact. Manually detecting these failures, triaging and responding to them will become very cumbersome in a large system. A good monitoring and alerting infrastructure is required to assist operational staff in managing it. Again this problem poses itself at both the infrastructure as well as the business level. Is a specific queue filling up? Has the processing logic stopped or is it slow? Are we violating the LA we promised to our customers?