If you’re just joining this series, it is one aspect of a response to the gap between how development and how operations view technology and measure their success – a gap that makes it wholly possible for development and operations to be individually successful, but for the organization to fail. So what can we do to better align development and operations so that they can speak the same language and work towards the success of the organization as a whole? This article series attempts to address a portion of this problem by presenting operation teams insight into how specific architecture and development decisions affect the day-to-day operational requirements of an application.
The last two articles presented an overview of Service-Oriented Architecture (SOA) and specifically two web service implementations: SOAP and REST. This service layer provides a standards-based mechanism with which to interact with systems, but the next question is: how do you perform the integration? Are you going to make the integration a manual process that is performed on-demand? Are you going to make the integration a batch process that runs overnight? Or is there a way to propagate changes to one system to other systems at the time that those changes happen?
This article presents a solution to this problem: Event-Driven Architecture (EDA). EDA is an enterprise integration pattern that enables system integration through asynchronous events. EDA has its own specific operational requirements that this article explores.
Event-Driven Architecture (EDA)
For the better part of the past 4 years I was working on a $1 billion dollar project for an entertainment company that involved the creation of new systems and the integration of existing systems. One question that we had to solve was how do we propagate changes from one system to other systems in the ecosystem? For example, if a guest purchases a ticket, how do we know that ticket exists and is owned by that specific guest so that we can show that in her profile?
The approach that we took was to integrate these systems using an event-driven architecture. From a high-level, this means that when something interesting happens in one of the systems of record (SOR) we require that system to raise an event that identifies the resource that changed. In the context of the aforementioned ticket example, when a guest purchases a ticket we would require the ticketing system to raise a “creation” event, with the identifier of the new ticket, which informs the other systems that the new ticket has been created. This is shown in figure 1.
Figure 1. Raising an event
There are different implementations for EDA applications, but the most common is to leverage an Enterprise Service Bus (ESB). An ESB provides standard mechanisms for systems to publish messages to the ESB and for systems to subscribe to messages in the ESB. Messaging comes in two flavors:
- Queues: queues facilitate point-to-point messaging, meaning that a message producer can send a message to exactly one message consumer
- Topics: topics facilitate publish-subscribe messaging, meaning that a message producer can publish a message to zero or more subscribers
Queues and topics serve their own specific set of business scenarios. For example, if you want a message processed only once then use a queue. If you want to send a message to a specific destination then use a queue. If you want messages to be processed by an arbitrary number of other systems then use a topic.
Therefore, in EDA we tend to favor topics because we want to enable new systems to be able to subscribe to changes in an existing system without having to change the existing system. When the message producer does not need to know about its subscribers then we say that the relationship between the two systems is loosely coupled. In other words, if we add a new system, say an analytics system that tracks different information about guests, we do not need to update the ticketing system to support the analytics system: the analytics system subscribes to events raised on the ticketing system’s topic and processes them. Figure 2 shows this graphically.
Figure 2. Using Topics for Loose Coupling
Figure 2 shows that when the ticketing system raises an event, both the guest profile service as well as the analytics engine both receive the event. New systems can subscribe to the ticketing system’s topic and integrate with it without the needing to change ticketing system.
Queues, however, have guaranteed delivery, which means that when a message is published to the queue, the ESB will hold that message in the queue until someone removes it from the queue. Topics, on the other hand, publish the message to all subscribers and then the message goes away. This leads to another problem: what if your topic listener goes down? Will you miss the event? Under normal circumstances the answer is “yes” you will miss the event, which is a problem! Fortunately, most message brokers (the part of the ESB that manages messages) have the concept of durable subscribers. If a subscriber registers itself as a durable subscriber then the ESB will hold on to that message until your topic listener is able to receive the message. From an operations perspective, you need to ensure that all topic listeners that care about receiving all events are registered as durable subscribers.
Light-weight versus Heavy-weight Events
Looking at the event itself, events can come in one of two flavors:
Light-weight events contain the identifier of the resource that changed and then require listeners to call the SOR back to get the details of the resource. This is shown graphically in figure 3.
Figure 3. Light-weight Event Processing
Heavy-weight events, on the other hand, include the details of the event in its payload, so no callback is required, which is shown in figure 3.
Figure 4. Heavy-weight Event Processing
So should you use light-weight or heavy-weight events? Like everything, there is a tradeoff. Events are not guaranteed to arrive at the same time – eventing is asynchronous. So if the ticketing system publishes the following events:
- Create ticket
- Upgrade ticket
- Add an additional feature to the ticket
But you receive the events in the following order:
- Create ticket
- Add an additional feature to the ticket
- Upgrade ticket
If the event payload has the entirety of the ticket (heavy-weight event) then you will create the ticket, add a feature to the ticket, and then overwrite the ticket with its upgraded state, which does not include the newly added feature. The result is that you will not have the current state of the ticket. How do you solve this problem? One way is to require a timestamp on the event and then do some local processing to ensure that you have the latest event and discard any earlier events that might be received later.
This is a lot of work and error-prone, so the more ideal solution is to use a light-weight event and make a callback. In this example, the ticketing system is the “source of truth” for the ticket, so when you make the callback, you are assured that you will always have the correct state of the ticket. But this comes at a cost, namely, your ticketing system needs to be able to support the additional load generated by the callbacks.
In the project I mentioned earlier, we had both scenarios: new systems that could support the additional load generated light-weight events and legacy systems that could not support the additional load generated heavy-weight events. As an operations or DevOps engineer, you are in the best position to make this recommendation to your developers because you know the capabilities of the systems and whether or not they can support additional load. But also, if the system can support the additional load and you do opt for light-weight events, you need to plan for that additional load and beef up the infrastructure to support that load. And it is not trivial: in our project, the ticketing system generated over 500K events in a day, which is a substantial amount of additional load!
Workflows versus Manual Listeners
Most ESBs provide the concept of workflows: a workflow can process an event and perform configurable actions when specific types of events are received. This is shown graphically in figure 4.
Figure 5. ESB Workflows
This is an example of where combining a service-oriented architecture with an ESB workflow can be powerful: the workflow can process the event, perform transformations on the event, and call services exposed by the various systems. It makes the event processing configuration rather than code.
The benefit to using ESB workflows is that the event is handled inside the ESB itself, so it typically performs better than writing your own listeners. The tradeoff is that you lose portability: business logic for your application is contained within the ESB so if you move your application to another environment, the ESB needs to go with you. If another site is using a different ESB and they do not want to adopt your ESB then all of that business logic will need to be rewritten.
The alternative is to manually write listeners in code. These listeners subscribe to the appropriate topics and perform their business logic independent of the ESB. In this type of scenario, you may not need a full ESB, but only a message broker, which is typically less expensive. But this portability comes at the expense of performance.
As an operations or DevOps engineer, you need to know how your developers have opted to perform this function because if they are using workflows then you will need to add additional processing capabilities to the ESB. And if they are going to write manual listeners then you need to account for additional network bandwidth between the ESBs and the listeners (you might want to locate them close to one another, such as in the same subnet) and you need to add adequate processing capabilities to the listener machines.
Depending on how many systems you are integrating, the performance of your ESB might benefit from segmentation. Segmentation basically means that you allocate resources to certain application components individually so that the load for one system does not negatively affect the performance of another system. In our large-scale project we started with a single segment but the load was so high that we ultimately created 8 different segments and separated the eventing between the different systems. Giving applications their own messaging resources can help isolate the load from
Alternatives to ESBs
While ESBs are the primary mechanism for implementing EDA applications, they are not the only game in town. As you might imagine, ESBs are complicated to configure and challenging to manage for the primary reason that they have to do the heavy-lifting in transporting hundreds of thousands or even millions of messages around each day. One approach that we did implement in our project was, in combination with the ESB, we used an Atom Feed. The Atom Syndication Format, which is the next generation of RSS (Really Simple Syndication), is a standard format for publishing web feeds and follows a very HTTP-centric approach. From a high-level, the SOR publishes events via an Atom Feed and consumers poll that feed to receive the event. The model is quite simple and negates the need for an ESB, but like all technologies, there is a tradeoff: latency. EDA applications have an implicit latency between the time that an SOR changes a resource and when that change is made available to other systems, but typically that latency is small. When using any type of polling mechanism, that latency is exacerbated by the configured polling interval. If your application can tolerate higher latency then you might explore Atom feeds.
Finally, there are a host of other technologies that can be employed to process events, including Akka and Apache Storm, both of which have different operational challenges, which I’ll save for a future article!
Event-Driven Architecture (EDA) is an enterprise integration pattern for integrating disparate systems via asynchronous eventing. When a change occurs in one system, it raises an event so that other systems can detect that change and update their view of the changed resource. This article reviewed EDA and presented several observations relevant to the operation of an EDA application. Specifically it reviewed topics versus queues, the need for durable topic subscribers, light-weight versus heavy-weight events, segmentation for performance, and ESB workflows versus manual event listeners. Finally, it reviewed a couple alternatives to using an ESB to enable an EDA.
If you are charged with managing an EDA application, my hope is that this article has empowered you with the questions to ask development as well as an understanding of the implications of specific EDA-implementation decisions on the environment in which it is running.