I’ve been working on a series of articles showing how to build microservices using an event-driven approach (which IMHO is the only real way to build microservices :) or… any complex distributed architecture). I’ll explore DDD, CQRS, Event-sourcing, event streaming, complex-event processing and more. I’m using a reference monolith application based on Java EE that uses all the typical Java EE technology and dives deep into what makes it tick, what drawbacks it has, and how to evolve it to a microservices architecture. I’ll show implementation details all the way from containers (Docker, Kubernetes) to the JVM layer (Spring Boot and WildFly Swarm) to the application architecture (events, commands, streaming, raw events, aggregates, aggregate roots, transactions, CQRS, etc). Hopefully it will be ready for my Red Hat Summit talk in San Francisco in June! Follow me on twitter @christianposta for updates on this project.
I want to quickly paint the picture in my head about distributed systems (maybe it’s a sloppy picture, but nevertheless). When we talk about microservices we talk about using microservices as a vehicle for building business agile IT systems: systems that allow a business to more quickly change, build new functionality, experiment, and stay ahead of its disruptors and competition (startups, etc). As part of autonomous systems that interact with each other to provide business agility, we also need to consider what happens when parts of these systems fail and how a system reacts to overcome failure. A central pre-requisite to being able to build agile, failure tolerant systems is autonomy. Autonomous systems can evolve independently from each other because they tend to shed dependencies on other systems, teams, and processes. Changes to a service A shouldn’t force system B to change, and any other ripple effects. If service A, on which service B depends, goes down, service B should not just blow up.
Where do we have examples of this autonomy in other systems outside of microservices? Well, if you follow the real reasons why microservices are a success then you know it’s not the technology per-se that enables Netflixes and Amazons of the world to be successful with microservices: it’s the organization system structure.
Some examples of these same types of agile systems include open-source communities, cities, stock markets, ant colonies, flocks of birds, and countless others. They can evolve, react, and even continue on in the face of massive failure) In fact, they’re a well-studied bunch in the field of Complex Adaptive Systems theory. The underlying common themes between these systems? Purpose, Autonomy, and reaction to their environments. These autonomous agents “react” to “events”
When something happens, an autonomous agent (ant, person, service) can “do something” or “do nothing” but it’s these events that drive the behavior in complex adaptive systems. Think about how you (as an autonomous person) do things throughout the day. You wake up, you dress based on the temperature (an event, or fact), you get in your car and drive to work (stopping at stop lights (event), avoiding the people driving erratically (event), etc). These are all responses to events. You get email in your inbox, you respond. You get a text from your wife to pick up dinner on the way home, etc. We live our entire life responding to events. IT systems built on events can be made to be be equally autonomous, scalable, and resilient to failures.
Going from authority to autonomy and embracing eventual consistency
In most distributed systems implementations I’ve seen, we tend to extend the notion of building systems within a single address space to building across an unreliable network. This is a bad idea for many reasons but many times it appears to be the simpler approach. We tend to invoke remote objects to prod them to do something. Or we call a remote service to “lookup” data. Maybe the “tax” service is the canonical location for anything to do with tax calculations. If we’re a shopping cart service we need to calculate the final price for the items in a shopping cart during checkout. So the shopping cart service calls the pricing service. The pricing service may also call the tax service to do some other adjustments to the price based on shipping location (country, state, city, etc). The tax service may call the catalog service (taxes may be different depending on product). The shipping service may also call the inventory service, etc. We may end up with these long strings of calls (which may be okay in monolith application where all these objects live in the same address space, etc). We’re following the “authority” pattern of accessing data: we call the service that has authority over the data. To me this feels a bit like shared global state and tons of mutexes and synchronization points. It also has nasty implications in terms of “transactionality” or ACIDity of a series of calls to authority.
This can lead to bottlenecks. It can also lead to hung services and cascading failures if some of these services in the chain are unavailable. It can also lead to weird dependencies where something like the inventory service now has to expose data in a certain way for the tax service and something different for the shipping service to consume. Or it exposes the data in one single format with lot of additional details that neither service really cares about.
What if we looked at this model differently? What if we invert the model. Instead of relying on and invoking services for their authority on certain matters, we rely on time and events (like we do in the real world!) to understand context about our environment before our service even gets invoked? What if we were able to listen to our environment and find that shipping from the USA to Cuba has just introduced a lower tax that it once was. This is a fact that we can observe and react to. Or we could just ignore it and do nothing. What if we could know that the tax on shipping to Cuba is now lower and capture that data so we could know it for future queries about shipping to Cuba when we display the shopping cart page? Then we may have a little more autonomy over our data and our service. We could store that information, or derivatives of that information, in our own databases which would be optimized for the types of service we provide. If we have to make a version change to our service we can just focus on what it means to version our own schemas and data and not have to worry what happens when other dependent services change.
What about eventual consistency?
Responding to events instead of “just-in-time” querying for authority allows our service to be more autonomous, fault tolerant, and resilient. But one thing that affects autonomous complex adaptive systems in reality that also affects autonomous event driven systems is “delays”
If you are notified of an event immediately, you can react immediately. For example, if a car is swerving into your lane and you see this, you can quickly hit the breaks or adjust your driving to not collide. However, if there is some kind of delay in observing this event then your reaction may be slow (maybe your driving impaired ?? or playing on your cell phone? or yelling at your kids for doing something, etc…… okay, please don’t send me mail about how to be a parent :)). This can also happen in IT systems. Let’s say a I order something on Amazon. This publishes an event, or fact, to other autonomous services (like order processing, billing, inventory, etc). These systems can observe this event, but what if the inventory system is disconnected from the network for a few minutes/hours/whatever? When they come back, they will eventually see the event and proceed to check inventory, etc and publish any events it deems necessary (ie, react) like “InventoryReserved” event or “InadequateInventory” event. This is a simple example of a set of autonomous systems “eventually” becoming consistent.
What technologies are at play here?
One last thing to say about events, delays, and autonomy here. Events are only useful if we can capture them and observe them in the order they occurred. That is, total ordering over a set of events must be preserved for our systems to have any confidence in how to react to them. If you start to squint, you can see how “ordering” also plays a role in how we construct “transactionality” across systems (more on that later). If we start seeing events out of order then we can never claim to get to eventual consistency without some kind of manual intervention. Martin Kleppmann calls this “perpetual inconsistency”. In the next post I’ll take a look at some of the technologies i’m using for my Summit presentation/demo that help with delays, ordering, and microservices. Stay tuned @christianposta!