My notes on the Atomikos microservices course along with an analysis. I am analyzing the two protocols existing today to create a microservices architecture, Two Commit phase and Saga.

This is a translation of the original document in Spanish. If there are a lot of syntax errors, I would appreciate it if you could tell me so I can add the improvements. The document is extensive, it does not have code as in other entries in this blog, since I plan to write different components of a microservices architecture as I think it should be done, which I already anticipate, I think it should be based on something that includes the concepts of the Sagas, as well as the concept of 2CP. In this document I analyze 2cp in depth, I think, and somewhat less in depth the Sagas. Thank you very much.

It is a work in progress, the more I learn and understand about this concept, the more notes I will add.


If it is already difficult to design and manage a monolith, or a microlith, what I call a monolith embryo, now having instances of microservices dedicated to a task, it is much more difficult.

If we have four different tasks managed by an instantiated microservice running each in a container, managed by kubernetes or openshift, if we have four instances raised by microservice, we have 16 instances running in a public or private cloud, plus dedicated nodes for access to data. We have gone from having a monolith with one or several layers of services running in a java virtual machine plus a few nodes to manage the clustered data, to probably having a cluster of machines, where there will be an instance of Kubernetes, which will manage numerous pods, it is that is, groups of one or more docker containers, with shared storage and network, all this caused because our monolith needs to serve many more users. The solution is to divide the different functionalities that our monolith can do into different self-contained applications, each with its own database, optionally running on an application server and stored by a Docker container.

One way, or the current and most fashionable way to access the methods of each instance, is by wrapping the service in a lightweight http rest application server, synchronous by nature, because you invoke them, the user waits and the server responds with a code, which can be 200, 400, 500, etc.

At the end of the day, adding a @Controller class to different methods that handle REST requests is easy, many people know the web and now we also have the possibility of using libraries like Feign to create clients around those classes marked as Controller and access a web resource by name.


I advocate that it can be problematic, because as the instances are down, the service would be inaccessible, unusable. In real services we will have several instances of said application servers having behind an infrastructure such as discovery services, load balancing services to distribute the load equally, check the health of the different services, mechanisms to quickly break the circuit, the cycle of invocation in case of a clear and obvious error such as the use of incorrect input data or an incorrect invocation of the microservice. And we are not yet talking about when you have to deal with the problem of distributed transactionality.

Even in these times of high availability, such highly scaled services with multiple instances stored in docker containers and
Managed through pods in Kubernetes or OpenShift clusters, they may not be available, so we must be prepared by software to know if we can invoke a business logic distributed among container instances.

It must be taken into account that, at some point, one of the services will be unwell, or the instances of that service will be down, or the databases of some instance will be down, or those of all instances of a microservice A will be crashes, or data has been lost, which would probably lead to trying to reconstruct the lost data and indexes through some backup, but probably even restoring said backup, we would have the problem of eventual inconsistency, that is, the data of the restored service due to the backup, they do not completely coincide with the data of the other services that have not suffered the problem. The data in A no longer refers to the data in B, at least the data that references the primary and secondary keys, along with the data that they committed just before the disaster, as they have not entered the backup, can that we have lost them forever. Double problem.

This problem will appear in both cases, whether you use 2CP, or using the Saga pattern. Given this, we can do two things.

One is to accept it and take it into account to prepare for it. I mean that we will have to think about contingency plans to rebuild the databases in a consistent way, make backups of each database every so often at a time, so that there is a temporary relationship, a solution that is not perfect, but would minimize the time of return to service.

Another way would be that the databases of all microservices are hosted on the same machine, so that every time a backup is made, everything is backed up at the same time, but this would no longer be a true distributed system, because the databases would be strongly coupled to the services. Another way would be to use the above, use backups and try to rebuild the indexes of the databases through the log data.

There has to be some other better way to solve the problem of eventual inconsistency, which will undoubtedly appear, because the subsystems fail, sooner or later, either because we ran out of disk quota, there has been a fire, someone has gone with a uzi to the office, or worse.


I love this definition, “Microservices architecture is an extremely efficient way to transform business problems into distributed transactional problems.”

Potential problems using a distributed microservices architecture with asynchronous distributed messaging systems?

They are under my point of view, mild and serious.

Light is dealing with eventual consistency, that is, dealing with distributed data that is not consistent with each other.
In the worst case, you’re always going to return a reading to the customer, which may be a little dated.

Depending on the type of business, it can be bearable, for example, it would not be too long if a client on Linkedin has not yet read the latest update in their news feed, or on Facebook, or on Twitter. At the end of the day, this translates into knowing, what is the good data? that of instance A? the one with the B ?.

In time, the client will have the last reading agreed and updated.

Another slight one is dealing with distributed transactionality, for this in the literature and in practice we can read that there are two great philosophies to deal with the problems that appear with distributed data architectures. Maybe three, see GRIT.

Namely, Two commit phase (2cp) and the Saga pattern.

Serious as the eventual inconsistency, before I have talked a bit about it. Losing data across multiple instances of a service, leaving data from surviving services potentially inconsistent with restored data. It can be hell in life as we are not prepared from the start to deal with this problem that will inevitably appear due to the finite and failurable nature of the physical systems that support it.

Idempotent consumers.
The event store.
Domain driven design.

These last three I have to expand, here or in future links.


Employer that tries to solve the problem of using distributed systems doing distributed compensation operations when necessary, that is, operations to redo the previous transactional state. Make a distributed rollback. I consider that by itself, it does not work very well, because it can cause consistency errors when the problem of eventual inconsistency occurs and in principle it does not take into account the problem of committing on both systems, the message queue and the bd.


We are going to deal with the asynchronous way to do microservices, that is, when we deal with messaging brokers, either through a message queue or when you have producers and consumers subscribed to a topic, a mailbox. In Java, we have free implementations, those based on the Java message service, and proprietary implementations, like Kafka.

With this messaging service in the middle of the different microservices, we can already think of having high availability in the microservices ecosystem and also in the messaging service, since it has a distributed nature with high scalability.

Naturally, as I said before, we can already consider having a set of instances of the same microservice to have high availability, since now these microservices will have a producer to speak to the messaging service and a consumer to listen to the messages.

If we realize it, we could ask ourselves if a load balancer in the messaging service would not be also necessary to select one and only a microservice, well, if we implement the Competing Consumers pattern, we can skip the need for the load balancer in the courier service. Literally, by assigning a message queue to that single consumer, you don’t need a load balancer as all messages will be redirected to that instance of the microservice.

Based on Java message service:

IBM MQ series
Sonic MQ
Activate MQ
Fiorano MQ
Swift MQ
Tibco MQ

Queues and topics are supported.

Queue based jms.

It implies that each message is delivered to a consumer, and only one. It is naturally scalable thanks to the Competing Consumer pattern,
since if we need greater scalability, we only have to add more consumers attached to a new queue and the messaging load will be distributed equally among the different consumers.

Jms based on topics.

It is not like kafka’s topics. A Topic in this context means producer and consumer based messaging. The messages
They go to all consumers, this is the main difference with the queues described above. It means that the messages are potentially actionable
by so many consumers have subscribed to the topic. There is no Competing Consumer employer acting. In my view, this type of implementation should be avoided
For a microservices architecture, as assigning the same message to several consumers implies that the microservices act on the same message.
It is interesting that one and only one act on the message, in principle.

There are two types of subscribers:

Durable subscription: They will receive messages even if the producers are not publishing new messages.

Non-durable subscription: They will not receive messages, the messages will be lost if the producers are publishing new messages.

Not Based on Jms:


There are several important concepts to deal with asynchronous ways to do microservices, especially since they must be avoided:

  1. Lost messages. We will have lost messages if we first commit in the bd and then we are not able to commit in the messaging queue, in the broker.
  2. Ghost messages. Conversely, we have not been able to commit to the db, but we did commit to the message queue.
  3. Duplicate messages. Potentially, it will occur as a result of the above, as a previous commit was made in the queue, if said commit was not deleted, the system could try to commit the message in the queue.


Basically, when we deal with asynchronous microservices, we have two services with which it interacts, one is messaging technology, another is the database in which we store the information.

When dealing with them, literally the message or the transaction is not done until we commit to them, either because the message has finally been entered in the queue or the topic so that the consumer can consume it, or because the bd of the microservice does commit.

Depending on how we have the order of transactional execution in our code, we could potentially have one error or another when a problem occurs.

Depending on whether you do not commit in the broker but in the database, or if you commit in the broker but not in the database, we will have one problem or another, but ultimately they are problems.

Those problems can be for example that the container that contains the microservice falls, either because k8s has had a problem, because the disk quota in the database has been met, because someone has set fire to the data cluster, or also It may be because the messaging nodes have been down as well, so it would also be impossible to commit to the messaging queue.

Note, I am talking about the phase in which we have already invoked the microservice via REST, that is, you are ready to invoke the database to commit and to enter the message to tell whoever is listening to a message saying what That is, like you have done the transaction in the database well.

Here we have a great challenge, to send a message if and only if, first we want to commit in the bd, that is, we are going to do an insert, update or delete, operations that if or if they need a commit in the bd, in that In this case we must be able to commit on both systems, because if we only commit on one of the systems, depending on the order in which we have executed the commit command, we will have one problem or the other.

The problem especially is that you have to commit on both systems, because, in order for the message to be visible in the queue for consumers, the messaging system must commit, and in the same way for the database, you have to commit in the insert, delete or update operation.

Basically, invoking both systems in the same transactional method will not work, because potentially one of the two systems, or both, will not work, at some point. We must try to maximize that the probabilities to invoke the two systems when we invoke a microservice are successful.

Furthermore, efforts must be made to maximize the possibilities of making a successful commit on all the necessary calls to the different instances of the different microservices to execute a distributed business logic before even making them, so that, in case of having distributed rollback operations In other words, reverse offset transactions occur as few times as possible.

I’ll talk about this later, when I’ve described both of them a little
philosophies to face this problem of distributed transactionality. These philosophies are Two phase commit and Sagas.

Both use databases to store the data, its state, and also use messaging brokers to transport that state to the invoker of the microservice.

The first will try to ensure the consensus of both individual systems to get distributed commit in all the different microservices before even invoking them, the second will try to compensate at least the transactions made in the database of each instance of each previous service to which the fault has occurred. As of today, it will not attempt to offset the commit in the message broker.

It is not a good idea to leave inconsistent or pending data in the databases, but it is not a good idea to leave it in the broker systems either.
messaging commits made when no commit has been made in the database.

Efforts should be made to commit to both systems at the same time, or not.

And if it happens, it should be as few times as possible. Next I will describe what happens when the broker has been committed but not in the database and vice versa.

We will have lost messages if we first commit in the database and then we are not able to commit in the messaging queue, in the broker.
Those problems with the broker can be varied, such as running out of memory, the typical OutOfMemoryException, it can be a bug in the broker library, that someone kills the microservice or broker container, it can even occur when an error has been detected of those and the system is doing a RESTART of that part of the system.

public void save () {
// this runs fine, commit on the db.
jdbcTemplate.execute ("INSERT INTO Order VALUES ()");
// here an exception is thrown, we do not commit in the queue, so the consumer does not know about the commit in the bd.
// We will have a lost message.
jmicroservicioTemplate.convertAndSend ("Order created …");

We will have ghost messages if we first commit in the messaging broker saying that we have committed in the bd, and then invoking the commit logic in the database we have a crash. That is, if we have something like this:


Probably it is a bad idea to have invoking the messaging broker in Transactional Database Access Methods. At the very least, it is an innocent posture to put the two calls to both systems without being 100% sure that they are both going to commit.


Depending on whether we find any problem between committing
queue first and commit to bd later, problems. If we reverse the order the same. It is a big problem, you have to avoid it as much as possible.


When we talk about this topic, we talk about the problem of sending the message to the queue, committing to the message queue, if and only if, we can commit to the microservice database.

For this, we have two possibilities, one is to follow the Saga protocol, which provides us with the ability to make compensatory transactions, the other is to follow the Two commit phase protocol, which requires the figure of a Service that manages transactionality.

In other words, a service that asks the different components if they are willing to commit to their different databases and queues.
of messages.

It is like saying to them, “hey, are you prepared to do everything necessary to save the data and indicate to others that we can do our job?”,
This said with my tone of voice of someone from Extremadura sounds more funny …

The use of these techniques is so that we can overcome the fact that it is an error to put the two calls in the transactional method, both to the DBMS and to the message queue. It does not matter in the order in which we put them, potentially speaking, there may be a failure in either of the two systems.

The real problem when dealing with these distributed systems is:

Sending a commit to a database does not guarantee that said system will do so, because:

1) if there is an intermediate timeout in the database, an automatic rollback will occur,

2) If there is a physical failure in the db before the commit arrives, we will have to force the rollback ourselves, or the SGBD itself will automatically rollback. It is most likely the latter.

The same occurs at the level of the message queue, these physical failures, when they occur, which will occur, will lead to inconsistencies.

As a consequence of these potential problems, asking several distributed systems to always commit is a risky situation, because either one system can, the other cannot, or both cannot, which will inevitably lead to situations of inconsistencies. in the best case where one of the systems cannot commit, or in blocking situations because both systems cannot commit. It is so. You have to assume it.

In the first case, the 2CP and Sagas protocols try to compensate as well as possible those distributed transactions that have been left in the air.
In the second, we can only keep a very good eye on the system to try never to get into that situation and get ahead of ourselves before it happens.

Now, the challenge is to treat these distributed transactions as a global transaction, that is, to combine both commits, the one of the bd and the one of the
queue in one, and for all the calls you need to make to execute your business logic.

Ensure that both commits can be made in each of the phases.

Mind you, I understand that the distributed transaction is the set of the transaction in the database and the transaction in the queue of events in a single microservice.

I am not talking about all the transactions necessary for a complex operation such as ordering a purchase request from a user. Yet.

Such requests would normally be something like checking that the user is in the system, the user is looking for a product to buy, the user
asks to buy the product, the system checks if the user has everything in order to buy, the system asks if the product is in stock,
the system charges the product to your account, the system returns the status of said initial operation. Each of these operations involves a global transaction. Intimidated?


Pattern that, like the previous one, tries to use distributed systems to execute a business logic, only this one will try to make sure or
at least maximize the odds of committing on both systems needed to invoke an instance of a microservice.

It can also present problems, because it assumes that it will not be necessary to do rollback distributed backwards, as Saga does.

From my point of view, it is an error, it must be borne in mind that it will be necessary to do rollback distributed backwards and it will also be necessary to be prepared for the problem of eventual inconsistency.

It is called 2CP because it establishes that it is necessary to control the distributed transaction on each call to an external system by means of two actions, one of preparation, the other with the real commit, for each external call. If to execute a business logic, we have three systems, to put a number, 2CP, it will try to alert all three systems with an initial preparation operation and once it has the ok of each one of the subsystems, that is, message and database of each one, will proceed to execute in order the final commit request in each of the subsystems.

I will not tire of influencing this, if we want there to be no ghost messages, duplicate messages or lost messages, we need to try to
ensure that all systems are ready to do their job. This is the main reason, I think, why the protocol does not take into account the
need to control situations of reverse distributed transactionality in the event of an intermediate or final call failure.

But problems can still occur.

A good transaction manager, one that takes into account interactions with a service, will try to achieve those two phases with each
one of the necessary components to invoke each of the external services, such as the messaging broker and the database.
In reality, the manager should try to negotiate with all the services following this strategy at the beginning of the operation. Following this strategy,
We will minimize to the maximum the need to make ROLLBACK distributed.

The key points to this strategy are:

  1. A Backend that is warned, prepared, can rollback, we have made sure of it, but it should not do it on its own, on its own initiative, because we do not want in principle neither rollbacks when the system potentially restarts, nor rollbacks if they occur internal timeouts in the database.
  2. A backend that is ready can commit, even if there has been a big problem with the db.
  3. We can see the “ready state”, as a checkpoint from which we are sure to be able to commit or rollback at the same time in both systems or in only one. The thing is, we need precise control over these two systems.
  4. This checkpoint is managed and controlled by a transaction manager.

For such a system, we need Jms technology with XA support, that is, database technology with XA support, and our own transaction manager that manages those two previous systems.
Virtually all queuing and db systems support XA, eXtendend Architecture, so the main part of the problem is to implement a
good manager of distributed transactions.

Basically, what that handler should do is:

Prepare DBMS in all the microservices necessary for said transaction.

If KO responds, rollback on both systems, the sgbd and the broker.
If you answer OK, we move on to the next phase.

Prepare broker
If KO responds, rollback on both systems, the sgbd and the broker.
If you answer OK, it implies that both have said ok.

If both respond OK, the log is written that everything has gone well and commit is made to both. The order here is important.
That is, now if we can do what we put in the method marked as transactional, then write in the log that everything has gone well, commit to the broker with a successful message and commit to the db.


  1. The microservice makes a JTA (java transaction api) request to the embedded JTA transaction manager.
  2. This one, first asks the DBMS, checking the state of the factory object that controls the connections with the DB, specifically, we will launch a quick query or another operation that returns an identifier for the process. If the answer is correct, we are left with the pid of this process, or something that indicates that indeed everything has gone well with the bd, or bad.
  3. Then we asked the message queue, how? throwing a message. In the same way, we collect the pid of said process or other information that indicates everything has gone well, or badly.
  4. We return both responses to the transaction manager.
  5. If the answer is OK, we launch the different logics to each system, to the DB and to the message queue, that commit.
    It is important to note that here we are telling the transaction manager to do everything possible to commit in both subsystems, because even though you initially obtained an OK in both systems, connectivity or other problems may appear and you really cannot do commit.
  6. Finally, we keep in the system log that both subsystems have committed, or not.
  7. Optional.

The orchestrator is informed that the invocation to this set of subsystems has gone well, or not, in which case, it will be necessary to act accordingly, either invoking the rest of the microservices to execute their normal business logic or otherwise and if it is necessary, start the reverse distributed compensation phase. It should be borne in mind that these clearing operations do not have to be able to be executed immediately, since some
One of the subsystems will have a serious problem that will surely require someone’s action. It will probably be a good idea to save in the log those operations that could not be performed and that need compensation.

The transaction manager, a decent one, should be able to rollback both its message queue and its bd, either because a problem has occurred involving a restart or catastrophic failure of one of the subsystems. Everything to try to ensure the commit in both subsystems for each call.


  1. If there are failures before launching 2cp transactions, rollback must be made in both subsystems, both in the queue and in the db.

Remember that the request has been made in both subsystems and we do not want ghost messages, lost messages, or possible duplicate messages.
They will lead to error. We must try to avoid them and minimize them as much as possible.

  1. If there are failures after having 2cp transactions, depending on whether there has been any other 2cp transaction prior to the current one, and we are not able to commit to the current one, we must launch some compensation operation.
    It is not a good idea to leave data pending commit, both in one and the other.
    There are authors who believe that nothing should be done, but I consider that transactional states cannot be left pending.
  2. If there are failures during 2cp transactions, it will depend on whether the transaction manager has left any log messages from the message queue and from the database. It will be based on it. If a commit appears in the log, launch a RETRY on each backend, to be sure. If no log message appears in the log, it is better to ROLLBACK everything, both the queue and the DBMS.

Following these steps, we either have a distributed commit in the call to each microservice, we have either left both subsystems in an optimal state, since we have made rollback in both subsystems and we will not have the unwanted phantom messages, lost messages or duplicate messages.

As a consequence, by following the steps in 2cp / xa, we can avoid the inconsistencies described above in the problem of ordered commits.
Remember, when we put invokes to both subsystems in the transactional method, both the DBMS and the messaging broker.

Question uncomfortable, in relation to point 7 described above.
What if after having gone through the first two points, where we have the ok
of the two systems, does a catastrophe occur that really prevents commit or rollback? or when the transaction manager decides that ROLLBACK should be done in both subsystems and cannot?


Wouldn’t it be possible to have the best of both worlds? that is, take the idea of ​​the Sagas, that is, a choreographer who manages all invocations, that is, the state of each microservice and a master global transaction manager that knows and manages each global transaction manager of each microservice.

Thus, before launching a series of operations that require invoking message queues and possible commits to their databases, we would have to make sure that these, all of them, have returned an ok message that will allow the Master Transaction Manager to be more likely there will be
commits both in each message queue and in its database, allowing the orchestrator to advance in the following invokes to the other message queues and database until finally having a final response from the last microservice that we need to invoke to obtain the final message that It should reach the one who invoked that business logic. In case any have answered KO before starting the process, directly
the invoker is told that it is not possible to perform the operation, without the need for unnecessary transactional clearing operations, for
at least a priori.

This does not mean that it is possible that between the time that the master global transaction manager received the OK from all the sub-managers, some will fall.

So, we have no choice but to do the reverse distributed rollback, but only in cases of force majeure in which someone has broken the initial agreement.

In principle, I do not consider it necessary to follow the example of an involved orchestrator, first because that implies that each service has to be aware of the
Other services, I consider that breaks the principle of independence having a de facto strong link. I better consider a choreographer or central controller object with control over a master transaction manager who knows the start and end status of each transaction manager subscribed to each microservice,
that is, a choreographer capable of idempotently managing the distributed transactions of each of the microservices, with the capacity to
Rollback as soon as you can from the intermediate states, in case you have to.

Literally, the orchestrator would have, for each business logic operation, an enumerator to internally describe the different calls and a mechanism to manage the different transaction managers of each microservice.
An enumerated or something more eccentric, like a state machine.
Remember, this manager, for each state, must know the state of its message queue and its database.
You should try to secure the commit on both systems or say that you will not be able to if one is not available.

The performance may have been poor. If to solve a business logic you have to call, let’s say five microservices, each call would need
first ask your transaction manager, get the ok and the pid of each process that ensures, a priori, the commit of each system, and then, launch the commit requests of each transaction of each microservice, leave them in a ready state, and then if global, launch commit commands.

If we have 5 microservices, each operation needs, for example, 10 ms on average to ask each transaction manager, let’s say that each operation to do the actual commit is 500 ms and we have microservices with input parameters so that we don’t need
communicating data from the previous state, that is, a microservice returns something I need to invoke the next, if we have it like this, we would have
than to successfully resolve an invocation for almost 3 seconds!

This approach may be totally unfortunate, but in return you have a system capable of handling potentially millions of requests per second, capable of operating robustly in long-term processes as well.

A user on Stackoverflow, @Tengiz, thinks like this:

It is the problem of eventual inconsistency that I referred to earlier.

Typically, 2PC is for immediate transactions.

    -> In principle, for transactions that you want to occur as quickly as possible, immediate.

Typically, Sagas are for long running transactions.

    -> Same as the previous one, but these transactions could take much more time, because there will be times when some component of the architecture is not available and the transaction cannot be done until it is.

Use cases are obvious afterwards:2PC can allow you to commit the whole transaction in a request or so, spanning this request across systemicroservicio and networks.
    Assuming each participating system and network follows the protocol, you can commit or rollback the entire transaction seamlessly.

    Saga allows you split transaction into multiple steps, spanning long periods of times (not necessarily systemicroservicio and networks).

    2PC: Save Customer for every received Invoice request, while both are managed by 2 different systemicroservicio.
    Sagas: Book a flight itinerary consisting of several connecting flights, while each individual flight is operated by different airlines.
    I personally consider Saga capable of doing what 2PC can do. Opposite is not accurate.

    I think Sagas are universal, while 2PC involves platform / vendor lockdown. "

I agree with you, pal.
With the Saga pattern, today, you have to implement the architecture yourself, that is, you have to implement the coordinator or orchestrator that manages the compensation of reverse transactionality. Deep down, it is the only thing he cares about doing well.

There are frameworks for the Sagas, such as eventuate-tram, axon and microprofile-lra, while with the 2cp pattern, you have to use a product that gives you the global transaction manager, or do it yourself. To this day, I don't know of any framework. Atomikos provides a commercial solution and an open source version that I have not had the opportunity to review yet, and they already warn that they have many bugs to resolve, bugs that will be solved in the Enterprise version.


Exactly once delivery means, the system tries to deliver and receive a single message only once, avoiding processing lost or duplicate messages.

The lost messages, remember, are messages that were not able to commit in the queue but in the database.

A duplicate message is one that was not able to commit to the bd but was able to commit to the queue, and by not doing rollback of the queue, the system would try
commit on the queue again. They are situations to avoid.

The problem to avoid is given by the characteristics of both systems, the message queue and the bd. Let’s remember.

The local transaction in the message broker involves:

  1. Read the message in the queue.
  2. Mark the message to be deleted when committing. Here two things can happen:

2.1. Commit to the broker, which implies that the message disappears from the broker.

2.2 Or rollback at the broker, which involves trying to deliver the message back to the broker to read again.

The local transaction in the DBMS involves:

1. Insert the record in the bd.
2. Commit on the bd.

Remember, we want and need to have precise and strict control of the systems to make ROLLBACK or COMMIT. We do not want it to be done automatically by different subsystems.

We want to be able to treat these transactions as a global transaction, commit or rollback following the all-or-nothing philosophy.
We can also call these global transactions, the set of distributed transactions across all systems to execute distributed business logic across multiple systems.

The mechanism for receiving messages without inconsistencies is exactly the same as the mechanism for sending a previously described message.
Basically, the microservice will ask the transaction manager about the availability of the different subsystems, if they are in good condition to attend requests, both to the message broker and to the database manager. Each of these will perform an operation that guarantees that you can commit as much as to mark the message to be extracted from the queue as, commit in the broker, as well as to read the database, which will be some SELECT type statement * FROM or what you think is necessary to make sure that the database is available, in good condition and you are able to retrieve an indicator, a process identifier, pid, on that state. When the manager obtains this OK from both systems, it sends a permanent commit to both systems; if it receives a KO, it sends the ROLLBACK set.

Finally, the Distributed Transaction Handler instructs the final result of the operation in the LOG.

Remember also, all this work is to avoid the inconsistencies that can occur when we have the two requests to commit in the method marked as @Transactional, both when writing or reading.

Kafka and RabbitMQ by not implementing XA natively, delegates this feature to consumers, so they should manage it, that is, we programmatically, on the consumer side.


Imagine in which situations you would like or need a message to be delivered from one system to another only once. Already? It occurs to meIn situations such as, deposit money from account A to account B, or withdraw money from account A to account B at the beginning of the month, or reserve this object X that is in storage A for customer Y who has already done so paid out.

Those are situations that we would not like
it happened more than once by mistake, right?

One way to achieve this is to use ACID messaging and transaction brokers on each of the systems.

This can be achieved in multiple ways, either using JTA and XA, in the manner of 2CP, but also with messaging brokers that implement such messaging control on consumers. First you have to ensure that the message is in the broker, extract it, commit in the broker, and then commit in the destination bd.
If we do this, the message is delivered once and processed at destination once, without having lost messages, ghost messages or duplicate messages.

The case is that depending on which is our chosen message broker and our database system, we will have to adjust in one way or another. My advice is to try to use at least one database of the Consistent type, CA or CP, according to the CAP theorem.



What does it mean that an operation is synchronous?
Basically I understand it as an operation for which I have to wait waiting for an answer and I cannot attend to anything else.

Unlike asynchronous, I have to wait for someone to tell me and I can’t do anything else. When we work with microservices, working with the synchronous style can be more difficult, more prone to errors, since, as there are remote errors in the communication between external services, which there will be, at least we will always need high availability between the instances of the microservices, greater care if possible in the code as we will have to take into account the timeouts and the try catch to catch exceptions that can be varied and act accordingly.
Before a timeout error, do we call again? How many times will we do a retry before timeouts? What if we are invoking over and over with input data that the called system does not recognize? or if the url has changed? or the api behind the url?

These situations will occur when we have multiple versions of our microservices. Human mistakes happen. Let’s not even say if we need to do a reverse rollback if a system invocation fails for whatever reason. In addition, errors may be introduced in case of eventual inconsistencies.

When should they be used? Well, when you have no choice, like when you need a piece of information to continue that can only give you another service. For example, imagine when we have to book a flight. It is a fairly common situation, so, as with the previously described asynchronous methods, we have to deal with the inconsistencies that can appear both when sending and when
receive a synchronous invocation.


Inconsistency faults can mostly come with network problems, for example:

* a caller calls an external service and this call does not arrive, no commit is made in the db.

* a caller calls an external service, it responds, but the response does not reach the caller, a commit has been made, at least in one of the two, probably in the db.

These network problems are usually caused by timeouts, probably due to the code of the application server that hosts said web service. Just yesterday I was reading about a bug that affects this problem in a fairly updated version of spring-boot, 2.2.6-RELEASE.
The case is that, for the invoker, the error is indistinguishable.

See link for better explanation:

The solution to these two problems is obvious, right? You have to retry until you have an OK, but, and this is very important, you need your external service to be idempotent.

What does it mean when invoking a REST web service is idempotent? because if you invoke it with the same parameters, on the same resource, it has the same answer, you have invoked it once or N times. For example, to create a new resource on an external system, we will invoke a POST action the first time, when the service responds with the identifier of the new resource, we can PUT as many times as necessary in case of TIME OUT. If the POST invocation fails, we can repeat the operation
as long as the web server has not returned the identifier. This identifier indicates whether or not the resource has been created, in this case it is a virtual resource, it does not imply that we have saved this resource in the database, but rather that the container of the business logic, of your service, in this case is a web application server that provides us with this desirable feature of idempotent resources.

Eye, eIt is possible for a double commit to occur. Imagine that the app invokes the remote service, the service commits on the db, but the response does not reach the app. You will RETRY and when you arrive again, doing a PUT operation, at most you will commit again. Or it may be N times, so it is so important that the operation is idempotent, at most you will do a UPDATE with the same information.

It is very important that the invoking party remembers to have to RETRY if the response code of the other web service does not arrive.

Now, imagine a situation where we have to invoke more than one web service, something very normal. The first invocation is correct, there is a commit on the bd, the server returns a 201 to us. We invoke the second, and an exception occurs. We invoke again, RETRY until the
number of times we have configured, but reaches the limit of RETRY. What do we do with the first? A rollback in a situation that a previous commit has occurred? We will not be able to rollback to a previously inserted data correctly. When these network problems occur in
synchronous calls to microservices, we will have orphan commits or duplicate commits, that is, inconsistencies. And you have to deal with it.


How could we solve these inconsistency problems when we are working with synchronous microservices?

The requirements are various, we will not want orphan commits in some microservice after a TIMEOUT has occurred in the call to another microservice, we will not want multiple commits on the same data in the same microservice and we will want to be able to do the reverse ROLLBACK
in the other microservices if we are not able to get COMMIT after several attempts.

What we want is to be able to commit in all microservices if or only if we can do it in all microservices and try to minimize inconsistency errors if there have been any commits.

We want to avoid data inconsistency problems when these network errors occur, basically, we will need to ROLLBACK in those microservices that have COMMIT. This is not a CIRCUIT BREAKER, as this pattern will try to avoid more calls to the microservices after a number of calls, but it is not in charge of trying to fix possible inconsistencies in the data after an orphan commit. Circuit Breaker will try to close the REST calling circuit as quickly as possible if it detects any network problems, i.e. it is at the level of HTTP calls. If Http is not available, Circuit Breaker will not work properly. Keep that in mind. It would be nice to have circuit breaker at a lower level in the TCP / IP stack.

It would be nice if he did, really, but it gives us a clue about what to do, when our business logic detects that there is an error in the call to some microservice, we have to tell them to make ROLLBACK, in all calls to said microservices that
have committed.

The way to do it 2CP is as follows:

The master microservice asks the other microservices to prepare, to initialize their transaction manager for said transaction, that is, the masters make a call and they respond with an identifier to be registered in case they have to request that they make ROLLBACK.

Once the teacher has an OK of all services, he asks them to do the PREPARE phase in each of the
microservices to finally ask them to COMMIT.

In case there is a TIMEOUT error, we will need a component in each microservice that does the ROLLBACK, we will not need to leave transactions in the prepared state. Leaving that transaction in a ready state is a waste of resources. Better tell the DBMS to recover
said resources.

How would it be done differently? Ideally, if there has been a timeout that forces you to cancel the operation after multiple RETRY in a microservice, it is to tell the service that if it has worked or demonstrated that it can work, do ROLLBACK to return to the state

Personally I see this method of synchronous calls more prone to errors and bugs, in addition to having a worse performance in high performance architectures in which the number of operations per second is a key factor.

An architecture in which you involve services
REST will always be limited to the number of invocations per second that the application server behind it can support.


Can having backups of our databases cause inconsistencies? The answer is a resounding yes, because, due to the distributed nature of these architectures, the information stored in a service database that has fallen, or has had a problem, may not be consistent with respect to the information that save another database that needs information din that fallen system.

Imagine the following situation, we have two microservices, each with its own database, or its own schema, probably each with information from the other service, since both are part of the business logic necessary for said transaction.

Now, imagine that one of the two databases is physically broken, having to change the disk, putting a new one.

What do we do? We may think that a backup should be thrown away, but let’s face it, a backup will always have less updated information than there was before the catastrophic error, so the other microservice will have information that potentially will not be found. in the backup.

It is an important problem and it must be taken into account that it also happens with the Sagas.

Remember that the Sagas also have the problem of eventual consistency, which has nothing to do with this problem.

It can also happen when our architecture is based on events, since these architectures in addition to the database, which can fail, have that message queue, which can also fail.

We need the data at all times to be consistent with each other, mutually consistent. Eventual inconsistency attacks this principle, this need in the world of distributed systems.

What can we do in this situation?

  1. Accept this situation

1.1. The data can be lost.

1.2. We must try to minimize the risk to the maximum, making consistent backups between them, at the same time. It doesn’t fix it, but it helps.

1.3. With an eXtended Architecture, this problem occurs when there are catastrophic failures in the database.

1.4. Without an XA architecture, this can happen even with normal operations.

  1. Use the same database for all services. 2.1. We backup everything at once. If the bd breaks, of all services, we can go back to an earlier point.

That if we potentially lose all the most current operations before the backup. It can also be argued that we lose the weak link if we use the same machine, or a database cluster.

Obviously, if we choose this situation, in which we use the same machine to store the data of all the instances of the different microservices, we will have to have different schemes for each instantiated service.

  1. Make consistent distributed backups with each other at the same time. If a database of a service, or all at once can occur, we must have a policy in which we back up all databases at once. Thus, if only one data is lost on that day, we could return to the previous stable situation in which the services were consistent with each other.

We could also try to rebuild the new database based on the information stored in the databases of the other microservices together with the information stored in the messaging brokers, but it will undoubtedly be an operation that we will have to take into account. We will have to create our ad hoc solution for this problem.

  1. A solution proposed by Atomikos. They propose that each component of each microservice have its own copy, that is, if a microservice has a message broker and a database assigned to them, they must be duplicated, at least, and that whenever they have to interact with each other, the broker and the database, together with each interaction with the broker and the database of another microservice, are stored in a log, so that, in case of having to rebuild any index, we can start from something reliable.

Hopefully that log will also be stored somewhere other than where the databases of each instance are stored, as well as the data of the messaging brokers and that are replicated on several machines.

What if both microservices fail because you have them running on the same k8s pod or similar?


This framework proposes a series of components to manage the correct operation of a distributed microservices architecture.

Before I commented that all microservices architecture should be able to manage a series of minimum characteristics, such as being able to appear before the rest of the services when you request the container or simply lift the service, you must be able to check the health status of the service and act in one or the other case, you must be able to manage the load balancing to invoke one or another instance of the service, you must be able to RETRY in case there is a TIMEOUT and you have to invoke the service again and you must be also able to execute the CIRCUIT BREAKER operation to break the chain of invocations in case you are trying to invoke the service with incorrect parameters, incorrect url.


Eureka is a service based on REST (Representational State Transfer) that is mainly used in the AWS cloud to locate services in order to balance the load and manage errors to calls from servers that contain a service. Created by Netflix people.

Eureka is basically a service running on a web application server in charge of managing the discovery of the instances of the new microservices.

Its objective is to register and locate existing microservices, report their location, their status and relevant data on each of them.

In addition, it facilitates load balancing and fault tolerance. We can install it using Docker, preferably.

To do this, we launch the following command:

docker pull springcloud / eureka

To launch it, listening on port 8761, the following command:

docker run -p 8761: 8761 -t springcloud / eureka

Then, each microservice that wants to discover itself before Eureka, will have to add the dependency “spring-cloud-starter-eureka-server”, or better yet, the dependency spring-cloud-starter-parent to select that dependency spring-cloud-starter- eureka-server for us.

Add the annotation @EnableEurekaClient to the main class of our microservice and configure its properties file as follows:


spring: application: name: my-application-name


port: 8080

eureka: client: serviceUrl: defaultZone: $ {EUREKA_URI: http: // localhost: 8761 / eureka}


Zuul will serve us as the entry point to which all our requests will arrive, which will be secured, balanced, routed and monitored.

Basically it will filter each request and will try to coordinate with Eureka, that is, it is in charge of requesting an instance of a specific microservice from Eureka and its routing towards the service that we want to consume.

We can also see Zuul as a filter that will try to reject all suspicious requests based on security reasons, so that we can mark those requests for further analysis.

With Zuul we also have Hystrix and Ribbon. The first serves as a CIRCUIT BREAKER mechanism, that is, mechanisms for resilience and fault tolerance. The second acts as a load balancer, helping Zuul locate the instances of the services that have been discovered by Eureka.

What are the advantages of Zuul?

It has several filters focused on managing different situations. Transform our system into a more agile one, capable of reacting faster and more efficiently. You can take care of managing authentication in a general way by being our entry point to the ecosystem. It is capable of deploying filters hot, so we could test new features without stopping the application.

How is it used?

First, keep in mind that we want Zuul / Hystrix and Ribbon to be the entry point for our architecture, and we want to be able to apply security, balancing, and routing mechanisms to access resources.

That said, it will be better to add all these features to a spring-cloud application that will also be discovered before Eureka, to work together.

This example can be taken as a starting point, although many more features would have to be added, such as outsourcing the security tokens to avoid having to use the one you may have included in the project, adding spring-security5, adding support for Hystrix and Ribbon …

I would also look at this one in the future…


It is another Spring Cloud library designed to create REST clients in a declarative way, that is, access REST resources with a name of our choice.

It is best seen with an example.

@FeignClient (“my-application-name”) public interface GreetingClient {

@RequestMapping (“/ greeting”) String greeting ();


We will mark our @Controller classes with the @EnableFeignClients annotation.



public class FeignClientApplication {

@Autowired private GreetingClient greetingClient;

@RequestMapping (“/ get-greeting”)

public String greeting (Model model) {model.addAttribute (“greeting”, greetingClient.greeting ()); return “greeting-view”;



We compile, execute, when the server is running, go to http: // localhost: 8080 / get-greeting and whatever the greeting () method that implements the GreetingClient interface returns will appear in the html file when you try to access the content of the greeting variable.


Consul is a tool for service discovery and configuration. It is distributed, allowing high availability and scalability. Basically it is an alternative to Eureka, with some improvements, for example Unlike Eureka, the Consul cluster has a server that acts as leader and responds to requests. In case of a fall

a new one is chosen and this change is transparent to client applications (if they have their own Client Consul agent configured correctly). Eureka, on the other hand, if a server goes down, it is the client applications that actively communicate with the next Eureka server configured in them.


Can I change the configuration without downloading the spring-cloud-config instances?



It is trying to deliver the latest agreed updated data from a distributed data system. It doesn’t have to be the most value
updated existing.

JTA stands for java transaction api, basically it is the standard Java api for transactions in bd, both in monolithic systems as in distributed systems.

XA stands for eXtended Architecture, OSI’s standard open API so backend resources work with distributed transactions. It is desirable so that they do not have to rely on a particular database management system.

Kafka and RabbitMQ and other recent products do not support XA, so this reality must be taken into account if we want to implement 2cp with any of these brokers.

In .Net I would use DTC (Distributed Transaction Coordinator) instead of JTA. For XA, nothing changes.







GRIT: Consistent Distributed Transactions across Polyglot Microservices with Multiple Databases

My deepest thanks

Guy Pardon

Chris Richardson

Eugen Paraschiv

Abraham Rodríguez

I do not work for Atomikos, nor for any company that provides microservices software. This post is meant to learn, for myself and for anyone who has found it useful.

I’m very sorry if you got bored reading this note

Congratulations if you have come this far.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s