Compensations in Workflows
Context
Microservices are in use or are planned to be adopted. There are use cases that require (eventual) consistency across multiple microservices. The transaction cannot be executed on the database level since the storage areas of each microservice are isolated. Distributed ACID transactions are not used as well.
Problem
- Inconsistent states emerge since the distributed transactions are not implemented.
Solution
Split the transaction on domain level (if it really exists) into multiple local technical transactions in the microservices. This means one microservice itself will stay in a consistent state but consistency is not guaranteed for the whole system. If a part of the transaction fails, the compensation notification notification is delivered to all participants to rollback the local transactions if necessary
If the transaction is not necessary from domain perspective, think about domain-motivated alternatives to get rid of the transaction on the technical level.
Depending on the workflow organization in the system, there might be different parts of the system being responsible to managing the compensation notifications:
- If there is a workflow orchestrator, it should be responsible for announcing compensations. However, ensuring an eventual notification delivery might be a challenge here in order to guarantee the local rollbacks.
- In workflow choreographies the compensation notification can be sent as events that are guaranteed to arrive eventually. However, especially in choreography-based workflows, there is a need to have very good monitoring in place to notice the need for compensations. Another challenge is ensuring that the events are sent out after a local transaction. Consider using timeouts to detect situations messages are not processed within a given time and use dead letter queues to cope with the situation.
Maturity
Proposed, evaluation required.
Sources of Evidence
L14:
- Distributed transactions difficult to implement
- => microservices emphasize transaction-less coordination
- leading to eventual consistency
- problems are dealt with by compensating operations
L16:
- Choreography: no instance to track if required actions done successfully
- add additional service to monitor (not trigger) the workflow
L21:
- Jolie provides mechanism for fault notification
- allow detailed control whether faults are propagated to client microservices
- => careful propagation allows to restore correct distributed state for whole system while preserving independence
L34:
- open challenge: how to do transactions across services
- transactions multiple microservices is very complex
- no-ACID transaction proposed: compensation transaction
L58:
- Context: Synapse architecture (pub-sub system)
- Enforce same atomicity in message delivery as in publisher transaction support
- all writes in transaction included in the same message
- subscribers process messages in a transaction with highest level of isolation and atomicity the DB permits
- Publisher: hijack DB diver to do 2PC transaction:
- all or none of the following happen
- (1) commit transaction locally
- (2) increment version dependencies
- (3) publish message to the reliable message broker
- all or none of the following happen
Interview B:
- Sometimes need for transactions => compensation / rollback of domain transaction
- transactions over multiple databases doesn't work in a world where services are independent
- Saga pattern usually used
- distributed domain transaction split into multiple local technical transactions
- in case of error: need for domain compensation
- in case of error of compensation: need for another compensation
- can become arbitrarily complex
- end with error in a database column that requires manual fixing
- need to know something went wrong in the first place!
- different way of thinking as challenge when coming from the monolith (true / false if worked or not)
- saga pattern: service coordinating the business rules/transaction
- could be mini-workflow engine => [interpretation] at workflow orchestration
- Trick 1: flow within service => good service cut can solve the problem
- Trick 2: question domain knowledge
- often not as transactional as we want to implement it
- example: bank account cannot be opened since don't know wife's birthday
- IT: can't open bank account
- programmers used to think in transactions
- reality at bank counter: open but without access of wife to account until paperwork done
- DDD helps to gain insights there
- IT: can't open bank account
- => domain alternatives for compensation
- Context: data replication leading to eventual consistency
- compensation to signal a service that data might be out of date
- for a short period: eventual consistency => not transactional
- figure out if this is relevant from domain perspective
- click paths at user => usually doesn't even notice => not relevant
- need to detect when replication didn't work / took to long
- dead letter queue + timeout
- run a compensation
- or let other services know they have out-of-date data
Interview C:
- don't do transactions with microservices
- SAGA pattern, event sourcing to the rescue with additional benefits
- need for knowledge about SAGA pattern
- decide where eventual consistency is okay (inevitable in distributed system)
- best logs no replacement for debugger
- challenge to monitor what event is triggered in which way
- SAGA: need for events to arrive in order etc
- never seen big companies being confident in their compensation solutions
- the "next generation" architecture in many enterprise companies
- (-) if there is a defect => hard to trace down
Interview D:
- Event-based systems: need to look if last event was successfully executed within time span
- supervisor service
- also other strategies
- often forgotten to do
- => failure handing separated from regular processing flow => system less complex
- Context: data replication
- caching before leading system accepted the update
- mark data as "treat with caution"
- or excuse afterwards that data was not final
- => compensation actions to cope with inconsistent state
- often not used since to complex!
- caching before leading system accepted the update