Bounded Retries
Context
Microservices are being adopted. Communication of clients with microservices or between microservices happens over an unreliable network. Timeouts are used to free resources when interacting with non-responsive microservice instances.
Problem
- Transient downstream failures cause requests to fail.
Solution
Use a bounded retry strategy to retry network interactions on failure. Retry up to a fixed amount of times to avoid overloading the system. This upper limit of retries can be complemented with an exponentially increasing waiting time between retries in order to avoid overloading the called microservice. This best practice applies to request-response like communication, but also to messaging scenarios.
Using retries might result in the calling microservice receiving a message multiple times. The implementation of the callee should make sure that this potential manifold message delivery does not cause unexpected behavior.
Infrastructure components like API facades, internal integration proxies, or service meshes may offer retry mechanisms. We advise putting no reaction logic (= business logic) to failing retries into such infrastructure components unless they are very generic, like always do up to two retries on timeouts. The reaction on a last failed bounded retry should be domain motivated and be located within the microservice.
Maturity
Proposed, evaluation required.
Sources of Evidence
L8:
- Failure handling mechanisms like retry and fallback directly within services' source code
L35:
- Bounded retries
- handle transient failures in the system
- retry API call with expectation that fault is temporary
- bounded number of times, usually with exponential backoff strategy to avoid overloading the callee microservice
- e.g. retry up to 5 times
- retries are observable from the network
Interview B:
- Service mesh can do retries
- okay as a default reaction pattern
- reaction to failing generic retries should not be part of service mesh anymore
Interview D:
- Service mesh can solve omission failures by retries
- if reaction agnostic of domain-knowledge, configurable, and default reaction pattern => okay to put into service mesh/API gateway
- further reactions based on domain knowledge should not be part of infrastructure