Bounded Retries

Context

Microservices are being adopted. Communication of clients with microservices or between microservices happens over an unreliable network. Timeouts are used to free resources when interacting with non-responsive microservice instances.

Problem

Transient downstream failures cause requests to fail.

Solution

Use a bounded retry strategy to retry network interactions on failure. Retry up to a fixed amount of times to avoid overloading the system. This upper limit of retries can be complemented with an exponentially increasing waiting time between retries in order to avoid overloading the called microservice. This best practice applies to request-response like communication, but also to messaging scenarios.

Using retries might result in the calling microservice receiving a message multiple times. The implementation of the callee should make sure that this potential manifold message delivery does not cause unexpected behavior.

Infrastructure components like API facades, internal integration proxies, or service meshes may offer retry mechanisms. We advise putting no reaction logic (= business logic) to failing retries into such infrastructure components unless they are very generic, like always do up to two retries on timeouts. The reaction on a last failed bounded retry should be domain motivated and be located within the microservice.

Maturity

Proposed, evaluation required.

Sources of Evidence

L8:

Failure handling mechanisms like retry and fallback directly within services' source code

L35:

Bounded retries
- handle transient failures in the system
- retry API call with expectation that fault is temporary
- bounded number of times, usually with exponential backoff strategy to avoid overloading the callee microservice
- e.g. retry up to 5 times
retries are observable from the network

Interview B:

Service mesh can do retries
- okay as a default reaction pattern
- reaction to failing generic retries should not be part of service mesh anymore

Interview D:

Service mesh can solve omission failures by retries
- if reaction agnostic of domain-knowledge, configurable, and default reaction pattern => okay to put into service mesh/API gateway
- further reactions based on domain knowledge should not be part of infrastructure

Bounded Retries

Context​

Problem​

Solution​

Maturity​

Sources of Evidence​

Context

Problem

Solution

Maturity

Sources of Evidence