Timeouts

Context

Microservices are being adopted. Communication between microservices happens over an unreliable network.

Problem

Microservice instances become non-responsive since they are waiting for a response message that never arrives or is heavily delayed.
Resources become unavailable since they are not released while waiting for a response message that never arrives or is heavily delayed.

Solution

Use timeouts to detect messages that are heavily delayed or got lost somewhere in the system. On timeout, orphan the running request and free allocated resources and complete the request, potentially as failed if no compensation or further error handling mechanism is available.

The timeout should be individually decided on depending on the context of the interaction. The determination of what is a viable time span for waiting for messages should depend on

technical factors, like roundtrip times, but also on
domain knowledge, like the time criticality of the executed job (e.g., is a user waiting on a response or not).

Using timeouts applies to synchronous request-response style messaging, but also to asynchronous message passing as events. In the latter case, the failure handling needs to be separated from the processing flow. Either an additional higher-level component monitors and supervises the processing flow or one of the involved components. Although this might seem to be more complex at the first glance than synchronous timeouts, it simplifies processing flows significantly by keeping failure handling separately.

Infrastructure components like API facades, internal integration proxies, or service meshes may offer timeout handling. We advise putting no reaction logic (= business logic) to timeouts into such infrastructure components unless they are very generic, like always do a retry on timeouts. The detection of a timeout might be outsourced to the infrastructure component.

If waiting for timeouts is not a good option due to the increase of waiting times at the clients, we advise using circuit breakers that build on timeouts leading to a fail-fast strategy.

Maturity

Proposed, evaluation required.

Sources of Evidence

L5:

Communication strategy involves identifying right protocol, response time expectations, timeouts, and API design

L31:

Timeout as "problem" section for circuit breaker pattern: waiting for the timeout takes too long
- use timeout as metric to determine state of circuit breaker

L35:

API call of microservice can have the following failure results
- delayed response
- error response (error code)
- invalid response
- connection timeout
- failure to establish connection
Timeouts ensures API call to microservice completes in bounded time
- maintain responsiveness and release resources associated with API call in timely fashion

L58:

Context: Synapse (example application)
Subscribers deadlocked => queues were filling up since they could not consume update
Timeouts => Synapse's recovery mechanism kicked in, rebootstrapped subscribers, and system was unblocked
Need for mechanism to give up on waiting for late or lost messages
- Timeout should be configurable

Interview B:

Need for dead-letter queues
- listen and react with timeout
Timeout needs to be motivated by domain-knowledge
- dependent on use-case: think about what should happen what => react according to it
- => huge challenge for developers of enterprise computing
Events
- can emulate request-response with corresponding events
- also requires timeout
- more complex
Example fluege.de
- services that can't make it within timeout just are not shown
Timeout (among otherS) can be in the sidecar/service mesh

Interview D:

Request-Response => watch out for timeouts
- catch timeout exception => should be clear how we compensate based on domain logic
Events
- also monitor if event flow is finished in certain time
  - could be the first service watching for the final event
  - could be a supervisor on level above
- => often forgotten
- many claim this is complex
  - but it is just separating failure handling from processing flow => in sum
Role of infrastructure
- Service mesh should only detect timeouts, not react to it
- failure can be ignored, or lead to something fundamentally different => decide individually
- reaction only if it is a generic one
  - e.g. always make a retry
- don't pull domain logic out of service! otherwise will become a bottleneck

Timeouts

Context​

Problem​

Solution​

Maturity​

Sources of Evidence​

Context

Problem

Solution

Maturity

Sources of Evidence