Timeouts
Context
Microservices are being adopted. Communication between microservices happens over an unreliable network.
Problem
- Microservice instances become non-responsive since they are waiting for a response message that never arrives or is heavily delayed.
- Resources become unavailable since they are not released while waiting for a response message that never arrives or is heavily delayed.
Solution
Use timeouts to detect messages that are heavily delayed or got lost somewhere in the system. On timeout, orphan the running request and free allocated resources and complete the request, potentially as failed if no compensation or further error handling mechanism is available.
The timeout should be individually decided on depending on the context of the interaction. The determination of what is a viable time span for waiting for messages should depend on
- technical factors, like roundtrip times, but also on
- domain knowledge, like the time criticality of the executed job (e.g., is a user waiting on a response or not).
Using timeouts applies to synchronous request-response style messaging, but also to asynchronous message passing as events. In the latter case, the failure handling needs to be separated from the processing flow. Either an additional higher-level component monitors and supervises the processing flow or one of the involved components. Although this might seem to be more complex at the first glance than synchronous timeouts, it simplifies processing flows significantly by keeping failure handling separately.
Infrastructure components like API facades, internal integration proxies, or service meshes may offer timeout handling. We advise putting no reaction logic (= business logic) to timeouts into such infrastructure components unless they are very generic, like always do a retry on timeouts. The detection of a timeout might be outsourced to the infrastructure component.
If waiting for timeouts is not a good option due to the increase of waiting times at the clients, we advise using circuit breakers that build on timeouts leading to a fail-fast strategy.
Maturity
Proposed, evaluation required.
Sources of Evidence
L5:
- Communication strategy involves identifying right protocol, response time expectations, timeouts, and API design
L31:
- Timeout as "problem" section for circuit breaker pattern: waiting for the timeout takes too long
- use timeout as metric to determine state of circuit breaker
L35:
- API call of microservice can have the following failure results
- delayed response
- error response (error code)
- invalid response
- connection timeout
- failure to establish connection
- Timeouts ensures API call to microservice completes in bounded time
- maintain responsiveness and release resources associated with API call in timely fashion
L58:
- Context: Synapse (example application)
- Subscribers deadlocked => queues were filling up since they could not consume update
- Timeouts => Synapse's recovery mechanism kicked in, rebootstrapped subscribers, and system was unblocked
- Need for mechanism to give up on waiting for late or lost messages
- Timeout should be configurable
Interview B:
- Need for dead-letter queues
- listen and react with timeout
- Timeout needs to be motivated by domain-knowledge
- dependent on use-case: think about what should happen what => react according to it
- => huge challenge for developers of enterprise computing
- Events
- can emulate request-response with corresponding events
- also requires timeout
- more complex
- Example fluege.de
- services that can't make it within timeout just are not shown
- Timeout (among otherS) can be in the sidecar/service mesh
Interview D:
- Request-Response => watch out for timeouts
- catch timeout exception => should be clear how we compensate based on domain logic
- Events
- also monitor if event flow is finished in certain time
- could be the first service watching for the final event
- could be a supervisor on level above
- => often forgotten
- many claim this is complex
- but it is just separating failure handling from processing flow => in sum
- also monitor if event flow is finished in certain time
- Role of infrastructure
- Service mesh should only detect timeouts, not react to it
- failure can be ignored, or lead to something fundamentally different => decide individually
- reaction only if it is a generic one
- e.g. always make a retry
- don't pull domain logic out of service! otherwise will become a bottleneck