Design for Failure
Context
Microservices are in use or are planned to be adopted. Microservices communicate over network with each other to a certain degree. This inevitably leads to failures by the unreliable network.
Problem
- Failures happen frequently in production.
Solution
Design for failure. This means assuming that failures will arise due to the unreliable network. This way of thinking will lead to design that is more robust and decrease the impact of frequent failures caused by dependent microservices.
Proactively design mechanisms to cope with these kinds of failures, e.g., by
- degradation of functionality,
- designing domain motivated alternatives,
- or implementing compensations in workflows,
- applying further technical error handling techniques,
- and employ CI/CD to deploy small changes to discover newly introduced failures easier.
Maturity
Proposed, evaluation required.
Sources of Evidence
L7:
- Comparison of microservice definition
- Design for failure (Lewis + Flower) as subset of failure isolation (Newman)
- Principle present in other distributed styles as well, e.g. SOA
L8:
- Design for failure was very influential on microservice style
L19:
- "Design for failure" important requirement in successfully building ms-based systems
- need for divide to concur model => break things into smaller chunks
- fast tooling for continuous delivery of many tiny changes
- => change one thing at a time
- (+) if it breaks we know that's the only thing that broke
L25:
- "Design for Failure" as microservice principle
- Continuous delivery allowing to deploy many small changes can help devs to change one thing at a time
LM45:
- Context: interviews and insights from multiple cases on technologies and sw quality in MSA
- (among others) design for failure among microservices that are generally followed albeit verying degrees
LM48:
- Context: microservice migration describes an examples project (FX Core) and compares back to monolith
- describes design of two different failover modes
- active/active: replicas running alongside, share load
- active/passice: only single instance routesChunkNames, passive one idle and takes over in case of failure
- design for failure results in better reliability at large scale
Interview D:
- Microservices require design for failure necessity-driven
Interview F:
- Context: consistency
- only within an instance of a microservice
- requires to think more about it: can't just write down a sequence
- need to strategies how to handle all the cases that can happen
- Is it okay to deliver an async resposne to the user?
- how to order requests of the same type?
- Need to react in a suiting manner
- "log exception, that should do" - it doesn't
- partially grounded in requirements: what to do when the process cannot be contiued here
- => microservice enforce thinking about this explicily
- easier with monoliths, most of these issues not present