Design for Failure

Context

Microservices are in use or are planned to be adopted. Microservices communicate over network with each other to a certain degree. This inevitably leads to failures by the unreliable network.

Problem

Failures happen frequently in production.

Solution

Design for failure. This means assuming that failures will arise due to the unreliable network. This way of thinking will lead to design that is more robust and decrease the impact of frequent failures caused by dependent microservices.

Proactively design mechanisms to cope with these kinds of failures, e.g., by

degradation of functionality,
designing domain motivated alternatives,
or implementing compensations in workflows,
applying further technical error handling techniques,
and employ CI/CD to deploy small changes to discover newly introduced failures easier.

Maturity

Proposed, evaluation required.

Sources of Evidence

L7:

Comparison of microservice definition
- Design for failure (Lewis + Flower) as subset of failure isolation (Newman)
Principle present in other distributed styles as well, e.g. SOA

L8:

Design for failure was very influential on microservice style

L19:

"Design for failure" important requirement in successfully building ms-based systems
- need for divide to concur model => break things into smaller chunks
- - fast tooling for continuous delivery of many tiny changes
- => change one thing at a time
- (+) if it breaks we know that's the only thing that broke

L25:

"Design for Failure" as microservice principle
- Continuous delivery allowing to deploy many small changes can help devs to change one thing at a time

LM45:

Context: interviews and insights from multiple cases on technologies and sw quality in MSA
(among others) design for failure among microservices that are generally followed albeit verying degrees

LM48:

Context: microservice migration describes an examples project (FX Core) and compares back to monolith
describes design of two different failover modes
- active/active: replicas running alongside, share load
- active/passice: only single instance routesChunkNames, passive one idle and takes over in case of failure
design for failure results in better reliability at large scale

Interview D:

Microservices require design for failure necessity-driven

Interview F:

Context: consistency
only within an instance of a microservice
requires to think more about it: can't just write down a sequence
- need to strategies how to handle all the cases that can happen
- Is it okay to deliver an async resposne to the user?
- how to order requests of the same type?
Need to react in a suiting manner
- "log exception, that should do" - it doesn't
- partially grounded in requirements: what to do when the process cannot be contiued here
- => microservice enforce thinking about this explicily
  - easier with monoliths, most of these issues not present

Design for Failure

Context​

Problem​

Solution​

Maturity​

Sources of Evidence​

Context

Problem

Solution

Maturity

Sources of Evidence