Circuit Breaker and Fail Fast

Context

Microservices are being adopted. Communication of clients with microservices or between microservices happens over an unreliable network. Timeouts are used to free resources when interacting with non-responsive microservice instances.

Problem

Waiting repeatidly for timeouts of requests to overloaded services slows down the system.
Failures cascade by blocking calling microservice instances until their timeouts apply.

Solution

Use circuit breakers that fail fast and isolate the unhealthy microservice instance to prevent cascading failures.

A circuit breaker monitors the health of a microservice instance with periodic health checks and/or counting request failures and timeouts. Once a microservice instance becomes unhealthy or the failure rate reaches a configured threshold, the circuit breaker blocks incoming traffic and immediately returns an error code or resorts to a fallback mechanism. Clients limit their waste of resources to access the unresponsive service instances, and the service instances have the chance to recover without additional load. The state of the circuit breaker can be expressed as a state machine.

As an implementations, there are three different kinds of circuit breakers:

Client-side circuit breakers live in the client code and prevent requests in the first place if the state is open.
Server-side circuit breakers live in the service code and require resources on the microservice instance.
Proxy circuit breakers sit between the client and the microservice instance. This solution may be considered if changes should be kept to a minimum and the concrete implementation shall be easily replaceable. However, proxies can become a bottleneck, so consider where it is running in the network (e.g. natural routing node vs. non-natural routing node) and consider introducing a circuit breaker proxy per microservice instead of a global one.

One challenge with circuit breakers is choosing the appropriate reaction in the open state where requests should fail immediately. Consider domain-motivated alternatives. If the circuit breaker is not under the governance of the microservice team, the team should be at least involved in defining the appropriate reaction.

Maturity

Proposed, evaluation required.

Sources of Evidence

L3:

Context: example system
used Hysterix as circuit breaker with other tools of Netflix OSS stack
- adds resilience during service calls
- failing fast decreases coupling between services => contributes to independent deployments

L5:

circuit breaker as keyword for communication/integration as microservice challenge
keyword as part of the communication/integration challenge

L7:

as part of microservice principles: isolate failure
- e.g. by introducing circuit breakers to make services more robust

L8:

5th wave: latency and fault-tolerant communication libs, e.g. Finagle, Hyterix, Proxygen, Resilience4j
- lets services communicate more efficiently and reliably

L12:

Circuit breaker
- fault tolerance should be embedded in every cloud-native application
- even more sense in microservice architectures where microservices collaborating
- Failure in each service may result in failure of whole system
- Circuit breaker can mitigate loss at lowest level
- Sketch: in every higher-level microservice that calls other services is a load balancer and a circuit breaker component
- => more comfortable wit new concepts and increased speed for rest of migration, and introducing new services

L13:

routing fabric that forwards to microservice instance
- often provides load balancing and can isolate microservices in failed state

L16:

Circuit breaker
- uses health status
- uses number of unsuccessful calls until threshold is reached
- if triggered => return immediate error instead of sending call to prevent broken service to be penetrated with additional requests
- after amount of time: test if service recovered / health status => half-open state
- semi-open state to prevent called service to become unavailable again due to high incoming traffic
- works well with load balancer
  - only load on services that are in healthy/open state
  - half-open state: number of requests is lowered
  - closed state: not used

L18:

service discovery aligned with monitoring
- only gains / retains place in service registry if successfully passes monitoring test

L19:

Failure isolation
- fail fast when failures happen
  - (+) better understanding and problem solving
- make small incremental changes => see what changed when sth breaks

L20:

Circuit breaker
- ranked 3rd as gains pertaining to design patterns during design stage
- significantly covered
- mitigate issues due to failures as cascading failure
Fault isolation and updateability as gains
- by bounded contexts
- and by circuit breaker

L24:

circuit breaker as one of most known microservice patterns
1 of 4 benchmark candidate applications implement circuit breaker pattern

L30:

need to monitor specific architectural patterns at runtime
- as circuit breakers (open, half-closed, ...)
- patterns used by including 3rd party libs as Netflix OSS stack and respective monitoring interfaces are already included

L31:

Migration pattern MP10: Introduce circuit breaker
- Context
  - end user requests needs interservice communication in internal system
- Problem
  - How fail fast and not wait until service call timeouted when calling a recently inavailable service
  - How to be more resilient when inavailable service is called
- Solution
  - user circuit breaker in consumer
  - does nothing if service provider available (closed circuit state)
  - monitors recent responses from the service provider & counts pf failing responses
    - if reaches threshold => open circuit state
  - open circuit state: return meaningful response code / latest cached data if acceptable for the specific response
  - specific timeout o check provider's availability => half-open circuit state
    - if success => closed circuit state
    - if not => open circuit state
- Challenges
  - appropriate response in open circuit state can be challenging
  - need to coordinate with business stakeholders
  - if response not just exception => provider team should certify possibility of returning that response on their behalf
- Technologies
  - Hysterix
Used MP10 in case study
- since number of service interactions would be increased
- each external request may be transformed to chain of internal service calls
- => importance to fail fast and not wait for timeouts
- found in zero of three casees

L34:

circuit breaker used as keyword for fault tolerance
circuit breaker as keyword for fault tolerance

L35:

circuit breaker
- prevent cascading failures
- if repeated calls fail => to open mode; caller returns a cached or default response to upstream microservice
- after fixed time: caller attempts to reestablish connectivity
  - success: closed state; resume normal operation
  - definition of success is implementation dependent (e.g. response time within threshold, absence of errors in time period)
no choreography language that can capture circuit breakers

L53:

Context: secure microservice framework for IoT
Circuit breaker as QoS policy
- shows exhaustion of capability
- dynamically generated to restrict further requests

L55:

Context: example system MAYA platform
API gateway based on Ribbon, collaborating with service registry and applying load balancing
- API gateway offers circuit breaker pattern
  - impedes the system to get stuck in case target backend service failed to answer within certain time

L61:

application must tolerate services failures
- either by failing fast
- or by gracefully degrading functionality
- at Netflix: (among others) use circuit breaker pattern to avoid cascading failures
- technology: Hysterix library
circuit breaker one of the most recurring design patterns (place 4)
- related to reliability and portability

LN43:

inevitable that lots of services, potentially undeer heavy load => cannot respond in timely manner or go do
Cirtuit breaker pattern handles failures fast
- can provide fallback with default data instead of waiting for dependency
- monitor failures
- if enough failures => call to dependency won't be made but instead an error is returned
  - fallback also possible
  - => user might not even notice that something went wrong
- => not adding more load on dependency, returned error gives dependency time to recover
Multiple technical solutions available
- most famous one is Hysterix
  - library for latency and fault tolerance for distributed systems

LN44:

Context: security implications => fail fast
fault tolerance contributes to security (but does not translate to it)
- if goal is to disrupt service, like vis DoS attack
Monolith: failure is often total
Distributed systems: partial failures (only specific nodes)
Microservice network should tolerance these partial failures and limit propagation
Circuit breaker pattern prevents cascading failures, increases overall system resilience
- adjusts node behaviour if peers fail partially or completely
- fail fast principle: decrease likelihood of attacks succeeding, minimize damage

LM43:

Context: SLR findings about microservices in DevOps
circuit pattern most recurring found pattern (~10%, 5 studies)
- indicates cascading failures is a major concern
Table 10: lists circuit breaker pattern

LM47:

Context: SLR with tactics to achieve quality attributes
Circuit breaker to prevent requests to service in case of failure
- opens depending on specific threshold when facing failure
- closes again after certain time, will not open again until errors detected
Motivation: fault tolerance important to ensure high availability, many dependencies, cascading failures
Description: state machine
- mainstream impl: Hystrix
- Three states: open, closed, half-open
  - closed => requests passed to target
  - if count of faults or timeouts > threshold or critical fault detected: becomes open
  - open => prevents requests being passed to target
  - perdiodic observation on service health: may become half-open to pass limited number of requests
- types of circuit breakers
  - client-side: force clients to use circuit breakers, needs access to client code
    - Drawback: clients can be malicious
    - ping services periodically to get health info
  - service-side
    - within service, decides whether request processed or no
    - Drawback: resources even if state is open to decline requests
    - can support throttling
  - proxy between client and service
    - for single or multiple services, for single all clients
    - if closed for client + service => requests go through
    - one proxy for many services can become bottleneck
    - one proxy per service ensures services and clients are equally protected
      - single client prevented to send too many requests and clients are more resilient against faulty services
Constraints
- requires additional resources for handshake mechanism
- challenging to recognise appropriate response in open circuit state

Circuit Breaker and Fail Fast

Context​

Problem​

Solution​

Maturity​

Sources of Evidence​

Context

Problem

Solution

Maturity

Sources of Evidence