Monitor Metrics at Different Levels

Context

Microservices are being adopted. Multiple microservices instances are deployed. Monitoring information is aggregated. There might be visualizations available of different metrics.

Problem

The collected metrics don't suffice to discover failures early
The collected metrics don't help to locate the defect in the system leading to long fixing times

Solution

Collect monitoring metrics at different levels of the deployed application, including on the domain level.

The most important goals of monitoring are to discover failures as fast as possible and locate the defect. For both goals, it makes sense to introduce metrics at different technical levels. Examples are:

Cluster-level, e.g. for Kubernetes:
- Nodes
- Namespaces
- Services
- Pods
Container-level
Operating-system-level
Application-level
Domain-level

Especially cluster and container metrics monitoring might be offered by the deployment platform, e.g. a cloud platform or cluster manager like Kubernetes.

Cluster-level monitoring information are important for whoever operates the cluster. Decisions like adding new nodes, removing unused ones, or adding new disk space can be based on this monitoring information.

Container-level monitoring can be used for auto-scaling instances of a microservice.

Operating-system level monitoring became less relevant as the container-level monitoring already collects information about a deployed unit. However, metrics on the operating-system level might add value by its introspective perspective, especially when dealing with multiple processes in a container (which is generally advised against - [see Docker Best Practices]).

Application-level monitoring should at least include a health status indicating if a service instance is operational. It may include the health status of its downstream components like database, message broker, or dependent microservices. Additional metrics like counting requests and responses and setting them in relation may support discovering emerging inconsistencies within a microservice.

Domain-level monitoring might be implemented with metrics like revenue per time unit in the e-commerce domain, or user registrations and logins per time unit. A drop in this metric indicates a very urgent incident and expresses the urgency to deal with it. It introduces a sense of responsibility since the impact of changes is immediately correlated to the business success of the product.

Maturity

Proposed, to be evaluated.

Sources of Evidence

L5:

Application performance monitoring
- infrastructural centric characteristic
- measure individual microservices' performance => assess health and existing SLAs for the system

L14:

Monitoring important topic to detect failures quickly and automatically restore services
Technical metrics
- how many requests per second is the db getting
Business relevant metrics
- how many orders per minute are received

L16:

Each microservice to provide interface to hand over monitoring information
- health status: ok, broken
  - prevent other microservices from calling broken services
  - escalate: if no healthy comm partner => self broken

L23:

variations in performance metrics across different microservices and datacenter resources
- e.g. SDN resources: throughput and latency; CPU: utilization and throughput; database: query response time
- => how to define performance metrics coherently across microservice?
Datacenter resource level monitoring: CPU percentage, TCP/IP performance
Microservice-level monitoring: end-to-end request processing latency and communication overhead
Cluster-wide monitoring frameworks
- HW metrics: cluster, CPU, memory utilization, ...
Monitoring framework used by AWS EC2 and K8s
- CPU, memory, filesystem, network usage statistics
- => can't monitor microservice-level performance metrics
=> ned for holistic techniques including datacenter resources and microservices
- value to scheduler and administrators
- track and understand impact of runtime uncertainties on performance

L25:

monitor "health status"

L31:

store hierarchical resource usage => alert on unusual resource usage burst

L34:

Monitoring spans fault tolerance, performance, or even security

L61:

Serverless provides monitoring capabilities at different levels
- OS, containers, communication, ...

LM47:

Context: SLR with tactics to achieve quality attributes
horizontal duplication relies on monitorability: info for decision for scale operations
Profiling tactic to detect performance issues, prepare for optimization
- CPU, memory, bandwith => scheduling and auto-scaling
- CPU and memory profiling
Fault monitor
- health monitoring of microservices: detect presence of fault to take recovery actions
- types:
  - centralized monitor: collect results of service invocation based on health checks by service discovery
    - minimize downtime of microservice
  - symmetric monitor: neighbors of service monitor (successor, predecessor)
  - arbitral monitor: decentralized and independent group of nodes
    - failure needs to be confirmed by majority
Monitorability
- Places:
  - infrastructure info (VM or container)
  - application info (response time)
  - environment info (network)
- generating monitoring data
  - at different levels: host, platform , service
  - by instrumentation, logging, distributed tracing
  - help to know runtime informiation from different aspects
    - available hosts, service response time, failure rate, resource consumption

Interview A:

Define metrics for each microservices
- CPU load
- memory usage
- ...
Liveliness probes, health checks
Use K8s for it for auto-scaling
Container monitoring on different levels
- Node, OS, K8s, pod, service, namespace, ...
- in order to localize the error
Biggest challenge: even recognize that something is going wrong in production

Interview B:

Team has to monitor itself now instead of operations
Metrics
- turnover per time unit as business metrics (otto.de does this)
  - introduces sense of responsibility

Monitor Metrics at Different Levels

Context​

Problem​

Solution​

Maturity​

Sources of Evidence​

Context

Problem

Solution

Maturity

Sources of Evidence