Use Dashboards and Visualizations

Context

Microservices are being adopted. Multiple microservices instances are deployed. Monitoring information is aggregated.

Problem

The collected metrics are not used consistently
It is hard to keep an overview of the collected monitoring data

Solution

Use dashboard visualization tools to display the most relevant monitoring information for each team.

Each team should have dedicated dashboards for showing the most relevant monitoring information. To make sure that the dashboards are scanned regularly you can, for example, display them on additional monitors in the microservice team's rooms. Potentially, offer monitoring dashboards and visualizations also for other stakeholders, like project managers, if you collect metrics at a suiting level.

Visualizations can help putting abnormal behavior into context and further simplify debugging the system.

Maturity

Proposed, to be evaluated.

Sources of Evidence

L3:

case study
Kibana dashboard for visualization
Elasticsearch for consolidating monitoring metrics

L14:

at otto.de
Each feature team two large screens besides their team space
- one to monitor deployment pipelines and other build-related information
- other graphs and metrics for all microservices of team
  - challenge if many microservices
  - => need for automatic anomalies detection
  - dashboard then give overview over graphs that are currently of interest

L19:

Visualize heath status of every service
- quickly allocate and respond to problems
With all fast moving parts, design of visualization has to fit the purpose to provide helpful information as input for analysis

L31:

visualization tool to get an overall view over system status
- displays time-series data
- help devs to refactor architecture to remove performance bottlenecks
- helps to detect anomalies

L43:

Build dashboard for staging (in picture)

L55:

collected monitoring information displayed in graphical form to MSI administrator
query language to interactively analyze the information coming from the platform

LN21:

effective way for debugging is tracing and visualizing system executions
microservices: much more complex and dynamic than traditional distributed systems
- lack of natural correspondance between microservices
- microservices can be dynamically created and destroyed
- => unclear whether visualization tools for distributed systems can be used
3 maturity levels of logging
- basic log analysis
  - other code
- visual log analysis
  - logs are visualized for fault localization
  - use conditions, regular expresions, and sorting
  - selected logs can be aggregated and visualized by differend kinds of statistical charts
  - log retrieval and visualization combined to drill up and down through logs
  - example: first use histogram to learn distribution range of results and choose abnormal result
  - need to use centralized logging system to visualize logs
  - include info about microservice and instances
  - success depends on tools for log collection, retrieval, and visualization
  - example technology: ELK stack - Logstash for log collection, ElasticSearch for log indexing, Kibana for visualization
- visual trace analysis
  - trace results from execution scenario and is composed of user-request trace segments
    - share the same user request ID (created for each user request)
    - passed along with the request to each directly or indirectly invoked microservice
    - => present in logs
  - use visualization tools to analyze invocation chain extracted from the traces
    - indentify suspicious ranges of microservice invocations and executions
  - microservice can invoke multiple microservices in parallel => tree structure
    - e.g., vertically show nested invocations, horizontally show duration invocations with colored bars
  - highly depends on used tool
  - technology: Dynatrace, Zipkin
  - most companies implement their own tracing and visualization tools => specific to implementaiton techniques of MSA
visual analysis has advantage when dealing with interaction faults
- evaluation: trace analysis 20h, visual log analysis 35h, basic log analysis 45h
- since initial understanding, fault scoping, fault localization are more time consuming than other steps => requires indepth understanding of the logs
  - initial understanding: trace analysis 3h, visual log analysis 7h, basic log analysis 12h
- participant feedback: very useful; how muh depends on fault type and dev experience/skills/preferences
empirical study => see that other services were called as well; could analyze them => another potential fault location
- => localization is more precise
- most fault cases except caused by environmental settings can benefit from trace visualization, esp. those for microservice interaction
- use of visualization tools for debugging is possible: use microservice / its state as node
  - difficulty: definition of microservice state => relies on dev experience
- challenge: huge number of nodes and events => makes visualization analysis infeasible
  - zoom in/out and node/event clustering allow to focus on suspicious scopes
  - combine with fault localization techniques => suggest suspicious scopes in traces, results of code-level fault localization within specific microservices

LM47:

Context: SLR with tactics to achieve quality attributes
Presesnting monitored data Motivation:
- necessity to visualize monitored data to different stakeholders
- stakeholders may have different needs => views
- Description: present key metrics in uniform way
  - multiple views => address different concerns, for each stakeholder
  - 1. service-specific metrics for analyzing response time, failure rate, throughput, CPU and memory consumption
  - 1. IT landscape analysis for long-term reports: available hosts and data centers, host utilization, service allocation on hosts
  - 1. infrastructure metrics to analyze CPU and memory load for specific host
  - 1. map of running services to provide an overview of all running services, interactions
    - also long-term reports for these data
  - 1. all downstream calls of service: support root cause analysis
- technology: Kibana dashboards; with ElasticSearch, Logstash

Interview A:

Standardize logging => using Kibana as UI part of it

Interview B:

Team has to monitor their service and look at their metrics
In best case: see if something is escalating (before it is too late)
Dashboard per service-team showing their metrics

Use Dashboards and Visualizations

Context​

Problem​

Solution​

Maturity​

Sources of Evidence​

Context

Problem

Solution

Maturity

Sources of Evidence