Use a Tracing Mechanism
Context
Microservices are being adopted. Logging and monitoring information is aggregated in a central place. There might be dashboards and visualizations available.
Problem
- Even though failures are discovered, it is hard to locate the defect in the system leading to long fixing times
- The full path an interacting (including evoked further interactions) through the system makes it hard to trace down to the defect
Solution
Introduce tracing mechanism that allows comprehending the full path that interactions take through the system. Since this is a system-wide component, we advise standardizing the trace format.
When failures are discovered, most often only symptoms of the defect are discovered and not the defect by itself. Thus, we need to understand the path of the interaction with all involved microservice instances to narrow down the location of the defect. Usually, tracing use a correlation id in the traffic to trace coherent requests through the system.
Together with additional metrics, tracing tools allow to
- trace a request with its headers and bodies through the system and see where errors occur,
- monitor the response time of each part in request chains to discover bottlenecks (introspect view on the system),
- span a dependency graph on empirical data to triangulate architecture diagrams,
- potentially gain further information depending on the chosen tool.
There is a plethora of tracing tools that allow to trace requests between microservice instances and correlate them to track their path through the system. Service meshes might offer this kind of information as well.
Maturity
Proposed, to be evaluated.
Sources of Evidence
L5:
- Distributed tracing = ability to track chains of service calls
- Tracing one of the most common problems in microservices
- tracing through all hops demands attention from academic community
- only few prominent solutions currently available in industry
L49:
- Need to track all messages transmitted between microservices => build the messaging graph
- Required for validation techniques
- Their solution: make this information part of standard contract
LN21:
- effective way for debugging is tracing and visualizing system executions
- microservices: much more complex and dynamic than traditional distributed systems
- lack of natural correspondance between microservices
- microservices can be dynamically created and destroyed
- => unclear whether visualization tools for distributed systems can be used
- 3 maturity levels of logging
- basic log analysis
- other code
- visual log analysis
- other code
- visual trace analysis
- trace results from execution scenario and is composed of user-request trace segments
- share the same user request ID (created for each user request)
- passed along with the request to each directly or indirectly invoked microservice
- => present in logs
- use visualization tools to analyze invocation chain extracted from the traces
- indentify suspicious ranges of microservice invocations and executions
- microservice can invoke multiple microservices in parallel => tree structure
- e.g., vertically show nested invocations, horizontally show duration invocations with colored bars
- highly depends on used tool
- technology: Dynatrace, Zipkin
- most companies implement their own tracing and visualization tools => specific to implementaiton techniques of MSA
- trace results from execution scenario and is composed of user-request trace segments
- basic log analysis
- visual analysis has advantage when dealing with interaction faults
- evaluation: trace analysis 20h, visual log analysis 35h, basic log analysis 45h
- since initial understanding, fault scoping, fault localization are more time consuming than other steps => requires indepth understanding of the logs
- initial understanding: trace analysis 3h, visual log analysis 7h, basic log analysis 12h
- participant feedback: very useful; how muh depends on fault type and dev experience/skills/preferences
- tracing supports understanding of microservice executions in the context of user requests and invocation chains
- empirical study => see that other services were called as well; could analyze them => another potential fault location
- => localization is more precise
- most fault cases except caused by environmental settings can benefit from trace visualization, esp. those for microservice interaction
- use of visualization tools for debugging is possible: use microservice / its state as node
- difficulty: definition of microservice state => relies on dev experience
- challenge: huge number of nodes and events => makes visualization analysis infeasible
- zoom in/out and node/event clustering allow to focus on suspicious scopes
- combine with fault localization techniques => suggest suspicious scopes in traces, results of code-level fault localization within specific microservices
LN42:
- Table 1: industry grade microservice technologies
- Tracing: Azure Service Fabric - Event Tracing Windows; Lagom - Basic; MicroProfile - OpenTracing; Spring Suite (Boot) - Spring Cloud Sleuth
- identify dependency by using tools
- Retrace ,Dynatrace, SchemaCrawler
LM47:
- Context: SLR with tactics to achieve quality attributes
- Distributed tracing allows to determine sequence of microservice calls
- means for root cause analysis
- log which service called with other service
- number of incoming and outgoing calls per service stored
- info combined with service API => call frequency on level of service instance / service type if aggregated
- analysis of communication frequency between services
- tools: New requestIdleCallback, Dynatrace APM, ZipKin, OpenTracing, Apache HTrace
LM48:
- Context: microservice migration describes an examples project (FX Core) and compares back to monolith
- foundation services for supportive functions, not business-related
- for centralized logging and monitoring (among others)
- TracingService (among others)
Interview A:
- Objective: find the error
- Use tracing tools
- trace request from beginning to end (e.g. DB) and back
- where does it get stuck, who is logging what
- => configure log level
- e.g. Zipkin
- first failure message usually just a symptom, not the component where the error is originally located
- error is not necessarily a bug, but also performance issue
- Service mesh usually offers tracing
Interview B:
- Tracing-infrastructure and tracing format needs to be standardized.