Aggregate Logging and Monitoring Information in a Central Place

Context

Microservices are being adopted. Multiple microservices are being deployed. Logging and monitoring is standardized across microservices. Logging and monitoring information is available locally in each microservice instance.

Problem

Debugging the system is cumbersome and effortful
- The manual collection of logging and monitoring information from the microservice instances is too laborious

Solution

Introduce an aggregation mechanism for logging and monitoring data.

There are many tools to collect logging and monitoring data into a central place. Example technologies:

Logstash in the ELK stack
Fluentd in the EFK stack
pre-configured mechanisms if you use service meshes or deploy to a (serverless) cloud.

All technical solutions have in common that the information must be made available by your application. Additionally, you have to standardize the information and formats used for logging and monitoring.

Maturity

Proposed, to be evaluated.

Sources of Evidence

L3:

monitoring solution comprises server and client containers
Server: Kibana for visualization, Elasticsearch for consolidating metrics
- Scale and cluster Elasticsearch
Clients: containers with Logstash to forward data to server (Elasticsearch cluster)

L5:

Special attention to carefully design central logging and aggregation system
- to continue debugging system appropriately
Logging and tracing vital for developers to understand system behavior as a whole-

L8:

sidecars become natural location for monitoring
idea behind service mesh technologies like Linkerd: extend notion of self-contained sidecars
Operators can dynamically monitor and manage behavior of multiple distributed sidecars by means of centralized control plane

L9:

Configuration State Manager manages configurations of all services => is a kind of service discovery
Each container has its "sidekick process"
Monitors status of service and container periodically
Registers or de-registers the service in CSM

L18:

Log aggregation once trivial with monolithic and three-tier applications
became incredibly complex in this new paradigm

L31:

Aggregate information in monitoring server
- Parse
- Aggregate
- Store somewhere to be queried (e.g. time series DB or indexing server)

L34:

Tracing / logging / monitoring information
- Solution from central to distributed logging

L41:

System has centralized logging with LoggingService, ElasticSearch, and Kibana
- allows log aggregation from all services
Same for monitoring: MonitoringService, Icinga, cAdvisor to aggregate monitored metrics
Centralization
- gives team system status overview
- act proactively on suspicious and faulty behavior

L43:

Microservice reference architecture
Log aggregation as infrastructural service

L46:

Runtime service / API monitoring and management: BAM, service monitoring (not explained further)
Enterprise monitoring and tracking manager
- microservices and other application publish standardized events (operational and log events, metrics)
- collect centrally all events and offer analytics operations

L55:

macro-component implements ELK stack
- log gathering, analysis, monitoring services
- logs from every microservice collected, stored, analyzed => graphical presentation
Need for query language to interactively analyze the information

L61:

Serverless functions: operation taken care by platform
- operational aspects like monitoring and logging at different levels taken care of
Class if infrastructure services
- Need for monitoring (logging, profiling) but also on system level: health management
- attracting strongest attention of researchers

L63:

Context: Micado platform
Microservice coordination logic layer
- uses Prometheus to collect information about various services
- to understand how execution environment performs
- used by Alert manager
  - if bottleneck: instructs Occops to launch / shut down cloud instances

LN21:

fault isolation in concurrent programs and distributed systems => start with logging thread and node level execution info and then locate faults in them
- apply to microservices is highly non-trivial since container instances in microservces are constantly changing, causing difficulty in log checking and overly fragmented logs
survey: all participants depend on log analysis for fault analysis and debugging
- debugging triggered by failure reports describing symptoms and optionally reproduction steps
- debugging ended when fault fixed
Table 4: maturity levels of debugging
- basic log analysis
  - time, executed methods, value of parameters and variables, intermediate results, extra context info as execution threads
  - follows same procedure as monolithic systems => common logging tools to capture and collext execution logs
  - to locate fault: manually examine large number of logs
  - success depends heavily on dev's experience on the system and used technology stack
- visual log analysis
  - logs are visualized for fault localization
  - use conditions, regular expresions, and sorting
  - selected logs can be aggregated and visualized by differend kinds of statistical charts
  - log retrieval and visualization combined to drill up and down through logs
  - example: first use histogram to learn distribution range of results and choose abnormal result
  - need to use centralized logging system to visualize logs
  - include info about microservice and instances
  - success depends on tools for log collection, retrieval, and visualization
  - example technology: ELK stack - Logstash for log collection, ElasticSearch for log indexing, Kibana for visualization
- visual trace analysis
  - technology: Zipkin
  - see other code
tracing info => each microservice collects tracing logs => collected by central logging system

LN43:

In case of failure => need for good logging in microservices
- should be easily searchable and all service should aggregate logs to one place to make finding problems easier

LM43:

Context: SLR findings about microservices in DevOps
Monitoring perceived as problem by 7 studies
Amazon CloudWatch can address logging, post-deployment monitoring, and adaptive actions monitoring by collecting data
- from logs, metrics, events
- to discover system-wide performance issues and take autoamted actions to keep system running smoothly
- interpretation: "collecting" data means collecting them in a central place/tool

LM47:

Context: SLR with tactics to achieve quality attributes
Monitoriability: becomes essential and complex part of MSA due to dynamic structure an behavior
Different solutions:
- Instrumentation: host, platform, or service instrumentation
  - agent on target that collects monitoring data
  - depending on where agent is deployed, can collect different information
- Logging: all outgoing and incoming requests to local log file of instance
  - timestamp, resposne time, respone codeTranslations, unique ID of source microservice instance, URL of target microservice instanceof, requested method
  - log files periodically fetched and aggregated
  - eg CloudWatch and Logstash
Storing monitored data in one place to collect info from all places
- centralized storage: analysis of service interactions in one place
  - less administrative overhead
  - might become bottleneck / point of failure
- decentralized Storage: at each host, platform, or service
  - higher scalability
  - less resilient: loss of local storage => loss of monitoring data
  - more operational overhead
Processing monitoring data
- Motivation: root cause localization, scaling up or down
- aggregation processing: store logs in aggregated form
  - reduce storage amount, long term analysis, eg long-term service runtime metrics
  - some analysis no possible on aggregated data, eg root cause analysis
- non-aggregating processing: store in native form for detailed analysis
  - huge amount of monitoring data and storage
  - not possible to store over long period of time
  - => for short time failure analysis
- both not mutually exclusive, rather use combination

LM48:

Context: microservice migration describes an examples project (FX Core) and compares back to monolith
foundation services for supportive functions, not business-related
- among others: LoggingService, MonitoringService
- active/active failover (multiple active replicas in parallel)
infrastructure services, includes monitoring and logging
challenge: missing system status overview
- centralized logging with LoggingService, ElasticSearch and Kibana for aggregating logs from all services
- same for monitoring with MonitoringService, Icinga and cAdvosor for aggregating monitoring metrics
- centralization: gives team complete system status overview
  - proactively act on suspicious and faulty behavior

Interview A:

Log to std-out in containers instead of local log files
Monolith: logs in var/lib etc
Microservices: too laborious with 30-40 containers, especially since containers can crash => logs are gone
no sense simply logging to file
Need to wrap your head around monitoring tools: logging, log analysis, debugging sessions,... => learn ELK stack
- also experience to operate it
Today EFK (?) stack in production logging GBs of data per hour
Logical need for those tools when doing microservices
Standards
- Log Stdout
- Logging format: JSON => collect it, parse it, show it
No need to have root privileges on a server to do log analysis
- here, the Kibana URL
- here, your filter

Interview C:

not enough to know to make global log ELK stack with many distributed instances
- simple reading => cannot see the forest for the trees

Aggregate Logging and Monitoring Information in a Central Place

Context​

Problem​

Solution​

Maturity​

Sources of Evidence​

Context

Problem

Solution

Maturity

Sources of Evidence