Aggregate Logging and Monitoring Information in a Central Place
Context
Microservices are being adopted. Multiple microservices are being deployed. Logging and monitoring is standardized across microservices. Logging and monitoring information is available locally in each microservice instance.
Problem
- Debugging the system is cumbersome and effortful
- The manual collection of logging and monitoring information from the microservice instances is too laborious
Solution
Introduce an aggregation mechanism for logging and monitoring data.
There are many tools to collect logging and monitoring data into a central place. Example technologies:
- Logstash in the ELK stack
- Fluentd in the EFK stack
- pre-configured mechanisms if you use service meshes or deploy to a (serverless) cloud.
All technical solutions have in common that the information must be made available by your application. Additionally, you have to standardize the information and formats used for logging and monitoring.
Maturity
Proposed, to be evaluated.
Sources of Evidence
L3:
- monitoring solution comprises server and client containers
- Server: Kibana for visualization, Elasticsearch for consolidating metrics
- Scale and cluster Elasticsearch
- Clients: containers with Logstash to forward data to server (Elasticsearch cluster)
L5:
- Special attention to carefully design central logging and aggregation system
- to continue debugging system appropriately
- Logging and tracing vital for developers to understand system behavior as a whole-
L8:
- sidecars become natural location for monitoring
- idea behind service mesh technologies like Linkerd: extend notion of self-contained sidecars
- Operators can dynamically monitor and manage behavior of multiple distributed sidecars by means of centralized control plane
L9:
- Configuration State Manager manages configurations of all services => is a kind of service discovery
- Each container has its "sidekick process"
- Monitors status of service and container periodically
- Registers or de-registers the service in CSM
L18:
- Log aggregation once trivial with monolithic and three-tier applications
- became incredibly complex in this new paradigm
L31:
- Aggregate information in monitoring server
- Parse
- Aggregate
- Store somewhere to be queried (e.g. time series DB or indexing server)
L34:
- Tracing / logging / monitoring information
- Solution from central to distributed logging
L41:
- System has centralized logging with LoggingService, ElasticSearch, and Kibana
- allows log aggregation from all services
- Same for monitoring: MonitoringService, Icinga, cAdvisor to aggregate monitored metrics
- Centralization
- gives team system status overview
- act proactively on suspicious and faulty behavior
L43:
- Microservice reference architecture
- Log aggregation as infrastructural service
L46:
- Runtime service / API monitoring and management: BAM, service monitoring (not explained further)
- Enterprise monitoring and tracking manager
- microservices and other application publish standardized events (operational and log events, metrics)
- collect centrally all events and offer analytics operations
L55:
- macro-component implements ELK stack
- log gathering, analysis, monitoring services
- logs from every microservice collected, stored, analyzed => graphical presentation
- Need for query language to interactively analyze the information
L61:
- Serverless functions: operation taken care by platform
- operational aspects like monitoring and logging at different levels taken care of
- Class if infrastructure services
- Need for monitoring (logging, profiling) but also on system level: health management
- attracting strongest attention of researchers
L63:
- Context: Micado platform
- Microservice coordination logic layer
- uses Prometheus to collect information about various services
- to understand how execution environment performs
- used by Alert manager
- if bottleneck: instructs Occops to launch / shut down cloud instances
LN21:
- fault isolation in concurrent programs and distributed systems => start with logging thread and node level execution info and then locate faults in them
- apply to microservices is highly non-trivial since container instances in microservces are constantly changing, causing difficulty in log checking and overly fragmented logs
- survey: all participants depend on log analysis for fault analysis and debugging
- debugging triggered by failure reports describing symptoms and optionally reproduction steps
- debugging ended when fault fixed
- Table 4: maturity levels of debugging
- basic log analysis
- time, executed methods, value of parameters and variables, intermediate results, extra context info as execution threads
- follows same procedure as monolithic systems => common logging tools to capture and collext execution logs
- to locate fault: manually examine large number of logs
- success depends heavily on dev's experience on the system and used technology stack
- visual log analysis
- logs are visualized for fault localization
- use conditions, regular expresions, and sorting
- selected logs can be aggregated and visualized by differend kinds of statistical charts
- log retrieval and visualization combined to drill up and down through logs
- example: first use histogram to learn distribution range of results and choose abnormal result
- need to use centralized logging system to visualize logs
- include info about microservice and instances
- success depends on tools for log collection, retrieval, and visualization
- example technology: ELK stack - Logstash for log collection, ElasticSearch for log indexing, Kibana for visualization
- visual trace analysis
- technology: Zipkin
- see other code
- basic log analysis
- tracing info => each microservice collects tracing logs => collected by central logging system
LN43:
- In case of failure => need for good logging in microservices
- should be easily searchable and all service should aggregate logs to one place to make finding problems easier
LM43:
- Context: SLR findings about microservices in DevOps
- Monitoring perceived as problem by 7 studies
- Amazon CloudWatch can address logging, post-deployment monitoring, and adaptive actions monitoring by collecting data
- from logs, metrics, events
- to discover system-wide performance issues and take autoamted actions to keep system running smoothly
- interpretation: "collecting" data means collecting them in a central place/tool
LM47:
- Context: SLR with tactics to achieve quality attributes
- Monitoriability: becomes essential and complex part of MSA due to dynamic structure an behavior
- Different solutions:
- Instrumentation: host, platform, or service instrumentation
- agent on target that collects monitoring data
- depending on where agent is deployed, can collect different information
- Logging: all outgoing and incoming requests to local log file of instance
- timestamp, resposne time, respone codeTranslations, unique ID of source microservice instance, URL of target microservice instanceof, requested method
- log files periodically fetched and aggregated
- eg CloudWatch and Logstash
- Instrumentation: host, platform, or service instrumentation
- Storing monitored data in one place to collect info from all places
- centralized storage: analysis of service interactions in one place
- less administrative overhead
- might become bottleneck / point of failure
- decentralized Storage: at each host, platform, or service
- higher scalability
- less resilient: loss of local storage => loss of monitoring data
- more operational overhead
- centralized storage: analysis of service interactions in one place
- Processing monitoring data
- Motivation: root cause localization, scaling up or down
- aggregation processing: store logs in aggregated form
- reduce storage amount, long term analysis, eg long-term service runtime metrics
- some analysis no possible on aggregated data, eg root cause analysis
- non-aggregating processing: store in native form for detailed analysis
- huge amount of monitoring data and storage
- not possible to store over long period of time
- => for short time failure analysis
- both not mutually exclusive, rather use combination
LM48:
- Context: microservice migration describes an examples project (FX Core) and compares back to monolith
- foundation services for supportive functions, not business-related
- among others: LoggingService, MonitoringService
- active/active failover (multiple active replicas in parallel)
- infrastructure services, includes monitoring and logging
- challenge: missing system status overview
- centralized logging with LoggingService, ElasticSearch and Kibana for aggregating logs from all services
- same for monitoring with MonitoringService, Icinga and cAdvosor for aggregating monitoring metrics
- centralization: gives team complete system status overview
- proactively act on suspicious and faulty behavior
Interview A:
- Log to std-out in containers instead of local log files
- Monolith: logs in var/lib etc
- Microservices: too laborious with 30-40 containers, especially since containers can crash => logs are gone
- no sense simply logging to file
- Need to wrap your head around monitoring tools: logging, log analysis, debugging sessions,... => learn ELK stack
- also experience to operate it
- Today EFK (?) stack in production logging GBs of data per hour
- Logical need for those tools when doing microservices
- Standards
- Log Stdout
- Logging format: JSON => collect it, parse it, show it
- No need to have root privileges on a server to do log analysis
- here, the Kibana URL
- here, your filter
Interview C:
- not enough to know to make global log ELK stack with many distributed instances
- simple reading => cannot see the forest for the trees