Skip to main content

Automate Anomaly Detection and Alerting

Context

Microservices are being adopted. Logging and monitoring information is aggregated in a central place. There might be dashboards and visualizations available.

Problem

  • Manually detecting anomalies is difficult
    • There is too much monitoring data and visualizations

Solution

Automate the anomaly detection instead of doing it by hand and use alerting mechanisms to bring it onto the screen of the microservice teams.

Automated anomaly detection might range from very simple solutions like thresholds up to very sophisticated ones including machine learning mechanisms. The need for more complex mechanisms comes on the one hand with the runtime dynamics of microservices by frequent updates, scaling instances up and down, etcetera, which makes it hard to define the "normal" state. Normal anomaly detection mechanisms might lead to a lot of false alarms.

If anomalies are detected the developers need to be notified. We see solutions ranging from showing it on the monitoring dashboards, over sending messages into a chat channel or via email, to automatically opening tickets with the reference to the anomaly.

Maturity

Proposed, to be evaluated.

Sources of Evidence

L3:

  • Each microservice monitored independently
  • Statistical models trained using monitoring data in normal situations
  • Each incoming monitoring data point calculate a score => spot outliers as anomalies

L8:

  • crucial issue: alert thresholds and filters
    • notify developers when something goes wrong
    • without overloading them with redundant or irrelevant information
  • even more challenging:
    • learn from past events and actions to better inform resource management decisions
    • => control theory and machine learning

L14:

  • Monitoring at otto.de
  • 2 monitors in each team room: 1 for CI, 1 for monitoring
  • Responsible for more microservices => not much space on screen
    • => basic monitoring and alarming not sufficient anymore
    • Need to automatically detect anomalies in all available metrics
    • => show them on dashboards as special points of interest

L19:

  • Use webhooks in GitHub to setup alerts => notify of certain events

L24:

  • Continuous monitoring from DevOps => enables to detect operational anomalies

L30:

  • Microservices: difficult to determine normal behavior due to the frequent changes (updates, scaling, ...) => no "steady state"
  • normal techniques may raise many false alarms
  • Solution: explicitly incorporate logs of change events into the anomaly detection
    • => research field of anomaly detection

L31:

  • use history data to alert at unusual resource usage burst

LN43:

  • Monoliths: not that many containers to monitor even though scaled horizontally
  • When there is more to monitor => need for good automated tool to notify persons who need to act when microservice fails

Interview A:

  • Broken deployment => webhooks to chat system
    • also on healthy CI => trigger for others to continue work (e.g. manual testing with HW product)
  • Could also use that for container monitoring (not yet)
  • Planned by interviewee:
    • Anomaly detected in production => open ticket for it
    • technologically easily feasible
    • Same for container monitoring and cluster monitoring (=> add new HW on time)