TUNIVERSE

AutoMAP:Diagnose Your Microservice-based Web Applications Automatically

字数统计: 813阅读时长: 5 min
2023/01/22

定位故障

challenging roots

  1. dynamic application structure
    • develop a central running component to collect historical data and generate structure from which could analyze patterns and diagnose anomalies.
    • time and effor-consuming, even unrealistic in some legacy systems
  2. indirect anomaly propagation
    • propagation of anomaly is not bounded by the calling dependency because the calling process maybe asynchronous.
    • Even anomaly occurs in a microservice that is not been called, it may affect other services in the same host or container, and cause the anomaly propagation.
  3. multiple types of metric
    • single type of metric is not enough to characterize the anomalies occur in diverse services
    • we still lack an automated mechanism that selects them appropriately according to the characteristics of involve service

model

Untitled

  1. determine performance metrics

    Untitled

  2. select a sampling interval parameter on raw metrics

    1. small interval ⇒ redundant fluctuations, not reflect the actual calling dependency

    2. large interval ⇒ lose effective metric changes, fail to catch the propagation topology

    3. statistical average calling intervals

      Untitled

  3. construct the anomaly behavior graph using multiple types of metrics

    ⭐ experiences from historical diagnosis and the characteristics of various services

    ⇒ a model named Anomaly Behavior Graph to extract correlations from metrics and help us discover similar anomalies in records

    • four steps to construct the graph:

      • Step1. Generate a complete, undirected and fully weighted graph G(V , E, W), where W(i,j,k) = 1 for ∀i, j ∈ [1, n] and ∀M(k), k ∈ [1, m];
      • Step2. For each type of metric M(k), test conditional independence of any pairs v(i), v(j). Set W(i,j,k) = 0 if conditional independence between v(i) and v(j) is accepted;
        • independent condition: given v(k) if P( vi ∩ vj | vk ) = P( vi | vk ) * P( vj | vk ) when P(vk)>0
      • Step3. Remove edge e(i,j) if W(i,j,k) = 0 for ∀k ∈ [1, m]. Set W(i,j,k) ← W(i,j,k)/||W(i,j)||(0);
      • Step4. Orient v-structures and the remaining edges in G.

      Untitled

  4. operation of add and subraction

    Untitled

    Untitled

  5. Automated Metric Weight Learning

    • Service Correlation

      1. calculate the covariance of two metric series divided by the product of their standard deviations, and use the absolute value of the result as the score.
      2. This score measures the strength of either positive or negative linear correlation between two services.

      Untitled

    • Result Precision

      Untitled

  6. heuristic root cause detection

    • forward transition

      • if e(ij)=1, walks from service v(i) ⇒ v(j)

      • definition of probability matrix P

        Untitled

    • self-transition

      • if none of the current service’s in- and out-neighbors show high correlation, just stay longer on its currently-visiting service

      • denote the self-transition probability for the visiting node v(i)

        Untitled

    • backward transition

      • to resolve the situation of visiting a particular service with low correlation score

      • the lower p is, the more constrained to the original direction; the higher p is, the more encouraged to walk backward when needed.

        Untitled

experiments

1. dataset

⭐ simulated and real-world

  • real-world datasets consist of 20 incidents occurred in cloud platform, which is collected by SRE team.
    • about 15 million metrics,
    • collected during 7200 seconds (1 hour before and 1 hour after the anomaly was detected) from 1732 microservice APIs

2. baseline

compared by top-5 and average top-1 to 5 result precision

  • single metric ⇒ M(lat) and M(thr)
    1. TBAC
    2. MonitorRank
    3. CloudRanger
  • pre-defined topology ⇒ behavior graph constructed on M(lat)
    1. NetMedic
    2. MS-Rank

3. experiments

  1. root cause identification

Untitled

Untitled

  • Root cause detection is inaccurate if it’s only based on correlation score (TBAC);
  • compared to static algorithms (TBAC, NetMedic, MonitorRank, CloudRanger), random walk scheme can identify the root cause with higher accuracy;
  • compared with the MS-Rank algorithm, the introduction of anomaly profile in AutoMAP can effectively improve the result accuracy, especially for top-1 results.
  1. self-optimization

Untitled

the precision of AutoMAP increases significantly when we conduct more rounds of test. As NetMedic does not support self-optimizing, so it is unstable in the rapidly changing system architecture.

  1. Algorithm parameter - ℓ

Untitled

  • a small ℓ results in a significant impact on the precision
  • precision increases more quickly when ℓ is higher
  1. Domain knowledge

    the role of domain knowledge gets weaker when more sampling data are available.

Untitled

  1. Algorithm parameters - α and ρ
  • α:increased it from 0.01 to 0.50, and find the overall execution time increases linearly. When we use more metric records to run AutoMAP, the impact of using different α is not obvious.

    ⇒ choose a relatively small α, making the correlation graph more consistent with ground truth

Untitled

Untitled

  • ρ:set up two environments (α = 0.01, ℓ = 200 versus α = 0.50, ℓ = 1440).

    • When ρ is smaller, a higher α is needed to ensure that the random walk algorithm has more paths to be chosen.

      the precision is not significantly improved when ρ is set close to 1.

      ⇒ choose a moderate backward transition parameter, e.g., ρ =0.2.

Untitled

CATALOG
  1. 1. challenging roots
  2. 2. model
  3. 3. experiments
    1. 3.1. 1. dataset
    2. 3.2. 2. baseline
    3. 3.3. 3. experiments