跳转至

21. Etcd

21.1. Insufficient Members

Etcd cluster should have an odd number of members

- alert: InsufficientMembers
  expr: count(etcd_server_id) > (count(etcd_server_id) / 2 - 1)
  for: 5m
  labels:
    severity: error
  annotations:
    summary: "Insufficient Members (instance {{ $labels.instance }})"
    description: "Etcd cluster should have an odd number of members\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.2. No Leader

Etcd cluster have no leader

- alert: NoLeader
  expr: etcd_server_has_leader == 0
  for: 5m
  labels:
    severity: error
  annotations:
    summary: "No Leader (instance {{ $labels.instance }})"
    description: "Etcd cluster have no leader\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.3. High number of leader changes

Etcd leader changed more than 3 times during last hour

- alert: HighNumberOfLeaderChanges
  expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High number of leader changes (instance {{ $labels.instance }})"
    description: "Etcd leader changed more than 3 times during last hour\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.4. High number of failed GRPC requests

More than 1% GRPC request failure detected in Etcd for 5 minutes

- alert: HighNumberOfFailedGrpcRequests
  expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.01
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High number of failed GRPC requests (instance {{ $labels.instance }})"
    description: "More than 1% GRPC request failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.5. High number of failed GRPC requests

More than 5% GRPC request failure detected in Etcd for 5 minutes

- alert: HighNumberOfFailedGrpcRequests
  expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.05
  for: 5m
  labels:
    severity: error
  annotations:
    summary: "High number of failed GRPC requests (instance {{ $labels.instance }})"
    description: "More than 5% GRPC request failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.6. GRPC requests slow

GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes

- alert: GrpcRequestsSlow
  expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le)) > 0.15
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "GRPC requests slow (instance {{ $labels.instance }})"
    description: "GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.7. High number of failed HTTP requests

More than 1% HTTP failure detected in Etcd for 5 minutes

- alert: HighNumberOfFailedHttpRequests
  expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.01
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High number of failed HTTP requests (instance {{ $labels.instance }})"
    description: "More than 1% HTTP failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.8. High number of failed HTTP requests

More than 5% HTTP failure detected in Etcd for 5 minutes

- alert: HighNumberOfFailedHttpRequests
  expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.05
  for: 5m
  labels:
    severity: error
  annotations:
    summary: "High number of failed HTTP requests (instance {{ $labels.instance }})"
    description: "More than 5% HTTP failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.9. HTTP requests slow

HTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes

- alert: HttpRequestsSlow
  expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) > 0.15
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "HTTP requests slow (instance {{ $labels.instance }})"
    description: "HTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.10. Etcd member communication slow

Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes

- alert: EtcdMemberCommunicationSlow
  expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.15
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Etcd member communication slow (instance {{ $labels.instance }})"
    description: "Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.11. High number of failed proposals

Etcd server got more than 5 failed proposals past hour

- alert: HighNumberOfFailedProposals
  expr: increase(etcd_server_proposals_failed_total[1h]) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High number of failed proposals (instance {{ $labels.instance }})"
    description: "Etcd server got more than 5 failed proposals past hour\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.12. High fsync durations

Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes

- alert: HighFsyncDurations
  expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High fsync durations (instance {{ $labels.instance }})"
    description: "Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

21.13. High commit durations

Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes

- alert: HighCommitDurations
  expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High commit durations (instance {{ $labels.instance }})"
    description: "Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"