9.Cassandra cassandra_exporter
1.instaclustr/cassandra_exporter
9.1. Cassandra Node is unavailable
Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}
  - alert: CassandraNodeIsUnavailable
    expr: sum(cassandra_endpoint_active) by (cassandra_cluster,instance,exported_endpoint) < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra Node is unavailable (instance {{ $labels.instance }})
      description: "Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9.2 Cassandra many compaction tasks are pending
Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }}
  - alert: CassandraManyCompactionTasksArePending
    expr: cassandra_table_estimated_pending_compactions > 100
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Cassandra many compaction tasks are pending (instance {{ $labels.instance }})
      description: "Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9.3 Cassandra commitlog pending tasks
Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }}
  - alert: CassandraCommitlogPendingTasks
    expr: cassandra_commit_log_pending_tasks > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
      description: "Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9.4 Cassandra compaction executor blocked tasks
Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }}
  - alert: CassandraCompactionExecutorBlockedTasks
    expr: cassandra_thread_pool_blocked_tasks{pool="CompactionExecutor"} > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9.5 Cassandra flush writer blocked tasks
Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }}
  - alert: CassandraFlushWriterBlockedTasks
    expr: cassandra_thread_pool_blocked_tasks{pool="MemtableFlushWriter"} > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9.6 Cassandra connection timeouts total
Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }}
  - alert: CassandraConnectionTimeoutsTotal
    expr: avg(cassandra_client_request_timeouts_total) by (cassandra_cluster,instance) > 5
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
      description: "Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9.7. Cassandra storage exceptions
Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }}
  - alert: CassandraStorageExceptions
    expr: changes(cassandra_storage_exceptions_total[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra storage exceptions (instance {{ $labels.instance }})
      description: "Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9.8. Cassandra tombstone dump
Cassandra tombstone dump - {{ $labels.cassandra_cluster }}
  - alert: CassandraTombstoneDump
    expr: avg(cassandra_table_tombstones_scanned{quantile="0.99"}) by (instance,cassandra_cluster,keyspace) > 100
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra tombstone dump (instance {{ $labels.instance }})
      description: "Cassandra tombstone dump - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9.9 Cassandra client request unvailable write
Some Cassandra client requests are unvailable to write - {{ $labels.cassandra_cluster }}
  - alert: CassandraClientRequestUnvailableWrite
    expr: changes(cassandra_client_request_unavailable_exceptions_total{operation="write"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unvailable write (instance {{ $labels.instance }})
      description: "Some Cassandra client requests are unvailable to write - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9.10 Cassandra client request unvailable read
Some Cassandra client requests are unvailable to read - {{ $labels.cassandra_cluster }}
  - alert: CassandraClientRequestUnvailableRead
    expr: changes(cassandra_client_request_unavailable_exceptions_total{operation="read"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unvailable read (instance {{ $labels.instance }})
      description: "Some Cassandra client requests are unvailable to read - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9.11. Cassandra client request write failure
Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}
  - alert: CassandraClientRequestWriteFailure
    expr: increase(cassandra_client_request_failures_total{operation="write"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request write failure (instance {{ $labels.instance }})
      description: "Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9.12. Cassandra client request read failure
Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}
  - alert: CassandraClientRequestReadFailure
    expr: increase(cassandra_client_request_failures_total{operation="read"}[1m]) > 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request read failure (instance {{ $labels.instance }})
      description: "Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
2. Cassandra : criteo/cassandra_exporter
1. Cassandra hints count
Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down
  - alert: CassandraHintsCount
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:totalhints:count"}[1m]) > 3
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra hints count (instance {{ $labels.instance }})
      description: "Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
2 Cassandra compaction task pending
Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.
  - alert: CassandraCompactionTaskPending
    expr: avg_over_time(cassandra_stats{name="org:apache:cassandra:metrics:compaction:pendingtasks:value"}[1m]) > 100
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra compaction task pending (instance {{ $labels.instance }})
      description: "Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
3. Cassandra viewwrite latency
High viewwrite latency on {{ $labels.instance }} cassandra node
  - alert: CassandraViewwriteLatency
    expr: cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile",service="cas"} > 100000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra viewwrite latency (instance {{ $labels.instance }})
      description: "High viewwrite latency on {{ $labels.instance }} cassandra node\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
4. Cassandra bad hacker
Increase of Cassandra authentication failures
  - alert: CassandraBadHacker
    expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra bad hacker (instance {{ $labels.instance }})
      description: "Increase of Cassandra authentication failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
5. Cassandra node down
Cassandra node down
  - alert: CassandraNodeDown
    expr: sum(cassandra_stats{name="org:apache:cassandra:net:failuredetector:downendpointcount"}) by (service,group,cluster,env) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra node down (instance {{ $labels.instance }})
      description: "Cassandra node down\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
6. Cassandra commitlog pending tasks
Unexpected number of Cassandra commitlog pending tasks
  - alert: CassandraCommitlogPendingTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:commitlog:pendingtasks:value"} > 15
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
      description: "Unexpected number of Cassandra commitlog pending tasks\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
7. Cassandra compaction executor blocked tasks
Some Cassandra compaction executor tasks are blocked
  - alert: CassandraCompactionExecutorBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count"} > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra compaction executor tasks are blocked\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
8. Cassandra flush writer blocked tasks
Some Cassandra flush writer tasks are blocked
  - alert: CassandraFlushWriterBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count"} > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra flush writer tasks are blocked\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
9. Cassandra repair pending tasks
Some Cassandra repair tasks are pending
  - alert: CassandraRepairPendingTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value"} > 2
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra repair pending tasks (instance {{ $labels.instance }})
      description: "Some Cassandra repair tasks are pending\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
10. Cassandra repair blocked tasks
Some Cassandra repair tasks are blocked
  - alert: CassandraRepairBlockedTasks
    expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:currentlyblockedtasks:count"} > 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Cassandra repair blocked tasks (instance {{ $labels.instance }})
      description: "Some Cassandra repair tasks are blocked\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
11. Cassandra connection timeouts total
Some connection between nodes are ending in timeout
  - alert: CassandraConnectionTimeoutsTotal
    expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
      description: "Some connection between nodes are ending in timeout\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
12. Cassandra storage exceptions
Something is going wrong with cassandra storage
  - alert: CassandraStorageExceptions
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra storage exceptions (instance {{ $labels.instance }})
      description: "Something is going wrong with cassandra storage\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
13. Cassandra tombstone dump
Too much tombstones scanned in queries
  - alert: CassandraTombstoneDump
    expr: cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"} > 1000
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra tombstone dump (instance {{ $labels.instance }})
      description: "Too much tombstones scanned in queries\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
14. Cassandra client request unvailable write
Write failures have occurred because too many nodes are unavailable
  - alert: CassandraClientRequestUnvailableWrite
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:unavailables:count"}[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unvailable write (instance {{ $labels.instance }})
      description: "Write failures have occurred because too many nodes are unavailable\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
15. Cassandra client request unvailable read
Read failures have occurred because too many nodes are unavailable
  - alert: CassandraClientRequestUnvailableRead
    expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:unavailables:count"}[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request unvailable read (instance {{ $labels.instance }})
      description: "Read failures have occurred because too many nodes are unavailable\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
16. Cassandra client request write failure
A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.
  - alert: CassandraClientRequestWriteFailure
    expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"}[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request write failure (instance {{ $labels.instance }})
      description: "A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
17. Cassandra client request read failure
A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.
  - alert: CassandraClientRequestReadFailure
    expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"}[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Cassandra client request read failure (instance {{ $labels.instance }})
      description: "A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
18. Cassandra cache hit rate key cache
Key cache hit rate is below 85%
  - alert: CassandraCacheHitRateKeyCache
    expr: cassandra_stats{name="org:apache:cassandra:metrics:cache:keycache:hitrate:value"} < .85
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: Cassandra cache hit rate key cache (instance {{ $labels.instance }})
      description: "Key cache hit rate is below 85%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"