9.Cassandra cassandra_exporter
9.1. Cassandra Node is unavailable
Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}
- alert: CassandraNodeIsUnavailable
expr: sum(cassandra_endpoint_active) by (cassandra_cluster,instance,exported_endpoint) < 1
for: 0m
severity: critical
summary: Cassandra Node is unavailable (instance {{ $labels.instance }})
description: "Cassandra Node is unavailable - {{ $labels.cassandra_cluster }} {{ $labels.exported_endpoint }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.2 Cassandra many compaction tasks are pending
Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }}
- alert: CassandraManyCompactionTasksArePending
expr: cassandra_table_estimated_pending_compactions > 100
for: 0m
severity: warning
summary: Cassandra many compaction tasks are pending (instance {{ $labels.instance }})
description: "Many Cassandra compaction tasks are pending - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.3 Cassandra commitlog pending tasks
Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }}
- alert: CassandraCommitlogPendingTasks
expr: cassandra_commit_log_pending_tasks > 15
for: 2m
severity: warning
summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
description: "Cassandra commitlog pending tasks - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.4 Cassandra compaction executor blocked tasks
Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }}
- alert: CassandraCompactionExecutorBlockedTasks
expr: cassandra_thread_pool_blocked_tasks{pool="CompactionExecutor"} > 15
for: 2m
severity: warning
summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
description: "Some Cassandra compaction executor tasks are blocked - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.5 Cassandra flush writer blocked tasks
Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }}
- alert: CassandraFlushWriterBlockedTasks
expr: cassandra_thread_pool_blocked_tasks{pool="MemtableFlushWriter"} > 15
for: 2m
severity: warning
summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
description: "Some Cassandra flush writer tasks are blocked - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.6 Cassandra connection timeouts total
Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }}
- alert: CassandraConnectionTimeoutsTotal
expr: avg(cassandra_client_request_timeouts_total) by (cassandra_cluster,instance) > 5
for: 2m
severity: critical
summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
description: "Some connection between nodes are ending in timeout - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.7. Cassandra storage exceptions
Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }}
- alert: CassandraStorageExceptions
expr: changes(cassandra_storage_exceptions_total[1m]) > 1
for: 0m
severity: critical
summary: Cassandra storage exceptions (instance {{ $labels.instance }})
description: "Something is going wrong with cassandra storage - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.8. Cassandra tombstone dump
Cassandra tombstone dump - {{ $labels.cassandra_cluster }}
- alert: CassandraTombstoneDump
expr: avg(cassandra_table_tombstones_scanned{quantile="0.99"}) by (instance,cassandra_cluster,keyspace) > 100
for: 2m
severity: critical
summary: Cassandra tombstone dump (instance {{ $labels.instance }})
description: "Cassandra tombstone dump - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.9 Cassandra client request unvailable write
Some Cassandra client requests are unvailable to write - {{ $labels.cassandra_cluster }}
- alert: CassandraClientRequestUnvailableWrite
expr: changes(cassandra_client_request_unavailable_exceptions_total{operation="write"}[1m]) > 0
for: 2m
severity: critical
summary: Cassandra client request unvailable write (instance {{ $labels.instance }})
description: "Some Cassandra client requests are unvailable to write - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.10 Cassandra client request unvailable read
Some Cassandra client requests are unvailable to read - {{ $labels.cassandra_cluster }}
- alert: CassandraClientRequestUnvailableRead
expr: changes(cassandra_client_request_unavailable_exceptions_total{operation="read"}[1m]) > 0
for: 2m
severity: critical
summary: Cassandra client request unvailable read (instance {{ $labels.instance }})
description: "Some Cassandra client requests are unvailable to read - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.11. Cassandra client request write failure
Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}
- alert: CassandraClientRequestWriteFailure
expr: increase(cassandra_client_request_failures_total{operation="write"}[1m]) > 0
for: 2m
severity: critical
summary: Cassandra client request write failure (instance {{ $labels.instance }})
description: "Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9.12. Cassandra client request read failure
Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}
- alert: CassandraClientRequestReadFailure
expr: increase(cassandra_client_request_failures_total{operation="read"}[1m]) > 0
for: 2m
severity: critical
summary: Cassandra client request read failure (instance {{ $labels.instance }})
description: "Read failures have occurred, ensure there are not too many unavailable nodes - {{ $labels.cassandra_cluster }}\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
2. Cassandra : criteo/cassandra_exporter
1. Cassandra hints count
Cassandra hints count has changed on {{ $labels.instance }}
some nodes may go down
- alert: CassandraHintsCount
expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:totalhints:count"}[1m]) > 3
for: 0m
severity: critical
summary: Cassandra hints count (instance {{ $labels.instance }})
description: "Cassandra hints count has changed on {{ $labels.instance }} some nodes may go down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
2 Cassandra compaction task pending
Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.
- alert: CassandraCompactionTaskPending
expr: avg_over_time(cassandra_stats{name="org:apache:cassandra:metrics:compaction:pendingtasks:value"}[1m]) > 100
for: 2m
severity: warning
summary: Cassandra compaction task pending (instance {{ $labels.instance }})
description: "Many Cassandra compaction tasks are pending. You might need to increase I/O capacity by adding nodes to the cluster.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
3. Cassandra viewwrite latency
High viewwrite latency on {{ $labels.instance }}
cassandra node
- alert: CassandraViewwriteLatency
expr: cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:viewwrite:viewwritelatency:99thpercentile",service="cas"} > 100000
for: 2m
severity: warning
summary: Cassandra viewwrite latency (instance {{ $labels.instance }})
description: "High viewwrite latency on {{ $labels.instance }} cassandra node\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
4. Cassandra bad hacker
Increase of Cassandra authentication failures
- alert: CassandraBadHacker
expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:client:authfailure:count"}[1m]) > 5
for: 2m
severity: warning
summary: Cassandra bad hacker (instance {{ $labels.instance }})
description: "Increase of Cassandra authentication failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
5. Cassandra node down
Cassandra node down
- alert: CassandraNodeDown
expr: sum(cassandra_stats{name="org:apache:cassandra:net:failuredetector:downendpointcount"}) by (service,group,cluster,env) > 0
for: 0m
severity: critical
summary: Cassandra node down (instance {{ $labels.instance }})
description: "Cassandra node down\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
6. Cassandra commitlog pending tasks
Unexpected number of Cassandra commitlog pending tasks
- alert: CassandraCommitlogPendingTasks
expr: cassandra_stats{name="org:apache:cassandra:metrics:commitlog:pendingtasks:value"} > 15
for: 2m
severity: warning
summary: Cassandra commitlog pending tasks (instance {{ $labels.instance }})
description: "Unexpected number of Cassandra commitlog pending tasks\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
7. Cassandra compaction executor blocked tasks
Some Cassandra compaction executor tasks are blocked
- alert: CassandraCompactionExecutorBlockedTasks
expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:compactionexecutor:currentlyblockedtasks:count"} > 0
for: 2m
severity: warning
summary: Cassandra compaction executor blocked tasks (instance {{ $labels.instance }})
description: "Some Cassandra compaction executor tasks are blocked\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
8. Cassandra flush writer blocked tasks
Some Cassandra flush writer tasks are blocked
- alert: CassandraFlushWriterBlockedTasks
expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:memtableflushwriter:currentlyblockedtasks:count"} > 0
for: 2m
severity: warning
summary: Cassandra flush writer blocked tasks (instance {{ $labels.instance }})
description: "Some Cassandra flush writer tasks are blocked\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
9. Cassandra repair pending tasks
Some Cassandra repair tasks are pending
- alert: CassandraRepairPendingTasks
expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:pendingtasks:value"} > 2
for: 2m
severity: warning
summary: Cassandra repair pending tasks (instance {{ $labels.instance }})
description: "Some Cassandra repair tasks are pending\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
10. Cassandra repair blocked tasks
Some Cassandra repair tasks are blocked
- alert: CassandraRepairBlockedTasks
expr: cassandra_stats{name="org:apache:cassandra:metrics:threadpools:internal:antientropystage:currentlyblockedtasks:count"} > 0
for: 2m
severity: warning
summary: Cassandra repair blocked tasks (instance {{ $labels.instance }})
description: "Some Cassandra repair tasks are blocked\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
11. Cassandra connection timeouts total
Some connection between nodes are ending in timeout
- alert: CassandraConnectionTimeoutsTotal
expr: rate(cassandra_stats{name="org:apache:cassandra:metrics:connection:totaltimeouts:count"}[1m]) > 5
for: 2m
severity: critical
summary: Cassandra connection timeouts total (instance {{ $labels.instance }})
description: "Some connection between nodes are ending in timeout\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
12. Cassandra storage exceptions
Something is going wrong with cassandra storage
- alert: CassandraStorageExceptions
expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:storage:exceptions:count"}[1m]) > 1
for: 0m
severity: critical
summary: Cassandra storage exceptions (instance {{ $labels.instance }})
description: "Something is going wrong with cassandra storage\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
13. Cassandra tombstone dump
Too much tombstones scanned in queries
- alert: CassandraTombstoneDump
expr: cassandra_stats{name="org:apache:cassandra:metrics:table:tombstonescannedhistogram:99thpercentile"} > 1000
for: 0m
severity: critical
summary: Cassandra tombstone dump (instance {{ $labels.instance }})
description: "Too much tombstones scanned in queries\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
14. Cassandra client request unvailable write
Write failures have occurred because too many nodes are unavailable
- alert: CassandraClientRequestUnvailableWrite
expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:unavailables:count"}[1m]) > 0
for: 0m
severity: critical
summary: Cassandra client request unvailable write (instance {{ $labels.instance }})
description: "Write failures have occurred because too many nodes are unavailable\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
15. Cassandra client request unvailable read
Read failures have occurred because too many nodes are unavailable
- alert: CassandraClientRequestUnvailableRead
expr: changes(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:unavailables:count"}[1m]) > 0
for: 0m
severity: critical
summary: Cassandra client request unvailable read (instance {{ $labels.instance }})
description: "Read failures have occurred because too many nodes are unavailable\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
16. Cassandra client request write failure
A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.
- alert: CassandraClientRequestWriteFailure
expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:write:failures:oneminuterate"}[1m]) > 0
for: 0m
severity: critical
summary: Cassandra client request write failure (instance {{ $labels.instance }})
description: "A lot of write failures encountered. A write failure is a non-timeout exception encountered during a write request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
17. Cassandra client request read failure
A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.
- alert: CassandraClientRequestReadFailure
expr: increase(cassandra_stats{name="org:apache:cassandra:metrics:clientrequest:read:failures:oneminuterate"}[1m]) > 0
for: 0m
severity: critical
summary: Cassandra client request read failure (instance {{ $labels.instance }})
description: "A lot of read failures encountered. A read failure is a non-timeout exception encountered during a read request. Examine the reason map to find to the root cause. The most common cause for this type of error is when batch sizes are too large.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
18. Cassandra cache hit rate key cache
Key cache hit rate is below 85%
- alert: CassandraCacheHitRateKeyCache
expr: cassandra_stats{name="org:apache:cassandra:metrics:cache:keycache:hitrate:value"} < .85
for: 2m
severity: critical
summary: Cassandra cache hit rate key cache (instance {{ $labels.instance }})
description: "Key cache hit rate is below 85%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"