ClickHouse® dashboard metrics
- General info
- Dashboard variables
- Panels and metrics
- 1. Uptime (logarithmic)
- 2. Failed Pods
- 3. Version
- 4. Tables / Databases
- 5. ReadOnly replicas
- 6. DNS and Distributed Connection Errors
- 7. Replication and ZooKeeper Exceptions
- 8. Delayed/Rejected/Pending Inserts
- 9. Queries (running)
- 10. Select Queries (started per sec)
- 11. Memory for Queries
- 12. Insert Queries (running)
- 13. Insert Queries (started per sec)
- 14. Rows Inserted
- 15. Replication Queue Jobs
- 16. Max Replica Delay
- 17. Zookeeper Transactions
- 18. Merges
- 19. Merged Rows
- 20. Merged Uncompressed Bytes
- 21. Active Parts
- 22. Detached parts
- 23. Max Part count for Partition
- 24. clickhouse-server Process Memory
- 25. Primary Keys Memory
- 26. Dictionary Memory
- 27. Disk Space Free
- 28. Table Stats
- 29. Clickhouse Data size on Disk
- 30. Background Tasks
- 31. Mutations
- 32. Marks Cache Hit Rate
- 33. CPU Time per second
- 34. Network / Disk CPU Time per second
- 35. Load Average 1m
- 36. CPU Time total
- 37. Connections
- Monitoring best practices
- Extra resources
Grafana's ClickHouse® dashboard enables comprehensive monitoring of a ClickHouse® DBMS cluster. The dashboard displays performance metrics, replication status, use of resources, and other critical cluster parameters.
To open a cluster dashboard:
- If you have not opened a project yet, select one.
- In the left-hand menu, select ClickHouse® Clusters.
- Select a cluster.
- Click Cluster monitoring.
This will open the cluster dashboard.
General info
Dashboard title: ClickHouse®.
UID: clickhouse-operator.
Refresh interval: 10 seconds.
Data source: Prometheus.
Dashboard variables
The dashboard uses the following variables for data filtering:
- Cluster (
chi): Selecting a ClickHouse® cluster. - Server (
hostname): Selecting a specific server. - Namespace (
namespace): Stackland project where ClickHouse® Operator is deployed.
Panels and metrics
1. Uptime (logarithmic)
Description: ClickHouse® server uptime since last restart. The chart uses a logarithmic scale to make large values easier to read.
Metric: chi_clickhouse_metric_Uptime.
Unit of measurement: Seconds.
2. Failed Pods
Description: Number of pods where metrics-exporter fails to retrieve metrics from clickhouse-server. Any non-zero value indicates issues with server availability.
Metric: chi_clickhouse_metric_fetch_errors.
Unit of measurement: Count.
Recommendations: If errors occur, check the pod status by running kubectl get pods --all-namespaces | grep clickhouse.
Links:
3. Version
Description: ClickHouse® version deployed on the servers. The system shows the version in numeric format; e.g., 11.22.33 appears as 11022033.
Metric: chi_clickhouse_metric_VersionInteger.
Unit of measurement: Numeric version format.
4. Tables / Databases
Description: Total number of tables and databases in the cluster.
Metrics:
chi_clickhouse_metric_NumberOfTables: Number of tables.chi_clickhouse_metric_NumberOfDatabases: Number of databases.
Unit of measurement: Count
5. ReadOnly replicas
Description: Number of replicas in read-only mode. Any non-zero value indicates replication issues.
Metric: chi_clickhouse_metric_ReadonlyReplica.
Unit of measurement: Count.
Recommendations: Check the ZooKeeper connection, available disk space, and network connectivity between replicas.
Links:
6. DNS and Distributed Connection Errors
Description: DNS errors and connectivity failures between servers in distributed tables.
Metrics:
chi_clickhouse_event_NetworkErrors: Network errors.chi_clickhouse_event_DistributedConnectionFailAtAll: Complete failures of distributed connections.chi_clickhouse_event_DistributedConnectionFailTry: Failed connection attempts.chi_clickhouse_event_DNSError: DNS errors.
Unit of measurement: Events per minute
Links:
7. Replication and ZooKeeper Exceptions
Description: Replication metrics and exceptions when working with ZooKeeper.
Metrics:
chi_clickhouse_metric_ReadonlyReplica: Read-only replicas.chi_clickhouse_event_ReplicaPartialShutdown: Partial replica shutdown.chi_clickhouse_event_ZooKeeperUserExceptions: custom ZooKeeper exceptions.chi_clickhouse_event_ZooKeeperInit: ZooKeeper initialization.chi_clickhouse_metric_ZooKeeperSession: ZooKeeper sessions.chi_clickhouse_event_ZooKeeperHardwareExceptions: hardware ZooKeeper exceptions.
Unit of measurement: Events per minute.
Links:
8. Delayed/Rejected/Pending Inserts
Description: Metrics of delayed, rejected, and pending data inserts.
Metrics:
chi_clickhouse_metric_DelayedInserts: Current number of delayedINSERTqueries.chi_clickhouse_event_DelayedInserts: Total counter of delayed blocks.chi_clickhouse_event_RejectedInserts: Number of rejected blocks.chi_clickhouse_metric_DistributedFilesToInsert: Files pending insertion into distributed tables.chi_clickhouse_metric_BrokenDistributedFilesToInsert: Corrupted files in distributed tables.
Unit of measurement: Count.
Metric description:
- delayed query: Number of
INSERTqueries delayed due to a large number of active data parts. - delayed blocks: Number of blocks with delayed insertion.
- rejected blocks: Number of blocks whose insertion was rejected with a
Too many partserror.
Recommendations: Check the parts_to_delay_insert and parts_to_throw_insert settings in the system.merge_tree_settings table.
Links:
9. Queries (running)
Description: Number of running queries per server and cluster-wide.
Metric: chi_clickhouse_metric_Query.
Unit of measurement: Count.
Links:
10. Select Queries (started per sec)
Description: Number of SELECT queries per second.
Metric: chi_clickhouse_event_SelectQuery.
Unit of measurement: Queries per second.
11. Memory for Queries
Description: Total memory allocated for running queries. Certain memory allocations may not be considered.
Metric: chi_clickhouse_metric_MemoryTracking.
Unit of measurement: Bytes.
Links:
12. Insert Queries (running)
Description: Number of running INSERT queries. It does not include queries that failed parsing or were rejected due to limits, but does include internal ClickHouse®-initiated queries.
Metric: chi_clickhouse_event_InsertQuery.
Unit of measurement: Queries per minute.
13. Insert Queries (started per sec)
Description: Number of INSERT queries per second.
Metric: chi_clickhouse_event_InsertQuery.
Unit of measurement: Queries per second.
14. Rows Inserted
Description: Number of rows inserted into tables.
Metric: chi_clickhouse_event_InsertedRows.
Unit of measurement: Rows per minute.
15. Replication Queue Jobs
Description: Rate of data part exchange between replicas.
Metrics:
chi_clickhouse_event_ReplicatedDataLoss: Data loss during replication.chi_clickhouse_event_ReplicatedPartChecks: Counter of data part checks.chi_clickhouse_event_ReplicatedPartChecksFailed: Counter of failed data part checks.chi_clickhouse_event_ReplicatedPartFetches: Network replication activity.chi_clickhouse_event_ReplicatedPartFailedFetches: Counter of failed attempts to fetch data parts.chi_clickhouse_event_ReplicatedPartFetchesOfMerged: Fetching merged data parts.chi_clickhouse_event_ReplicatedPartMerges: Merging replicated data parts.chi_clickhouse_metric_ReplicasSumInsertsInQueue: Replication lag. It shows the number of pending queries in the queue.chi_clickhouse_metric_ReplicasSumMergesInQueue: Data merge lag. It shows the number of merges not yet completed by replicas.
Unit of measurement: Events per minute.
Links:
16. Max Replica Delay
Description: Replica lag relative to the current time for direct inserts into *ReplicatedMergeTree tables.
Metrics:
chi_clickhouse_metric_ReplicasMaxAbsoluteDelay: Absolute lag, in seconds.chi_clickhouse_metric_ReplicasMaxRelativeDelay: Relative lag, in seconds.
Unit of measurement: Seconds.
Links:
17. Zookeeper Transactions
Description: Number of ZooKeeper transactions per second.
Metric: chi_clickhouse_event_ZooKeeperTransactions.
Unit of measurement: Transactions per second.
Links:
18. Merges
Description: Rate of background merges for data parts.
Metric: chi_clickhouse_event_Merge.
Unit of measurement: Merges per minute.
Links:
19. Merged Rows
Description: Number of rows processed in merging.
Metric: chi_clickhouse_event_MergedRows.
Unit of measurement: Rows per minute.
20. Merged Uncompressed Bytes
Description: Size of uncompressed data processed in merging.
Metric: chi_clickhouse_event_MergedUncompressedBytes.
Unit of measurement: Bytes per minute.
21. Active Parts
Description: Number of active data parts in tables.
Metric: chi_clickhouse_table_parts (filtered by active="1").
Unit of measurement: Count.
Links:
22. Detached parts
Description: Number of detached data parts, along with the reason for detachment.
Metrics:
chi_clickhouse_metric_DetachedParts: Number of detached data parts.chi_clickhouse_table_parts(filtered byactive="0"): Inactive parts.
Unit of measurement: Count.
Reasons for detachment:
detached_by_user: Detached by the user.broken: Corrupted parts.clone: Cloned parts.ignored: Ignored parts.
Links:
23. Max Part count for Partition
Description: Maximum number of physical data parts per logical partition.
Metric: chi_clickhouse_metric_MaxPartCountForPartition.
Unit of measurement: Count.
Links:
24. clickhouse-server Process Memory
Description: Memory usage by clickhouse-server (available since ClickHouse® 20.4+).
Metrics:
chi_clickhouse_metric_MemoryCode: Executable code (CODE).chi_clickhouse_metric_MemoryResident: Resident set size (RSS).chi_clickhouse_metric_MemoryShared: Shared memory (SHR).chi_clickhouse_metric_MemoryDataAndStack: Data and stack (DATA).chi_clickhouse_metric_MemoryVirtual: Virtual memory (VIRT).
Unit of measurement: Bytes.
Memory type description:
- VIRT: Total virtual memory (VIRT = SWAP + RSS).
- SWAP: Amount of memory swapped out.
- RSS: Physical memory not swapped out (RSS = CODE + DATA).
- CODE: Memory for executable code (text resident set).
- DATA: Memory for non-executable data (data resident set).
- SHR: Shared memory available to other processes.
Links:
25. Primary Keys Memory
Description: Memory allocated for primary key storage.
Metric: chi_clickhouse_metric_MemoryPrimaryKeyBytesAllocated.
Unit of measurement: Bytes.
Links:
26. Dictionary Memory
Description: Memory allocated for dictionaries.
Metric: chi_clickhouse_metric_MemoryDictionaryBytesAllocated.
Unit of measurement: Bytes.
Links:
27. Disk Space Free
Description: Free disk space ratio. Make sure to consider configurations with multiple volumes, Kubernetes volume claims, and Object Storage as the storage backend.
Metric: chi_clickhouse_metric_DiskFreeBytes / chi_clickhouse_metric_DiskTotalBytes.
Unit of measurement: Fraction (0–1).
Links:
28. Table Stats
Description: Table statistics, such as data size, row count, number of parts, and average row size.
Metrics:
chi_clickhouse_table_parts_bytes: Data size, in bytes.chi_clickhouse_table_parts_rows: Number of rows.chi_clickhouse_table_parts: Number of parts.
Unit of measurement:
- Bytes
- Rows
- Parts
- BytePerRow (calculated field)
29. Clickhouse Data size on Disk
Description: Total disk space used by *MergeTree tables.
Metric: chi_clickhouse_metric_DiskDataBytes.
Unit of measurement: Bytes.
Links:
30. Background Tasks
Description: Number of active background tasks.
Metrics:
chi_clickhouse_metric_BackgroundPoolTask: Merge, mutation, data fetch, and replication queue management tasks.chi_clickhouse_metric_BackgroundSchedulePoolTask: Periodic ReplicatedMergeTree tasks, such as cleanup of old parts, part mutations, and replica reinitialization.chi_clickhouse_metric_BackgroundMovePoolTask: Data movement tasks.
Unit of measurement: Count.
Links:
31. Mutations
Description: Number of active mutations (ALTER DELETE/ALTER UPDATE) and data parts pending mutation.
Metrics:
chi_clickhouse_table_mutations: Number of mutations.chi_clickhouse_table_mutations_parts_to_do: Number of parts pending mutation.
Unit of measurement: Count.
Links:
32. Marks Cache Hit Rate
Description: Cache hit rate for mark files (.mrk) read from memory rather than disk.
Metric: chi_clickhouse_event_MarkCacheHits / (chi_clickhouse_event_MarkCacheHits + chi_clickhouse_event_MarkCacheMisses).
Unit of measurement: Fraction (0–1).
Links:
33. CPU Time per second
Description: CPU time spent on different types of activity.
Metrics:
chi_clickhouse_event_RealTimeMicroseconds: Real execution time.chi_clickhouse_event_UserTimeMicroseconds: User CPU time.chi_clickhouse_event_SystemTimeMicroseconds: System CPU time.chi_clickhouse_event_OSIOWaitMicroseconds: I/O wait time.chi_clickhouse_event_OSCPUWaitMicroseconds: CPU wait time.chi_clickhouse_event_OSCPUVirtualTimeMicroseconds: Virtual CPU time.
Unit of measurement: Microseconds per second
34. Network / Disk CPU Time per second
Description: CPU time spent on network and disk operations.
Metrics:
chi_clickhouse_event_DiskReadElapsedMicroseconds: Disk read time.chi_clickhouse_event_DiskWriteElapsedMicroseconds: Disk write time.chi_clickhouse_event_NetworkReceiveElapsedMicroseconds: Network receive time.chi_clickhouse_event_NetworkSendElapsedMicroseconds: Network send time.
Unit of measurement: Microseconds per second
35. Load Average 1m
Description: Average system load over one minute (Unix load average). Load is considered high if it approaches the number of available CPUs or the CPU limits allocated to the ClickHouse® pod.
Metric: chi_clickhouse_metric_LoadAverage1.
Unit of measurement: Dimensionless quantity.
36. CPU Time total
Description: Total CPU time spent on various activities over the selected period.
Metrics:
chi_clickhouse_event_DiskReadElapsedMicroseconds: Disk read time.chi_clickhouse_event_DiskWriteElapsedMicroseconds: Disk write time.chi_clickhouse_event_NetworkReceiveElapsedMicroseconds: Network receive time.chi_clickhouse_event_NetworkSendElapsedMicroseconds: Network send time.chi_clickhouse_event_RealTimeMicroseconds: Real query execution time.chi_clickhouse_event_UserTimeMicroseconds: User CPU time.chi_clickhouse_event_SystemTimeMicroseconds: System CPU time.chi_clickhouse_event_OSIOWaitMicroseconds: I/O wait time.chi_clickhouse_event_OSCPUWaitMicroseconds: CPU wait time.chi_clickhouse_event_OSCPUVirtualTimeMicroseconds: CPU time spent on virtual OS processes.chi_clickhouse_event_ThrottlerSleepMicroseconds: Throttler wait time.chi_clickhouse_event_DelayedInsertsMilliseconds: Time spent on delayed inserts.chi_clickhouse_event_ZooKeeperWaitMicroseconds: ZooKeeper wait time.chi_clickhouse_event_CompileExpressionsMicroseconds: Expression compilation time.chi_clickhouse_event_MergesTimeMilliseconds: Merge time.chi_clickhouse_event_RWLockReadersWaitMilliseconds: Read lock wait time.chi_clickhouse_event_RWLockWritersWaitMilliseconds: Write lock wait time.chi_clickhouse_event_SelectQueryTimeMicroseconds: Time spent runningSELECTqueries.chi_clickhouse_event_InsertQueryTimeMicroseconds: Time spent runningINSERTqueries.chi_clickhouse_event_Object StorageReadMicroseconds: Object Storage read time.chi_clickhouse_event_Object StorageWriteMicroseconds: Object Storage write time.
Unit of measurement: Microseconds.
Interval: 1 minute.
37. Connections
Description: Different connection types per server.
Metrics:
chi_clickhouse_metric_TCPConnection: TCP connections (native protocol).chi_clickhouse_metric_HTTPConnection: HTTP connections.chi_clickhouse_metric_InterserverConnection: Inter-server connections.chi_clickhouse_metric_MySQLConnection: MySQL connections.
Unit of measurement: Count.
Links:
Monitoring best practices
Critical metrics
The following metrics require immediate attention when they deviate from normal values:
- Failed Pods: It must be
0. Any non-zero value indicates server unavailability. - ReadOnly replicas: It must be
0. Any non-zero value indicates replication issues. - DNS and Distributed Connection Errors: It must be as low as possible. High values indicate network issues.
- Delayed/Rejected Inserts: High values of this metric indicate write performance issues.
- Disk Space Free: Monitor free space; critical threshold is below 10%.
Performance metrics
To evaluate your cluster performance, pay attention to:
- Queries (running): Number of concurrent queries.
- Memory for Queries: Query memory usage.
- CPU Time per second: CPU load.
- Marks Cache Hit Rate: Cache efficiency, which must be more than 90%.
Replication metrics
For replication status monitoring:
- Replication Queue Jobs: Replication queue size.
- Max Replica Delay: Replica lag.
- Zookeeper Transactions: ZooKeeper transaction rate.
Storage metrics
For disk space usage monitoring:
- Active Parts: Number of active data parts.
- Detached parts : Detached parts, which must be minimal.
- Max Part count for Partition: Number of parts per partition.
- Clickhouse Data size on Disk: Total data size.