Monitoring the state of a ClickHouse® cluster and its hosts

Written by

Updated at January 14, 2026

Monitoring the cluster state
- Available charts
Monitoring the state of hosts
Setting up alerts in Yandex Monitoring
Cluster state and status
- Cluster states
- Cluster statuses

Data on the cluster and host state is available in the management console. You can view them on the Monitoring tab of the cluster management page or in Yandex Monitoring.

Diagnostic information about cluster states is presented as graphs.

Chart update rate:

Standard hosts and hosts with an increased RAM to vCPU ratio (memory-optimized): 15 seconds.
Hosts with a guaranteed vCPU share under 100% (burstable): 150 seconds.

Note

The most appropriate multiple units (MB, GB, and more) are automatically used in charts.

You can configure alerts in Yandex Monitoring to receive notifications about cluster failures. In Yandex Monitoring, there are two alert thresholds: Warning and Alarm. If the specified threshold is exceeded, you will receive alerts via the configured notification channels.

Monitoring the cluster state

To view detailed information on the state of a Managed Service for ClickHouse® cluster:

Management console

In the management console, select the folder the cluster is in.
Go to Managed Service for ClickHouse.
Click the name of your cluster and select the Monitoring tab.

The page that opens will display the performance charts for your ClickHouse® cluster and its hosts.
To get started with Yandex Monitoring metrics, dashboards, or alerts, click Open in Monitoring in the top panel.

Available charts

If your cluster includes a coordination service, the Monitoring page shows the following tabs depending on the selected service:

ClickHouse: State information for the whole ClickHouse® cluster and its hosts.
Keeper: State information for ClickHouse® Keeper hosts.
ZooKeeper: State information for ZooKeeper hosts.

If the cluster consists of only ClickHouse® hosts, the Monitoring page will have the Clusters tab with information aligned with the ClickHouse tab.

Regardless of the cluster configuration, the Monitoring page also has the Hosts tab for detailed host state information.

ClickHouse

Keeper

ZooKeeper

This tab displays the following charts:

Under Summary:
- Select queries: Number of select queries per second for a cluster.
- Insert queries: Number of insert queries per second for a cluster.
- Total queries: Total number of queries per second for a cluster.
- Inserted data: Data insertion rate for a cluster.
- Read data: Data read rate for a cluster.
- Merged data: Data merge rate for a cluster.
- CPU usage: Number of CPU cores used in a ClickHouse® subcluster.
- Memory usage: Use of RAM in a ClickHouse® subcluster.
- Disk space usage: Disk space used in a ClickHouse® subcluster.
Under Queries:
- Select queries per host: Number of select queries per second for each cluster host.
- Insert queries per host: Number of insert queries per second for each cluster host.
- Total queries per host: Total number of queries per second for each host.
- Failed select queries per host: Percentage of failed select queries per ClickHouse® subcluster host.
- Failed insert queries per host: Percentage of failed insert queries per ClickHouse® subcluster host.
- Failed queries per host: Percentage of failed queries per ClickHouse® subcluster host.
- Average select query time per host: Average select query time per ClickHouse® subcluster host.
- Average insert query time per host: Average insert query time per ClickHouse® subcluster host.
- Average query time per host: Average query time per ClickHouse® subcluster host.
Under Connections and Locks:
- Connections per host: Number of connections per cluster host.
- Active locks per host: Number of active locks per cluster host.
- Waiting locks per host: Number of waiting locks per cluster host.
Under Data Traffic:
- Read data per host: Data read rate per cluster host.
- Inserted data per host: Data insertion rate per cluster host.
- Merged data per host: Data merge rate per cluster host.
- Read rows per host: Row read rate per second for each cluster host.
- Inserted rows per host: Row insertion rate per second for each cluster host.
- Merged rows per host: Row merge rate per second for each cluster host.
Under Storage:
- Disk space usage per host, bytes: Disk space used per ClickHouse® subcluster host.
- Disk space usage per host, %: Percentage of the disk space used per ClickHouse® subcluster host.
- Inode usage, %: Percentage of inodes used per ClickHouse® subcluster host.
- Databases: Number of databases per cluster host.
- Tables: Number of tables per cluster host.
- Rows of MergeTree tables: Number of rows in MergeTree tables per cluster host.
- Data parts: Number of data parts per cluster host.
- Detached data parts: Number of separate data parts per cluster host.
Under Replication and Background Data Processing:
- Max replication delay across tables: Maximum table replication delay per cluster host. Values greater than a few seconds may indicate excessive load or replication issues.
- Replication queue: Replication queue size per cluster host.
- Max data parts per partition: Maximum number of data parts per partition for each cluster host. This value is limited by the DBMS settings. Approaching the limit indicates excessive load or low efficiency of data insertion.
- Merges and mutations pool tasks: Number of active merge and mutation tasks in the background pool per ClickHouse® subcluster host.
- Fetches pool tasks: Number of active fetch tasks in the background pool per ClickHouse® subcluster host.
- Move pool tasks: Number of active move tasks in the background pool per ClickHouse® subcluster host.
Under System Resources:
- CPU usage per host, cores: Number of CPU cores used per ClickHouse® subcluster host.
- Memory usage per host, bytes: RAM used per ClickHouse® subcluster host.
- CPU usage per host, %: CPU core usage percentage per ClickHouse® subcluster host.
- Memory usage per host, %: Percentage of RAM used per ClickHouse® subcluster host.
- Disk read per host: Disk read rate per ClickHouse® subcluster host.
- Disk write per host: Disk write rate per ClickHouse® subcluster host.
- Disk usage per host: Speed of disk operations per ClickHouse® subcluster host.
- Network data received per host: Network data receive rate per ClickHouse® subcluster host.
- Network data sent per host: Network data send rate per ClickHouse® subcluster host.
- Network usage per host: Network data exchange rate per ClickHouse® subcluster host.

This tab displays the following charts:

Under Service Metrics:
- Transactions: Number of transactions per second.
- Outstanding requests per Keeper host: Number of requests being processed per ClickHouse® Keeper host.
- Connections per Keeper host: Number of connections per ClickHouse® Keeper host.
- Transactions per ClickHouse® host: Number of transactions per second for each ClickHouse® host.
- Average transaction time per ClickHouse® host: Average transaction time per ClickHouse® host. It shows the time ClickHouse® spends to access ClickHouse® Keeper.
- Average latency per Keeper host: Average response latency per ClickHouse® Keeper host.
- Znodes: Number of znodes.
- Ephemeral nodes: Number of ephemeral nodes.
- Watches: Number of watches.
Under System Metrics:
- CPU usage per host, cores: Number of CPU cores used per host.
- Memory usage per host, bytes: RAM used per ClickHouse® Keeper subcluster host.
- CPU usage per host, %: CPU core workload per host.
- Memory usage per host, %: Percentage of RAM used per host.
- Disk read per host: Disk read rate per host.
- Disk write per host: Disk write rate per host.
- Disk usage per host: Speed of disk operations per host.
- Network usage per host: Network data exchange rate per host.
- Network data received per host: Network data receive rate per host.
- Network data sent per host: Network data send rate per host.

This tab displays the following charts:

Under Service Metrics:
- Transactions: Number of transactions per second.
- Outstanding requests per ZooKeeper host: Number of requests being processed per ZooKeeper host.
- Connections per ZooKeeper host: Number of connections per ZooKeeper host.
- Transactions per ClickHouse® host: Number of transactions per second for each ClickHouse® host.
- Average transaction time per ClickHouse® host: Average transaction time per ClickHouse® host. It shows the time ClickHouse® spends to access ZooKeeper.
- Average latency per ZooKeeper host: Average response latency per ZooKeeper host.
- Znodes: Number of znodes.
- Ephemeral nodes: Number of ephemeral nodes.
- Watches: Number of watches.
Note

For more information about znodes, ephemeral nodes, and watches, see this ZooKeeper guide.
Under System Metrics:
- CPU usage per host, cores: Number of CPU cores used per host.
- Memory usage per host, bytes: RAM used per ZooKeeper subcluster host.
- CPU usage per host, %: CPU core workload per host.
- Memory usage per host, %: Percentage of RAM used per host.
- Disk read per host: Disk read rate per host.
- Disk write per host: Disk write rate per host.
- Disk usage per host: Speed of disk operations per host.
- Network usage per host: Network data exchange rate per host.
- Network data received per host: Network data receive rate per host.
- Network data sent per host: Network data send rate per host.

Monitoring the state of hosts

To view detailed information on the state of individual Managed Service for ClickHouse® hosts:

Management console

In the management console, select the folder the cluster is in.
Go to Managed Service for ClickHouse.
Click the name of your cluster and select the Monitoring tab.
Navigate to the Hosts tab and select the host.

Host type, CLICKHOUSE, KEEPER or ZOOKEEPER, is specified for each host.

To get started with Yandex Monitoring metrics, dashboards, or alerts, click Open in Monitoring in the top panel.

ClickHouse®

Keeper

ZooKeeper

The following charts are displayed for ClickHouse® hosts:

Under Service Metrics:
- Availability: Host availability.
- Quearies: Number of queries per second for each type.
- Connections: Number of HTTP and TCP connections.
- Failed queries: Percentage of failed queries for each type.
- Average query time: Average query time for each type.
- Locks: Number of active and waiting read and write locks.
- Processed data: Speed of reading, inserting, and merging data.
- Processed rows: Speed of reading, inserting, and merging rows per second.
- Background tasks: Number of merge and mutation, extract and move tasks in the background pool.
- Max replication delay across tables: Maximum replication delay across tables. Values greater than a few seconds may indicate excessive load or replication issues.
- Replication queue: Replication queue size.
- Max data parts per partition: Maximum number of data parts per partition. This value is limited by the DBMS settings. Approaching the limit indicates excessive load or low efficiency of data insertion.
- Threads: Number of threads used by ClickHouse®.
Under System Metrics:
- CPU usage, %: CPU core usage percentage.
- Memory usage, %: RAM usage percentage.
- Disk space usage, %: Disk space usage percentage.
- CPU usage, cores: Number of CPU cores used.
- Memory usage, bytes: RAM usage.
- Disk space usage, bytes: Disk space usage.
- Disk throughput: Disk throughput.
- Disk IOPS: Number of disk read and write operations.
- Network throughput: Network throughput.

The following charts are displayed for ClickHouse® Keeper hosts:

Under Service Metrics:
- Availability: Host availability.
- Role: Host role, Leader or Follower, in a ClickHouse® Keeper subcluster.
- Objects: Number of znodes, ephemeral nodes, and watches.
- Connections: Number of active database connections to the host.
- Outstanding requests: Number of outstanding requests.
- Latency: Number of requests with delayed response.
Under System Metrics:
- CPU usage, %: CPU core usage percentage.
- Memory usage, %: RAM usage percentage.
- Disk space usage, %: Disk space usage percentage.
- CPU usage, cores: Number of CPU cores used.
- Memory usage, bytes: RAM usage.
- Disk space usage, bytes: Disk space usage.
- Disk throughput: Disk throughput.
- Disk IOPS: Number of disk read and write operations.
- Network throughput: Network throughput.

The following charts are displayed for ZooKeeper hosts:

Under Service Metrics:
- Availability: Host availability.
- Role: Host role, Leader or Follower, in a ZooKeeper subcluster.
- Objects: Number of znodes, ephemeral nodes, and watches.
- Connections: Number of active database connections to the host.
- Outstanding requests: Number of outstanding requests.
- Request time: Read and write operation processing time.
Under System Metrics:
- CPU usage, %: CPU core usage percentage.
- Memory usage, %: RAM usage percentage.
- Disk space usage, %: Disk space usage percentage.
- CPU usage, cores: Number of CPU cores used.
- Memory usage, bytes: RAM usage.
- Disk space usage, bytes: Disk space usage.
- Disk throughput: Disk throughput.
- Disk IOPS: Number of disk read and write operations.
- Network throughput: Network throughput.

Setting up alerts in Yandex Monitoring

Management console

In the management console, select the folder containing the cluster for which you want to set up alerts.
Go to Monitoring.
Under Service dashboards, select:
- Managed Service for ClickHouse® — Cluster Overview to set up cluster alerts.
- Managed Service for ClickHouse® — ZooKeeper to set up ZooKeeper host alerts.
- Managed Service for ClickHouse® — Host Overview to set up host alerts.
In the relevant metrics chart, click and select Create alert.
If the chart shows multiple metrics, select the data query to generate a metric and click Continue. For more information about the query language, see this Yandex Monitoring guide.
Set the Alarm and Warning thresholds to trigger the alert.
Click Create alert.

To have other cluster health indicators monitored automatically:

Management console

Create an alert.
Add a status metric.
In the alert parameters, set the alert thresholds.

Below are the recommended thresholds for some metrics:

Metric	Internal metric name	`Alarm`	`Warning`
Maximum number of data parts per partition	`ch_system_async_metrics_MaxPartCountForPartition`	`250`	`150`
Number of failed queries	`ch_system_events_FailedQuery_rate`	20% of the total number of queries	10% of the total number of queries
Storage space used	`disk.used_bytes`	95% of the storage size	80% of the storage size
Number of healthy hosts	`is_alive`	`<number_of_hosts> - 2`	`<number_of_hosts> - 1`

To determine the thresholds for the ch_system_events_FailedQuery_rate metric, use Total queries for the cluster.

For the disk.used_bytes metric, the Alarm and Warning thresholds are only set in bytes. For example, the recommended values for a 100 GB disk are as follows:

Alarm: 102,005,473,280 bytes (95%)
Warning: 85,899,345,920 bytes (80%)

You can check the current storage size in the cluster details. For a complete list of supported metrics, see this Monitoring guide.

Cluster state and status

The State of a cluster shows the health of its hosts, while the Status shows whether the cluster is started, stopped, or is at an intermediate stage.

To check the cluster’s state and status:

Management console

In the management console, select the folder the cluster is in.
Go to Managed Service for ClickHouse.
In the cluster row, hover over the indicator in the Availability column.

Cluster states

State	Description	Suggested actions
ALIVE	Cluster is operating normally.	No action is required.
DEGRADED	Cluster is not running at its full capacity: the state of at least one of the hosts is other than `ALIVE`.	Run the diagnostics: Go to the Hosts tab and see which hosts are not working. Go to the Operations tab and make sure all operations are completed. Make sure the cluster is not under maintenance. If you cannot find the cause yourself, contact support.
DEAD	The cluster is down: none of its hosts are running.	Make a support request stating the following: Cluster ID. IDs of the last operations performed on it. Time the cluster entered the `DEAD` state according to the availability charts.
UNKNOWN	Cluster state is unknown.	Make a support request stating the following: Cluster ID. IDs of the last operations performed on it. Time the cluster entered the `UNKNOWN` state according to the availability charts.

Cluster statuses

Status	Description	Suggested actions
CREATING	Preparing for the first start	Wait a while and get started. The time it takes to create a cluster depends on the host class.
RUNNING	The cluster is operating normally	No action is required.
STOPPING	The cluster is stopping	After a while, the cluster status will switch to `STOPPED` and the cluster will be disabled. No action is required.
STOPPED	The cluster is stopped	Start the cluster to get it running again.
STARTING	Starting the cluster that was stopped earlier	After a while, the cluster status will switch to `RUNNING`. Wait a while and get started.
UPDATING	Updating the cluster's configuration	Once the update is complete, the cluster will get the status it had prior to the update: `RUNNING` or `STOPPED`.
ERROR	Error when performing an operation with the cluster or during a maintenance window	If the cluster remains in this status for a long time, contact support. You can see whether a cluster is available by its status.
STATUS_UNKNOWN	The cluster is unable to determine its status	If the cluster remains in this status for a long time, contact support.

ClickHouse® is a registered trademark of ClickHouse, Inc.

Monitoring the state of a ClickHouse® cluster and its hosts

Monitoring the cluster stateMonitoring the cluster state

Available chartsAvailable charts

Monitoring the state of hostsMonitoring the state of hosts

Setting up alerts in Yandex MonitoringSetting up alerts in Yandex Monitoring

Cluster state and statusCluster state and status

Cluster statesCluster states

Cluster statusesCluster statuses

Was the article helpful?

Monitoring the cluster state

Available charts

Monitoring the state of hosts

Setting up alerts in Yandex Monitoring

Cluster state and status

Cluster states

Cluster statuses