Monitoring the state of ClickHouse® clusters and hosts
Data on cluster and host states are available in the management console. You can view them on the Monitoring tab of the cluster management page or in Yandex Monitoring.
Diagnostic information about cluster states is presented as graphs.
Chart update rate:
- Standard hosts and hosts with an increased RAM to vCPU ratio (
memory-optimized
): 15 seconds. - Hosts with a guaranteed vCPU share under 100% (
burstable
): 150 seconds.
Note
The most appropriate multiple units (MB, GB, and more) are automatically used in charts.
You can configure alerts in Yandex Monitoring to receive notifications about cluster failures. In Yandex Monitoring, there are two alert thresholds: Warning
and Alarm
. If the specified threshold is exceeded, you will receive alerts via the configured notification channels.
Monitoring cluster state
To view detailed information about the Managed Service for ClickHouse® cluster state:
-
In the management console
, go to the folder page and select Managed Service for ClickHouse. -
Click the cluster name and open the Monitoring tab.
-
To get started with Yandex Monitoring metrics, dashboards, or alerts, click Open in Monitoring in the top panel.
If the cluster uses a mechanism for replication and distribution of queries using ZooKeeper, the Monitoring tab displays the following two tabs:
- ClickHouse: Charts that show the operation of the entire ClickHouse® cluster and hosts.
- ZooKeeper: Charts that indicate the performance of ZooKeeper hosts.
If replication is not used or ClickHouse® Keeper is used, charts from the ClickHouse tab are displayed on the general Monitoring tab.
- Active locks per host: Number of active locks per host.
- Average insert query time per host: Average time it takes to execute insert queries on each host.
- Average query time per host: Average time it takes to execute queries on each host.
- Average select query time per host: Average time it takes to execute select queries on each host.
- Connections per host: Number of connections on each host.
- Data parts: Number of data parts.
- Databases: Number of databases.
- CPU cores usage: Number of processor cores used.
- CPU cores usage per host: Number of processor cores used on each host.
- CPU usage per host: Usage of processor cores on each host.
- Disk read per host: Disk read rate on each host (in bytes per second).
- Disk space usage: Shows how much disk space is used (in bytes).
- Disk space usage per host: Shows how much disk space is used on each host (two charts are displayed, in bytes and %).
- Disk usage per host: Speed of disk operations on each host (in bytes per second).
- Disk write per host: Disk write rate on each host (in bytes per second).
- Failed insert queries per host: Percentage of failed insert queries per second on each host.
- Failed queries per host: Total percentage of failed queries per second on each host.
- Failed select queries per host: Percentage of failed select queries per second on each host.
- Inode usage: Number of inodes used (two charts are displayed, in pieces and %).
- Insert queries: Number of insert queries per second.
- Insert queries per host: Number of insert queries per second on each host.
- Inserted data: Data insertion rate (in bytes per second).
- Inserted data per host: Data insertion rate on each host (in bytes per second).
- Inserted rows per host: Row insertion rate on each host (in rows per second).
- Max data parts per partition: Highest number of data parts per partition among all tables. The limit for this value is set in the DBMS settings. Approaching the limit indicates excessive load or low efficiency of data insertion.
- Max replication delay across tables: Longest replication delay across tables. Values greater than a few seconds may indicate excessive load or replication issues.
- Memory usage: Use of RAM, in bytes.
- Memory usage per host: Shows how much RAM is used on each host (two charts are displayed, in bytes and %).
- Merged data: Data merge rate (in bytes per second).
- Merged data per host: Data merge rate on each host (in bytes per second).
- Merged rows per host: Row merge rate on each host (in rows per second).
- Network data received per host: Network data receipt rate on each host (in bytes per second).
- Network data sent per host: Network data send rate on each host (in bytes per second).
- Network usage per host: Rate of data exchange over the network on each host (in bytes per second).
- Read data: Data read rate (in bytes per second).
- Read data per host: Data read rate on each host (in bytes per second).
- Read rows per host: Data read rate on each host (in rows per second).
- Replication queue: Total replication queue of all tables.
- Rows of MergeTree tables: Number of rows in MergeTree
tables. - Select queries: Number of select queries per second.
- Select queries per host: Number of select queries per second on each host.
- Tables: Number of tables.
- Total queries: Total number of queries per second.
- Queries per host: Total number of queries per second on each host.
- Waiting locks per host: Number of pending locks on each host.
- Average latency per ZooKeeper host: Average time it takes each ZooKeeper host to respond.
- Average transaction time per ClickHouse® host: Average time it takes each ClickHouse® host to execute a transaction. It indicates the time ClickHouse® spends to access ZooKeeper.
- Connections per ZooKeeper host: Number of connections on each ZooKeeper host.
- CPU cores usage: Number of processor cores used.
- CPU cores usage per host: Number of processor cores used on each host.
- CPU usage per host: Usage of processor cores on each host.
- Disk read per host: Disk read rate on each host (in bytes per second).
- Disk space usage: Shows how much disk space is used (in bytes). Two charts are displayed: one for each host and one for all hosts.
- Disk space usage per host: Shows how much disk space is used on each host (%).
- Disk usage per host: Speed of disk operations on each host (in bytes per second).
- Disk write per host: Disk write rate on each host (in bytes per second).
- Ephemeral nodes: Number of Ephemeral nodes.
Note
For more information about Znodes, Ephemeral nodes, and Watches, see the ZooKeeper documentation
- Memory usage: Use of RAM, in bytes.
- Memory usage per host: Shows how much RAM is used on each host (two charts are displayed, in bytes and %).
- Network data received per host: Network data receipt rate on each host (in bytes per second).
- Network data sent per host: Network data send rate on each host (in bytes per second).
- Network usage per host: Rate of data exchange over the network on each host (in bytes per second).
- Outstanding requests per ZooKeeper host: Number of requests being processed on each ZooKeeper host.
- Transactions: Number of transactions per second.
- Transactions per ClickHouse® host: Number of transactions per second on each ClickHouse® host.
- Watches: Number of Watches.
- Znodes: Number of Znodes.
Monitoring the state of hosts
To view detailed information about the state of individual Managed Service for ClickHouse® hosts:
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Click the cluster name and open the Hosts tab.
- Select the Monitoring tab.
- Select the host from the drop-down list. The host type is shown next to its name:
CLICKHOUSE
orZOOKEEPER
.
This page displays charts showing the load on an individual host in the cluster:
- CPU usage: Usage of processor cores. As the load goes up, the Idle value goes down.
- Disk read/write bytes: Speed of disk operations, in bytes per second.
- Disk IOPS: Number of disk operations per second.
- Memory usage: Use of RAM, in bytes. At high loads, the value of the Free parameter goes down while those of other parameters go up.
- Network bytes: Speed of data exchange over the network, in bytes per second.
- Network packets: Number of packets exchanged over the network, per second.
Alert settings in Yandex Monitoring
- In the management console
, select the folder with the cluster you want to configure alerts for. - In the list of services, select
Monitoring. - Under Service dashboards, select:
- Managed Service for ClickHouse® — Cluster Overview to configure cluster alerts.
- Managed Service for ClickHouse® — ZooKeeper to configure ZooKeeper host alerts.
- Managed Service for ClickHouse® — Host Overview to configure host alerts.
- In the appropriate chart with metrics, click
and select Create alert. - If the chart shows multiple metrics, select a data query to generate a metric and click Continue. For more information about the query language, see the Yandex Monitoring documentation.
- Set the
Alarm
andWarning
threshold values to trigger the alert. - Click Create alert.
To have other cluster health indicators monitored automatically:
- Create an alert.
- Add a status metric.
- In the alert parameters, set up your alert thresholds.
The recommended thresholds are as follows:
Metric | Parameter | Alarm |
Warning |
---|---|---|---|
Maximum number of data parts per partition | ch_system_async_metrics_MaxPartCountForPartition |
250 |
150 |
Number of failed queries | ch_system_events_FailedQuery_rate |
20% of the total number of queries | 10% of the total number of queries |
Storage space used | disk.used_bytes |
95% of the storage size | 80% of the storage size |
Number of healthy hosts | is_alive |
<host_count>: 2 |
<host_count>: 1 |
To determine the threshold values for the ch_system_events_FailedQuery_rate
metric, use Total queries
for the cluster.
For the disk.used_bytes
metric, the Alarm
and Warning
thresholds are only set in bytes. For example, the recommended values for a 100 GB disk are as follows:
Alarm
:102,005,473,280
bytes (95%)Warning
:85,899,345,920
bytes (80%)
You can view the current storage size in the detailed information about the cluster. For a complete list of supported metrics, see the Monitoring documentation.
Cluster state and status
The State of a cluster shows the health of its hosts, while the Status shows whether the cluster is started, stopped, or is at an intermediate stage.
To view a cluster's state and status:
- In the management console
, go to the folder page and select Managed Service for ClickHouse. - Hover over the indicator in the Availability column in the required cluster row.
Cluster states
State | Description | Suggested actions |
---|---|---|
ALIVE | Cluster is operating normally. | No action is required. |
DEGRADED | Cluster is not running at its full capacity: the state of at least one of the hosts is other than ALIVE . |
Run the diagnostics:
|
DEAD | The cluster is down: none of its hosts are running. | Make a support request
|
UNKNOWN | Cluster state is unknown. | Make a support request
|
Cluster statuses
Status | Description | Suggested actions |
---|---|---|
CREATING | Preparing for the first launch | Wait a while and get started. The time it takes to create a cluster depends on the host class. |
RUNNING | Cluster is operating normally | No action is required. |
STOPPING | Stopping cluster | After a while, the cluster status will change to STOPPED and the cluster will be disabled. No action is required. |
STOPPED | Cluster stopped | Start the cluster to get it running again. |
STARTING | Starting the cluster that was stopped earlier | After a while, the cluster status will change to RUNNING . Wait a while and get started. |
UPDATING | Updating the cluster status | After the update is completed, the cluster status will change to RUNNING . Wait a while and get started. |
ERROR | An error occurred that does not allow the cluster to continue working | Run the initial diagnostics:
|
STATUS_UNKNOWN | Cluster is unable to determine its own status | Run the initial diagnostics:
|
ClickHouse® is a registered trademark of ClickHouse, Inc