Managed Service for YTsaurus cluster state monitoring

Written by

Updated at July 16, 2026

Cluster state monitoring
Setting up alerts in Yandex Monitoring
Cluster status

Data on the cluster and host state is available in the management console. You can view them on the Monitoring tab of the cluster management page or in Yandex Monitoring.

Diagnostic information about cluster states is presented as graphs.

Charts are updated every 15 seconds.

Note

The most appropriate multiple units (MB, GB, and more) are automatically used in charts.

You can configure alerts in Yandex Monitoring to receive notifications about cluster failures. In Yandex Monitoring, there are two alert thresholds: Warning and Alarm. If the specified threshold is exceeded, you will receive alerts via the configured notification channels.

Cluster state monitoring

To view detailed information on the health state of a Managed Service for YTsaurus cluster:

Management console

In the management console, navigate to the relevant folder.
In the list of services, select Managed Service for YTsaurus.
Click the name of your cluster and select the Monitoring tab.
To get started with metrics, dashboards, or alerts, select Yandex Monitoring in the list of services.

The page has two tabs, Cluster and Master, with the following charts:

Automaton thread CPU usage: CPU utilization by the automaton thread, i.e., the master primary execution thread, as a percentage.
- TOTAL: Aggregate CPU utilization across all master nodes.
- ms-N: CPU utilization by the automaton thread on master node N.
User write time Time spent on user-initiated write operations.
- root: Time spent on root user write operations.
- robot-odin: Time spent on robot-odin user write operations.
- scheduler: Time spent on scheduler write operations.
- tablet_cell_changelogger: Time spent on tablet_cell_changelogger user write operations.
User read time Time spent on user-initiated read operations.
- tablet_cell_changelogger: Time spent on tablet_cell_changelogger user read operations.
- root: Time spent on root user read operations.
- operations_client: Time spent on client user read operations.
Master memory usage: Memory usage by the master, GB.
- ms-N: Memory usage on master node N.
LocalRead thread CPU usage: CPU utilization by the LocalRead thread.
- TOTAL: Aggregate CPU utilization across all master nodes as a percentage.
- ms-N: CPU utilization by the LocalRead thread on master node N.
User write request rate: User-initiated write request rate, writes per second.
- root: root user write request rate.
- robot-odin: robot-odin user write request rate.
- scheduler: Scheduler-initiated write request rate.
- tablet_cell_changelogger: tablet_cell_changelogger user write request rate.
User read request rate: User-initiated read request rate, reads per second.
- root: root user read request rate.
- queue_agent: queue_agent user read request rate.
- operations_client: Client user read request rate.
Create chunk requests: Create chunk request rate, requests per second.
- TOTAL: Total number of create chunk requests from all users.
- root: Number of root user create chunk requests.
- robot-odin: Number of robot-odin user create chunk requests.
- scheduler: Number of scheduler-initiated create chunk requests.
- tablet_cell_snapshotter: Number of tablet_cell_snapshotter user create chunk requests.
Under Compute resources:
- CPU distributed vs total: Total and distributed CPU resources in the cluster.
  - distributes_resources: Amount of distributed CPU resources.
  - total_cluster_resources: Total CPU resources in the cluster.
- GPU distributed vs total: Total and distributed GPU resources in the cluster.
  - distributes_resources: Amount of distributed GPU resources.
  - total_cluster_resources: Total GPU resources in the cluster.
- Memory distributed vs total: Total and distributed memory resources in the cluster.
  - distributes_resources: Amount of distributed memory resources.
  - total_cluster_resources: Total memory resources in the cluster.
- Undistributed_resources: Amount of undistributed resources.
  - memory: Amount of undistributed memory resources.
  - cpu: Amount of undistributed CPU resources.
Under Scheduler pools:
- CPU: usage, demand, guarantee: CPU resource usage, demand, and guarantee.
  - usage: CPU resources currently in use.
  - demand: Requested CPU resources.
  - guarantee: Guaranteed CPU resources.
- RAM: usage, demand, guarantee: RAM resource usage, demand, and guarantee.
  - usage: RAM currently in use.
  - demand: Requested RAM.
  - guarantee: Guaranteed RAM.
- GPU: usage, demand, guarantee: GPU resource usage, demand, and guarantee.
  - usage: GPU resources currently in use.
  - demand: Requested GPU resources.
  - guarantee: Guaranteed GPU resources.
- Running operation count: Number of running operations.
  - lightweight usage: Number of running lightweight operations.
  - usage: Number of running operations.
- Total operation count: Total number of operations.
  limit: Maximum allowed operations.
  usage: Number of operations in use.
Under Problems:
- Offline nodes: Nodes in offline status.
- Banned nodes: Nodes in banned status.
- Decommissioned nodes: Nodes in decommissioned status.
- Alerts: Number of active alerts about issues on YTsaurus cluster nodes.

Setting up alerts in Yandex Monitoring

To configure cluster state indicator alerts:

Management console

In the management console, select the folder with the cluster for which you want to set up alerts.
In the list of services, select Monitoring.
Under Service dashboards, select Managed Service for YTsaurus — Cluster Resources.
On the chart of interest, click and select Create alert.
If the chart displays multiple metrics, select the data query for the relevant metric and click Continue. Learn more about the query language in this Yandex Monitoring guide.
Set the Alarm and Warning threshold values to trigger the alert.
Click Create alert.

To have other cluster health indicators monitored automatically:

Management console

Create an alert.
Add a status metric.
In the alert parameters, set the alert thresholds.

For a complete list of supported metrics, see this Monitoring guide.

Cluster status

Status shows whether the cluster is started, stopped, or in a transitory state.

To view the cluster status, proceed as follows:

Navigate to the folder dashboard and select Managed Service for YTsaurus.
In the cluster row, find the Status column.

Status	Description	Suggested actions
CREATING	Preparing for the first start.	Wait a while and get started. The time it takes to create a cluster depends on the host class.
RUNNING	The cluster is operating normally.	No action is required.
STOPPING	The cluster is stopping.	After a while, the cluster status will switch to `STOPPED` and the cluster will be disabled. No action is required.
STOPPED	The cluster stopped.	Start the cluster to get it running again.
STARTING	Starting the cluster that was stopped earlier.	After a while, the cluster status will switch to `RUNNING`. Wait a while and get started.
UPDATING	Updating the cluster configuration.	Once the update is complete, the cluster will get the status it had prior to the update: `RUNNING` or `STOPPED`.
DELETING	Deleting the cluster.	Once cluster deletion is complete, the cluster will be disabled. No action is required.
ERROR	Error when performing an operation with the cluster or during a maintenance window.	If the cluster remains in this status for a long time, contact support.
STATUS_UNKNOWN	The cluster is unable to determine its status.	If the cluster remains in this status for a long time, contact support.

Managed Service for YTsaurus cluster state monitoring

Cluster state monitoringCluster state monitoring

Setting up alerts in Yandex MonitoringSetting up alerts in Yandex Monitoring

Cluster statusCluster status

Was the article helpful?

Cluster state monitoring

Setting up alerts in Yandex Monitoring

Cluster status