Yandex Cloud
Search
Contact UsTry it for free
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
  • Marketplace
    • Featured
    • Infrastructure & Network
    • Data Platform
    • AI for business
    • Security
    • DevOps tools
    • Serverless
    • Monitoring & Resources
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
    • Price calculator
    • Pricing plans
  • Customer Stories
  • Documentation
  • Blog
© 2026 Direct Cursus Technology L.L.C.
Yandex Managed Service for YTsaurus
  • Getting started
    • All guides
    • Information about existing clusters
    • Creating a cluster
    • Connecting to a cluster
    • Updating a cluster
    • Stopping and starting a cluster
    • Deleting a cluster
    • Cluster state monitoring
  • Access management
  • Pricing policy
  • Terraform reference
  • Yandex Monitoring metrics
  • Audit Trails events
  • Release notes

In this article:

  • Cluster health state monitoring
  • Setting up alerts in Yandex Monitoring
  • Cluster status
  1. Step-by-step guides
  2. Cluster state monitoring

Cluster health state monitoring Managed Service for YTsaurus

Written by
Yandex Cloud
Updated at April 29, 2026
  • Cluster health state monitoring
  • Setting up alerts in Yandex Monitoring
  • Cluster status

Data on the cluster and host state is available in the management console. You can view them on the Monitoring tab of the cluster management page or in Yandex Monitoring.

Diagnostic information about cluster states is presented as graphs.

Charts are updated every 15 seconds.

Note

The most appropriate multiple units (MB, GB, and more) are automatically used in charts.

You can configure alerts in Yandex Monitoring to receive notifications about cluster failures. In Yandex Monitoring, there are two alert thresholds: Warning and Alarm. If the specified threshold is exceeded, you will receive alerts via the configured notification channels.

Cluster health state monitoringCluster health state monitoring

To view detailed information on the health state of a Managed Service for YTsaurus cluster:

Management console
  1. In the management console, navigate to the relevant folder.
  2. In the list of services, select Managed Service for YTsaurus.
  3. Click the name of your cluster and select the Monitoring tab.
  4. To get started with metrics, dashboards, or alerts, select Yandex Monitoring in the list of services.

The page has two tabs, Cluster and Master, with the following charts:

  • Automaton thread CPU usage: CPU utilization by the automaton thread, i.e., the master primary execution thread, as a percentage.
    • TOTAL: Aggregate CPU utilization across all master nodes.
    • ms-N: CPU utilization by the automaton thread on master node N.
  • User write time Time spent on user-initiated write operations.
    • root: Time spent on root user write operations.
    • robot-odin: Time spent on robot-odin user write operations.
    • scheduler: Time spent on scheduler write operations.
    • tablet_cell_changelogger: Time spent on tablet_cell_changelogger user write operations.
  • User read time Time spent on user-initiated read operations.
    • tablet_cell_changelogger: Time spent on tablet_cell_changelogger user read operations.
    • root: Time spent on root user read operations.
    • operations_client: Time spent on client user read operations.
  • Master memory usage: Memory usage by the master, GB.
    • ms-N: Memory usage on master node N.
  • LocalRead thread CPU usage: CPU utilization by the LocalRead thread.
    • TOTAL: Aggregate CPU utilization across all master nodes as a percentage.
    • ms-N: CPU utilization by the LocalRead thread on master node N.
  • User write request rate: User-initiated write request rate, writes per second.
    • root: root user write request rate.
    • robot-odin: robot-odin user write request rate.
    • scheduler: Scheduler-initiated write request rate.
    • tablet_cell_changelogger: tablet_cell_changelogger user write request rate.
  • User read request rate: User-initiated read request rate, reads per second.
    • root: root user read request rate.
    • queue_agent: queue_agent user read request rate.
    • operations_client: Client user read request rate.
  • Create chunk requests: Create chunk request rate, requests per second.
    • TOTAL: Total number of create chunk requests from all users.
    • root: Number of root user create chunk requests.
    • robot-odin: Number of robot-odin user create chunk requests.
    • scheduler: Number of scheduler-initiated create chunk requests.
    • tablet_cell_snapshotter: Number of tablet_cell_snapshotter user create chunk requests.
  • Under Compute resources:
    • CPU distributed vs total: Total and distributed CPU resources in the cluster.
      • distributes_resources: Amount of distributed CPU resources.
      • total_cluster_resources: Total CPU resources in the cluster.
    • GPU distributed vs total: Total and distributed GPU resources in the cluster.
      • distributes_resources: Amount of distributed GPU resources.
      • total_cluster_resources: Total GPU resources in the cluster.
    • Memory distributed vs total: Total and distributed memory resources in the cluster.
      • distributes_resources: Amount of distributed memory resources.
      • total_cluster_resources: Total memory resources in the cluster.
    • Undistributed_resources: Amount of undistributed resources.
      • memory: Amount of undistributed memory resources.
      • cpu: Amount of undistributed CPU resources.
  • Under Scheduler pools:
    • CPU: usage, demand, guarantee: CPU resource usage, demand, and guarantee.
      • usage: CPU resources currently in use.
      • demand: Requested CPU resources.
      • guarantee: Guaranteed CPU resources.
    • RAM: usage, demand, guarantee: RAM resource usage, demand, and guarantee.
      • usage: RAM currently in use.
      • demand: Requested RAM.
      • guarantee: Guaranteed RAM.
    • GPU: usage, demand, guarantee: GPU resource usage, demand, and guarantee.
      • usage: GPU resources currently in use.
      • demand: Requested GPU resources.
      • guarantee: Guaranteed GPU resources.
    • Running operation count: Number of running operations.
      • lightweight usage: Number of running lightweight operations.
      • usage: Number of running operations.
    • Total operation count: Total number of operations.
      limit: Maximum allowed operations.
      usage: Number of operations in use.
  • Under Problems:
    • Offline nodes: Nodes in offline status.
    • Banned nodes: Nodes in banned status.
    • Decommissioned nodes: Nodes in decommissioned status.
    • Alerts: Number of active alerts about issues on YTsaurus cluster nodes.

Setting up alerts in Yandex MonitoringSetting up alerts in Yandex Monitoring

To configure cluster state indicator alerts:

Management console
  1. In the management console, select the folder with the cluster for which you want to set up alerts.
  2. In the list of services, select Monitoring.
  3. Under Service dashboards, select Managed Service for YTsaurus — Cluster Resources.
  4. In the chart you need, click and select Create alert.
  5. If the chart displays multiple metrics, select the data query for the relevant metric and click Continue. You can learn more about the query language in this Yandex Monitoring article.
  6. Set the Alarm and Warning threshold values to trigger the alert.
  7. Click Create alert.

To have other cluster health indicators monitored automatically:

Management console
  1. Create an alert.
  2. Add a status metric.
  3. In the alert parameters, set the alert thresholds.

For a complete list of supported metrics, see this Monitoring guide.

Cluster statusCluster status

Status shows whether the cluster is started, stopped, or in a transitory state.

To view the cluster status, proceed as follows:

  1. Navigate to the folder dashboard and select Managed Service for YTsaurus.
  2. In the cluster row, find the Status column.

Status

Description

Suggested actions

CREATING

Preparing for the first start.

Wait a while and get started. The time it takes to create a cluster depends on the host class.

RUNNING

The cluster is operating normally.

No action is required.

STOPPING

The cluster is stopping.

After a while, the cluster status will switch to STOPPED and the cluster will be disabled. No action is required.

STOPPED

The cluster stopped.

Start the cluster to get it running again.

STARTING

Starting the cluster that was stopped earlier.

After a while, the cluster status will switch to RUNNING. Wait a while and get started.

UPDATING

Updating the cluster configuration.

Once the update is complete, the cluster will get the status it had prior to the update: RUNNING or STOPPED.

DELETING

Deleting the cluster.

Once cluster deletion is complete, the cluster will be disabled. No action is required.

ERROR

Error when performing an operation with the cluster or during a maintenance window.

If the cluster remains in this status for a long time, contact support.

STATUS_UNKNOWN

The cluster is unable to determine its status.

If the cluster remains in this status for a long time, contact support.

Was the article helpful?

Previous
Deleting a cluster
Next
Running a PySpark job
© 2026 Direct Cursus Technology L.L.C.