Cluster health state monitoring Managed Service for YTsaurus
Data on the cluster and host state is available in the management console
Diagnostic information about cluster states is presented as graphs.
Charts are updated every 15 seconds.
Note
The most appropriate multiple units (MB, GB, and more) are automatically used in charts.
You can configure alerts in Yandex Monitoring to receive notifications about cluster failures. In Yandex Monitoring, there are two alert thresholds: Warning and Alarm. If the specified threshold is exceeded, you will receive alerts via the configured notification channels.
Cluster health state monitoring
To view detailed information on the health state of a Managed Service for YTsaurus cluster:
- In the management console
, navigate to the relevant folder. - In the list of services, select Managed Service for YTsaurus.
- Click the name of your cluster and select the Monitoring tab.
- To get started with metrics, dashboards, or alerts, select Yandex Monitoring in the list of services.
The page has two tabs, Cluster and Master, with the following charts:
- Automaton thread CPU usage: CPU utilization by the automaton thread, i.e., the master primary execution thread, as a percentage.
- TOTAL: Aggregate CPU utilization across all master nodes.
- ms-N: CPU utilization by the automaton thread on master node N.
- User write time Time spent on user-initiated write operations.
- root: Time spent on
rootuser write operations. - robot-odin: Time spent on
robot-odinuser write operations. - scheduler: Time spent on scheduler write operations.
- tablet_cell_changelogger: Time spent on
tablet_cell_changeloggeruser write operations.
- root: Time spent on
- User read time Time spent on user-initiated read operations.
- tablet_cell_changelogger: Time spent on
tablet_cell_changeloggeruser read operations. - root: Time spent on
rootuser read operations. - operations_client: Time spent on client user read operations.
- tablet_cell_changelogger: Time spent on
- Master memory usage: Memory usage by the master, GB.
- ms-N: Memory usage on master node N.
- LocalRead thread CPU usage: CPU utilization by the
LocalReadthread.- TOTAL: Aggregate CPU utilization across all master nodes as a percentage.
- ms-N: CPU utilization by the
LocalReadthread on master node N.
- User write request rate: User-initiated write request rate, writes per second.
- root:
rootuser write request rate. - robot-odin:
robot-odinuser write request rate. - scheduler: Scheduler-initiated write request rate.
- tablet_cell_changelogger:
tablet_cell_changeloggeruser write request rate.
- root:
- User read request rate: User-initiated read request rate, reads per second.
- root:
rootuser read request rate. - queue_agent:
queue_agentuser read request rate. - operations_client: Client user read request rate.
- root:
- Create chunk requests: Create chunk request rate, requests per second.
- TOTAL: Total number of create chunk requests from all users.
- root: Number of
rootuser create chunk requests. - robot-odin: Number of
robot-odinuser create chunk requests. - scheduler: Number of scheduler-initiated create chunk requests.
- tablet_cell_snapshotter: Number of
tablet_cell_snapshotteruser create chunk requests.
- Under Compute resources:
- CPU distributed vs total: Total and distributed CPU resources in the cluster.
- distributes_resources: Amount of distributed CPU resources.
- total_cluster_resources: Total CPU resources in the cluster.
- GPU distributed vs total: Total and distributed GPU resources in the cluster.
- distributes_resources: Amount of distributed GPU resources.
- total_cluster_resources: Total GPU resources in the cluster.
- Memory distributed vs total: Total and distributed memory resources in the cluster.
- distributes_resources: Amount of distributed memory resources.
- total_cluster_resources: Total memory resources in the cluster.
- Undistributed_resources: Amount of undistributed resources.
- memory: Amount of undistributed memory resources.
- cpu: Amount of undistributed CPU resources.
- CPU distributed vs total: Total and distributed CPU resources in the cluster.
- Under Scheduler pools:
- CPU: usage, demand, guarantee: CPU resource usage, demand, and guarantee.
- usage: CPU resources currently in use.
- demand: Requested CPU resources.
- guarantee: Guaranteed CPU resources.
- RAM: usage, demand, guarantee: RAM resource usage, demand, and guarantee.
- usage: RAM currently in use.
- demand: Requested RAM.
- guarantee: Guaranteed RAM.
- GPU: usage, demand, guarantee: GPU resource usage, demand, and guarantee.
- usage: GPU resources currently in use.
- demand: Requested GPU resources.
- guarantee: Guaranteed GPU resources.
- Running operation count: Number of
runningoperations.- lightweight usage: Number of
runninglightweight operations. - usage: Number of
runningoperations.
- lightweight usage: Number of
- Total operation count: Total number of operations.
limit: Maximum allowed operations.
usage: Number of operations in use.
- CPU: usage, demand, guarantee: CPU resource usage, demand, and guarantee.
- Under Problems:
- Offline nodes: Nodes in
offlinestatus. - Banned nodes: Nodes in
bannedstatus. - Decommissioned nodes: Nodes in
decommissionedstatus. - Alerts: Number of active alerts about issues on YTsaurus cluster nodes.
- Offline nodes: Nodes in
Setting up alerts in Yandex Monitoring
To configure cluster state indicator alerts:
- In the management console
, select the folder with the cluster for which you want to set up alerts. - In the list of services, select
Monitoring. - Under Service dashboards, select Managed Service for YTsaurus — Cluster Resources.
- In the chart you need, click
and select Create alert. - If the chart displays multiple metrics, select the data query for the relevant metric and click Continue. You can learn more about the query language in this Yandex Monitoring article.
- Set the
AlarmandWarningthreshold values to trigger the alert. - Click Create alert.
To have other cluster health indicators monitored automatically:
- Create an alert.
- Add a status metric.
- In the alert parameters, set the alert thresholds.
For a complete list of supported metrics, see this Monitoring guide.
Cluster status
Status shows whether the cluster is started, stopped, or in a transitory state.
To view the cluster status, proceed as follows:
- Navigate to the folder dashboard and select Managed Service for YTsaurus.
- In the cluster row, find the Status column.
|
Status |
Description |
Suggested actions |
|
CREATING |
Preparing for the first start. |
Wait a while and get started. The time it takes to create a cluster depends on the host class. |
|
RUNNING |
The cluster is operating normally. |
No action is required. |
|
STOPPING |
The cluster is stopping. |
After a while, the cluster status will switch to |
|
STOPPED |
The cluster stopped. |
Start the cluster to get it running again. |
|
STARTING |
Starting the cluster that was stopped earlier. |
After a while, the cluster status will switch to |
|
UPDATING |
Updating the cluster configuration. |
Once the update is complete, the cluster will get the status it had prior to the update: |
|
DELETING |
Deleting the cluster. |
Once cluster deletion is complete, the cluster will be disabled. No action is required. |
|
ERROR |
Error when performing an operation with the cluster or during a maintenance window. |
If the cluster remains in this status for a long time, contact support |
|
STATUS_UNKNOWN |
The cluster is unable to determine its status. |
If the cluster remains in this status for a long time, contact support |