Cluster, host, and shard state monitoring in Yandex StoreDoc

Written by

Updated at June 23, 2026

Cluster health state monitoring
Host state monitoring
Shard state monitoring
Setting up alerts in Yandex Monitoring
- Monitoring read-only mode transitions
Cluster health and status
- Cluster health states
- Cluster statuses

Data on the cluster and host state is available in the management console. You can view them on the Monitoring tab of the cluster management page or in Yandex Monitoring.

Diagnostic information about cluster states is presented as graphs.

Chart update rate:

Standard hosts and hosts with an increased RAM to vCPU ratio (memory-optimized): 15 seconds.
Hosts with a guaranteed vCPU share under 100% (burstable): 150 seconds.

Note

The most appropriate multiple units (MB, GB, and more) are automatically used in charts.

You can configure alerts in Yandex Monitoring to receive notifications about cluster failures. In Yandex Monitoring, there are two alert thresholds: Warning and Alarm. If the specified threshold is exceeded, you will receive alerts via the configured notification channels.

Cluster health state monitoring

To view detailed information on the health state of a Yandex StoreDoc cluster:

Management console

Open the folder dashboard.
Navigate to Yandex StoreDoc.
Click the name of your cluster and select the Monitoring tab.

The page that opens will display performance charts for the cluster.

To get started with Yandex Monitoring metrics, dashboards, or alerts, click Open in Monium in the top panel.

The following charts are displayed for a cluster:

Hosts available for write: Cluster host write availability.
Hosts available for read: Cluster host read availability.

Under Traffic:

Queries on primary: Increase in commands and operations on cluster primary replicas.
Queries on secondaries: Increase in commands and operations on cluster secondary replicas.
Replicated queries: Increase in replicated commands and operations on cluster secondary replicas.
Documents affected on primary: Increase in documents added, updated, deleted, or returned by queries on cluster primary replicas.
Documents affected on secondaries: Increase in documents added, updated, deleted, or returned by queries on cluster secondary replicas.
Documents affected per host: Increase in documents added, updated, deleted, or returned by queries on each cluster host.
Total operations count on cluster: Total increase in commands and operations in the cluster.
Connections per host: Number of available and incoming connections on each cluster host.
Readers/writers active queue per host, top 5: Number of read and write operations in the five largest lock queues on each cluster host.

Under Latency:

Average operations time on primary: Average execution time for commands and operations on cluster primary replicas.
Average operations time on secondaries: Average execution time for commands and operations on cluster secondary replicas.
Average operation time per host: Average time it takes each cluster host to execute operations.
Total operations time on Primaries: Total execution time for all operations on cluster primary replicas.
Total operations time on Secondaries: Total execution time for all operations on cluster secondary replicas.
Total operations time on Cluster: Total execution time for all operations in the cluster.
Write operations time, top 5 collections: Total time spent on write operations for the five largest collections in the cluster.
Read operations time, top 5 collections: Total time spent on read operations for the five largest collections in the cluster.

Under DB Metrics:

Replication lag per host and write_concern wait: Replication delay and write concern timeout on each cluster host.
Scanned / returned: Average ratio of scanned keys and documents to returned documents in the cluster.
Scan and order per host: Increase in the number of non-index-based data sorts on each cluster host.
Data size on primary, top 5 databases: Data size for the five largest databases on cluster primary replicas.
Index size on primary, top 5 indexes: Index size for the five largest databases on cluster primary replicas.
TTL indexes activity: Increase in deleted documents and background deletion operations using TTL indexes.
Configured oplog size per host: Maximum operation log size on each cluster host.
Oplog window: Time interval for retaining replication data in the oplog collection on each cluster host.
Open cursors total: Total number of open cursors on cluster hosts. The following is displayed separately:
- Total cursors.
- Pinned cursors.
- No-timeout cursors.

Under Resources → CPU:

CPU usage per host: CPU usage percentage per cluster host.
CPU usage on Primaries: CPU load on cluster primary replicas.
CPU usage on Secondaries: CPU load on cluster secondary replicas.

Under Resources → Memory:

Memory usage per host: RAM usage per cluster host as a percentage.
Memory usage on Primaries: RAM usage on cluster primary replicas.
Memory usage on Secondaries: RAM usage on cluster secondary replicas.

Under Resources → Network:

Network usage per host: Total network load on each cluster host.
Network data sent per host: Network data send rate on each cluster host.
Network data received per host: Network data receive rate on each cluster host.

Under Resources → Data:

Disk space usage per host: Disk space usage per cluster host as a percentage of the total storage size.
Disk space usage on Primaries: Disk space usage on cluster primary replicas.
Disk space usage on Secondaries: Disk space usage on cluster secondary replicas.
Disk usage per host: Total disk read and write rate on each cluster host.
Disk write per host: Disk write rate on each cluster host.
Disk read per host: Disk read rate on each cluster host.

Under Errors:

Write conflicts per host: Increase in write conflicts on each cluster host.
Page faults per host: Number of page faults on each cluster host.
Asserts total: Increase in triggered asserts in the cluster.

Under WiredTiger:

WiredTiger checkpoint time on primary: Time required to create checkpoints on cluster primary replicas.
WiredTiger cache state on primary: Cache usage on cluster primary replicas.
WiredTiger transactions state on primary: Increase in transactions on cluster primary replicas.
WiredTiger concurrent transactions on primary: Current number of parallel transaction tickets on cluster primary replicas.
WiredTiger cache pages evicted on primary: Increase in evicted cache pages (both modified and not) on cluster primary replicas.

Under Mongos:

Mongos in balancer round: Indicates whether mongos is involved in the current balancing round.
Mongos active migrations: Maximum number of active chunk migration operations via mongos.
Mongos migrations: Maximum number of chunk migration operations (both successful and not) via mongos.

Host state monitoring

To view detailed information on the state of individual Yandex StoreDoc hosts:

Management console

Open the folder dashboard.
Navigate to Yandex StoreDoc.
Click the name of your cluster and select the Monitoring tab.
Navigate to the Hosts tab and select the host.

The page that opens will display performance charts for the cluster hosts.

To get started with Yandex Monitoring metrics, dashboards, or alerts, click Open in Monium in the top panel.

The following charts are displayed for the hosts:

CPU usage: CPU usage percentage by consumption types: user, system, steal, softirq, nice, irq, iowait, idle, and guest.
Memory usage: RAM usage by consumption type: used, total, shared, free, cached, buffers, available, and active.
Disk IOPS: Number of read and write operations per second.
Network bytes: Network data transmit and receive rate.
Network packets: Network packet transmit and receive rate (packets per second).
Disk space usage: Used and available disk space.

Under Disk Metrics Details:

Disk write latency (percentiles): Disk write latency, in percentiles.
Disk write bytes: Average and maximum disk write rate.
Disk write operations: Average and maximum number of write operations per second.
Disk read latency (percentiles): Disk read latency, percentiles.
Disk read bytes: Average and maximum disk read rate.
Disk read operations: Average and maximum number of read operations per second.
Disk write throttler latency (percentiles): Write delay introduced by exceeding disk quota, percentiles.
Disk read throttler latency (percentiles): Read delay introduced by exceeding disk quota, percentiles.
Disk used quota: Percentage of disk operation quota usage.

Shard state monitoring

To view detailed information on the health state of Yandex StoreDoc shards:

Management console

Open the folder dashboard.
Navigate to Yandex StoreDoc.
Click the name of your cluster and select the Monitoring tab.
Navigate to the Shards tab and select a shard.

The page that opens will display health state charts for the selected shard and its hosts.

To get started with Yandex Monitoring metrics, dashboards, or alerts, click Open in Monium in the top panel.

The following charts are displayed for shards:

Hosts available for write: Shard host write availability.
Hosts available for read: Shard host read availability.

Under Traffic:

Queries on primary: Increase in commands and operations on shard primary replicas.
Queries on secondaries: Increase in commands and operations on shard secondary replicas.
Replicated queries: Increase in replicated commands and operations on shard secondary replicas.
Documents affected on primary: Increase in documents added, updated, deleted, or returned by queries on shard primary replicas.
Documents affected on secondaries: Increase in documents added, updated, deleted, or returned by queries on shard secondary replicas.
Documents affected per host: Increase in documents added, updated, deleted, or returned by queries on each shard host.
Total operations count on cluster: Total increase in commands and operations executed on the shard.
Connections per host: Number of available and incoming connections on each shard host.
Readers/writers active queue per host, top 5: Number of read and write operations in the five largest lock queues on each shard host.

Under Latency:

Average operations time on primary: Average execution time for commands and operations on shard primary replicas.
Average operations time on secondaries: Average execution time for commands and operations on shard secondary replicas.
Average operation time per host: Average time it takes each shard host to execute operations.
Total operations time on Primaries: Total execution time for all operations on shard primary replicas.
Total operations time on Secondaries: Total execution time for all operations on shard secondary replicas.
Total operations time on Cluster: Total execution time for all operations on the shard.
Write operations time, top 5 collections: Total time spent on write operations for the five largest collections on the shard.
Read operations time, top 5 collections: Total time spent on read operations for the five largest collections on the shard.

Under DB Metrics:

Replication lag per host and write_concern wait: Replication delay and write concern timeout on each shard host.
Scanned / returned: Average ratio of scanned keys and documents to returned documents on the shard.
Scan and order per host: Increase in the number of non-index-based data sorts on each shard host.
Data size on primary, top 5 databases: Data size for the five largest databases on shard primary replicas.
Index size on primary, top 5 indexes: Index size for the five largest databases on shard primary replicas.
TTL indexes activity: Increase in deleted documents and background deletion operations using TTL indexes on the shard.
Configured oplog size per host: Maximum operation log size on each shard host.
Oplog window: Time interval for retaining replication data in the oplog collection on each shard host.
Open cursors total: Total number of open cursors on shard hosts. The following is displayed separately:
- Total cursors
- Pinned cursors
- No-timeout cursors

Under Resources → CPU:

CPU usage per host: CPU usage percentage per shard host.
CPU usage on Primaries: CPU load on shard primary replicas.
CPU usage on Secondaries: CPU load on shard secondary replicas.

Under Resources → Memory:

Memory usage per host: RAM usage per shard host as a percentage.
Memory usage on Primaries: RAM usage on shard primary replicas.
Memory usage on Secondaries: RAM usage on shard secondary replicas.

Under Resources → Network:

Network usage per host: Total network load on each shard host.
Network data sent per host: Network data send rate on each shard host.
Network data received per host: Network data receive rate on each shard host.

Under Resources → Data:

Disk space usage per host: Disk space usage per shard host as a percentage of the total storage size.
Disk space usage on Primaries: Disk space usage on shard primary replicas.
Disk space usage on Secondaries: Disk space usage on shard secondary replicas.
Disk usage per host: Total disk read and write rate on each shard host.
Disk write per host: Disk write rate on each shard host.
Disk read per host: Disk read rate on each shard host.

Under Errors:

Write conflicts per host: Increase in write conflicts on each shard host.
Page faults per host: Number of page faults on each shard host.
Asserts total: Increase in triggered asserts on the shard.

Under WiredTiger:

WiredTiger checkpoint time on primary: Time required to create checkpoints on shard primary replicas.
WiredTiger cache state on primary: Cache usage on shard primary replicas.
WiredTiger transactions state on primary: Increase in transactions on shard primary replicas.
WiredTiger concurrent transactions on primary: Current number of parallel transaction tickets on shard primary replicas.
WiredTiger cache pages evicted on primary: Increase in evicted cache pages (both modified and not) on shard primary replicas.

Under Mongos:

Mongos in balancer round: Indicates whether mongos is involved in the current balancing round on the shard.
Mongos active migrations: Maximum number of active chunk migration operations via mongos on the shard.
Mongos migrations: Maximum number of chunk migration operations (both successful and not) via mongos on the shard.

Setting up alerts in Yandex Monitoring

Management console

In the management console, select the folder with the cluster for which you want to set up alerts.
Navigate to Monitoring.
Under Service dashboards, select:
- Yandex StoreDoc to set up cluster alerts.
- Yandex StoreDoc — Host Overview to set up host alerts.
On the relevant chart, click and select Create alert.
If the chart displays multiple metrics, select the data query for the relevant metric and click Continue. To learn more about the query language, see this Yandex Monitoring article.
Set the Alarm and Warning alert thresholds.
Click Create alert.

To have other cluster health indicators monitored automatically:

Management console

Create an alert.
Add a status metric.
In the alert parameters, set the alert thresholds.

Recommended threshold values for selected metrics:

Metric	Internal metric name	`Alarm`	`Warning`
Database write availability	`can_write`	`0`	—
Replication lag	`replset_status-replicationLag`	`180`	`30`
Storage space used	`disk.used_bytes`	90% of the storage size	70% of the storage size

The Alarm and Warning thresholds for the disk.used_bytes metric are specified exclusively in bytes. For example, recommended values for a 100 GB disk are as follows:

Alarm: 96636764160 bytes (90%)
Warning: 75161927680 bytes (70%)

You can check the current storage size in the cluster details. For a complete list of supported metrics, see this Monitoring guide.

Monitoring read-only mode transitions

To track storage fill levels on the cluster hosts and receive notifications when free space is about to run out:

Create an alert.
Add the disk.free_bytes metric.

To do this, create a query in the query builder:

service=managed-mongodb → name=disk.free_bytes → host=* → resource_id=* → resource_type=cluster.
Configure alert notification thresholds:
- Condition: Set the Less than or equals condition for free disk space that will trigger the alert.
  
  Recommended thresholds relative to storage size are as follows:
  
  Storage size, GB Alarm Warning
  
  ⩽ 600 1G: 1 GB 1500M: 1.5 GB
  
  > 600 6G: 6 GB 10G: 10 GB
- Advanced settings → Aggregation function: Select Minimum the metric’s minimum value over the period.

Storage size, GB	`Alarm`	`Warning`
⩽ 600	`1G`: 1 GB	`1500M`: 1.5 GB
> 600	`6G`: 6 GB	`10G`: 10 GB

Cluster health and status

The State of a cluster shows the health of its hosts, while the Status shows whether the cluster is started, stopped, or is at an intermediate stage.

To view the health state and status of a cluster:

Open the folder dashboard.
Navigate to Yandex StoreDoc.
In the cluster row, hover over the indicator in the Availability column.

Cluster health states

State	Description	Suggested actions
ALIVE	Cluster is operating normally.	No action is required.
DEGRADED	Cluster is not running at its full capacity: the state of at least one of the hosts is other than `ALIVE`.	Run the diagnostics: Go to the Hosts tab and see which hosts are not working. Go to the Operations tab and make sure all operations are completed. Make sure the cluster is not under maintenance. If you cannot find the cause yourself, contact support.
DEAD	The cluster is down: none of its hosts are running.	Make a support request stating the following: Cluster ID. IDs of the last operations performed on it. Time the cluster entered the `DEAD` state according to the availability charts.
UNKNOWN	Cluster state is unknown.	Make a support request stating the following: Cluster ID. IDs of the last operations performed on it. Time the cluster entered the `UNKNOWN` state according to the availability charts.

Cluster statuses

Status	Description	Suggested actions
CREATING	Preparing for the first start	Wait a while and get started. The time it takes to create a cluster depends on the host class.
RUNNING	The cluster is operating normally	No action is required.
STOPPING	The cluster is stopping	After a while, the cluster status will switch to `STOPPED` and the cluster will be disabled. No action is required.
STOPPED	The cluster is stopped	Start the cluster to get it running again.
STARTING	Starting the cluster that was stopped earlier	After a while, the cluster status will switch to `RUNNING`. Wait a while and get started.
UPDATING	Updating the cluster's configuration	Once the update is complete, the cluster will get the status it had prior to the update: `RUNNING` or `STOPPED`.
ERROR	Error when performing an operation with the cluster or during a maintenance window	If the cluster remains in this status for a long time, contact support. You can see whether a cluster is available by its status.
STATUS_UNKNOWN	The cluster is unable to determine its status	If the cluster remains in this status for a long time, contact support.

Cluster, host, and shard state monitoring in Yandex StoreDoc

Cluster health state monitoringCluster health state monitoring

Host state monitoringHost state monitoring

Shard state monitoringShard state monitoring

Setting up alerts in Yandex MonitoringSetting up alerts in Yandex Monitoring

Monitoring read-only mode transitionsMonitoring read-only mode transitions

Cluster health and statusCluster health and status

Cluster health statesCluster health states

Cluster statusesCluster statuses

Was the article helpful?

Cluster health state monitoring

Host state monitoring

Shard state monitoring

Setting up alerts in Yandex Monitoring

Monitoring read-only mode transitions

Cluster health and status

Cluster health states

Cluster statuses