High availability of a Yandex Managed Service for Valkey™ cluster
High availability is the ability of a system to quickly recover functionality in the event of failure, ensuring continuous service operation for clients.
The high availability of a Yandex Managed Service for Valkey™ cluster depends on the SLA
Number and placement of cluster hosts
For a non-sharded cluster
A single-host cluster does not provide high availability. If the master fails, your cluster becomes unavailable for reading and writing until the master is recovered.
If your cluster has at least two hosts, it remains available if one of them fails. A cluster is resilient to a failure of one availability zone if no zone contains more than half of all its hosts.
For a sharded cluster
If each cluster shard has at least two hosts, such a cluster remains available if one of the hosts fails. A cluster is resilient to a failure of one availability zone if no zone contains more than half of the hosts belonging to a single shard.
Replication and master failover settings
High availability is achieved through replication and master failover, which work as follows:
- rdsync
, a host status management agent by Yandex, was integrated into the Yandex Managed Service for Valkey™ architecture; this agent automatically selects a new master and switches over to it in the event of a master failure. To ensure the optimal performance ofrdsync, the number of hosts in the cluster must be even. - You can influence new master selection in a Yandex Managed Service for Valkey™ cluster by configuring priorities for cluster hosts.
- You can manually select a new master and switch over to it.
- If you use public access for the host, you must also enable it for the replicas, otherwise the cluster will become unavailable following master failover.
- Yandex Managed Service for Valkey™ clusters use asynchronous replication, i.e., the result of a write request is committed to the master host, which then forwards the data to the cluster replicas.
Connecting to a database
Cluster availability depends on the connection method and settings:
- Only use recommended clients for connection.
- Configure security groups.
- Set the values of the Timeout, Maxmemory policy, Maxmemory percent, Client output buffer limit normal, and Client output buffer limit pubsub Valkey™ settings so that under there are no write operation failures or mass connection interruptions normal operating conditions.
Storage settings
If the database storage is 100% full, the cluster will switch to read-only mode. To keep your cluster writable:
- Enable automatic storage expansion.
- Create an alert to monitor storage utilization.
Persistence settings and the WAIT command
To increase fault tolerance:
- Enable persistence on replicas.
- Use the
WAITcommand withN/2available replicas, whereNis the number of cluster hosts.
Virtual machine type
Cluster availability depends on the type of VMs you use to deploy your hosts. A highly available cluster should use a VM type with a 100% vCPU guarantee. The burstable VM type with a 50% vCPU guarantee does not ensure high availability and should only be used for test environments.
Other settings
The following cluster parameters and settings may also affect its availability:
- Backup settings.
- Storage disk type you selected.
- Host classes.
- Quotas and limits.
Maintaining a cluster and modifying its parameters
The following operations may lead to interrupted connections, temporary performance degradation, or temporary cluster unavailability:
- Starting maintenance operations (the start time is set by selecting a maintenance window) may cause interrupted connections and temporary write unavailability of the cluster.
- Updating the Valkey™ version, changing the host class, changing the disk type, or increasing the storage size leads to interrupted connections, temporary performance degradation, or temporary write unavailability of the cluster.
- Updating the Databases Valkey™ setting leads to interrupted connections and temporary write unavailability of the cluster.
- Automatic and manual database backups may cause temporary performance degradation of the cluster.
Run these operations when the cluster load is minimal.