Updating subclusters
You can perform the following actions for any subcluster:
- Change the number of hosts.
- Change the host class.
- Change the autoscaling rule for data processing subclusters.
- Increase storage size.
- Edit security groups.
You can also switch availability zones for subclusters. By doing so, you will be migrating them to a different availability zone along with the Yandex Data Processing cluster. The process depends on the cluster type:
- Migrating a lightweight cluster to a different availability zone.
- Migrating an HDFS cluster to a different availability zone.
Changing the number of hosts
You can change the number of hosts in data storage and processing subclusters:
- Navigate to the folder dashboard
and select Yandex Data Processing. - Click the cluster name and open the Subclusters tab.
- Click
for the subcluster you need and select Edit. - Enter or select the required number of hosts in the Hosts field.
- Optionally, specify the decommissioning timeout.
- Click Save changes.
Yandex Data Processing will start adding hosts.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To change the number of hosts for a subcluster:
-
View the description of the CLI command for updating a subcluster:
yc dataproc subcluster update --help -
Set a new number of hosts in the command for updating the subcluster:
yc dataproc subcluster update <subcluster_name_or_ID> \ --cluster-name=<cluster_name> \ --hosts-count=<number_of_hosts>You can request the subcluster name or ID with the list of cluster subclusters, and the cluster name, with the list of folder clusters.
-
Open the current Terraform configuration file that defines your infrastructure.
For more information about creating this file, see Creating clusters.
-
In the description of the Yandex Data Processing cluster, edit the
hosts_countvalue undersubcluster_specfor your data storage or processing subcluster:resource "yandex_dataproc_cluster" "data_cluster" { ... cluster_config { ... subcluster_spec { name = "<subcluster_name>" ... hosts_count = <number_of_hosts_in_subcluster> } } }
Changing the host class
Warning
Changing host properties through the Yandex Compute Cloud interfaces may result in host failure. To change the cluster host settings, use the Yandex Data Processing interfaces, such as the management console, CLI, Terraform, or API.
You can change the computing power of hosts in a separate subcluster. It depends on the driver deploy mode:
- In
deployMode=clustermode, when the driver is deployed on one of the cluster'scomputehosts, 4-8 CPU cores and 16 GB RAM are sufficient for the subcluster with themasterhost. - In
deployMode=clientmode, when the driver is deployed on the cluster's master host, the computing power depends on the job logic and the number of concurrent jobs.
For more information about driver deploy modes and computing resource consumption, see Resource allocation.
To change the host class for a subcluster:
- In the management console
, select the folder with the cluster whose subcluster you want to change. - Select Yandex Data Processing and the required cluster.
- Navigate to Subclusters.
- Click
for the subcluster you need and select Edit. - Select the required platform and configuration under Host class.
- Optionally, specify the decommissioning timeout.
- Click Save changes.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To change the host class for a subcluster:
-
View the description of the CLI command for updating a subcluster:
yc dataproc subcluster update --help -
Request a list of available host classes (the
ZONE IDScolumn specifies the availability zones where you can select the appropriate class):yc dataproc resource-preset listResult:
+-----------+--------------------------------+-------+----------+ | ID | ZONE IDS | CORES | MEMORY | +-----------+--------------------------------+-------+----------+ | b3-c1-m4 | ru-central1-a, ru-central1-b, | 2 | 4.0 GB | | | ru-central1-c | | | | ... | +-----------+--------------------------------+-------+----------+ -
Specify the class in the command for updating the subcluster:
yc dataproc subcluster update <subcluster_name_or_ID> \ --cluster-name=<cluster_name> \ --resource-preset=<host_class>You can request the subcluster name or ID with the list of cluster subclusters, and the cluster name, with the list of folder clusters.
-
Open the current Terraform configuration file that defines your infrastructure.
For more information about creating this file, see Creating clusters.
-
In the description of the Yandex Data Processing cluster, edit the
resource_preset_idvalue undersubcluster_spec.resourcesfor your subcluster:resource "yandex_dataproc_cluster" data_cluster { ... cluster_config { ... subcluster_spec { name = "<subcluster_name>" ... resources { resource_preset_id = "<subcluster_host_class>" ... } } } -
Make sure the settings are correct.
-
In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.
-
Run this command:
terraform validateTerraform will show any errors found in your configuration files.
-
-
Confirm updating the resources.
-
Run this command to view the planned changes:
terraform planIf you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
-
If everything looks correct, apply the changes:
-
Run this command:
terraform apply -
Confirm updating the resources.
-
Wait for the operation to complete.
-
-
For more information about resources you can create with Terraform, see the relevant provider documentation.
Yandex Data Processing will start updating the subcluster. Note that this will restart all hosts in the subcluster being updated.
Changing the autoscaling rule for data processing subclusters
You can configure the autoscaling rule in data processing subclusters.
Make sure the cloud quota is sufficient to increase the VM resources. Open the Quotas
- Total HDD capacity
- Total SSD capacity
- Number of disks
- Number of vCPUs for instances
- Number of instances.
To enable autoscaling, assign the following roles to the Yandex Data Processing cluster's service account:
- dataproc.agent: To enable the service account to get info on cluster host states, jobs, and log groups.
- dataproc.provisioner: To enable the service account to work with an autoscaling instance group. This will enable subcluster autoscaling.
To configure autoscaling for subclusters:
- Navigate to the folder dashboard
and select Yandex Data Processing. - Select a cluster and open the Subclusters tab.
- Click
for the subcluster you need and select Edit. - Under Scaling, enable Autoscaling if it is disabled.
- Configure the autoscaling settings.
- The default metric used for autoscaling is
yarn.cluster.containersPending. To enable scaling based on CPU utilization, disable the Default scaling setting and specify the target CPU utilization level. - Click Save changes.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To configure autoscaling for subclusters:
-
View the description of the CLI command for updating a subcluster:
yc dataproc subcluster update --help -
Configure the autoscaling settings in the command for updating the subcluster :
yc dataproc subcluster update <subcluster_name_or_ID> \ --cluster-name=<cluster_name> \ --hosts-count=<minimum_number_of_hosts> \ --max-hosts-count=<maximum_number_of_hosts> \ --enable-preemptible=<use_of_preemptible_VMs> \ --warmup-duration=<VM_warmup_period> \ --stabilization-duration=<stabilization_period> \ --measurement-duration=<load_measurement_interval> \ --cpu-utilization-target=<target_CPU_utilization_level> \ --autoscaling-decommission-timeout=<decommissioning_timeout>Where:
--hosts-count: Minimum number of hosts (VMs) in a subcluster. The minimum value is1, and the maximum value is32.--max-hosts-count: Maximum number of hosts (VMs) in a subcluster. The minimum value is1, and the maximum value is100.--enable-preemptible: Indicates if preemptible VMs are used. It can be eithertrueorfalse.--warmup-duration: Time required to warm up a VM instance, in<value>sformat. The minimum value is0s, and the maximum value is600s.--stabilization-duration: Period, in seconds, during which the required number of VMs cannot be decreased, in<value>sformat. The minimum value is60sand the maximum value is1800s.--measurement-duration: Period, in seconds, for which average utilization is calculated for each VM, in<value>sformat. The minimum value is60s(1 minute), and the maximum value is600s(10 minutes).--cpu-utilization-target: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise,yarn.cluster.containersPendingwill be used for scaling based on the number of pending resources. The minimum value is10, and the maximum value is100.--autoscaling-decommission-timeout: Decommissioning timeout in seconds. The minimum value is0, and the maximum value is86400(24 hours).
You can request the subcluster name or ID with the list of cluster subclusters, and the cluster name, with the list of folder clusters.
To configure autoscaling for subclusters:
-
Open the current Terraform configuration file that defines your infrastructure.
For more information about creating this file, see Creating clusters.
-
In the description of the Yandex Data Processing cluster, add a section named
subcluster_spec.autoscaling_configwith the required autoscaling settings for your subcluster:resource "yandex_dataproc_cluster" "data_cluster" { ... cluster_config { ... subcluster_spec { name = "<subcluster_name>" role = "COMPUTENODE" ... autoscaling_config { max_hosts_count = <maximum_number_of_VMs_in_group> measurement_duration = <load_measurement_interval> warmup_duration = <warmup_period> stabilization_duration = <stabilization_period> preemptible = <use_of_preemptible_VMs> cpu_utilization_target = <target_CPU_utilization_level> decommission_timeout = <decommissioning_timeout> } }Where:
max_hosts_count: Maximum number of hosts (VMs) in a subcluster. The minimum value is1, and the maximum value is100.measurement_duration: Period, in seconds, for which average utilization is calculated for each VM, in<value>sformat. The minimum value is60s(1 minute), and the maximum value is600s(10 minutes).warmup_duration: Time required to warm up a VM instance, in<value>sformat. The minimum value is0s, and the maximum value is600s.stabilization_duration: Period, in seconds, during which the required number of VMs cannot be decreased, in<value>sformat. The minimum value is60sand the maximum value is1800s.preemptible: Indicates if preemptible VMs are used. It can be eithertrueorfalse.cpu_utilization_target: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise,yarn.cluster.containersPendingwill be used for scaling based on the number of pending resources. The minimum value is10, and the maximum value is100.decommission_timeout: Decommissioning timeout in seconds. The minimum value is0, and the maximum value is86400(24 hours).
-
Make sure the settings are correct.
-
In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.
-
Run this command:
terraform validateTerraform will show any errors found in your configuration files.
-
-
Confirm updating the resources.
-
Run this command to view the planned changes:
terraform planIf you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
-
If everything looks correct, apply the changes:
-
Run this command:
terraform apply -
Confirm updating the resources.
-
Wait for the operation to complete.
-
-
For more information about resources you can create with Terraform, see the relevant provider documentation.
Increasing storage size
Warning
Changing host properties through the Yandex Compute Cloud interfaces may result in host failure. To change the cluster host settings, use the Yandex Data Processing interfaces, such as the management console, CLI, Terraform, or API.
You can increase the amount of storage available to each host in a subcluster.
Note
Currently, you cannot reduce storage size. If required, re-create the Yandex Data Processing subcluster.
Make sure the cloud quota is sufficient to increase the VM resources. Open the Quotas
- Total HDD capacity.
- Total SSD capacity.
- Number of disks.
To change the storage size for a subcluster:
- In the management console
, select the folder with the cluster whose subcluster you want to change. - Select Yandex Data Processing and the required cluster.
- Navigate to Subclusters.
- Click
for the subcluster you need and select Edit. - Enter or select the required amount of storage under Storage size.
- Click Save changes.
Yandex Data Processing will start updating the subcluster.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To change the storage size for a subcluster:
-
View the description of the CLI command for updating a subcluster:
yc dataproc subcluster update --help -
Specify the required storage size in the update subcluster command:
yc dataproc subcluster update <subcluster_name_or_ID> \ --cluster-name=<cluster_name> \ --disk-size=<storage_size_in_GB>You can request the subcluster name or ID with the list of cluster subclusters, and the cluster name, with the list of folder clusters.
If all these conditions are met, Yandex Data Processing launches the operation to increase storage space.
To increase the subcluster storage size:
-
Open the current Terraform configuration file that defines your infrastructure.
For more information about creating this file, see Creating clusters.
-
In the description of the Yandex Data Processing cluster, edit the
disk_sizevalue undersubcluster_spec.resourcesfor your subcluster:resource "yandex_dataproc_cluster" "data_cluster" { ... cluster_config { ... subcluster_spec { name = "<subcluster_name>" ... resources { disk_size = <storage_size_in_GB> ... } } } } -
Make sure the settings are correct.
-
In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.
-
Run this command:
terraform validateTerraform will show any errors found in your configuration files.
-
-
Confirm updating the resources.
-
Run this command to view the planned changes:
terraform planIf you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
-
If everything looks correct, apply the changes:
-
Run this command:
terraform apply -
Confirm updating the resources.
-
Wait for the operation to complete.
-
-
For more information about resources you can create with Terraform, see the relevant provider documentation.
Editing security groups
- Navigate to the folder dashboard
and select Yandex Data Processing. - Click the cluster name and open the Hosts tab.
- Click the host name.
- Under Network, click
and select Edit. - Select the required security groups.
- Click Save.
-
Open the current Terraform configuration file that defines your infrastructure.
For more information about creating this file, see Creating clusters.
-
Edit the
security_group_idsvalue in the cluster description:resource "yandex_dataproc_cluster" "data_cluster" { ... security_group_ids = [ "<list_of_cluster_security_group_IDs>" ] } -
Make sure the settings are correct.
-
In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.
-
Run this command:
terraform validateTerraform will show any errors found in your configuration files.
-
-
Confirm updating the resources.
-
Run this command to view the planned changes:
terraform planIf you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
-
If everything looks correct, apply the changes:
-
Run this command:
terraform apply -
Confirm updating the resources.
-
Wait for the operation to complete.
-
-
For more information about the resources you can create with Terraform, see this provider article.
Warning
You may need to additionally set up security groups to connect to the cluster.