Updating subclusters
You can perform the following actions for any subcluster:
- Changing the number of hosts.
- Changing the host class.
- Changing the autoscaling rule for data processing subclusters.
- Increasing storage size.
- Changing security groups.
You can also change subclusters' availability zone. In which case they will be migrated to a different availability zone together with the Yandex Data Processing cluster. The process depends on the cluster type:
- Migrating a lightweight cluster to a different availability zone.
- Migrating an HDFS cluster to a different availability zone.
Changing the number of hosts
You can change the number of hosts in data storage and processing subclusters:
- Go to the folder page
and select Yandex Data Processing. - Click the cluster name and open the Subclusters tab.
- Click
for the subcluster you need and select Edit. - Enter or select the required number of hosts in the Hosts field.
- (Optional) Specify the decommissioning timeout.
- Click Save changes.
Yandex Data Processing will run the add host operation.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To change the number of hosts for a subcluster:
-
View a description of the CLI update subcluster command:
yc dataproc subcluster update --help
-
Set a new number of hosts in the update subcluster command:
yc dataproc subcluster update <subcluster_name_or_ID> \ --cluster-name=<cluster_name> \ --hosts-count=<number_of_hosts>
You can request a subcluster name or ID with a list of cluster subclusters, and a cluster name with a list of folder clusters.
-
Open the current Terraform configuration file with an infrastructure plan.
For more information about how to create this file, see Creating clusters.
-
In the Yandex Data Processing cluster description, edit the value of the
hosts_count
parameter undersubcluster_spec
for the relevant data storage or processing subcluster:resource "yandex_dataproc_cluster" "data_cluster" { ... cluster_config { ... subcluster_spec { name = "<subcluster_name>" ... hosts_count = <number_of_subcluster_hosts> } } }
Changing the host class
Warning
Changing host properties through the Yandex Compute Cloud interfaces may result in host failure. To change the cluster host settings, use the Yandex Data Processing interfaces, such as the management console, CLI, Terraform, or API.
You can change the computing power of hosts in a separate subcluster. It depends on the driver deploy mode:
- In
deployMode=cluster
mode, when the driver is deployed on one of the cluster'scompute
hosts, 4–8 CPU cores and 16 GB RAM are sufficient for the subcluster with themaster
host. - In
deployMode=client
mode, when the driver is deployed on the cluster's master host, the computing power depends on the job logic and the number of concurrently running jobs.
For more information about driver deploy modes and computing resource consumption, see Resource allocation.
To change the host class for a subcluster:
- In the management console
, select the folder with the cluster whose subcluster you want to change. - Select Yandex Data Processing and the required cluster.
- Go to Subclusters.
- Click
for the subcluster you need and select Edit. - Select the required platform and configuration under Host class.
- (Optional) Specify the decommissioning timeout.
- Click Save changes.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To change the host class for a subcluster:
-
View a description of the CLI update subcluster command:
yc dataproc subcluster update --help
-
Request a list of available host classes (the
ZONE IDS
column specifies the availability zones where you can select the appropriate class):yc dataproc resource-preset list
Result:
+-----------+--------------------------------+-------+----------+ | ID | ZONE IDS | CORES | MEMORY | +-----------+--------------------------------+-------+----------+ | b3-c1-m4 | ru-central1-a, ru-central1-b, | 2 | 4.0 GB | | | ru-central1-c | | | | ... | +-----------+--------------------------------+-------+----------+
-
Specify the class in the update subcluster command:
yc dataproc subcluster update <subcluster_name_or_ID> \ --cluster-name=<cluster_name> \ --resource-preset=<host_class>
You can request a subcluster name or ID with a list of cluster subclusters, and a cluster name with a list of folder clusters.
-
Open the current Terraform configuration file with an infrastructure plan.
For more information about how to create this file, see Creating clusters.
-
In the Yandex Data Processing cluster description, edit the value of the
resource_preset_id
parameter undersubcluster_spec.resources
for the relevant subcluster:resource "yandex_dataproc_cluster" data_cluster { ... cluster_config { ... subcluster_spec { name = "<subcluster_name>" ... resources { resource_preset_id = "<subcluster_host_class>" ... } } }
-
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
For more information about resources you can create using Terraform, see the provider documentation
Yandex Data Processing will run the update subcluster operation. Note that all the hosts in the updated subcluster will be restarted.
Changing the autoscaling rule for data processing subclusters
You can configure the autoscaling rule in data processing subclusters.
Make sure the cloud quota is sufficient to increase the VM resources. Open the Quotas
- Total HDD capacity
- Total SSD capacity
- Number of disks
- Number of vCPUs for instances
- Number of instances.
To enable autoscaling, make sure the Yandex Data Processing cluster service account has the following roles assigned:
- dataproc.agent: To enable the service account to get info on cluster host states, jobs, and log groups.
- dataproc.provisioner: To enable the service account to work with an autoscaling instance group. This will enable subcluster autoscaling.
To configure autoscaling for subclusters:
- Go to the folder page
and select Yandex Data Processing. - Select a cluster and open the Subclusters tab.
- Click
for the appropriate subcluster and select Edit. - Under Scaling, enable Autoscaling if it is disabled.
- Set autoscaling parameters.
- The default metric used for autoscaling is
yarn.cluster.containersPending
. To enable scaling based on CPU usage, disable the Default scaling setting and specify the target CPU utilization level. - Click Save changes.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To configure autoscaling for subclusters:
-
View a description of the CLI update subcluster command:
yc dataproc subcluster update --help
-
Set the autoscaling parameters in the update subcluster command:
yc dataproc subcluster update <subcluster_name_or_ID> \ --cluster-name=<cluster_name> \ --hosts-count=<minimum_number_of_hosts> \ --max-hosts-count=<maximum_number_of_hosts> \ --enable-preemptible=<use_preemptible_VMs> \ --warmup-duration=<VM_warmup_time> \ --stabilization-duration=<stabilization_period> \ --measurement-duration=<utilization_measurement_interval> \ --cpu-utilization-target=<target_CPU_utilization_level> \ --autoscaling-decommission-timeout=<decommissioning_timeout>
Where:
--hosts-count
: Minimum number of hosts (VMs) in a subcluster. The minimum value is1
and the maximum value is32
.--max-hosts-count
: Maximum number of hosts (VMs) in a subcluster. The minimum value is1
and the maximum value is100
.--enable-preemptible
: Indicates if preemptible VMs are used. It may take either thetrue
orfalse
value.--warmup-duration
: Time required to warm up a VM instance, in<value>s
format. The minimum value is0s
and the maximum value is600s
(10 minutes).--stabilization-duration
: Period, in seconds, during which the required number of instances cannot be decreased, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is1800s
(30 minutes).--measurement-duration
: Period, in seconds, for which utilization measurements are averaged for each instance, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is600s
(10 minutes).--cpu-utilization-target
: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise,yarn.cluster.containersPending
will be used as a metric (based on the number of pending resources). The minimum value is10
and the maximum value is100
.--autoscaling-decommission-timeout
: Decommissioning timeout in seconds. The minimum value is0
and the maximum value is86400
(24 hours).
You can request a subcluster name or ID with a list of cluster subclusters, and a cluster name with a list of folder clusters.
To configure autoscaling for subclusters:
-
Open the current Terraform configuration file with an infrastructure plan.
For more information about how to create this file, see Creating clusters.
-
In the Yandex Data Processing cluster description, add a
subcluster_spec.autoscaling_config
section with the required autoscaling settings for the relevant subcluster:resource "yandex_dataproc_cluster" "data_cluster" { ... cluster_config { ... subcluster_spec { name = "<subcluster_name>" role = "COMPUTENODE" ... autoscaling_config { max_hosts_count = <maximum_number_of_VMs_in_group> measurement_duration = <utilization_measurement_interval> warmup_duration = <warmup_time> stabilization_duration = <stabilization_period> preemptible = <use_preemptible_VMs> cpu_utilization_target = <target_vCPU_utilization_level> decommission_timeout = <decommissioning_timeout> } }
Where:
max_hosts_count
: Maximum number of hosts (VMs) in a subcluster. The minimum value is1
and the maximum value is100
.measurement_duration
: Period, in seconds, for which utilization measurements are averaged for each instance, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is600s
(10 minutes).warmup_duration
: Time required to warm up a VM instance, in<value>s
format. The minimum value is0s
and the maximum value is600s
(10 minutes).stabilization_duration
: Period, in seconds, during which the required number of instances cannot be decreased, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is1800s
(30 minutes).preemptible
: Indicates if preemptible VMs are used. It may take either thetrue
orfalse
value.cpu_utilization_target
: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise,yarn.cluster.containersPending
will be used as a metric (based on the number of pending resources). The minimum value is10
and the maximum value is100
.decommission_timeout
: Decommissioning timeout in seconds. The minimum value is0
and the maximum value is86400
(24 hours).
-
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
For more information about resources you can create using Terraform, see the provider documentation
Increasing storage size
Warning
Changing host properties through the Yandex Compute Cloud interfaces may result in host failure. To change the cluster host settings, use the Yandex Data Processing interfaces, such as the management console, CLI, Terraform, or API.
You can increase the amount of storage available to each host in a particular subcluster.
Note
Currently, you cannot reduce storage size. If necessary, re-create the Yandex Data Processing subcluster.
Make sure the cloud quota is sufficient to increase the VM resources. Open the Quotas
- Total HDD capacity.
- Total SSD capacity.
- Number of disks.
To change the storage size for a subcluster:
- In the management console
, select the folder with the cluster whose subcluster you want to change. - Select Yandex Data Processing and the required cluster.
- Go to Subclusters.
- Click
for the subcluster you need and select Edit. - Enter or select the required amount of storage under Size of storage.
- Click Save changes.
Yandex Data Processing will run the update subcluster operation.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To change the storage size for a subcluster:
-
View a description of the CLI update subcluster command:
yc dataproc subcluster update --help
-
Specify the required storage size in the update subcluster command:
yc dataproc subcluster update <subcluster_name_or_ID> \ --cluster-name=<cluster_name> \ --disk-size=<storage_size_in_GB>
You can request a subcluster name or ID with a list of cluster subclusters, and a cluster name with a list of folder clusters.
If all these conditions are met, Yandex Data Processing launches the operation to increase storage space.
To increase the subcluster storage size:
-
Open the current Terraform configuration file with an infrastructure plan.
For more information about how to create this file, see Creating clusters.
-
In the Yandex Data Processing cluster description, edit the value of the
disk_size
parameter undersubcluster_spec.resources
for the relevant subcluster:resource "yandex_dataproc_cluster" "data_cluster" { ... cluster_config { ... subcluster_spec { name = "<subcluster_name>" ... resources { disk_size = <storage_size_in_GB> ... } } } }
-
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
For more information about resources you can create using Terraform, see the provider documentation
Changing security groups
- Go to the folder page
and select Yandex Data Processing. - Click the cluster name and open the Hosts tab.
- Click the host name.
- Under Network, click
and select Edit. - Select the appropriate security groups.
- Click Save.
-
Open the current Terraform configuration file with an infrastructure plan.
For more information about how to create this file, see Creating clusters.
-
Change the value of the
security_group_ids
parameter in the cluster description:resource "yandex_dataproc_cluster" "data_cluster" { ... security_group_ids = [ "<list_of_cluster_security_group_IDs>" ] }
-
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
For more information about resources you can create using Terraform, see the provider documentation
Warning
You may need to additionally set up security groups to connect to the cluster.