Creating an autoscaling Yandex Data Processing cluster in another folder's network

Written by

Yandex Cloud

Improved by

Danila N.

Updated at July 7, 2026

View in Markdown

Roles for creating a cluster
Roles required to enable autoscaling in another folder's network
Configure your network
Configure security groups
Create a Yandex Data Processing cluster

Roles for creating a cluster

To create a Yandex Data Processing cluster, your Yandex Cloud account needs the following roles:

dataproc.editor: To create a cluster.
vpc.user: To use the cluster network.
iam.serviceAccounts.user: To assign a service account to the cluster and use it to create resources.

Make sure to assign the following roles to the Yandex Data Processing cluster’s service account:

dataproc.agent: To enable the service account to get info on cluster host states, jobs, and log groups.
dataproc.provisioner: To enable the service account to work with an autoscaling instance group. This will enable subcluster autoscaling.
resource-manager.auditor or higher for the folder where you want to create a Yandex Data Processing cluster: For connection to the cluster using OS Login.

Tip

To restrict the permissions of a cluster's service account (its IAM token is available when running jobs):

Specify a separate service account for cluster autoscaling when creating or updating the cluster via the Yandex Cloud CLI, Terraform, or API.
Assign the dataproc.provisioner role to this account only.

For more information about assigning roles, see this Yandex Identity and Access Management guide.

Roles required to enable autoscaling in another folder's network

To set up autoscaling in another folder's network:

In the management console, navigate to the appropriate folder.
Grant access to the folder and the following roles to the Yandex Data Processing cluster’s service account:
- resource-manager.viewer: To view cloud and folder metadata.
- vpc.user: To use the network.
- dns.editor: To manage DNS.

Configure your network

In the management console, navigate to the folder where you need to create a network.
Create a network with subnets.
Set up a NAT gateway in the subnet to host the Yandex Data Processing cluster.

Before creating a Yandex Data Processing cluster, you need to create and configure security groups. If the selected security groups lack the required rules, Yandex Cloud will prevent creating the Yandex Data Processing cluster.

In the management console, navigate to the folder where you created your network.
Create one or more security groups for the Yandex Data Processing cluster service traffic.
Add these rules:
- One inbound and one outbound rule for service traffic:
  - Port range: 0-65535
  - Protocol: Any
  - Source/Destination name: Security group
  - Security group: Current
- A separate rule for outbound HTTPS traffic. This will allow using Yandex Object Storage buckets, UI Proxy, and autoscaling of Yandex Data Processing clusters.
  
  You can set up this rule using one of the two methods:
  To all addresses
  
  To addresses used by Yandex Cloud
  - Port range: 443.
  - Protocol: TCP.
  - Destination name: CIDR.
  - CIDR blocks: 0.0.0.0/0.
  - Port range: 443.
  - Protocol: TCP.
  - Destination name: CIDR.
  - CIDR blocks:
    
    84.201.181.26/32: Getting the Yandex Data Processing cluster status, running jobs.
    
    158.160.167.170/32: UI Proxy.
    
    158.160.59.216/32: Yandex Data Processing cluster state monitoring, autoscaling.
    
    213.180.193.243/32: Access to Object Storage.
    
    84.201.181.184/32: Cloud Logging address.
- Rule to allow NTP server access for time sync:
  - Port range: 123.
  - Protocol: UDP.
  - Destination name: CIDR.
  - CIDR blocks: 0.0.0.0/0.

If you intend to use multiple security groups for your Yandex Data Processing cluster, allow all traffic between them.

Note

You can specify more granular security group rules, such as allowing traffic only in specific subnets.

Make sure to properly configure security groups for all subnets where the Yandex Data Processing cluster hosts will reside.

You can configure security groups after creating a Yandex Data Processing cluster to connect to Apache Hive™ Metastore or Yandex Data Processing cluster hosts via the internet or an intermediate VM.

Create a Yandex Data Processing cluster

A Yandex Data Processing cluster must include a subcluster with a master host and at least one subcluster for data storage or processing.

If you want to create a Yandex Data Processing cluster copy, import its configuration to Terraform.

Management console

CLI

Terraform

API

In the management console, select the folder where you want to create a Yandex Data Processing cluster.
Click Create resource and select Yandex Data Processing cluster from the drop-down list.
Enter a name for the Yandex Data Processing cluster. Optionally, add a cluster description.

Follow these naming requirements:
- The name must be unique within the folder.
- Length: between 3 and 63 characters.
- It can only contain lowercase Latin letters, numbers, and hyphens.
- It must start with a letter and cannot end with a hyphen.
Select the environment where you want to create your cluster (you cannot change the environment once the cluster is created):
- PRODUCTION: For stable versions of your applications.
- PRESTABLE: For testing purposes. The prestable environment is similar to the production environment and likewise covered by an SLA, but it is the first to get new features, improvements, and bug fixes. In the prestable environment, you can test new versions for compatibility with your application.
Add or delete cluster labels. You can use them to organize resources into logical groups.
Specify the following cluster settings:
- Image version and services you want to use in the Yandex Data Processing cluster.
  
  You can use an image version 2.0.39 or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.
  
  Tip
  
  To use the most recent image version, specify 2.1.
- Public part of the SSH key in the SSH key field. For information on how to generate and use SSH keys, see the Yandex Compute Cloud documentation.
- Service account that needs access to the Yandex Data Processing cluster.
- Availability zone for the Yandex Data Processing cluster.
- Optionally, cluster component properties.
- Optionally, custom initialization scripts for cluster hosts. For each script, specify the following information:
  - URI to the initialization script in the https://, http://, hdfs://, or s3a:// scheme.
  - Optionally, script timeout in seconds. If your initialization script runs longer than the specified time period, it will be terminated.
  - Optionally, arguments enclosed in square brackets and separated by commas. The initialization script will use these arguments. Here is an example of arguments:
```
["arg1","arg2",...,"argN"]
```
- Format for specifying the Yandex Object Storage bucket name, List or ID.
- Bucket the cluster will use.
  
  Depending on the format you selected, either pick a name from the list or specify it manually. You can get it with the list of buckets in the folder.
  
  The Yandex Data Processing cluster’s service account must have READ and WRITE permissions for this bucket.
- Format for specifying a network for the Yandex Data Processing cluster.
- Network for the cluster.
- Security groups with the required permissions.
  
  Warning
  
  When you create a Yandex Data Processing cluster, security group settings are verified. If the Yandex Data Processing cluster cannot work properly with these settings, you will get a warning. You can find an example of working settings above.
- UI Proxy. Enabling the UI Proxy option will make available the web interfaces of the Yandex Data Processing components.
- Yandex Cloud Logging log group the cluster will send logs to.
  
  To save logs in a log group, assign the logging.writer role to the cluster service account.
Configure Yandex Data Processing subclusters. The following types are available:
- Subcluster with a master host, designated as Master. There can only be one such subcluster.
- Data storage subclusters, designated as Data. These are used to deploy storage components.
- Data processing subclusters, designated as Compute. These are used to deploy computing components. Storage of such a subcluster is intended only for temporarily storing files being processed.
For each Yandex Data Processing subcluster, specify:
- Subcluster name.
- Number of hosts (with only one master host allowed).
- Host class, which defines the platform and computing resources available to the host.
- Storage size and type.
- Subnet.
  
  In the subnet, you need to set up a NAT gateway for the Yandex Data Processing subcluster with a master host. For more information, see Configure your network.
- Access to Yandex Data Processing subcluster hosts from the internet. To enable access, select Public access. In this case, you can only connect to Yandex Data Processing subcluster hosts over SSL. For more information, see Connecting to a Yandex Data Processing cluster.
  
  Warning
  
  After you create a Yandex Data Processing cluster, you cannot request or disable public access to a subcluster. However, you can delete the Yandex Data Processing subcluster for data processing and recreate it with the public access settings you need.
Optionally, set up autoscaling for data processing subclusters:
1. In the Compute type subcluster settings, enable Autoscaling.
2. Configure the autoscaling settings.
3. The default metric used for autoscaling is yarn.cluster.containersPending. To enable CPU-based scaling, disable Default scaling and specify the target CPU utilization.
4. Click Add.
Optionally, add and configure additional subclusters for data storage or processing.
Optionally, enable accidental cluster deletion protection in the additional settings.

Even with cluster deletion protection enabled, you can still connect to the Yandex Data Processing cluster manually and delete the data.
Click Create cluster.

If you do not have the Yandex Cloud CLI yet, install and initialize it.

The folder used by default is the one specified when creating the CLI profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also specify a different folder for any command using --folder-name or --folder-id. If you access a resource by its name, the search will be limited to the default folder. If you access a resource by its ID, the search will be global, i.e., through all folders based on access permissions.

To create a Yandex Data Processing cluster:

Check whether the folder has any subnets for the Yandex Data Processing cluster hosts:
```
yc vpc subnet list
```
If there are no subnets in the folder, create the right ones in Yandex Virtual Private Cloud.
See the description of the CLI command for creating a Yandex Data Processing cluster:
```
yc dataproc cluster create --help
```
Specify the Yandex Data Processing cluster properties in this command (the example does not show all that are available):
```
yc dataproc cluster create <cluster_name> \
  --environment=<environment> \
  --bucket=<bucket_name> \
  --zone=<availability_zone> \
  --service-account-name=<cluster_service_account_name> \
  --autoscaling-service-account-name=<name_of_service_account_of_autoscaling_subclusters> \
  --version=<image_version> \
  --services=<list_of_components> \
  --ssh-public-keys-file=<path_to_public_SSH_key> \
  --subcluster name=<name_of_subcluster_with_master_host>,`
               `role=masternode,`
               `resource-preset=<host_class>,`
               `disk-type=<storage_type>,`
               `disk-size=<storage_size_in_GB>,`
               `subnet-name=<subnet_name>,`
               `assign-public-ip=<public_access_to_subcluster_host> \
  --subcluster name=<name_of_subcluster_for_storing_data>,`
               `role=datanode,`
               `resource-preset=<host_class>,`
               `disk-type=<storage_type>,`
               `disk-size=<storage_size_in_GB>,`
               `subnet-name=<subnet_name>,`
               `hosts-count=<number_of_hosts>,`
               `assign-public-ip=<public_access_to_subcluster_host> \
  --deletion-protection \
  --ui-proxy=<access_to_component_web_interfaces> \
  --log-group-id=<log_group_ID> \
  --security-group-ids=<list_of_security_group_IDs>
```
Note

The Yandex Data Processing cluster name must be unique within the folder. It may contain Latin letters, numbers, hyphens, and underscores. The name may be up to 63 characters long.

Where:
- --environment: Cluster environment, prestable or production.
- --bucket: Name of the Object Storage bucket to store job dependencies and results. The Yandex Data Processing cluster’s service account must have READ and WRITE permissions for this bucket.
- --zone: Availability zone where the Yandex Data Processing cluster hosts will reside.
- --service-account-name: Name of the Yandex Data Processing cluster service account.
- --autoscaling-service-account-name: Name of the service account for managing autoscaling subclusters. This is an optional setting.
- --version: Image version.
  
  You can use an image version 2.0.39 or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.
  
  Tip
  
  To use the most recent image version, set --version to 2.1.
- --services: List of components you want to use in your Yandex Data Processing cluster. If you skip this setting, the yarn, tez, and spark components will be used by default.
- --ssh-public-keys-file: Full path to the file with the public part of the SSH key for access to the Yandex Data Processing cluster hosts. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation.
- --subcluster: Properties of Yandex Data Processing subclusters:
  - name: Yandex Data Processing subcluster name.
  - role: Yandex Data Processing subcluster role, which can be masternode, datanode, or computenode.
  - resource-preset: Host class.
  - disk-type: Storage type, which can be network-ssd, network-hdd, or network-ssd-nonreplicated.
  - disk-size: Storage size, in GB.
  - subnet-name: Subnet name.
  - hosts-count: Number of hosts in the Yandex Data Processing subclusters for data storage or processing. The valid values range from 1 to 32.
  - assign-public-ip: Access to Yandex Data Processing subcluster hosts from the internet. It can either be true or false. If access is enabled, you can only connect to the Yandex Data Processing cluster over SSL. For more information, see Connecting to a Yandex Data Processing cluster.
    
    Warning
    
    After you create a Yandex Data Processing cluster, you cannot request or disable public access to a subcluster. However, you can delete the Yandex Data Processing subcluster for data processing and recreate it with the public access settings you need.
- --deletion-protection: Yandex Data Processing cluster deletion protection.
  
  Even with deletion protection enabled, one can still connect to the cluster manually and delete the data.
- --ui-proxy: Access to the web interfaces of the Yandex Data Processing components. It can be either true or false.
- --log-group-id: Log group ID.
- --security-group-ids: List of security group IDs.
To create a Yandex Data Processing cluster with multiple subclusters for data storage or processing, provide the required number of --subcluster arguments in the cluster creation command:
```
yc dataproc cluster create <cluster_name> \
  ...
  --subcluster <subcluster_properties> \
  --subcluster <subcluster_properties> \
  ...
```
To enable autoscaling in Yandex Data Processing subclusters for data processing, specify the following properties:
```
yc dataproc cluster create <cluster_name> \
  ...
  --subcluster name=<subcluster_name>,`
               `role=computenode`
               `...`
               `hosts-count=<minimum_number_of_hosts>`
               `max-hosts-count=<maximum_number_of_hosts>,`
               `preemptible=<using_preemptible_VMs>,`
               `warmup-duration=<VM_warmup_period>,`
               `stabilization-duration=<stabilization_period>,`
               `measurement-duration=<load_measurement_interval>,`
               `cpu-utilization-target=<target_CPU_utilization>,`
               `autoscaling-decommission-timeout=<decommissioning_timeout>
```
Where:
- hosts-count: Minimum number of hosts (VMs) per Yandex Data Processing subcluster. The minimum value is 1, and the maximum value is 32.
- max-hosts-count: Maximum number of hosts (VMs) per Yandex Data Processing subcluster. The valid values range from 1 to 100.
- preemptible: Specifies if preemptible VMs are used. It can either be true or false.
- warmup-duration: Time required to warm up a VM, in <value>s format. The minimum value is 0s and the maximum, 600s (ten minutes).
- stabilization-duration: Period, in seconds, during which the required number of VMs cannot be decreased, in <value>s format. The minimum value is 60s (one minute). The maximum value is 1800s (30 minutes).
- measurement-duration: Period, in seconds, for which average utilization is calculated for each VM, in <value>s format. The minimum value is 60s (one minute). The maximum value is 600s (ten minutes).
- cpu-utilization-target: Target CPU utilization, in %. Use this setting to enable CPU-based autoscaling. Otherwise, yarn.cluster.containersPending will be used for autoscaling based on the number of pending resources. The minimum value is 10 and the maximum, 100.
- autoscaling-decommission-timeout: Decommissioning timeout, in seconds. The minimum value is 0 and the maximum, 86400 (24 hours).
To create a Yandex Data Processing cluster based on dedicated host groups, specify their IDs as a comma-separated list in the --host-group-ids parameter:
```
yc dataproc cluster create <cluster_name> \
  ...
  --host-group-ids=<dedicated_host_group_IDs>
```
Alert

You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.
To configure Yandex Data Processing cluster hosts using initialization scripts, specify them in one or multiple --initialization-action parameters:
```
yc dataproc cluster create <cluster_name> \
  ...
  --initialization-action uri=<initialization_script_URI>,`
                          `timeout=<script_timeout>,`
                          `args=["arg1","arg2","arg3",...]
```
Where:
- uri: Link to the initialization script in the https://, http://, hdfs://, or s3a:// scheme.
- timeout (optional): Script timeout, in seconds. If your initialization script runs longer than the specified time period, it will be terminated.
- args (optional): Arguments for the initialization script, provided as a comma-separated list.

With Terraform, you can quickly create a cloud infrastructure in Yandex Cloud and manage it using configuration files. These files store the infrastructure description written in HashiCorp Configuration Language (HCL). If you change the configuration files, Terraform automatically detects which part of your configuration is already deployed, and what should be added or removed.

Terraform is distributed under the Business Source License. The Yandex Cloud provider for Terraform is distributed under the MPL-2.0 license.

For more information about the provider resources, see the guides on the Terraform website or its mirror.

To create a Yandex Data Processing cluster:

In the command line, navigate to the directory that will contain the Terraform configuration files describing your infrastructure. If there is no such directory, create one.
If you do not have Terraform yet, install it and configure the Yandex Cloud provider.

To manage infrastructure using Terraform under a service account or user accounts (a Yandex account, a federated account, or a local user), authenticate using the appropriate method.

Create a configuration file with a description of your cloud network, subnets, security group, and NAT gateway.

The Yandex Data Processing cluster resides in a cloud network. If you already have a network in place, you do not need to describe it again.

The Yandex Data Processing cluster hosts reside in subnets of the selected cloud network. If you already have suitable subnets, you do not need to describe them again.

Below is an example of a configuration file describing a single-subnet cloud network, security group, NAT gateway, and route table:

resource "yandex_vpc_network" "test_network" {
  name = "<network_name>"
}

resource "yandex_vpc_subnet" "test_subnet" {
  name           = "<subnet_name>"
  zone           = "<availability_zone>"
  network_id     = yandex_vpc_network.test_network.id
  v4_cidr_blocks = ["<subnet>"]
  route_table_id = yandex_vpc_route_table.data-processing-rt.id
}

resource "yandex_vpc_gateway" "data-processing-gateway" {
  name = "data-processing-gateway"
  shared_egress_gateway {}
}

resource "yandex_vpc_route_table" "data-processing-rt" {
  network_id = "${yandex_vpc_network.test-network.id}"

  static_route {
    destination_prefix = "0.0.0.0/0"
    gateway_id         = "${yandex_vpc_gateway.data-processing-gateway.id}"
  }
}

resource "yandex_vpc_security_group" "data-processing-sg" {
  description = "Security group for DataProc"
  name        = "data-processing-security-group"
  network_id  = yandex_vpc_network.data-proc-network.id

  egress {
    description    = "Allow outgoing HTTPS traffic"
    protocol       = "TCP"
    port           = 443
    v4_cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description       = "Allow any incomging traffic within the security group"
    protocol          = "ANY"
    from_port         = 0
    to_port           = 65535
    predefined_target = "self_security_group"
  }

  egress {
    description       = "Allow any outgoing traffic within the security group"
    protocol          = "ANY"
    from_port         = 0
    to_port           = 65535
    predefined_target = "self_security_group"
  }

  egress {
    description    = "Allow outgoing traffic to NTP servers for time synchronization"
    protocol       = "UDP"
    port           = 123
    v4_cidr_blocks = ["0.0.0.0/0"]
  }
}

Create a configuration file describing the following resources:

Yandex Data Processing cluster service account that needs access to the Object Storage cluster.
Service account to manage autoscaling subclusters, if required.
Service account to create the Object Storage bucket.
Static key.
Object Storage bucket to store job execution results.

resource "yandex_iam_service_account" "data_proc_sa" {
  name        = "<service_account_name>"
  description = "<service_account_description>"
}

resource "yandex_iam_service_account" "data_proc_sa_autoscaling" {
  name        = "<service_account_name>"
  description = "<service_account_description>"
}

resource "yandex_resourcemanager_folder_iam_member" "dataproc-agent" {
  folder_id = "<folder_ID>"
  role      = "dataproc.agent"
  member    = "serviceAccount:${yandex_iam_service_account.data_proc_sa.id}"
}

resource "yandex_resourcemanager_folder_iam_member" "dataproc-provisioner" {
  folder_id = "<folder_ID>"
  role      = "dataproc.provisioner"
  member    = "serviceAccount:${yandex_iam_service_account.data_proc_sa_autoscaling.id}"
}

resource "yandex_iam_service_account" "bucket_sa" {
  name        = "<service_account_name>"
  description = "<service_account_description>"
}

resource "yandex_resourcemanager_folder_iam_member" "storage-admin" {
  folder_id = "<folder_ID>"
  role      = "storage.admin"
  member    = "serviceAccount:${yandex_iam_service_account.bucket_sa.id}"
}

resource "yandex_iam_service_account_static_access_key" "bucket_sa_static_key" {
  service_account_id = yandex_iam_service_account.bucket_sa.id
}

resource "yandex_storage_bucket" "data_bucket" {
  depends_on = [
    yandex_resourcemanager_folder_iam_member.storage-admin
  ]

  grant {
    id          = yandex_iam_service_account.data_proc_sa.id
    type        = "CanonicalUser"
    permissions = ["READ","WRITE"]
  }

  bucket     = "<bucket_name>"
  access_key = yandex_iam_service_account_static_access_key.bucket_sa_static_key.access_key
  secret_key = yandex_iam_service_account_static_access_key.bucket_sa_static_key.secret_key
}

Create a configuration file describing the Yandex Data Processing cluster and its subclusters.

Here, you can also specify the properties of the Yandex Data Processing cluster components, jobs, and environment, if required.

Below is an example of a configuration file structure that describes a Yandex Data Processing cluster consisting of a master host subcluster, data storage subcluster, and data processing subcluster:

resource "yandex_dataproc_cluster" "data_cluster" {
  bucket                         = "${yandex_storage_bucket.data_bucket.bucket}"
  name                           = "<cluster_name>"
  description                    = "<cluster_description>"
  environment                    = "<cluster_environment>"
  service_account_id             = yandex_iam_service_account.data_proc_sa.id
  autoscaling_service_account_id = yandex_iam_service_account.data_proc_sa_autoscaling.id
  zone_id                        = "<availability_zone>"
  security_group_ids             = [yandex_vpc_security_group.data-processing-sg.id]
  deletion_protection            = <cluster_deletion_protection>

  depends_on = [
    yandex_resourcemanager_folder_iam_member.dataproc-provisioner,
    yandex_resourcemanager_folder_iam_member.dataproc-agent
  ]

  cluster_config {
    version_id = "<image_version>"

    hadoop {
      services   = ["<list_of_components>"]
      # List example: ["HDFS", "YARN", "SPARK", "TEZ", "MAPREDUCE", "HIVE"].
      properties = {
        "<component_property>" = <value>
        ...
      }
      ssh_public_keys = [
        file("<path_to_public_SSH_key>")
      ]
    }

    subcluster_spec {
      name = "<name_of_subcluster_with_master_host>"
      role = "MASTERNODE"
      resources {
        resource_preset_id = "<host_class>"
        disk_type_id       = "<storage_type>"
        disk_size          = <storage_size_in_GB>
      }
      subnet_id   = yandex_vpc_subnet.test_subnet.id
      hosts_count = 1
    }

    subcluster_spec {
      name = "<name_of_subcluster_for_storing_data>"
      role = "DATANODE"
      resources {
        resource_preset_id = "<host_class>"
        disk_type_id       = "<storage_type>"
        disk_size          = <storage_size_in_GB>
      }
      subnet_id   = yandex_vpc_subnet.test_subnet.id
      hosts_count = <number_of_hosts_in_subcluster>
    }

    subcluster_spec {
      name = "<data_processing_subcluster_name>"
      role = "COMPUTENODE"
      resources {
        resource_preset_id = "<host_class>"
        disk_type_id       = "<storage_type>"
        disk_size          = <storage_size_in_GB>
      }
      subnet_id   = yandex_vpc_subnet.test_subnet.id
      hosts_count = <number_of_hosts_in_subcluster>
    }
  }
}

Where deletion_protection means Yandex Data Processing cluster deletion protection. It can either be true or false.

Even with cluster deletion protection enabled, one can still delete a user or database or connect manually and delete the database contents.

You can use an image version 2.0.39 or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.

Tip

To use the most recent image version, set version_id to 2.1.

To access the web interfaces of the Yandex Data Processing components, add the ui_proxy field set to true to the Yandex Data Processing cluster description:

resource "yandex_dataproc_cluster" "data_cluster" {
  ...
  ui_proxy = true
  ...
}

To configure autoscaling in Yandex Data Processing subclusters for data processing, add the autoscaling_config section with the settings you need to subcluster_spec of the subcluster in question:

subcluster_spec {
  name = "<subcluster_name>"
  role = "COMPUTENODE"
  ...
  autoscaling_config {
    max_hosts_count        = <maximum_number_of_VMs_in_group>
    measurement_duration   = <load_measurement_interval>
    warmup_duration        = <warmup_period>
    stabilization_duration = <stabilization_period>
    preemptible            = <use_of_preemptible_VMs>
    cpu_utilization_target = <target_CPU_utilization>
    decommission_timeout   = <decommissioning_timeout>
  }
}

Where:

max_hosts_count: Maximum number of hosts (VMs) per Yandex Data Processing subcluster. The minimum value is 1, and the maximum value is 100.
measurement_duration: Period, in seconds, for which average utilization is calculated for each VM, in <value>s format. The minimum value is 60s (1 minute), and the maximum value is 600s (10 minutes).
warmup_duration: Time required to warm up a VM, in <value>s format. The minimum value is 0s, and the maximum value is 600s.
stabilization_duration: Period, in seconds, during which the required number of VMs cannot be decreased, in <value>s format. The minimum value is 60s and the maximum value is 1800s.
preemptible: Indicates if preemptible VMs are used. It can be either true or false.
cpu_utilization_target: Target CPU utilization, in %. Use this setting to enable scaling based on CPU utilization. Otherwise, yarn.cluster.containersPending will be used for scaling based on the number of pending resources. The minimum value is 10, and the maximum value is 100.
decommission_timeout: Decommissioning timeout, in seconds. The minimum value is 0 and the maximum, 86400 (24 hours).

For more information about the resources you can create with Terraform, see this provider guide.

Make sure the Terraform configuration files are correct:
1. In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.
2. Run this command:
```
terraform validate
```
  Terraform will show any errors found in your configuration files.
Create a Yandex Data Processing cluster:
1. Run this command to view the planned changes:
```
terraform plan
```
  If you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
2. If everything looks correct, apply the changes:
  1. Run this command:
```
terraform apply
```
  2. Confirm updating the resources.
  3. Wait for the operation to complete.
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console.

To create a Yandex Data Processing cluster, call the create API method and provide the following in the request:

ID of the folder to host the Yandex Data Processing cluster, in the folderId parameter.
Yandex Data Processing cluster name in the name parameter.
Cluster environment, PRESTABLE or PRODUCTION, in the environment parameter.
Yandex Data Processing cluster configuration in the configSpec parameter, including:
- Image version in the configSpec.versionId parameter.
  
  You can use an image version 2.0.39 or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.
  
  Tip
  
  To use the most recent image version, specify 2.1.
- List of components in the configSpec.hadoop.services parameter.
- Public part of the SSH key in the configSpec.hadoop.sshPublicKeys parameter.
- Settings of the Yandex Data Processing subclusters in the configSpec.subclustersSpec parameter.
Availability zone of the Yandex Data Processing cluster in the zoneId parameter.
ID of the Yandex Data Processing cluster service account in the serviceAccountId parameter.
ID of the service account for managing autoscaling subclusters in the autoscalingServiceAccountId parameter, if required.
Bucket name in the bucket parameter.
IDs of the Yandex Data Processing cluster security groups in the hostGroupIds parameter.
Yandex Data Processing cluster deletion protection settings in the deletionProtection parameter.

Even with deletion protection enabled, one can still connect to the cluster manually and delete the data.

To assign a public IP address to all hosts of a Yandex Data Processing subcluster, provide true in configSpec.subclustersSpec.assignPublicIp.

To create a Yandex Data Processing cluster based on dedicated host groups, provide a list of their IDs in the hostGroupIds parameter.

Alert

You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.

To configure Yandex Data Processing cluster hosts using initialization scripts, specify them in one or multiple configSpec.hadoop.initializationActions parameters.

After the Yandex Data Processing cluster status switches to Running, you can connect to the Yandex Data Processing subcluster hosts using the specified SSH key.

Creating an autoscaling Yandex Data Processing cluster in another folder's network

Roles for creating a clusterRoles for creating a cluster

Roles required to enable autoscaling in another folder's networkRoles required to enable autoscaling in another folder's network

Configure your networkConfigure your network

Configure security groupsConfigure security groups

Create a Yandex Data Processing clusterCreate a Yandex Data Processing cluster

Was the article helpful?

Roles for creating a cluster

Roles required to enable autoscaling in another folder's network

Configure your network

Configure security groups

Create a Yandex Data Processing cluster