Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI Studio
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex Data Processing
  • Getting started
    • All guides
      • Information about existing clusters
      • Creating a cluster
      • Creating an autoscaling cluster in another folder's network
      • Connecting to a cluster
      • Updating a cluster
      • Migrating a lightweight cluster to a different availability zone
      • Deleting a cluster
    • Creating and using Python virtual environments
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Roles for creating a cluster
  • Configure your network
  • Configure security groups
  • Create a Yandex Data Processing cluster
  • Create a Yandex Data Processing cluster copy
  • Example
  • Creating a lightweight Yandex Data Processing cluster for Spark and PySpark jobs
  1. Step-by-step guides
  2. Yandex Data Processing clusters
  3. Creating a cluster

Creating a Yandex Data Processing cluster

Written by
Yandex Cloud
Improved by
Danila N.
Updated at July 21, 2025
  • Roles for creating a cluster
  • Configure your network
  • Configure security groups
  • Create a Yandex Data Processing cluster
  • Create a Yandex Data Processing cluster copy
  • Example
    • Creating a lightweight Yandex Data Processing cluster for Spark and PySpark jobs

Roles for creating a clusterRoles for creating a cluster

To create a Yandex Data Processing cluster, your Yandex Cloud account needs the following roles:

  • dataproc.editor: To create a cluster.
  • vpc.user: To use the cluster network.
  • iam.serviceAccounts.user: To link a service account to the cluster and create resources under that service account.

Make sure to assign the following roles to the Yandex Data Processing cluster’s service account:

  • dataproc.agent: To enable the service account to get info on cluster host states, jobs, and log groups.
  • dataproc.provisioner: To enable the service account to work with an autoscaling instance group. This will enable subcluster autoscaling.

For more information about assigning roles, see the Yandex Identity and Access Management documentation.

Configure your networkConfigure your network

Configure internet access from the subnet to which the Yandex Data Processing subcluster with a master host will be connected, e.g., using a NAT gateway. This will enable the Yandex Data Processing subcluster to interact with Yandex Cloud services or hosts in other networks.

Configure security groupsConfigure security groups

Warning

Before creating a Yandex Data Processing cluster, you need to create and configure security groups. If the selected security groups do not have the required rules, Yandex Cloud will prevent creating the Yandex Data Processing cluster.

  1. Create one or more security groups for the Yandex Data Processing cluster’s control plane traffic.
  2. Add rules:
    • One rule for incoming and another one for outgoing service traffic:

      • Port range: 0-65535
      • Protocol: Any
      • Source/Destination name: Security group
      • Security group: Current
    • A separate rule for outgoing HTTPS traffic. This will allow you to use Yandex Object Storage buckets, UI Proxy, and autoscaling of Yandex Data Processing clusters.

      You can set up this rule using one of the two methods:

      To all addresses
      To the addresses used by Yandex Cloud
      • Port range: 443
      • Protocol: TCP
      • Destination name: CIDR
      • CIDR blocks: 0.0.0.0/0
      • Port range: 443
      • Protocol: TCP
      • Destination name: CIDR
      • CIDR blocks:
        • 84.201.181.26/32: Getting the Yandex Data Processing cluster status, running jobs, UI Proxy.
        • 158.160.59.216/32: Monitoring the Yandex Data Processing cluster state, autoscaling.
        • 213.180.193.243/32: Access to Object Storage.
        • 84.201.181.184/32: Cloud Logging address.
    • Rule that allows access to NTP servers for time syncing:

      • Port range: 123
      • Protocol: UDP
      • Destination name: CIDR
      • CIDR blocks: 0.0.0.0/0

If you intend to use multiple security groups for your Yandex Data Processing cluster, allow all traffic across these groups.

Note

You can specify more granular rules for your security groups, such as allowing traffic only within specific subnets.

You must configure security groups correctly for all subnets where the Yandex Data Processing cluster’s hosts will reside.

You can set up security groups after creating a Yandex Data Processing cluster to connect to Metastore or Yandex Data Processing cluster hosts via the internet or an intermediate VM.

Create a Yandex Data Processing clusterCreate a Yandex Data Processing cluster

A Yandex Data Processing cluster must include a subcluster with a master host and at least one subcluster for data storage or processing.

If you want to create a Yandex Data Processing cluster copy, import its configuration to Terraform.

Management console
CLI
Terraform
API
  1. In the management console, select the folder where you want to create a Yandex Data Processing cluster.

  2. Click Create resource and select Yandex Data Processing cluster from the drop-down list.

  3. Enter a name for the Yandex Data Processing cluster. Optionally, add a cluster description.

    Follow these naming requirements:

    • The name must be unique within the folder.
    • It must be from 2 to 63 characters long.
    • It can only contain lowercase Latin letters, numbers, and hyphens.
    • It must start with a letter and cannot end with a hyphen.
  4. Select the environment to create the cluster in (you cannot change the environment once the cluster is created):

    • PRODUCTION: For stable versions of your apps.
    • PRESTABLE: For testing purposes. The prestable environment is similar to the production environment and likewise covered by the SLA, but it is the first to get new functionalities, improvements, and bug fixes. In the prestable environment, you can test the compatibility of new versions with your application.
  5. Add or delete cluster labels. You can use them to split or join resources into logical groups.

  6. Specify the following cluster settings:

    • Image version and services you want to use in the Yandex Data Processing cluster.

      You can use an image version 2.0.39 or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.

      Tip

      To use the most recent image version, specify 2.0.

    • Method of connecting to a Yandex Data Processing cluster:

      • SSH key: Using SSH keys.
      • Access via OS Login: Using OS Login.

      If you selected access via SSH key, specify its public part in the SSH key field. For information on how to generate and use SSH keys, see the Yandex Compute Cloud documentation.

    • Service account that needs access to the Yandex Data Processing cluster.

    • Availability zone for the Yandex Data Processing cluster.

    • Optionally, cluster component properties.

    • Optionally, custom initialization scripts for cluster hosts. For each script, specify the following information:

      • URI link to the initialization script in the https://, http://, hdfs://, or s3a:// schema.

      • Optionally, script timeout (in seconds). If your initialization script runs longer than the specified time period, it will be terminated.

      • Optionally, arguments enclosed in square brackets and separated by commas. The initialization script will use these arguments. Sample arguments:

        ["arg1","arg2",...,"argN"]
        
    • Name selection format for the Yandex Object Storage bucket, List or ID.

    • Bucket the cluster will use.

      Depending on the format you selected, either pick a name from the list or specify it manually. You can get it with the list of buckets in the folder.

      The Yandex Data Processing cluster’s service account must have READ and WRITE permissions for this bucket.

    • Format in which to specify a network for the Yandex Data Processing cluster.

    • Network for the cluster.

    • Security groups with the required permissions.

      Warning

      When you create a Yandex Data Processing cluster, security group settings undergo a verification. If the Yandex Data Processing cluster cannot operate properly with these settings, you will get a warning. You can find a sample functional configuration above.

    • UI Proxy. Enabling the UI Proxy option will make available the web interfaces of the Yandex Data Processing components.

    • Yandex Cloud Logging log group the cluster will send logs to.

      To save logs in a log group, assign the logging.writer role to the cluster service account.

  7. Configure the Yandex Data Processing subclusters. The following types are available:

    • Subcluster with a master host is specified as Master. There can only be one such subcluster.
    • Data storage subclusters are specified as Data. These are used to deploy storage components.
    • Data processing subclusters are specified as Compute. These are used to deploy computing components. The storage of such a subcluster can only be used to temporarily store processed files.

    For each Yandex Data Processing subcluster, specify:

    • Subcluster name.

    • Number of hosts (there can only be one master host).

    • Host class, i.e., the platform and computing resources available to the host.

    • Storage size and type.

    • Subnet.

      In the subnet, you need to set up a NAT gateway for the Yandex Data Processing subcluster with a master host. For more information, see Configure a network.

    • Access to Yandex Data Processing subcluster hosts from the internet. Select the Public access option to enable access. In this case, you can only connect to Yandex Data Processing subcluster hosts over SSL. For more information, see Connecting to a Yandex Data Processing cluster.

      Warning

      After you create a Yandex Data Processing cluster, you cannot request or disable public access to a subcluster. However, you can delete the Yandex Data Processing subcluster for data processing and create it again with the public access settings you need.

  8. Optionally, set up automatic scaling for data processing subclusters.

    1. In the Compute type subcluster settings, enable Autoscaling.
    2. Configure the autoscaling settings.
    3. The default metric used for autoscaling is yarn.cluster.containersPending. To enable scaling based on CPU utilization, disable Default scaling and specify the target CPU utilization level.
    4. Click Add.
  9. Optionaly, add and set up additional subclusters for data storage or processing.

  10. Optionally, use additional settings to enable protection against accidental cluster deletion.

    Even with cluster deletion protection enabled, you can still connect to the Yandex Data Processing cluster manually and delete the data.

  11. Click Create cluster.

If you do not have the Yandex Cloud CLI installed yet, install and initialize it.

By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.

To create a Yandex Data Processing cluster:

  1. Check whether the folder has any subnets for the Yandex Data Processing cluster hosts:

    yc vpc subnet list
    

    If there are no subnets in the folder, create the required subnets in Yandex Virtual Private Cloud.

  2. View the description of the CLI command for creating a Yandex Data Processing cluster:

    yc dataproc cluster create --help
    
  3. Specify the Yandex Data Processing cluster configuration in the create command (the list of supported settings in the example is not exhaustive):

    yc dataproc cluster create <cluster_name> \
      --environment=<environment> \
      --bucket=<bucket_name> \
      --zone=<availability_zone> \
      --service-account-name=<service_account_name> \
      --version=<image_version> \
      --services=<list_of_components> \
      --ssh-public-keys-file=<path_to_public_SSH_key> \
      --subcluster name=<name_of_subcluster_with_master_host>,`
                   `role=masternode,`
                   `resource-preset=<host_class>,`
                   `disk-type=<storage_type>,`
                   `disk-size=<storage_size_in_GB>,`
                   `subnet-name=<subnet_name>,`
                   `assign-public-ip=<public_access_to_subcluster_host> \
      --subcluster name=<name_of_subcluster_for_storing_data>,`
                   `role=datanode,`
                   `resource-preset=<host_class>,`
                   `disk-type=<storage_type>,`
                   `disk-size=<storage_size_in_GB>,`
                   `subnet-name=<subnet_name>,`
                   `hosts-count=<number_of_hosts>,`
                   `assign-public-ip=<public_access_to_subcluster_host> \
      --deletion-protection \
      --ui-proxy=<access_to_component_web_interfaces> \
      --log-group-id=<log_group_ID> \
      --security-group-ids=<list_of_security_group_IDs>
    

    Note

    The Yandex Data Processing cluster name must be unique within the folder. It may contain Latin letters, numbers, hyphens, and underscores. The name may be up to 63 characters long.

    Where:

    • --environment: Cluster environment, either prestable or production.

    • --bucket: Name of an Object Storage bucket where job dependencies and results will be stored. The Yandex Data Processing cluster’s service account must have READ and WRITE permissions for this bucket.

    • --zone: Availability zone where the Yandex Data Processing cluster hosts will reside.

    • --service-account-name: Name of the Yandex Data Processing cluster’s service account.

    • --version: Image version.

      You can use an image version 2.0.39 or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.

      Tip

      To use the most recent image version, specify 2.0 in the --version parameter.

    • --services: List of components that you want to use in the Yandex Data Processing cluster. If you do not specify this parameter, these default components will be used: yarn, tez, and spark.

    • --ssh-public-keys-file: Full path to the file with the public part of the SSH key for access to the Yandex Data Processing cluster hosts.For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation.

      Note

      To connect to a cluster via OS Login, replace --ssh-public-keys-file with the --oslogin parameter in the command.

    • --subcluster: Properties of Yandex Data Processing subclusters:

      • name: Yandex Data Processing subcluster name.

      • role: Yandex Data Processing subcluster role, which can be masternode, datanode, or computenode.

      • resource-preset: Host class.

      • disk-type: Storage type, which can be network-ssd, network-hdd, or network-ssd-nonreplicated.

      • disk-size: Storage size in GB.

      • subnet-name: Subnet name.

      • hosts-count: Number of hosts in the Yandex Data Processing subclusters for data storage or processing. The minimum value is 1, and the maximum value is 32.

      • assign-public-ip: Access to Yandex Data Processing subcluster hosts from the internet. It can either be true or false. If access is enabled, you can only connect to the Yandex Data Processing cluster over SSL. For more information, see Connecting to a Yandex Data Processing cluster.

        Warning

        After you create a Yandex Data Processing cluster, you cannot request or disable public access to a subcluster. However, you can delete the Yandex Data Processing subcluster for data processing and create it again with the public access settings you need.

    • --deletion-protection: Protection against accidental deletion of the Yandex Data Processing cluster.

      Even with deletion protection enabled, one can still connect to the cluster manually and delete the data.

    • --ui-proxy: Access to the web interfaces of the Yandex Data Processing components. It can be either true or false.

    • --log-group-id: Log group ID.

    • --security-group-ids: List of security group IDs.

    To create a Yandex Data Processing cluster with multiple data storage or processing subclusters, provide the required number of --subcluster arguments in the cluster creation command:

    yc dataproc cluster create <cluster_name> \
      ...
      --subcluster <subcluster_parameters> \
      --subcluster <subcluster_parameters> \
      ...
    
  4. To enable autoscaling in Yandex Data Processing subclusters for data processing, specify the following parameters:

    yc dataproc cluster create <cluster_name> \
      ...
      --subcluster name=<subcluster_name>,`
                   `role=computenode`
                   `...`
                   `hosts-count=<minimum_number_of_hosts>`
                   `max-hosts-count=<maximum_number_of_hosts>,`
                   `preemptible=<use_preemptible_VMs>,`
                   `warmup-duration=<VM_warmup_period>,`
                   `stabilization-duration=<stabilization_period>,`
                   `measurement-duration=<load_measurement_interval>,`
                   `cpu-utilization-target=<target_CPU_utilization_level>,`
                   `autoscaling-decommission-timeout=<decommissioning_timeout>
    

    Where:

    • hosts-count: Minimum number of hosts (VMs) per Yandex Data Processing subcluster. The minimum value is 1, and the maximum value is 32.
    • max-hosts-count: Maximum number of hosts (VMs) per Yandex Data Processing subcluster. The minimum value is 1, and the maximum value is 100.
    • preemptible: Indicates if preemptible VMs are used. It can either be true or false.
    • warmup-duration: Time required to warm up a VM instance, in <value>s format. The minimum value is 0s, and the maximum value is 600s.
    • stabilization-duration: Period, in seconds, during which the required number of VMs cannot be decreased, in <value>s format. The minimum value is 60s and the maximum value is 1800s.
    • measurement-duration: Period, in seconds, for which average utilization is calculated for each VM, in <value>s format. The minimum value is 60s (1 minute), and the maximum value is 600s (10 minutes).
    • cpu-utilization-target: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise, yarn.cluster.containersPending will be used for scaling based on the number of pending resources. The minimum value is 10, and the maximum value is 100.
    • autoscaling-decommission-timeout: Decommissioning timeout in seconds. The minimum value is 0, and the maximum value is 86400 (24 hours).
  5. To create a Yandex Data Processing cluster based on dedicated host groups, specify their IDs as a comma-separated list in the --host-group-ids parameter:

    yc dataproc cluster create <cluster_name> \
      ...
      --host-group-ids=<dedicated_host_group_IDs>
    

    Alert

    You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.

  6. To configure Yandex Data Processing cluster hosts using initialization scripts, specify them in one or multiple --initialization-action parameters:

    yc dataproc cluster create <cluster_name> \
      ...
      --initialization-action uri=<initialization_script_URI>,`
                              `timeout=<script_timeout>,`
                              `args=["arg1","arg2","arg3",...]
    

    Where:

    • uri: Link to the initialization script in the https://, http://, hdfs://, or s3a:// scheme.
    • Optionally, timeout: Script timeout, in seconds. If your initialization script runs longer than this time period, it will be terminated.
    • Optionally, args: Arguments for the initialization script, provided as a comma-separated list.

With Terraform, you can quickly create a cloud infrastructure in Yandex Cloud and manage it using configuration files. These files store the infrastructure description written in HashiCorp Configuration Language (HCL). If you change the configuration files, Terraform automatically detects which part of your configuration is already deployed, and what should be added or removed.

Terraform is distributed under the Business Source License. The Yandex Cloud provider for Terraform is distributed under the MPL-2.0 license.

For more information about the provider resources, see the relevant documentation on the Terraform website or its mirror.

To create a Yandex Data Processing cluster:

  1. Using the command line, navigate to the folder that will contain the Terraform configuration files with an infrastructure plan. Create the directory if it does not exist.

  2. If you do not have Terraform yet, install it and configure the Yandex Cloud provider.

  3. Create a configuration file with a description of your cloud network, subnets, security group, and NAT gateway.

    The Yandex Data Processing cluster resides in a cloud network. If you already have a suitable network, you do not need to describe it again.

    The Yandex Data Processing cluster hosts reside in subnets of the selected cloud network. If you already have suitable subnets, you do not need to describe them again.

    Below is an example of a configuration file describing a single-subnet cloud network, security group, NAT gateway, and route table:

    resource "yandex_vpc_network" "test_network" {
      name = "<network_name>"
    }
    
    resource "yandex_vpc_subnet" "test_subnet" {
      name           = "<subnet_name>"
      zone           = "<availability_zone>"
      network_id     = yandex_vpc_network.test_network.id
      v4_cidr_blocks = ["<subnet>"]
      route_table_id = yandex_vpc_route_table.data-processing-rt.id
    }
    
    resource "yandex_vpc_gateway" "data-processing-gateway" {
      name = "data-processing-gateway"
      shared_egress_gateway {}
    }
    
    resource "yandex_vpc_route_table" "data-processing-rt" {
      network_id = "${yandex_vpc_network.test-network.id}"
    
      static_route {
        destination_prefix = "0.0.0.0/0"
        gateway_id         = "${yandex_vpc_gateway.data-processing-gateway.id}"
      }
    }
    
    resource "yandex_vpc_security_group" "data-processing-sg" {
      description = "Security group for DataProc"
      name        = "data-processing-security-group"
      network_id  = yandex_vpc_network.data-proc-network.id
    
      egress {
        description    = "Allow outgoing HTTPS traffic"
        protocol       = "TCP"
        port           = 443
        v4_cidr_blocks = ["0.0.0.0/0"]
      }
    
      ingress {
        description       = "Allow any incomging traffic within the security group"
        protocol          = "ANY"
        from_port         = 0
        to_port           = 65535
        predefined_target = "self_security_group"
      }
    
      egress {
        description       = "Allow any outgoing traffic within the security group"
        protocol          = "ANY"
        from_port         = 0
        to_port           = 65535
        predefined_target = "self_security_group"
      }
    
      egress {
        description    = "Allow outgoing traffic to NTP servers for time synchronization"
        protocol       = "UDP"
        port           = 123
        v4_cidr_blocks = ["0.0.0.0/0"]
      }
    }
    
  4. Create a configuration file describing the following resources:

    • Service account that needs access to the Yandex Data Processing cluster and Object Storage bucket.
    • Service account for creating the Object Storage bucket.
    • Static key.
    • Object Storage bucket to store job execution results in.
    resource "yandex_iam_service_account" "data_proc_sa" {
      name        = "<service_account_name>"
      description = "<service_account_description>"
    }
    
    resource "yandex_resourcemanager_folder_iam_member" "dataproc-agent" {
      folder_id = "<folder_ID>"
      role      = "dataproc.agent"
      member    = "serviceAccount:${yandex_iam_service_account.data_proc_sa.id}"
    }
    
    resource "yandex_resourcemanager_folder_iam_member" "dataproc-provisioner" {
      folder_id = "<folder_ID>"
      role      = "dataproc.provisioner"
      member    = "serviceAccount:${yandex_iam_service_account.data_proc_sa.id}"
    }
    
    resource "yandex_iam_service_account" "bucket_sa" {
      name        = "<service_account_name>"
      description = "<service_account_description>"
    }
    
    resource "yandex_resourcemanager_folder_iam_member" "storage-admin" {
      folder_id = "<folder_ID>"
      role      = "storage.admin"
      member    = "serviceAccount:${yandex_iam_service_account.bucket_sa.id}"
    }
    
    resource "yandex_iam_service_account_static_access_key" "bucket_sa_static_key" {
      service_account_id = yandex_iam_service_account.bucket_sa.id
    }
    
    resource "yandex_storage_bucket" "data_bucket" {
      depends_on = [
        yandex_resourcemanager_folder_iam_member.storage-admin
      ]
    
      grant {
        id          = yandex_iam_service_account.data_proc_sa.id
        type        = "CanonicalUser"
        permissions = ["READ","WRITE"]
      }
    
      bucket     = "<bucket_name>"
      access_key = yandex_iam_service_account_static_access_key.bucket_sa_static_key.access_key
      secret_key = yandex_iam_service_account_static_access_key.bucket_sa_static_key.secret_key
    }
    
  5. Create a configuration file describing the Yandex Data Processing cluster and its subclusters.

    If required, here you can also specify the properties of the Yandex Data Processing cluster components, jobs, and the environment.

    Below is an example of a configuration file structure that describes a Yandex Data Processing cluster consisting of a subcluster with a master host, a data storage subcluster, and a data processing subcluster:

    resource "yandex_dataproc_cluster" "data_cluster" {
      bucket              = "${yandex_storage_bucket.data_bucket.bucket}"
      name                = "<cluster_name>"
      description         = "<cluster_description>"
      environment         = "<cluster_environment>"
      service_account_id  = yandex_iam_service_account.data_proc_sa.id
      zone_id             = "<availability_zone>"
      security_group_ids  = [yandex_vpc_security_group.data-processing-sg.id]
      deletion_protection = <cluster_deletion_protection>
      depends_on = [
        yandex_resourcemanager_folder_iam_member.dataproc-provisioner,
        yandex_resourcemanager_folder_iam_member.dataproc-agent
      ]
    
      cluster_config {
        version_id = "<image_version>"
    
        hadoop {
          services   = ["<list_of_components>"]
          # Sample list: ["HDFS", "YARN", "SPARK", "TEZ", "MAPREDUCE", "HIVE"].
          properties = {
            "<component_property>" = <value>
            ...
          }
          ssh_public_keys = [
            file("<path_to_public_SSH_key>")
          ]
        }
    
        subcluster_spec {
          name = "<name_of_subcluster_with_master_host>"
          role = "MASTERNODE"
          resources {
            resource_preset_id = "<host_class>"
            disk_type_id       = "<storage_type>"
            disk_size          = <storage_size_in_GB>
          }
          subnet_id   = yandex_vpc_subnet.test_subnet.id
          hosts_count = 1
        }
    
        subcluster_spec {
          name = "<name_of_subcluster_for_storing_data>"
          role = "DATANODE"
          resources {
            resource_preset_id = "<host_class>"
            disk_type_id       = "<storage_type>"
            disk_size          = <storage_size_in_GB>
          }
          subnet_id   = yandex_vpc_subnet.test_subnet.id
          hosts_count = <number_of_hosts_in_subcluster>
        }
    
        subcluster_spec {
          name = "<data_processing_subcluster_name>"
          role = "COMPUTENODE"
          resources {
            resource_preset_id = "<host_class>"
            disk_type_id       = "<storage_type>"
            disk_size          = <storage_size_in_GB>
          }
          subnet_id   = yandex_vpc_subnet.test_subnet.id
          hosts_count = <number_of_hosts_in_subcluster>
        }
      }
    }
    

    Where deletion_protection means deletion protection for the Yandex Data Processing cluster. It can either be true or false.

    Even with cluster deletion protection enabled, one can still delete a user or database or connect manually and delete the database contents.

    You can use an image version 2.0.39 or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.

    Tip

    To use the most recent image version, specify 2.0 in the version_id parameter.

    To connect to a cluster via OS Login, replace ssh_public_keys with oslogin set to true.

    To access the web interfaces of the Yandex Data Processing components, add the ui_proxy field set to true to the Yandex Data Processing cluster description:

    resource "yandex_dataproc_cluster" "data_cluster" {
      ...
      ui_proxy = true
      ...
    }
    

    To configure the autoscaling settings in Yandex Data Processing subclusters for data processing, add the autoscaling_config section with the settings you need to the subcluster_spec description of the appropriate subcluster:

    subcluster_spec {
      name = "<subcluster_name>"
      role = "COMPUTENODE"
      ...
      autoscaling_config {
        max_hosts_count        = <maximum_number_of_VMs_in_group>
        measurement_duration   = <load_measurement_interval>
        warmup_duration        = <warmup_period>
        stabilization_duration = <stabilization_period>
        preemptible            = <use_preemptible_VMs>
        cpu_utilization_target = <target_CPU_utilization_level>
        decommission_timeout   = <decommissioning_timeout>
      }
    }
    

    Where:

    • max_hosts_count: Maximum number of hosts (VMs) per Yandex Data Processing subcluster. The minimum value is 1, and the maximum value is 100.
    • measurement_duration: Period, in seconds, for which the average utilization is calculated for each VM, in <value>s format. The minimum value is 60s (1 minute), and the maximum value is 600s (10 minutes).
    • warmup_duration: Time required to warm up a VM instance, in <value>s format. The minimum value is 0s, and the maximum value is 600s.
    • stabilization_duration: Period, in seconds, during which the required number of VMs cannot be decreased, in <value>s format. The minimum value is 60s and the maximum value is 1800s.
    • preemptible: Indicates if preemptible VMs are used. It can either be true or false.
    • cpu_utilization_target: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise, yarn.cluster.containersPending will be used for scaling based on the number of pending resources. The minimum value is 10, and the maximum value is 100.
    • decommission_timeout: Decommissioning timeout in seconds. The minimum value is 0, and the maximum value is 86400 (24 hours).

    For more information about resources you can create with Terraform, see the provider documentation.

  6. Make sure the Terraform configuration files are correct:

    1. In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.

    2. Run this command:

      terraform validate
      

      Terraform will show any errors found in your configuration files.

  7. Create a Yandex Data Processing cluster:

    1. Run this command to view the planned changes:

      terraform plan
      

      If you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.

    2. If everything looks correct, apply the changes:

      1. Run this command:

        terraform apply
        
      2. Confirm updating the resources.

      3. Wait for the operation to complete.

    All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console.

To create a Yandex Data Processing cluster, use the create API method and include the following in the request:

  • ID of the folder to host the Yandex Data Processing cluster, in the folderId parameter.

  • Yandex Data Processing cluster name in the name parameter.

  • Cluster environment in the environment parameter: PRESTABLE or PRODUCTION.

  • Yandex Data Processing cluster configuration in the configSpec parameter, including:

    • Image version in the configSpec.versionId parameter.

      You can use an image version 2.0.39 or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.

      Tip

      To use the most recent image version, specify 2.0.

    • List of components in the configSpec.hadoop.services parameter.

    • Public part of the SSH key in the configSpec.hadoop.sshPublicKeys parameter.

      Note

      To connect to a cluster via OS Login, provide configSpec.hadoop.osloginEnabled: true in the request instead of configSpec.hadoop.sshPublicKeys.

    • Settings of the Yandex Data Processing subclusters in the configSpec.subclustersSpec parameter.

  • Availability zone of the Yandex Data Processing cluster in the zoneId parameter.

  • ID of the Yandex Data Processing cluster’s service account in the serviceAccountId parameter.

  • Bucket name in the bucket parameter.

  • IDs of the Yandex Data Processing cluster’s security groups in the hostGroupIds parameter.

  • Yandex Data Processing cluster deletion protection settings in the deletionProtection parameter.

    Even with deletion protection enabled, one can still connect to the cluster manually and delete the data.

To assign a public IP address to all hosts of a Yandex Data Processing subcluster, provide the true value in the configSpec.subclustersSpec.assignPublicIp parameter.

To create a Yandex Data Processing cluster based on dedicated host groups, provide a list of their IDs in the hostGroupIds parameter.

Alert

You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.

To configure Yandex Data Processing cluster hosts using initialization scripts, specify them in one or multiple configSpec.hadoop.initializationActions parameters.

After the Yandex Data Processing cluster status switches to Running, you can connect to the Yandex Data Processing subcluster hosts.

Create a Yandex Data Processing cluster copyCreate a Yandex Data Processing cluster copy

You can create a Yandex Data Processing cluster with the same settings as the one you created earlier. To do so, import the source Yandex Data Processing cluster’s configuration to Terraform. This way, you can either create an identical copy or use the configuration you imported as the baseline and modify it as needed. Importing is convenient if the source Yandex Data Processing cluster has multiple settings (e.g., it is an HDFS cluster) and you need to create a similar one.

To create a Yandex Data Processing cluster copy:

Terraform
  1. If you do not have Terraform yet, install it.

  2. Get the authentication credentials. You can add them to environment variables or specify them later in the provider configuration file.

  3. Configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it.

  4. Place the configuration file in a separate working directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.

  5. In the same working directory, place a .tf file with the following contents:

    resource "yandex_dataproc_cluster" "old" { }
    
  6. Write the initial Yandex Data Processing cluster’s ID to the environment variable:

    export DATAPROC_CLUSTER_ID=<cluster_ID>
    

    You can get the ID with the list of clusters in the folder.

  7. Import the initial Yandex Data Processing cluster’s settings into the Terraform configuration:

    terraform import yandex_dataproc_cluster.old ${DATAPROC_CLUSTER_ID}
    
  8. Get the imported configuration:

    terraform show
    
  9. Copy it from the terminal and paste it into the .tf file.

  10. Place the file in the new imported-cluster directory.

  11. Edit the configuration you copied so that you can create a new Yandex Data Processing cluster from it:

    • Specify the new Yandex Data Processing cluster name in the resource string and the name parameter.

    • Delete the created_at, host_group_ids, id, and subcluster_spec.id parameters.

    • If you are connecting to the cluster using an SSH key, change its format in the ssh_public_keys parameter. Initial format:

      ssh_public_keys = [
        <<-EOT
          <key>
        EOT,
      ]
      

      Required format:

      ssh_public_keys = [
        "<key>"
      ]
      
    • Optionally, make further changes if you need to customize the configuration.

  12. Get the authentication credentials in the imported-cluster directory.

  13. In the same directory, configure and initialize a provider. To avoid creating a configuration file with provider settings manually, download it.

  14. Place the configuration file in the imported-cluster directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.

  15. Make sure the Terraform configuration files are correct using this command:

    terraform validate
    

    Terraform will show any errors found in your configuration files.

  16. Create the required infrastructure:

    1. Run this command to view the planned changes:

      terraform plan
      

      If you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.

    2. If everything looks correct, apply the changes:

      1. Run this command:

        terraform apply
        
      2. Confirm updating the resources.

      3. Wait for the operation to complete.

    All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console.

ExampleExample

Creating a lightweight Yandex Data Processing cluster for Spark and PySpark jobsCreating a lightweight Yandex Data Processing cluster for Spark and PySpark jobs

CLI

Create a Yandex Data Processing cluster to run Spark jobs without HDFS and data storage subclusters with the following test settings:

  • Name: my-dataproc.
  • Environment: production.
  • Bucket: dataproc-bucket.
  • Availability zone: ru-central1-c.
  • Service account: dataproc-sa.
  • Image version: 2.0.
  • Components: SPARK and YARN.
  • Path to the public part of the SSH key: /home/username/.ssh/id_rsa.pub.
  • Yandex Data Processing subcluster with master hosts and a single compute subcluster for processing data:
    • Class: s2.micro.
    • Network SSD storage (network-ssd): 20 GB.
    • Subnet: default-ru-central1-c.
    • Public access: Allowed.
  • Security group: enp6saqnq4ie244g67sb.
  • Protection against accidental deletion of the Yandex Data Processing cluster: Enabled.

Run this command:

yc dataproc cluster create my-dataproc \
   --environment=production \
   --bucket=dataproc-bucket \
   --zone=ru-central1-c \
   --service-account-name=dataproc-sa \
   --version=2.0 \
   --services=SPARK,YARN \
   --ssh-public-keys-file=/home/username/.ssh/id_rsa.pub \
   --subcluster name="master",`
                `role=masternode,`
                `resource-preset=s2.micro,`
                `disk-type=network-ssd,`
                `disk-size=20,`
                `subnet-name=default-ru-central1-c,`
                `assign-public-ip=true \
   --subcluster name="compute",`
                `role=computenode,`
                `resource-preset=s2.micro,`
                `disk-type=network-ssd,`
                `disk-size=20,`
                `subnet-name=default-ru-central1-c,`
                `assign-public-ip=true \
   --security-group-ids=enp6saqnq4ie244g67sb \
   --deletion-protection

Was the article helpful?

Previous
Information about existing clusters
Next
Creating an autoscaling cluster in another folder's network
© 2025 Direct Cursus Technology L.L.C.