Creating an Apache Spark™ cluster

Written by

Yandex Cloud

Updated at October 20, 2025

Roles for creating a cluster
Creating a cluster

Each Apache Spark™ cluster contains computing resources to run Spark applications.

Roles for creating a cluster

To create a Apache Spark™ cluster, your Yandex Cloud account needs the following roles:

managed-spark.admin: To create a cluster.
vpc.user: To use the cluster network.
iam.serviceAccounts.user: To assign a service account to a cluster.

Make sure to assign the managed-spark.integrationProvider and storage.editor roles to the cluster service account. The cluster will thus get the permissions it needs to work with user resources.

For more information about assigning roles, see the Yandex Identity and Access Management documentation.

Creating a cluster

Management console

Terraform

gRPC API

In the management console, select the folder where you want to create a Apache Spark™ cluster.
Select Managed Service for Apache Spark™.
Click Create cluster.
Under Basic parameters:
1. Give the cluster a name. The name must be unique within the folder.
2. Optionally, enter a description for the cluster.
3. Optionally, create labels:
  1. Click Add label.
  2. Enter a label in key: value format.
  3. Press Enter.
4. Select an existing service account or create a new one.
  
  Make sure to assign the managed-spark.integrationProvider role to this service account.
Under Network settings, select a network, subnet, and security group for the cluster.
Specify the computing resources to run Spark applications on:
- Driver configuration: Number of driver hosts and their class. It can be either fixed or autoscalable.
- Executor configuration: Number of executor hosts and their class. It can be either fixed or autoscalable.
Specify additional cluster settings, if required:
1. Pip packages and Deb packages: Pip and deb package names for installing additional libraries and applications.
  
  To specify multiples packages, click Add.
  
  The package name format and version are defined by the install command: pip install for pip packages and apt install for deb packages.
2. Maintenance window: Maintenance window settings:
  - To enable maintenance at any time, select arbitrary (default).
  - To specify the preferred maintenance start time, select by schedule and specify the desired day of the week and UTC hour. For example, you can choose a time when the cluster is least loaded.
  Maintenance operations are carried out both on enabled and disabled clusters. They may include updating the DBMS, applying patches, and so on.
3. Metastore: Metastore server connected to your cluster.
4. Deletion protection: Manages cluster protection against accidental deletion.
5. Enable the History Server setting to allow using the service to monitor Spark History Server applications. After creating a cluster, the service will be available via a link.
6. Configure logging:
  1. Enable the Write logs setting.
  2. Select the log destination:
    - Folder: Select a folder from the list.
    - Group: Select a log group from the list or create a new one.
  3. Select Min. logging level from the list.
Click Create.

With Terraform, you can quickly create a cloud infrastructure in Yandex Cloud and manage it using configuration files. These files store the infrastructure description written in HashiCorp Configuration Language (HCL). If you change the configuration files, Terraform automatically detects which part of your configuration is already deployed, and what should be added or removed.

Terraform is distributed under the Business Source License. The Yandex Cloud provider for Terraform is distributed under the MPL-2.0 license.

For more information about the provider resources, see the relevant documentation on the Terraform website or its mirror.

If you do not have Terraform yet, install it and configure the Yandex Cloud provider.

To create a Yandex Managed Service for Apache Spark™ cluster:

In the configuration file, describe the resources you are creating:
- Yandex Managed Service for Apache Spark™ cluster: Cluster description.
- Network: Description of the cloud network where a cluster will be located. If you already have a suitable network, you don't have to describe it again.
- Subnets: Description of the subnets to connect the cluster hosts to. If you already have suitable subnets, you don't have to describe them again.
Here is an example of the configuration file structure:
```
resource "yandex_spark_cluster" "<cluster_name>" {
  description         = "<cluster_description>"
  name                = "<cluster_name>"
  folder_id           = "<folder_ID>"
  service_account_id  = "<service_account_ID>"
  deletion_protection = <protect_cluster_from_deletion>

  labels = {
    <label_list>
  }

  network = {
    subnet_ids         = ["<list_of_subnet_IDs>"]
    security_group_ids = ["<list_of_security_group_IDs>"]
  }

  config = {
    resource_pools = {
      driver = {
        resource_preset_id = "<host_class>"
        size               = <fixed_number_of_instances>
      }
      executor = {
        resource_preset_id = "<host_class>"
        size               = <fixed_number_of_instances>
      }
    }
  }

  logging = {
    enabled      = <enable_logging>
    folder_id    = "<folder_ID>"
  }

}

resource "yandex_vpc_network" "<network_name>" {
  name = "<network_name>"
}

resource "yandex_vpc_subnet" "<subnet_name>" {
  name           = "<subnet_name>"
  zone           = "<availability_zone>"
  network_id     = "<network_ID>"
  v4_cidr_blocks = ["<range>"]
}
```
Where:
- description: Cluster description. This is an optional parameter.
- name: Cluster name.
- folder_id: Folder ID. This is an optional parameter. If the value is missing, the cluster will reside in the folder specified in the provider settings.
- service_account_id: Service account ID.
- deletion_protection: Cluster protection from accidental deletion, true or false. This is an optional parameter.
- labels: List of labels. This is an optional parameter. Provide labels in <key> = "<value>" format.
- subnet_ids: Subnet IDs list.
- security_group_ids: List of security group IDs.
- driver: Host configuration to run Apache Spark™ drivers. In this section, specify:
  - Host class in the resource_preset_id parameter.
  - Number of instances. Specify a fixed number in the size parameter or the minimum and maximum number for autoscaling in the min_size and max_size parameters.
- executor: Host configuration to run Apache Spark™ executors. In this section, specify:
  - Host class in the resource_preset_id parameter.
  - Number of instances. Specify a fixed number in the size parameter or the minimum and maximum number for autoscaling in the min_size and max_size parameters.
- logging: Logging parameters. Logs generated by Apache Spark™ components will be sent to Yandex Cloud Logging. To enable logging:
  - Set the enabled = true value.
  - Specify one of two log storage locations:
    - folder_id: Folder ID. Logs will be written to the default log group for this folder.
    - log_group_id: Custom log group ID. Logs will be written to this group.
If necessary, configure additional DBMS settings:
- To set up the maintenance window (for disabled clusters as well), add the maintenance_window section to the cluster description:
```
resource "yandex_spark_cluster" "<cluster_name>" {
  ...
  maintenance_window {
    type = <maintenance_type>
    day  = <day_of_week>
    hour = <hour>
  }
  ...
}
```
  Where:
  - type: Maintenance type. The possible values include:
    - ANYTIME: Anytime
    - WEEKLY: On a schedule
  - day: Day of week for the WEEKLY type, i.e., MON, TUE, WED, THU, FRI, SAT, or SUN.
  - hour: UTC hour for the WEEKLY type, from 1 to 24.
- To enable Apache Spark™ History Server, add a section named history_server to the cluster configuration description:
```
resource "yandex_spark_cluster" "<cluster_name>" {
...
  config = {
  ...
    history_server = {
      enabled = true
    }
  }
}
```
- To connect an Apache Hive™ Metastore server to your cluster, add the metastore section to the cluster configuration description:
```
resource "yandex_spark_cluster" "<cluster_name>" {
...
  config = {
  ...
    metastore = {
      cluster_id = "<metastore_cluster_ID>"
    }
  }
}
```
- To connect additional deb and pip packages for running Apache Spark™ jobs, add the dependencies section to the cluster configuration description:
```
resource "yandex_spark_cluster" "<cluster_name>" {
...
  config = {
  ...
    dependencies = {
      deb_packages = ["<list_of_deb_packages>"]
      pip_packages = ["<list_of_pip_packages>"]
    }
  }
}
```
  Where deb_packages and pip_packages are package names. Their format depends on the installation command: apt install for deb packages and pip install for pip packages.
Validate your configuration.
1. In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.
2. Run this command:
```
terraform validate
```
  Terraform will show any errors found in your configuration files.
Create a Yandex Managed Service for Apache Spark™ cluster.
1. Run this command to view the planned changes:
```
terraform plan
```
  If you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
2. If everything looks correct, apply the changes:
  1. Run this command:
```
terraform apply
```
  2. Confirm updating the resources.
  3. Wait for the operation to complete.
This will create all the resources you need in the specified folder. You can check the new resources and their configuration using the management console.

For more information, see this Terraform provider article.

Get an IAM token for API authentication and save it as an environment variable:
```
export IAM_TOKEN="<IAM_token>"
```
Clone the cloudapi repository:
```
cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapi
```
Below, we assume the repository contents are stored in the ~/cloudapi/ directory.
Create a file named body.json and paste the following code into it:
```
{
  "folder_id": "<folder_ID>",
  "name": "<cluster_name>",
  "description": "<cluster_description>",
  "labels": { <label_list> },
  "config": {
    "resource_pools": {
      "driver": {
        "resource_preset_id": "<resource_ID>",
        "scale_policy": {
          "fixed_scale": {
            "size": "<number_of_instances>"
          }
        }
      },
      "executor": {
        "resource_preset_id": "<resource_ID>",
        "scale_policy": {
          "auto_scale": {
            "min_size": "<minimum_number_of_instances>",
            "max_size": "<maximum_number_of_instances>"
          }
        }
      }
    },
    "history_server": {
      "enabled": <use_of_Apache_Spark_History_Server>
    },
    "dependencies": {
      "pip_packages": [ <list_of_pip_packages> ],
      "deb_packages": [ <list_of_deb_packages> ]
    },
    "metastore": {
      "cluster_id": "<cluster_ID>"
    }
  },
  "network": {
    "subnet_ids": [ <list_of_subnet_IDs> ],
    "security_group_ids": [ <list_of_security_group_IDs> ]
  },
  "deletion_protection": <deletion_protection>,
  "service_account_id": "<service_account_ID>",
  "logging": {
    "enabled": <use_of_logging>,
    "folder_id": "<folder_ID>"
  }
}
```
Where:
- folder_id: Folder ID. You can request it with the list of folders in the cloud.
- name: Cluster name.
- description: Cluster description.
- labels: List of labels. Provide labels in "<key>": "<value>" format.
- config: Cluster configuration:
  - resource_pools: Resource pool configuration:
    - driver: Host configuration to run Apache Spark™ drivers.
      - resource_preset_id: Driver host class.
      - scale_policy: Host group scaling policy for the driver:
        
        fixed_scale: Fixed scaling policy.
        
        size: Number of driver hosts.
        
        auto_scale: Automatic scaling policy.
        
        min_size: Minimum number of driver hosts.
        
        max_size: Maximum number of driver hosts.
        
        Specify either fixed_scale or auto_scale.
    - executor: Host configuration to run Apache Spark™ executors.
      - resource_preset_id: Executor host class.
      - scale_policy: Host group scaling policy for the executor:
        
        fixed_scale: Fixed scaling policy.
        
        size: Number of executor hosts.
        
        auto_scale: Automatic scaling policy.
        
        min_size: Minimum number of executor hosts.
        
        max_size: Maximum number of executor hosts.
        
        Specify either fixed_scale or auto_scale.
    - history_server: History server parameters.
      - enabled: Flag to enable history server. It allows using the service to monitor Spark History Server applications.
    - dependencies: Lists of packages enabling you to install additional libraries and applications on the cluster.
      - pip_packages: List of pip packages.
      - deb_packages: List of deb packages.
      The package name format and version are defined by the install command: pip install for pip packages and apt install for deb packages.
    - metastore: Metastore parameters.
      - cluster_id: Apache Hive™ Metastore cluster ID.
  - network: Network settings:
    - subnet_ids: List of subnet IDs.
    - security_group_ids: List of security group IDs.
  - deletion_protection: Enables cluster protection against accidental deletion. The possible values are: true or false.
    
    Even if it is enabled, one can still connect to the cluster manually and delete it.
  - service_account_id: Service account ID.
  - logging: Logging parameters:
    - enabled: Enables logging. Logs generated by Spark applications will go to Yandex Cloud Logging. The possible values are: true or false.
    - folder_id: Folder ID. Logs will be written to the default log group for this folder.
    - log_group_id: Custom log group ID. Logs will be written to this group.
    Specify either folder_id or log_group_id.

Use the ClusterService/Create call and send the following request, e.g., via gRPCurl:

grpcurl \
    -format json \
    -import-path ~/cloudapi/ \
    -import-path ~/cloudapi/third_party/googleapis/ \
    -proto ~/cloudapi/yandex/cloud/spark/v1/cluster_service.proto \
    -rpc-header "Authorization: Bearer $IAM_TOKEN" \
    -d @ \
    spark.api.cloud.yandex.net:443 \
    yandex.cloud.spark.v1.ClusterService.Create \
    < body.json

View the server response to make sure your request was successful.

Creating an Apache Spark™ cluster

Roles for creating a clusterRoles for creating a cluster

Creating a clusterCreating a cluster

Was the article helpful?

Roles for creating a cluster

Creating a cluster