Updating an Apache Spark™ cluster
After creating a cluster, you can edit its basic and advanced settings.
To change the cluster settings:
-
Open the folder dashboard
. -
Go to Managed Service for Apache Spark™.
-
Select your cluster and click Edit in the top panel.
-
Under Basic parameters:
- Edit the cluster name and description.
- Delete or add new labels.
- Select a service account or create a new one with the
managed-spark.integrationProviderrole. The cluster will thus get the permissions it needs to work with other resources.
-
Under Network settings, select a security group for cluster network traffic.
-
Under Driver configuration and Executor configuration, specify the number of instances and computing resource configuration. The number of instances can be either fixed or autoscalable.
-
Under Advanced settings:
-
Delete or add names of pip and deb packages.
The package name format and version are defined by the install command:
pip installfor pip packages andapt installfor deb packages. -
In the Maintenance window setting, update the cluster maintenance time:
- To enable maintenance at any time, select arbitrary (default).
- To specify the preferred maintenance start time, select by schedule and specify the desired day of the week and UTC hour. For example, you can choose a time when the cluster is least loaded.
Maintenance operations are carried out both on enabled and disabled clusters. They may include updating the DBMS, applying patches, and so on.
-
Select a Apache Hive™ Metastore cluster to connect as a metadata storage.
-
Enable or disable cluster deletion protection.
-
Enable or disable History Server. This option allows using the service to monitor Spark History Server
applications. -
Enable or disable Write logs. This option enables logging of Spark applications in the cluster:
- Select the log destination:
- Folder: Select a folder from the list.
- Group: Select a log group from the list or create a new one.
- Select Min. logging level from the list.
- Select the log destination:
-
-
Click Save.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To change the cluster settings:
-
See the description of the CLI command for updating a cluster:
yc managed-spark cluster update --help -
Provide a list of settings to update in the update cluster command:
yc managed-spark cluster update <cluster_name_or_ID> \ --new-name <cluster_name> \ --description <cluster_description> \ --labels <label_list> \ --service-account-id <service_account_ID> \ --security-group-ids <list_of_security_group_IDs> \ --driver-preset-id <driver_resource_ID> \ --driver-fixed-size <number_of_driver_instances> \ --executor-preset-id <executor_resource_ID> \ --executor-fixed-size <number_of_executor_instances> \ --history-server-enabled <use_Spark_History_Server> \ --metastore-cluster-id <Apache_Hive™_Metastore_cluster_ID> \ --pip-packages <list_of_pip_packages> \ --deb-packages <list_of_deb_packages> \ --log-enabled \ --log-folder-id <folder_ID> \ --maintenance-window type=<maintenance_type>,` `day=<day_of_week>,` `hour=<hour> \ --deletion-protectionWhere:
-
--new-name: Cluster name, unique within the cloud. -
--description: Cluster description. -
--labels: List of labels. Provide labels in<key>=<value>format. -
--service-account-id: ID of the service account for access to Yandex Cloud services. Make sure to assign themanaged-spark.integrationProviderrole to this service account. -
--security-group-ids: List of security group IDs. -
Host configuration to run Apache Spark™ drivers:
--driver-preset-id: Driver host class.--driver-fixed-size: Fixed number of driver hosts.--driver-min-size: Minimum number of driver hosts for autoscaling.--driver-max-size: Maximum number of driver hosts for autoscaling.
Specify either a fixed number of hosts (
--driver-fixed-size) or minimum and maximum number of hosts (--driver-min-sizeand--driver-max-size) for autoscaling. -
Host configuration to run Apache Spark™ executors:
--executor-preset-id: Executor host class.--executor-fixed-size: Fixed number of executor hosts.--executor-min-size: Minimum number of executor hosts for autoscaling.--executor-max-size: Maximum number of executor hosts for autoscaling.
Specify either a fixed number of hosts (
--executor-fixed-size) or minimum and maximum number of hosts (--executor-min-sizeand--executor-max-size) for autoscaling. -
--history-server-enabled: Enables the Spark History Server monitoring service. -
--metastore-cluster-id: Apache Hive™ Metastore cluster ID. This setting connects the Apache Hive™ Metastore metadata storage. -
Lists of packages enabling you to install additional libraries and applications in the cluster:
--pip-packages: List of pip packages.--deb-packages: List of deb packages.
You can set version restrictions for the installed packages, e.g.:
--pip-packages pandas==2.1.1,scikit-learn>=1.0.0,clickhouse-driver~=0.2.0The package name format and version are defined by the install command:
pip installfor pip packages andapt installfor deb packages. -
Logging parameters:
--log-enabled: Enables logging.--log-folder-id: Folder ID. Logs will be written to the default log group for this folder.--log-group-id: Custom log group ID. Logs will be written to this group.
Specify either
--log-folder-idor--log-group-id. -
--maintenance-window: Maintenance window settings (including for stopped clusters), wheretypeis the maintenance type:anytime: At any time (default).weekly: On a schedule. For this value, also specify the following:day: Day of week, i.e.,MON,TUE,WED,THU,FRI,SAT, orSUN.hour: Hour of day (UTC), from1to24.
-
--deletion-protection: Enables cluster protection against accidental deletion.Even with deletion protection on, one can still connect to the cluster manually and delete it.
You can get the cluster name and ID with the list of clusters in the folder.
-
To change the cluster settings:
-
Open the current Terraform configuration file describing your infrastructure.
For more information about creating this file, see Creating clusters.
-
To change cluster settings, change the required field values in the configuration file.
Alert
Do not change the cluster name using Terraform. This will delete the existing cluster and create a new one.
Here is an example of the configuration file structure:
resource "yandex_spark_cluster" "my_spark_cluster" { description = "<cluster_description>" name = "my-spark-cluster" folder_id = "<folder_ID>" service_account_id = "<service_account_ID>" deletion_protection = <protect_cluster_from_deletion> labels = { <label_list> } network = { subnet_ids = ["<list_of_subnet_IDs>"] security_group_ids = ["<list_of_security_group_IDs>"] } config = { resource_pools = { driver = { resource_preset_id = "<host_class>" size = <fixed_number_of_instances> } executor = { resource_preset_id = "<host_class>" size = <fixed_number_of_instances> } } history_server = { enabled = <use_Spark_History_Server> } metastore = { cluster_id = "<Apache_Hive™_Metastore_cluster_ID>" } dependencies = { deb_packages = ["<list_of_deb_packages>"] pip_packages = ["<list_of_pip_packages>"] } } maintenance_window = { type = "<maintenance_type>" day = "<day_of_week>" hour = "<hour>" } logging = { enabled = <enable_logging> folder_id = "<folder_ID>" } }Where:
-
description: Cluster description. -
service_account_id: Service account ID. -
deletion_protection: Cluster deletion protection,trueorfalse. -
labels: List of labels. Provide labels in<key> = "<value>"format. -
security_group_ids: List of security group IDs. -
driver: Host configuration to run Apache Spark™ drivers. In this section, specify:- Host class in the
resource_preset_idparameter. - Number of instances. Specify a fixed number in the
sizeparameter or the minimum and maximum number for autoscaling in themin_sizeandmax_sizeparameters.
- Host class in the
-
executor: Host configuration to run Apache Spark™ executors. In this section, specify:- Host class in the
resource_preset_idparameter. - Number of instances. Specify a fixed number in the
sizeparameter or the minimum and maximum number for autoscaling in themin_sizeandmax_sizeparameters.
- Host class in the
-
maintenance_window: Maintenance window settings (including for disabled clusters). In this section, specify:- Maintenance type in the
typeparameter. The possible values include:ANYTIME: Any time.WEEKLY: On a schedule.
- Day of week for the
WEEKLYmaintenance type in thedayparameter, i.e.,MON,TUE,WED,THU,FRI,SAT, orSUN. - UTC hour from
1to24for theWEEKLYmaintenance type in thehourparameter.
- Maintenance type in the
-
history_server: Connecting Apache Spark™ History Server. To use the service, set theenabledparameter totrue. -
metastore: Connecting a Apache Hive™ Metastore metadata storage. Specify a Apache Hive™ Metastore cluster ID in thecluster_idparameter. -
dependencies: Additional deb and pip packages for running Apache Spark™ jobs. In this section, specify:deb_packages: Names of deb packages. Their format depends on theapt installinstallation command.pip_packages: Names of pip packages. Their format depends on thepip installinstallation command.
-
logging: Logging parameters. Logs generated by Apache Spark™ components will be sent to Yandex Cloud Logging. To enable logging:-
Set the
enabled = truevalue. -
Specify one of two log storage locations:
folder_id: Folder ID. Logs will be written to the default log group for this folder.log_group_id: Custom log group ID. Logs will be written to this group.
-
-
-
Validate your configuration.
-
In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.
-
Run this command:
terraform validateTerraform will show any errors found in your configuration files.
-
-
Confirm updating the resources.
-
Run this command to view the planned changes:
terraform planIf you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
-
If everything looks correct, apply the changes:
-
Run this command:
terraform apply -
Confirm updating the resources.
-
Wait for the operation to complete.
-
-
For more information, see this Terraform provider article.
To change the cluster settings:
-
Get an IAM token for API authentication and place it in an environment variable:
export IAM_TOKEN="<IAM_token>" -
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapiBelow, we assume that the repository contents reside in the
~/cloudapi/directory. -
Create a file named
body.jsonand paste the following code into it:{ "cluster_id": "<cluster_ID>", "update_mask": "<list_of_settings_to_update>", "name": "<cluster_name>", "description": "<cluster_description>", "labels": { <label_list> }, "config_spec": { "resource_pools": { "driver": { "resource_preset_id": "<driver_resource_ID>", "scale_policy": { "fixed_scale": { "size": "<number_of_driver_instances>" } } }, "executor": { "resource_preset_id": "<executor_resource_ID>", "scale_policy": { "auto_scale": { "min_size": "<minimum_number_of_executor_instances>", "max_size": "<maximum_number_of_executor_instances>" } } } }, "history_server": { "enabled": <use_Spark_History_Server> }, "dependencies": { "pip_packages": [ <list_of_pip_packages> ], "deb_packages": [ <list_of_deb_packages> ] }, "metastore": { "cluster_id": "<Apache_Hive™_Metastore_cluster_ID>" } }, "network_spec": { "security_group_ids": [ <list_of_security_group_IDs> ] }, "deletion_protection": <deletion_protection>, "service_account_id": "<service_account_ID>", "logging": { "enabled": <use_of_logging>, "log_group_id": "<log_group_ID>", "folder_id": "<folder_ID>" } }Where:
-
cluster_id: Cluster ID. You can get it with the list of clusters in the folder. -
update_mask: List of settings you want to update as an array of strings (paths[]).Format for listing settings
"update_mask": { "paths": [ "<setting_1>", "<setting_2>", ... "<setting_N>" ] }Warning
When you update a cluster, all its parameters will reset to their defaults unless explicitly provided in the request. To avoid this, list the settings you want to change in the
update_maskparameter. -
name: Cluster name. -
description: Cluster description. -
labels: List of labels provided in"<key>": "<value>"format. -
config_spec: Cluster configuration:-
resource_pools: Resource pool configuration:-
driver: Host configuration to run Apache Spark™ drivers.-
resource_preset_id: Driver host class. -
scale_policy: Host group scaling policy for the driver:-
fixed_scale: Fixed scaling policy.size: Number of driver hosts.
-
auto_scale: Automatic scaling policy.min_size: Minimum number of driver hosts.max_size: Maximum number of driver hosts.
Specify either
fixed_scaleorauto_scale. -
-
-
executor: Host configuration to run Apache Spark™ executors.-
resource_preset_id: Executor host class. -
scale_policy: Host group scaling policy for the executor:-
fixed_scale: Fixed scaling policy.size: Number of executor hosts.
-
auto_scale: Automatic scaling policy.min_size: Minimum number of executor hosts.max_size: Maximum number of executor hosts.
Specify either
fixed_scaleorauto_scale. -
-
-
history_server: History server parameters.enabled: Flag to enable history server. It allows using the service to monitor Spark History Server applications.
-
dependencies: Lists of packages enabling you to install additional libraries and applications on the cluster.pip_packages: List of pip packages.deb_packages: List of deb packages.
You can set version restrictions for the installed packages, e.g.:
"dependencies": { "pip_packages": [ "pandas==2.1.1", "scikit-learn>=1.0.0", "clickhouse-driver~=0.2.0" ] }The package name format and version are defined by the install command:
pip installfor pip packages andapt installfor deb packages. -
metastore: Parameters of the cluster’s metadata storage.cluster_id: Apache Hive™ Metastore cluster ID.
-
-
network: Network settings:security_group_ids: List of security group IDs.
-
deletion_protection: Enables cluster protection against accidental deletion. The possible values aretrueorfalse.Even with deletion protection on, one can still connect to the cluster manually and delete it.
-
service_account_id: ID of the service account for access to Yandex Cloud services. Make sure to assign themanaged-spark.integrationProviderrole to this service account: -
logging: Logging parameters:enabled: Enables logging. The possible values aretrueorfalse. Logs generated by Apache Spark™ components will be sent to Yandex Cloud Logging. The possible values aretrueorfalse.folder_id: Folder ID. Logs will be written to the default log group for this folder.log_group_id: Custom log group ID. Logs will be written to this group.
Specify either
folder_idorlog_group_id.
-
-
-
Call the ClusterService.Update method, e.g., via the following gRPCurl
request:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/spark/v1/cluster_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d @ \ spark.api.cloud.yandex.net:443 \ yandex.cloud.spark.v1.ClusterService.Update \ < body.json -
Check the server response to make sure your request was successful.