Updating an Apache Spark™ cluster
After creating a cluster, you can edit its basic and advanced settings.
To change the cluster settings:
-
Navigate to the folder page
and select Managed Service for Apache Spark™. -
Select the cluster and click Edit in the top panel.
-
Under Basic parameters:
- Edit the cluster name and description.
- Delete or add new labels.
- Select a service account or create a new one with the
managed-spark.integrationProviderrole. The cluster will thus get the permissions it needs to work with other resources.
-
Under Network settings, select a security group for cluster network traffic.
-
Under Driver configuration and Executor configuration, specify the number of instances and computing resource configuration. The number of instances can be either fixed or autoscalable.
-
Under Advanced settings:
-
Delete or add names of pip and deb packages.
The package name format and version are defined by the install command:
pip installfor pip packages andapt installfor deb packages. -
In the Maintenance window setting, update the cluster maintenance time:
- To enable maintenance at any time, select arbitrary (default).
- To specify the preferred maintenance start time, select by schedule and specify the desired day of the week and UTC hour. For example, you can choose a time when the cluster is least loaded.
Maintenance operations are carried out both on enabled and disabled clusters. They may include updating the DBMS, applying patches, and so on.
-
Select a Apache Hive™ Metastore cluster to connect as a metadata storage.
-
Enable or disable cluster deletion protection.
-
Enable or disable History Server. This option allows using the service to monitor Spark History Server
applications. -
Enable or disable Write logs. This option enables logging of Spark applications in the cluster:
- Select the log destination:
- Folder: Select a folder from the list.
- Group: Select a log group from the list or create a new one.
- Select Min. logging level from the list.
- Select the log destination:
-
-
Click Save.
To change the cluster settings:
-
Open the current Terraform configuration file that defines your infrastructure.
For more information about creating this file, see Creating clusters.
-
To change cluster settings, change the required field values in the configuration file.
Alert
Do not change the cluster name using Terraform. This will delete the existing cluster and create a new one.
Here is an example of the configuration file structure:
resource "yandex_spark_cluster" "my_spark_cluster" { description = "<cluster_description>" name = "my-spark-cluster" folder_id = "<folder_ID>" service_account_id = "<service_account_ID>" deletion_protection = <protect_cluster_from_deletion> labels = { <label_list> } network = { subnet_ids = ["<list_of_subnet_IDs>"] security_group_ids = ["<list_of_security_group_IDs>"] } config = { resource_pools = { driver = { resource_preset_id = "<host_class>" size = <fixed_number_of_instances> } executor = { resource_preset_id = "<host_class>" size = <fixed_number_of_instances> } } history_server = { enabled = <use_of_Apache_Spark_History_Server> } metastore = { cluster_id = "<Apache_Hive™_Metastore_cluster_ID>" } dependencies = { deb_packages = ["<list_of_deb_packages>"] pip_packages = ["<list_of_pip_packages>"] } } maintenance_window = { type = "<maintenance_type>" day = "<day_of_week>" hour = "<hour>" } logging = { enabled = <enable_logging> folder_id = "<folder_ID>" } }Where:
-
description: Cluster description. -
service_account_id: Service account ID. -
deletion_protection: Cluster protection from accidental deletion,trueorfalse. -
labels: List of labels. Provide labels in<key> = "<value>"format. -
security_group_ids: List of security group IDs. -
driver: Host configuration to run Apache Spark™ drivers. In this section, specify:- Host class in the
resource_preset_idparameter. - Number of instances. Specify a fixed number in the
sizeparameter or the minimum and maximum number for autoscaling in themin_sizeandmax_sizeparameters.
- Host class in the
-
executor: Host configuration to run Apache Spark™ executors. In this section, specify:- Host class in the
resource_preset_idparameter. - Number of instances. Specify a fixed number in the
sizeparameter or the minimum and maximum number for autoscaling in themin_sizeandmax_sizeparameters.
- Host class in the
-
maintenance_window: Maintenance window settings (including for disabled clusters). In this section, specify:- Maintenance type in the
typeparameter. The possible values include:ANYTIME: Any time.WEEKLY: On a schedule.
- Day of week for the
WEEKLYmaintenance type in thedayparameter, i.e.,MON,TUE,WED,THU,FRI,SAT, orSUN. - UTC hour for the
WEEKLYmaintenance type in thehourparameter, from1to24.
- Maintenance type in the
-
history_server: Connecting Apache Spark™ History Server. To use the service, set theenabledparameter totrue. -
metastore: Connecting a Apache Hive™ Metastore metadata storage. Specify a Apache Hive™ Metastore cluster ID in thecluster_idparameter. -
dependencies: Additional deb and pip packages for running Apache Spark™ jobs. In this section, specify:deb_packages: Names of deb packages. Their format depends on theapt installinstallation command.pip_packages: Names of pip packages. Their format depends on thepip installinstallation command.
-
logging: Logging parameters. Logs generated by Apache Spark™ components will be sent to Yandex Cloud Logging. To enable logging:-
Set the
enabled = truevalue. -
Specify one of two log storage locations:
folder_id: Folder ID. Logs will be written to the default log group for this folder.log_group_id: Custom log group ID. Logs will be written to this group.
-
-
-
Validate your configuration.
-
In the command line, navigate to the directory that contains the current Terraform configuration files defining the infrastructure.
-
Run this command:
terraform validateTerraform will show any errors found in your configuration files.
-
-
Confirm updating the resources.
-
Run this command to view the planned changes:
terraform planIf you described the configuration correctly, the terminal will display a list of the resources to update and their parameters. This is a verification step that does not apply changes to your resources.
-
If everything looks correct, apply the changes:
-
Run this command:
terraform apply -
Confirm updating the resources.
-
Wait for the operation to complete.
-
-
For more information, see this Terraform provider article.
To change the cluster settings:
-
Get an IAM token for API authentication and save it as an environment variable:
export IAM_TOKEN="<IAM_token>" -
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapiBelow, we assume the repository contents are stored in the
~/cloudapi/directory. -
Create a file named
body.jsonand paste the following code into it:{ "cluster_id": "<cluster_ID>", "update_mask": "<list_of_parameters_to_update>", "name": "<cluster_name>", "description": "<cluster_description>", "labels": { <label_list> }, "config_spec": { "resource_pools": { "driver": { "resource_preset_id": "<driver_resource_ID>", "scale_policy": { "fixed_scale": { "size": "<number_of_driver_instances>" } } }, "executor": { "resource_preset_id": "<executor_resource_ID>", "scale_policy": { "auto_scale": { "min_size": "<minimum_number_of_executor_instances>", "max_size": "<maximum_number_of_executor_instances>" } } } }, "history_server": { "enabled": <use_of_Apache_Spark_History_Server> }, "dependencies": { "pip_packages": [ <list_of_pip_packages> ], "deb_packages": [ <list_of_deb_packages> ] }, "metastore": { "cluster_id": "<Apache_Hive™_Metastore_cluster_ID>" } }, "network_spec": { "security_group_ids": [ <list_of_security_group_IDs> ] }, "deletion_protection": <deletion_protection>, "service_account_id": "<service_account_ID>", "logging": { "enabled": <use_of_logging>, "log_group_id": "<log_group_ID>", "folder_id": "<folder_ID>" } }Where:
-
cluster_id: Cluster ID. You can get it with the list of clusters in a folder. -
update_mask: List of parameters to update as an array ofpaths[]strings.Format for listing settings
"update_mask": { "paths": [ "<setting_1>", "<setting_2>", ... "<setting_N>" ] }Warning
When you update a cluster, all its parameters will reset to their defaults unless explicitly provided in the request. To avoid this, list the settings you want to change in the
update_maskparameter. -
name: Cluster name. -
description: Cluster description. -
labels: List of labels. Provide labels in"<key>": "<value>"format. -
config_spec: Cluster configuration:-
resource_pools: Resource pool configuration:-
driver: Host configuration to run Apache Spark™ drivers.-
resource_preset_id: Driver host class. -
scale_policy: Host group scaling policy for the driver:-
fixed_scale: Fixed scaling policy.size: Number of driver hosts.
-
auto_scale: Automatic scaling policy.min_size: Minimum number of driver hosts.max_size: Maximum number of driver hosts.
Specify either
fixed_scaleorauto_scale. -
-
-
executor: Host configuration to run Apache Spark™ executors.-
resource_preset_id: Executor host class. -
scale_policy: Host group scaling policy for the executor:-
fixed_scale: Fixed scaling policy.size: Number of executor hosts.
-
auto_scale: Automatic scaling policy.min_size: Minimum number of executor hosts.max_size: Maximum number of executor hosts.
Specify either
fixed_scaleorauto_scale. -
-
-
history_server: History server parameters.enabled: Flag to enable history server. It allows using the service to monitor Spark History Server applications.
-
dependencies: Lists of packages enabling you to install additional libraries and applications on the cluster.pip_packages: List of pip packages.deb_packages: List of deb packages.
You can set version restrictions for the installed packages, e.g.:
"dependencies": { "pip_packages": [ "pandas==2.0.2", "scikit-learn>=1.0.0", "clickhouse-driver~=0.2.0" ] }The package name format and version are defined by the install command:
pip installfor pip packages andapt installfor deb packages. -
metastore: Parameters of the cluster’s metadata storage.cluster_id: Apache Hive™ Metastore cluster ID.
-
-
network: Network settings:security_group_ids: List of security group IDs.
-
deletion_protection: Enables cluster protection against accidental deletion. The possible values are:trueorfalse.Even if it is enabled, one can still connect to the cluster manually and delete it.
-
service_account_id: ID of the service account for access to Yandex Cloud services. Make sure to assign themanaged-spark.integrationProviderrole to this service account: -
logging: Logging parameters:enabled: Enables logging. The possible values are:trueorfalse. Logs generated by Apache Spark™ components will be sent to Yandex Cloud Logging. The possible values are:trueorfalse.folder_id: Folder ID. Logs will be written to the default log group for this folder.log_group_id: Custom log group ID. Logs will be written to this group.
Specify either
folder_idorlog_group_id.
-
-
-
Use the ClusterService.Update call and send the following request, e.g., via gRPCurl
:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/spark/v1/cluster_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d @ \ spark.api.cloud.yandex.net:443 \ yandex.cloud.spark.v1.ClusterService.Update \ < body.json -
View the server response to make sure your request was successful.