Creating a Yandex Data Processing cluster
Roles for creating a cluster
To create a Yandex Data Processing cluster, your Yandex Cloud account needs the following roles:
- dataproc.editor: To create a cluster.
- vpc.user: To use the cluster network.
- iam.serviceAccounts.user: To link a service account to the cluster and create resources under that service account.
Make sure to assign the following roles to the Yandex Data Processing cluster service account:
- dataproc.agent: To enable the service account to get info on cluster host states, jobs, and log groups.
- dataproc.provisioner: To enable the service account to work with an autoscaling instance group. This will enable subcluster autoscaling.
For more information about assigning roles, see the Yandex Identity and Access Management documentation.
Configure a network
Configure internet access from the subnet to which the Yandex Data Processing subcluster with a master host will be connected, e.g., using a NAT gateway. This will enable the Yandex Data Processing subcluster to work with Yandex Cloud services or hosts in other networks.
Configure security groups
Warning
You need to create and configure security groups before creating a Yandex Data Processing cluster. If the selected security groups do not have the required rules, Yandex Cloud disables the Yandex Data Processing cluster creation.
- Create one or more security groups for service traffic of the Yandex Data Processing cluster.
- Add rules:
-
One rule for inbound and another one for outbound service traffic:
- Port range:
0-65535
- Protocol:
Any
- Source/Destination name:
Security group
- Security group:
Current
- Port range:
-
A separate rule for outgoing HTTPS traffic. This will allow you to use Yandex Object Storage buckets, UI Proxy, and autoscaling of Yandex Data Processing clusters.
You can set up this rule using one of the two methods:
To all addressesTo the addresses used by Yandex Cloud- Port range:
443
- Protocol:
TCP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
443
- Protocol:
TCP
- Destination name:
CIDR
- CIDR blocks:
84.201.181.26/32
: Getting the Yandex Data Processing cluster status, running jobs, UI Proxy.158.160.59.216/32
: Monitoring the Yandex Data Processing cluster state, autoscaling.213.180.193.243/32
: Access to Object Storage.84.201.181.184/32
: Cloud Logging address.
- Port range:
-
Rule that allows access to NTP servers for time syncing:
- Port range:
123
- Protocol:
UDP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
If you plan to use multiple security groups for your Yandex Data Processing cluster, allow all traffic between these groups.
Note
You can specify more detailed rules for your security groups, e.g., to allow traffic only in specific subnets.
You must configure security groups correctly for all subnets in which the Yandex Data Processing cluster hosts will reside.
You can set up security groups after creating a Yandex Data Processing cluster to connect to Metastore or Yandex Data Processing cluster hosts via the internet or an intermediate VM.
Create a Yandex Data Processing cluster
A Yandex Data Processing cluster must include a subcluster with a master host and at least one subcluster for data storage or processing.
If you want to create a Yandex Data Processing cluster copy, import its configuration to Terraform.
-
In the management console
, select the folder where you want to create a Yandex Data Processing cluster. -
Click Create resource and select
Yandex Data Processing cluster in the drop-down list. -
Enter a name for the Yandex Data Processing cluster. Optionally, add cluster description.
The naming requirements are as follows:
- It must be unique within the folder.
- The name must be from 2 to 63 characters long.
- It may contain lowercase Latin letters, numbers, and hyphens.
- The first character must be a letter and the last character cannot be a hyphen.
-
Add or delete cluster labels. You can use them to split or join resources into logical groups.
-
Specify the following cluster settings:
-
Image version and services you want to use in the Yandex Data Processing cluster.
You can use an image version
2.0.39
or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, specify
2.0
. -
Public part of the SSH key in the SSH key field. For information on how to generate and use SSH keys, see the Yandex Compute Cloud documentation.
-
Service account to which you need to grant access to the Yandex Data Processing cluster.
-
Availability zone for the Yandex Data Processing cluster.
-
(Optional) Cluster component properties.
-
(Optional) Custom initialization scripts for cluster hosts. For each script, specify the following information:
-
URI link to the initialization script in the
https://
,http://
,hdfs://
, ors3a://
schema. -
(Optional) Script timeout (in seconds). If your initialization script runs longer than the specified time period, it will be terminated.
-
(Optional) Arguments, enclosed in square brackets and separated by commas. The initialization script will use these arguments. Example of the arguments:
["arg1","arg2",...,"argN"]
-
-
Yandex Object Storage bucket name selection format, List or ID.
-
Bucket that will be used by the cluster.
Depending on the format you selected, either pick a name from the list or specify it manually. You can get it with a list of buckets in the folder.
-
Format in which to specify a network for the Yandex Data Processing cluster.
-
Network for the cluster.
-
Security groups with the required permissions.
Warning
When you create a Yandex Data Processing cluster, security group settings are verified. If the Yandex Data Processing cluster cannot operate properly with these settings, you will get a warning. A sample functional configuration is provided above.
-
UI Proxy. If the UI Proxy option is enabled, Yandex Data Processing component web interfaces will be available.
-
Yandex Cloud Logging log group the cluster will send logs to.
To save logs in a log group, assign the
logging.writer
role to the cluster service account.
-
-
Configure the Yandex Data Processing subclusters. The following types are available:
- Subcluster with a master host is specified as
Master
. There can only be one such subcluster. - Data storage subclusters are specified as
Data
. These are used to deploy storage components. - Data processing subclusters are specified as
Compute
. These are used to deploy computing components. The storage of such a subcluster can only be used to temporarily store processed files.
For each Yandex Data Processing subcluster, specify:
-
Subcluster name.
-
Number of hosts (there can only be one master host).
-
Host class, i.e., the platform and computing resources available to the host.
-
Storage size and type.
-
Subnet.
In the subnet, you need to set up a NAT gateway for the Yandex Data Processing subcluster with a master host. For more information, see Configure a network.
-
Access to Yandex Data Processing subcluster hosts from the internet. Select the Public access option to enable access. In this case, you can only connect to Yandex Data Processing subcluster hosts using SSL. For more information, see Connecting to a Yandex Data Processing cluster.
Warning
After you create a Yandex Data Processing cluster, you cannot request or disable public access to the subcluster. However, you can delete the Yandex Data Processing subcluster for data processing and create it again with the public access settings you need.
- Subcluster with a master host is specified as
-
(Optional) Set up automatic scaling for data processing subclusters.
- In the
Compute
type subcluster settings, enable Autoscaling. - Set autoscaling parameters.
- The default metric used for autoscaling is
yarn.cluster.containersPending
. To enable scaling based on CPU utilization, disable the Default scaling setting and specify the target CPU utilization level. - Click Add.
- In the
-
(Optional) Add and set up additional subclusters for data storage or processing.
-
(Optional) Use additional settings to enable protection against accidental cluster deletion.
With the protection enabled, one can still connect to the Yandex Data Processing cluster manually and delete the data.
-
Click Create cluster.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To create a Yandex Data Processing cluster:
-
Check whether the folder has any subnets for the Yandex Data Processing cluster hosts:
yc vpc subnet list
If there are no subnets in the folder, create the required subnets in Yandex Virtual Private Cloud.
-
View the description of the CLI command to create a Yandex Data Processing cluster:
yc dataproc cluster create --help
-
Specify Yandex Data Processing cluster parameters in the create command (the list of supported parameters in the example is not exhaustive):
yc dataproc cluster create <cluster_name> \ --bucket=<bucket_name> \ --zone=<availability_zone> \ --service-account-name=<service_account_name> \ --version=<image_version> \ --services=<list_of_components> \ --ssh-public-keys-file=<path_to_public_SSH_key> \ --subcluster name=<name_of_subcluster_with_master_host>,` `role=masternode,` `resource-preset=<host_class>,` `disk-type=<storage_type>,` `disk-size=<storage_size_in_GB>,` `subnet-name=<subnet_name>,` `assign-public-ip=<public_access_to_subcluster_host> \ --subcluster name=<name_of_subcluster_for_storing_data>,` `role=datanode,` `resource-preset=<host_class>,` `disk-type=<storage_type>,` `disk-size=<storage_size_in_GB>,` `subnet-name=<subnet_name>,` `hosts-count=<number_of_hosts>,` `assign-public-ip=<public_access_to_subcluster_host> \ --deletion-protection \ --ui-proxy=<access_to_component_web_interfaces> \ --log-group-id=<log_group_ID> \ --security-group-ids=<list_of_security_group_IDs>
Note
The Yandex Data Processing cluster name must be unique within the folder. It may contain Latin letters, numbers, hyphens, and underscores. The name may be up to 63 characters long.
Where:
-
--bucket
: Name of an Object Storage bucket where job dependencies and results will be stored. The service account of the Yandex Data Processing cluster must haveREAD and WRITE
permissions for this bucket. -
--zone
: Availability zone where the Yandex Data Processing cluster hosts will reside. -
--service-account-name
: Name of the Yandex Data Processing cluster service account. -
--version
: Image version.You can use an image version
2.0.39
or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, specify
2.0
in the--version
parameter. -
--services
: List of components that you want to use in the Yandex Data Processing cluster. If you do not specify this parameter, these default components will be used:yarn
,tez
, andspark
. -
--ssh-public-keys-file
: Full path to the file with the public part of the SSH key for access to the Yandex Data Processing cluster hosts.For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation. -
--subcluster
: Parameters of Yandex Data Processing subclusters:-
name
: Yandex Data Processing subcluster name. -
role
: Yandex Data Processing subcluster role, which can bemasternode
,datanode
, orcomputenode
. -
resource-preset
: Host class. -
disk-type
: Storage type, which can benetwork-ssd
,network-hdd
, ornetwork-ssd-nonreplicated
. -
disk-size
: Storage size in GB. -
subnet-name
: Subnet name. -
hosts-count
: Number of hosts in the Yandex Data Processing subclusters for data storage or processing. The minimum value is1
, and the maximum value is32
. -
assign-public-ip
: Access to Yandex Data Processing subcluster hosts from the internet. It can either betrue
orfalse
. If access is enabled, you can only connect to the Yandex Data Processing cluster using SSL. For more information, see Connecting to a Yandex Data Processing cluster.Warning
After you create a Yandex Data Processing cluster, you cannot request or disable public access to the subcluster. However, you can delete the Yandex Data Processing subcluster for data processing and create it again with the public access settings you need.
-
-
--deletion-protection
: Protection against accidental deletion of the Yandex Data Processing cluster.Cluster deletion protection will not prevent a manual connection to a cluster to delete data.
-
--ui-proxy
: Access to Yandex Data Processing component web interfaces. It can either betrue
orfalse
. -
--log-group-id
: Log group ID. -
--security-group-ids
: List of security group IDs.
To create a Yandex Data Processing cluster with multiple data storage or processing subclusters, provide the required number of
--subcluster
arguments in the cluster creation command:yc dataproc cluster create <cluster_name> \ ... --subcluster <subcluster_parameters> \ --subcluster <subcluster_parameters> \ ...
-
-
To enable autoscaling in Yandex Data Processing subclusters for data processing, specify the following parameters:
yc dataproc cluster create <cluster_name> \ ... --subcluster name=<subcluster_name>,` `role=computenode` `...` `hosts-count=<minimum_number_of_hosts>` `max-hosts-count=<maximum_number_of_hosts>,` `preemptible=<use_preemptible_VMs>,` `warmup-duration=<VM_warmup_period>,` `stabilization-duration=<stabilization_period>,` `measurement-duration=<load_measurement_interval>,` `cpu-utilization-target=<target_CPU_utilization_level>,` `autoscaling-decommission-timeout=<decommissioning_timeout>
Where:
hosts-count
: Minimum number of hosts (VMs) per Yandex Data Processing subcluster. The minimum value is1
, and the maximum value is32
.max-hosts-count
: Maximum number of hosts (VMs) per Yandex Data Processing subcluster. The minimum value is1
, and the maximum value is100
.preemptible
: Indicates if preemptible VMs are used. It can either betrue
orfalse
.warmup-duration
: Time required to warm up a VM instance, in<value>s
format. The minimum value is0s
, and the maximum value is600s
.stabilization-duration
: Period, in seconds, during which the required number of VMs cannot be decreased, in<value>s
format. The minimum value is60s
and the maximum value is1800s
.measurement-duration
: Period, in seconds, for which the average utilization is calculated for each VM, in<value>s
format. The minimum value is60s
(1 minute), and the maximum value is600s
(10 minutes).cpu-utilization-target
: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise,yarn.cluster.containersPending
will be used for scaling based on the number of pending resources. The minimum value is10
, and the maximum value is100
.autoscaling-decommission-timeout
: Decommissioning timeout in seconds. The minimum value is0
, and the maximum value is86400
(24 hours).
-
To create a Yandex Data Processing cluster based on dedicated host groups, specify their IDs as a comma-separated list in the
--host-group-ids
parameter:yc dataproc cluster create <cluster_name> \ ... --host-group-ids=<dedicated_host_group_IDs>
Alert
You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.
-
To configure Yandex Data Processing cluster hosts using initialization scripts, specify them in one or multiple
--initialization-action
parameters:yc dataproc cluster create <cluster_name> \ ... --initialization-action uri=<initialization_script_URI>,` `timeout=<script_timeout>,` `args=["arg1","arg2","arg3",...]
Where:
uri
: Link to the initialization script in thehttps://
,http://
,hdfs://
, ors3a://
scheme.- (Optional)
timeout
: Script timeout, in seconds. If your initialization script runs longer than this time period, it will be terminated. - (Optional)
args
: Arguments for the initialization script, provided as a comma-separated list.
With Terraform
Terraform is distributed under the Business Source License
For more information about the provider resources, see the documentation on the Terraform
To create a Yandex Data Processing cluster:
-
Using the command line, navigate to the folder that will contain the Terraform configuration files with an infrastructure plan. Create the directory if it does not exist.
-
If you don't have Terraform, install it and configure the Yandex Cloud provider.
-
Create a configuration file describing the cloud network and subnets.
The Yandex Data Processing cluster resides in a cloud network. If you already have a suitable network, you do not need to describe it again.
Yandex Data Processing cluster hosts reside in subnets of the selected cloud network. If you already have suitable subnets, you do not need to describe them again.
Example structure of a configuration file that describes a cloud network with a single subnet:
resource "yandex_vpc_network" "test_network" { name = "<network_name>" } resource "yandex_vpc_subnet" "test_subnet" { name = "<subnet_name>" zone = "<availability_zone>" network_id = yandex_vpc_network.test_network.id v4_cidr_blocks = ["<subnet>"] }
-
Create a configuration file with a description of the following resources:
- Service account to which you need to grant access to the Yandex Data Processing cluster.
- Service account to create the Object Storage bucket.
- Static key.
- Object Storage bucket for storing job execution results
resource "yandex_iam_service_account" "data_proc_sa" { name = "<service_account_name>" description = "<service_account_description>" } resource "yandex_resourcemanager_folder_iam_member" "dataproc-agent" { folder_id = "<folder_ID>" role = "dataproc.agent" member = "serviceAccount:${yandex_iam_service_account.data_proc_sa.id}" } resource "yandex_resourcemanager_folder_iam_member" "dataproc-provisioner" { folder_id = "<folder_ID>" role = "dataproc.provisioner" member = "serviceAccount:${yandex_iam_service_account.data_proc_sa.id}" } resource "yandex_iam_service_account" "bucket_sa" { name = "<service_account_name>" description = "<service_account_description>" } resource "yandex_resourcemanager_folder_iam_member" "storage-editor" { folder_id = "<folder_ID>" role = "storage.editor" member = "serviceAccount:${yandex_iam_service_account.bucket_sa.id}" } resource "yandex_iam_service_account_static_access_key" "bucket_sa_static_key" { service_account_id = yandex_iam_service_account.bucket_sa.id } resource "yandex_storage_bucket" "data_bucket" { depends_on = [ yandex_resourcemanager_folder_iam_member.storage-editor ] bucket = "<bucket_name>" access_key = yandex_iam_service_account_static_access_key.bucket_sa_static_key.access_key secret_key = yandex_iam_service_account_static_access_key.bucket_sa_static_key.secret_key }
-
Create a configuration file describing the Yandex Data Processing cluster and its subclusters.
If required, here you can also specify the properties of the Yandex Data Processing cluster components, jobs, and the environment.
Below is an example of a configuration file structure that describes a Yandex Data Processing cluster consisting of a subcluster with a master host, a data storage subcluster, and a data processing subcluster:
resource "yandex_dataproc_cluster" "data_cluster" { bucket = "<bucket_name>" name = "<cluster_name>" description = "<cluster_description>" service_account_id = yandex_iam_service_account.data_proc_sa.id zone_id = "<availability_zone>" security_group_ids = ["<list_of_security_group_IDs>"] deletion_protection = <cluster_deletion_protection> cluster_config { version_id = "<image_version>" hadoop { services = ["<list_of_components>"] # Example of the list: ["HDFS", "YARN", "SPARK", "TEZ", "MAPREDUCE", "HIVE"]. properties = { "<component_property>" = <value> ... } ssh_public_keys = [ file("${file("<path_to_public_SSH_key>")}") ] } subcluster_spec { name = "<name_of_subcluster_with_master_host>" role = "MASTERNODE" resources { resource_preset_id = "<host_class>" disk_type_id = "<storage_type>" disk_size = <storage_size_in_GB> } subnet_id = yandex_vpc_subnet.test_subnet.id hosts_count = 1 } subcluster_spec { name = "<name_of_subcluster_for_storing_data>" role = "DATANODE" resources { resource_preset_id = "<host_class>" disk_type_id = "<storage_type>" disk_size = <storage_size_in_GB> } subnet_id = yandex_vpc_subnet.test_subnet.id hosts_count = <number_of_hosts_in_subcluster> } subcluster_spec { name = "<data_processing_subcluster_name>" role = "COMPUTENODE" resources { resource_preset_id = "<host_class>" disk_type_id = "<storage_type>" disk_size = <storage_size_in_GB> } subnet_id = yandex_vpc_subnet.test_subnet.id hosts_count = <number_of_hosts_in_subcluster> } } }
Where
deletion_protection
means deletion protection for the Yandex Data Processing cluster. It can either betrue
orfalse
.Enabled deletion protection will not prevent a manual connection with the purpose to delete database contents.
You can use an image version
2.0.39
or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, specify
2.0
in theversion_id
parameter.To access web interfaces of Yandex Data Processing components, add the
ui_proxy
field set totrue
to the Yandex Data Processing cluster description:resource "yandex_dataproc_cluster" "data_cluster" { ... ui_proxy = true ... }
To configure autoscaling parameters in Yandex Data Processing subclusters for data processing, add the
autoscaling_config
section with the settings you need to thesubcluster_spec
description of the relevant subcluster:subcluster_spec { name = "<subcluster_name>" role = "COMPUTENODE" ... autoscaling_config { max_hosts_count = <maximum_number_of_VMs_in_group> measurement_duration = <load_measurement_interval> warmup_duration = <warmup_period> stabilization_duration = <stabilization_period> preemptible = <use_preemptible_VMs> cpu_utilization_target = <target_CPU_utilization_level> decommission_timeout = <decommissioning_timeout> } }
Where:
max_hosts_count
: Maximum number of hosts (VMs) per Yandex Data Processing subcluster. The minimum value is1
, and the maximum value is100
.measurement_duration
: Period, in seconds, for which the average utilization is calculated for each VM, in<value>s
format. The minimum value is60s
(1 minute), and the maximum value is600s
(10 minutes).warmup_duration
: Time required to warm up a VM instance, in<value>s
format. The minimum value is0s
, and the maximum value is600s
.stabilization_duration
: Period, in seconds, during which the required number of VMs cannot be decreased, in<value>s
format. The minimum value is60s
and the maximum value is1800s
.preemptible
: Indicates if preemptible VMs are used. It can either betrue
orfalse
.cpu_utilization_target
: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise,yarn.cluster.containersPending
will be used for scaling based on the number of pending resources. The minimum value is10
, and the maximum value is100
.decommission_timeout
: Decommissioning timeout in seconds. The minimum value is0
, and the maximum value is86400
(24 hours).
For more information about the resources you can create with Terraform, see the provider documentation
. -
Check that the Terraform configuration files are correct:
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Create a Yandex Data Processing cluster:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
To create a Yandex Data Processing cluster, use the create API method and include the following in the request:
-
ID of the folder to host the Yandex Data Processing cluster, in the
folderId
parameter. -
Yandex Data Processing cluster name in the
name
parameter. -
Yandex Data Processing cluster configuration in the
configSpec
parameter, including:-
Image version in the
configSpec.versionId
parameter.You can use an image version
2.0.39
or higher to create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, specify
2.0
. -
List of components in the
configSpec.hadoop.services
parameter. -
Public part of the SSH key in the
configSpec.hadoop.sshPublicKeys
parameter. -
Settings of the Yandex Data Processing subclusters in the
configSpec.subclustersSpec
parameter.
-
-
Availability zone of the Yandex Data Processing cluster in the
zoneId
parameter. -
ID of the Yandex Data Processing cluster service account in the
serviceAccountId
parameter. -
Bucket name in the
bucket
parameter. -
IDs of the Yandex Data Processing cluster security groups in the
hostGroupIds
parameter. -
Yandex Data Processing cluster deletion protection settings in the
deletionProtection
parameter.Cluster deletion protection will not prevent a manual connection to a cluster to delete data.
To assign a public IP address to all hosts of a Yandex Data Processing subcluster, provide the true
value in the configSpec.subclustersSpec.assignPublicIp
parameter.
To create a Yandex Data Processing cluster based on dedicated host groups, provide a list of their IDs in the hostGroupIds
parameter.
Alert
You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.
To configure Yandex Data Processing cluster hosts using initialization scripts, specify them in one or multiple configSpec.hadoop.initializationActions
parameters.
After the Yandex Data Processing cluster status changes to Running, you can connect to the Yandex Data Processing subcluster hosts using the specified SSH key.
Create a Yandex Data Processing cluster copy
You can create a Yandex Data Processing cluster with the same settings as the one you previously created. To do so, you need to import the configuration of the source Yandex Data Processing cluster to Terraform. This way you can either create an identical copy or use the imported configuration as the baseline and modify it as needed. Importing is a convenient option when the source Yandex Data Processing cluster has lots of settings (e.g., it is an HDFS cluster) and you need to create a similar one.
To create a Yandex Data Processing cluster copy:
-
If you do not have Terraform yet, install it.
-
Get the authentication credentials. You can add them to environment variables or specify them later in the provider configuration file.
-
Configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in a separate working directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.
-
In the same working directory, place a
.tf
file with the following contents:resource "yandex_dataproc_cluster" "old" { }
-
Write the ID of the initial Yandex Data Processing cluster to the environment variable:
export DATAPROC_CLUSTER_ID=<cluster_ID>
You can request the ID with the list of clusters in the folder.
-
Import the settings of the initial Yandex Data Processing cluster into the Terraform configuration:
terraform import yandex_dataproc_cluster.old ${DATAPROC_CLUSTER_ID}
-
Get the imported configuration:
terraform show
-
Copy it from the terminal and paste it into the
.tf
file. -
Place the file in the new
imported-cluster
directory. -
Edit the copied configuration so that you can create a new Yandex Data Processing cluster from it:
-
Specify the new Yandex Data Processing cluster name in the
resource
string and thename
parameter. -
Delete the
created_at
,host_group_ids
,id
, andsubcluster_spec.id
parameters. -
Change the SSH key format in the
ssh_public_keys
parameter. Source format:ssh_public_keys = [ <<-EOT <key> EOT, ]
Required format:
ssh_public_keys = [ "<key>" ]
-
Optionally, make further changes if you need to customize the configuration.
-
-
Get the authentication credentials in the
imported-cluster
directory. -
In the same directory, configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in the
imported-cluster
directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file. -
Check that the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Create the required infrastructure:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
Example
Creating a lightweight Yandex Data Processing cluster for Spark and PySpark jobs
Create a Yandex Data Processing cluster to run Spark jobs without HDFS and data storage subclusters with the following test settings:
- Name:
my-dataproc
. - Bucket:
dataproc-bucket
. - Availability zone:
ru-central1-c
. - Service account:
dataproc-sa
. - Image version:
2.0
. - Components:
SPARK
andYARN
. - Path to the public part of the SSH key:
/home/username/.ssh/id_rsa.pub
. - Yandex Data Processing subcluster with
master
hosts and a singlecompute
subcluster for processing data:- Class:
s2.micro
. - Network SSD storage (
network-ssd
): 20 GB. - Subnet:
default-ru-central1-c
. - Public access: Allowed.
- Class:
- Security group:
enp6saqnq4ie244g67sb
. - Protection against accidental Yandex Data Processing cluster deletion: Enabled.
Run this command:
yc dataproc cluster create my-dataproc \
--bucket=dataproc-bucket \
--zone=ru-central1-c \
--service-account-name=dataproc-sa \
--version=2.0 \
--services=SPARK,YARN \
--ssh-public-keys-file=/home/username/.ssh/id_rsa.pub \
--subcluster name="master",`
`role=masternode,`
`resource-preset=s2.micro,`
`disk-type=network-ssd,`
`disk-size=20,`
`subnet-name=default-ru-central1-c,`
`assign-public-ip=true \
--subcluster name="compute",`
`role=computenode,`
`resource-preset=s2.micro,`
`disk-type=network-ssd,`
`disk-size=20,`
`subnet-name=default-ru-central1-c,`
`assign-public-ip=true \
--security-group-ids=enp6saqnq4ie244g67sb \
--deletion-protection