Creating a Yandex Data Proc cluster
To create a Yandex Data Proc cluster, a user must be assigned the editor
and dataproc.agent
roles. For more information, see the role description.
Configure a network
Configure internet access from the subnet to which the Yandex Data Proc subcluster with a master host will be connected, e.g., using a NAT gateway. This will enable the Yandex Data Proc subcluster to interact with Yandex Cloud services or hosts in other networks.
Configure security groups
Warning
You need to create and configure security groups before creating a Yandex Data Proc cluster. If the selected security groups do not have the required rules, Yandex Cloud disables the Yandex Data Proc cluster creation.
-
Create one or more security groups for service traffic of the Yandex Data Proc cluster.
-
-
One rule for inbound and another one for outbound service traffic:
- Port range:
0-65535
- Protocol:
Any
- Source/Destination name:
Security group
- Security group:
Current
- Port range:
-
A separate rule for outgoing HTTPS traffic. This will allow you to use Yandex Object Storage buckets, UI Proxy, and autoscaling of Yandex Data Proc clusters.
You can set up this rule using one of the two methods:
To all addressesTo the addresses used by Yandex Cloud- Port range:
443
- Protocol:
TCP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
443
- Protocol:
TCP
- Destination name:
CIDR
- CIDR blocks:
84.201.181.26/32
: Getting the Yandex Data Proc cluster status, running jobs, UI Proxy.158.160.59.216/32
: Monitoring the Yandex Data Proc cluster state, autoscaling.213.180.193.243/32
: Access to Object Storage.
- Port range:
-
A rule that allows access to NTP servers for time syncing:
- Port range:
123
- Protocol:
UDP
- Destination name:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
If you plan to use multiple security groups for your Yandex Data Proc cluster, allow all traffic between these groups.
Note
You can set more detailed rules for security groups, such as allowing traffic in only specific subnets.
You must configure security groups correctly for all subnets in which the Yandex Data Proc cluster hosts will reside.
You can set up security groups after creating a Yandex Data Proc cluster to connect to Metastore or Yandex Data Proc cluster hosts via the internet or an intermediate VM.
Create a Yandex Data Proc cluster
A Yandex Data Proc cluster must include a subcluster with a master host and at least one subcluster for data storage or processing.
If you want to create a Yandex Data Proc cluster copy, import its configuration to Terraform.
-
In the management console
, select the folder where you want to create a Yandex Data Proc cluster. -
Click Create resource and select
Data Proc cluster from the drop-down list. -
Enter a name for the Yandex Data Proc cluster in the Cluster name field. The naming requirements are as follows:
- It must be unique within the folder.
- The name must be from 3 to 63 characters long.
- It may contain lowercase Latin letters, numbers, and hyphens.
- The first character must be a letter and the last character cannot be a hyphen.
-
Select a suitable image version and the services you want to use in the Yandex Data Proc cluster.
Using an image of version
2.0.39
or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, specify
2.0
. -
Enter the public part of your SSH key in the SSH key field. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation.
-
Select or create a service account to which you will grant access to the Yandex Data Proc cluster. Make sure to assign the
dataproc.agent
role to the service account of the Yandex Data Proc cluster. -
Select the availability zone for the Yandex Data Proc cluster.
-
If required, configure the properties of Yandex Data Proc cluster components, jobs, and the environment.
-
If necessary, specify custom initialization scripts for Yandex Data Proc cluster hosts. For each script, specify:
-
URI: Link to the initialization script in the
https://
,http://
,hdfs://
, ors3a://
scheme. -
(Optional) Timeout: Script execution timeout, in seconds. If your initialization script runs longer than this time period, it will be terminated.
-
(Optional) Arguments: List of arguments of your initialization script, enclosed in square brackets
[]
and separated by commas, such as:["arg1","arg2",...,"argN"]
-
-
Select the name of a bucket in Object Storage to store job dependencies and results.
-
Select a network for the Yandex Data Proc cluster.
-
Select security groups that have the required permissions.
Warning
When you create a Yandex Data Proc cluster, security group settings are verified. If the Yandex Data Proc cluster cannot operate properly with these settings, a warning will appear. A sample functional configuration is provided above.
-
Enable the UI Proxy option to access the web interfaces of Yandex Data Proc components.
-
Yandex Cloud Logging stores Yandex Data Proc cluster logs. Select a log group from the list or create a new one.
To enable this functionality, assign the
logging.writer
role to the service account of the Yandex Data Proc cluster. For more information, see the Cloud Logging documentation. -
Configure Yandex Data Proc subclusters: maximum one subcluster with a master host (Master) and subclusters for data storage or processing.
Roles of Yandex Data Proc subclusters for data storage and processing are different: you can deploy data storage components on data storage subclusters and computing components on data processing subclusters. You can use a storage on a Yandex Data Proc subcluster for data processing only to temporarily store the files being processed.
For each Yandex Data Proc subcluster, you can configure:
-
Number of hosts.
-
Host class: Platform and computing resources available to the host.
-
Storage size and type.
-
Subnet of the network where the Yandex Data Proc cluster resides.
In the subnet, you need to set up a NAT gateway for the Yandex Data Proc subcluster with a master host. For more information, see Configure a network.
-
To access Yandex Data Proc subcluster hosts from the internet, select Public access. In this case, you can only connect to Yandex Data Proc subcluster hosts using SSL. For more information, see Connecting to a Yandex Data Proc cluster.
Warning
After you create a Yandex Data Proc cluster, you cannot request or disable public access to the subcluster. However, you can delete the Yandex Data Proc subcluster for data processing and create it again with the public access settings you need.
-
-
In Yandex Data Proc subclusters for data processing, you can specify autoscaling parameters.
Note
To enable automatic scaling, assign the following roles to the cluster service account:
dataproc.editor
dataproc.agent
- Under ** Add subcluster**, click Add.
- In the Roles field, select
COMPUTENODE
. - Under Scaling, enable the Autoscaling setting.
- Set autoscaling parameters.
- The default metric used for autoscaling is
yarn.cluster.containersPending
. To enable scaling based on CPU usage, disable the Default scaling setting and specify the target CPU utilization level. - Click Add.
-
If required, configure additional settings of the Yandex Data Proc cluster:
Deletion protection manages protection of the Yandex Data Proc cluster from accidental deletion by a user.Enabled protection will not prevent a manual connection to the Yandex Data Proc cluster and data deletion.
-
Click Create cluster.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To create a Yandex Data Proc cluster:
-
Check whether the folder has any subnets for the Yandex Data Proc cluster hosts:
yc vpc subnet list
If there are no subnets in the folder, create the required subnets in Yandex Virtual Private Cloud.
-
View the description of the CLI command for creating a Yandex Data Proc cluster:
yc dataproc cluster create --help
-
Specify Yandex Data Proc cluster parameters in the create command (the list of supported parameters in the example is not exhaustive):
yc dataproc cluster create <cluster_name> \ --bucket=<bucket_name> \ --zone=<availability_zone> \ --service-account-name=<service_account_name> \ --version=<image_version> \ --services=<list_of_components> \ --ssh-public-keys-file=<path_to_public_SSH_key> \ --subcluster name=<name_of_subcluster_with_master_host>,` `role=masternode,` `resource-preset=<host_class>,` `disk-type=<storage_type>,` `disk-size=<storage_size_in_GB>,` `subnet-name=<subnet_name>,` `assign-public-ip=<public_access_to_subcluster_host> \ --subcluster name=<name_of_data_storage_subcluster>,` `role=datanode,` `resource-preset=<host_class>,` `disk-type=<storage_type>,` `disk-size=<storage_size_in_GB>,` `subnet-name=<subnet_name>,` `hosts-count=<number_of_hosts>,` `assign-public-ip=<public_access_to_subcluster_host> \ --deletion-protection=<cluster_deletion_protection> \ --ui-proxy=<access_to_component_web_interfaces> \ --log-group-id=<log_group_ID> \ --security-group-ids=<list_of_security_group_IDs>
Note
The Yandex Data Proc cluster name must be unique within the folder. It may contain Latin letters, numbers, hyphens, and underscores. The name may be up to 63 characters long.
Where:
-
--bucket
: Name of the bucket in Object Storage that will store job dependencies and results. The service account of the Yandex Data Proc cluster must haveREAD and WRITE
permissions for this bucket. -
--zone
: Availability zone where the Yandex Data Proc cluster hosts will reside. -
--service-account-name
: Name of the Yandex Data Proc cluster service account. Make sure to assign thedataproc.agent
role to the service account of the Yandex Data Proc cluster. -
--version
: Image version.Using an image of version
2.0.39
or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, set the
--version
parameter value to2.0
. -
--services
: List of components that you want to use in the Yandex Data Proc cluster. If this parameter is omitted, the default set will be used:hdfs
,yarn
,mapreduce
,tez
, andspark
. -
--ssh-public-keys-file
: Full path to the file with the public part of the SSH key for access to the Yandex Data Proc cluster hosts. For information about how to generate and use SSH keys, see the Yandex Compute Cloud documentation. -
--subcluster
: Parameters of Yandex Data Proc subclusters:-
name
: Yandex Data Proc subcluster name. -
role
: Yandex Data Proc subcluster role:masternode
,datanode
, orcomputenode
. -
resource-preset
: Host class. -
disk-type
: Storage type (network-ssd
,network-hdd
, ornetwork-ssd-nonreplicated
). -
disk-size
: Storage size in GB. -
subnet-name
: Name of the subnet. -
hosts-count
: Number of hosts in the Yandex Data Proc subclusters for data storage or processing. The minimum value is1
and the maximum value is32
. -
assign-public-ip
: Access to Yandex Data Proc subcluster hosts from the internet. It may take either thetrue
orfalse
value. If access is enabled, you can only connect to the Yandex Data Proc cluster using SSL. For more information, see Connecting to a Yandex Data Proc cluster.Warning
After you create a Yandex Data Proc cluster, you cannot request or disable public access to the subcluster. However, you can delete the Yandex Data Proc subcluster for data processing and create it again with the public access settings you need.
-
-
--deletion-protection
: Deletion protection of the Yandex Data Proc cluster. It may take either thetrue
orfalse
value.Cluster deletion protection will not prevent a manual connection to a cluster to delete data.
-
--ui-proxy
: Access to Yandex Data Proc component web interfaces. It may take either thetrue
orfalse
value. -
--log-group-id
: Log group ID. -
--security-group-ids
: List of security group IDs.
To create a Yandex Data Proc cluster with multiple data storage or processing subclusters, provide the required number of
--subcluster
arguments in thecluster create
command:yc dataproc cluster create <cluster_name> \ ... --subcluster <subcluster_parameters> \ --subcluster <subcluster_parameters> \ ...
-
-
To enable autoscaling in Yandex Data Proc subclusters for data processing, specify the following parameters:
yc dataproc cluster create <cluster_name> \ ... --subcluster name=<subcluster_name>,` `role=computenode` `...` `hosts-count=<minimum_number_of_hosts>` `max-hosts-count=<maximum_number_of_hosts>,` `preemptible=<use_preemptible_VMs>,` `warmup-duration=<VM_warmup_time>,` `stabilization-duration=<stabilization_period>,` `measurement-duration=<utilization_measurement_interval>,` `cpu-utilization-target=<target_CPU_utilization_level>,` `autoscaling-decommission-timeout=<decommissioning_timeout>
Where:
hosts-count
: Minimum number of hosts (VMs) in the Yandex Data Proc subcluster. The minimum value is1
and the maximum value is32
.max-hosts-count
: Maximum number of hosts (VMs) in the Yandex Data Proc subcluster. The minimum value is1
and the maximum value is100
.preemptible
: Indicates if preemptible VMs are used. It may take either thetrue
orfalse
value.warmup-duration
: Time required to warm up a VM instance, in<value>s
format. The minimum value is0s
and the maximum value is600s
(10 minutes).stabilization-duration
: Interval in seconds, during which the required number of instances cannot be decreased, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is1800s
(30 minutes).measurement-duration
: Period in seconds, for which utilization measurements should be averaged for each instance, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is600s
(10 minutes).cpu-utilization-target
: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise,yarn.cluster.containersPending
will be used as a metric (based on the number of pending resources). The minimum value is10
and the maximum value is100
.autoscaling-decommission-timeout
: Decommissioning timeout in seconds. The minimum value is0
and the maximum value is86400
(24 hours).
Note
To enable automatic scaling, assign the following roles to the cluster service account:
dataproc.editor
dataproc.agent
-
To create a Yandex Data Proc subcluster residing on groups of dedicated hosts, specify their IDs separated by commas in the
--host-group-ids
parameter:yc dataproc cluster create <cluster_name> \ ... --host-group-ids=<IDs_of_groups_of_dedicated_hosts>
Alert
You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.
-
To configure Yandex Data Proc cluster hosts using initialization scripts, specify them in one or multiple
--initialization-action
parameters:yc dataproc cluster create <cluster_name> \ ... --initialization-action uri=<initialization_script_URI>,` `timeout=<script_execution_timeout>,` `args=["arg1","arg2","arg3",...]
Where:
URI
: Link to the initialization script in thehttps://
,http://
,hdfs://
, ors3a://
scheme.- (Optional)
timeout
: Script execution timeout, in seconds. If your initialization script runs longer than this time period, it will be terminated. - (Optional)
args
: Arguments separated by commas with which an initialization script must be executed.
Terraform
For more information about the provider resources, see the documentation on the Terraform
If you change the configuration files, Terraform automatically detects which part of your configuration is already deployed, and what should be added or removed.
To create a Yandex Data Proc cluster:
-
Using the command line, navigate to the folder that will contain the Terraform configuration files with an infrastructure plan. Create the directory if it does not exist.
-
If you don't have Terraform, install it and configure the Yandex Cloud provider.
-
Create a configuration file describing the cloud network and subnets.
The Yandex Data Proc cluster resides in a cloud network. If you already have a suitable network, you do not need to describe it again.
Yandex Data Proc cluster hosts reside in subnets of the selected cloud network. If you already have suitable subnets, you do not need to describe them again.
Example structure of a configuration file that describes a cloud network with a single subnet:
resource "yandex_vpc_network" "test_network" { name = "<network_name>" } resource "yandex_vpc_subnet" "test_subnet" { name = "<subnet_name>" zone = "<availability_zone>" network_id = yandex_vpc_network.test_network.id v4_cidr_blocks = ["<subnet>"] }
-
Create a configuration file describing the service account to access the Yandex Data Proc cluster, as well as the static key and the Object Storage bucket to store jobs and results.
resource "yandex_iam_service_account" "data_proc_sa" { name = "<service_account_name>" description = "<service_account_description>" } resource "yandex_resourcemanager_folder_iam_member" "dataproc" { folder_id = "<folder_ID>" role = "dataproc.agent" member = "serviceAccount:${yandex_iam_service_account.data_proc_sa.id}" } resource "yandex_resourcemanager_folder_iam_member" "bucket-creator" { folder_id = "<folder_ID>" role = "dataproc.editor" member = "serviceAccount:${yandex_iam_service_account.data_proc_sa.id}" } resource "yandex_iam_service_account_static_access_key" "sa_static_key" { service_account_id = yandex_iam_service_account.data_proc_sa.id } resource "yandex_storage_bucket" "data_bucket" { depends_on = [ yandex_resourcemanager_folder_iam_member.bucket-creator ] bucket = "<bucket_name>" access_key = yandex_iam_service_account_static_access_key.sa_static_key.access_key secret_key = yandex_iam_service_account_static_access_key.sa_static_key.secret_key }
-
Create a configuration file describing the Yandex Data Proc cluster and its subclusters.
If required, here you can also specify the properties of the Yandex Data Proc cluster components, jobs, and the environment.
Below is an example of a configuration file structure that describes a Yandex Data Proc cluster consisting of a subcluster with a master host, a data storage subcluster, and a data processing subcluster:
resource "yandex_dataproc_cluster" "data_cluster" { bucket = "<bucket_name>" name = "<cluster_name>" description = "<cluster_description>" service_account_id = yandex_iam_service_account.data_proc_sa.id zone_id = "<availability_zone>" security_group_ids = ["<list_of_security_group_IDs>"] deletion_protection = <cluster_deletion_protection> cluster_config { version_id = "<image_version>" hadoop { services = ["<list_of_components>"] # Example of the list: ["HDFS", "YARN", "SPARK", "TEZ", "MAPREDUCE", "HIVE"]. properties = { "<component_property>" = <value> ... } ssh_public_keys = [ file("${file("<path_to_public_SSH_key>")}") ] } subcluster_spec { name = "<name_of_subcluster_with_master_host>" role = "MASTERNODE" resources { resource_preset_id = "<host_class>" disk_type_id = "<storage_type>" disk_size = <storage_size_in_GB> } subnet_id = yandex_vpc_subnet.test_subnet.id hosts_count = 1 } subcluster_spec { name = "<name_of_data_storage_subcluster>" role = "DATANODE" resources { resource_preset_id = "<host_class>" disk_type_id = "<storage_type>" disk_size = <storage_size_in_GB> } subnet_id = yandex_vpc_subnet.test_subnet.id hosts_count = <number_of_subcluster_hosts> } subcluster_spec { name = "<name_of_data_processing_subcluster>" role = "COMPUTENODE" resources { resource_preset_id = "<host_class>" disk_type_id = "<storage_type>" disk_size = <storage_size_in_GB> } subnet_id = yandex_vpc_subnet.test_subnet.id hosts_count = <number_of_subcluster_hosts> } } }
Where
deletion_protection
is the deletion protection of the Yandex Data Proc cluster. It may take either thetrue
orfalse
value.Cluster deletion protection will not prevent a manual connection to delete the contents of a database.
Using an image of version
2.0.39
or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, set the
version_id
parameter to2.0
.To access web interfaces of Yandex Data Proc components, add the
ui_proxy
field set totrue
to the Yandex Data Proc cluster description:resource "yandex_dataproc_cluster" "data_cluster" { ... ui_proxy = true ... }
To configure autoscaling parameters in Yandex Data Proc subclusters for data processing, add the
autoscaling_config
section with the required settings to thesubcluster_spec
description of the relevant subcluster:subcluster_spec { name = "<subcluster_name>" role = "COMPUTENODE" ... autoscaling_config { max_hosts_count = <maximum_number_of_VMs_in_group> measurement_duration = <utilization_measurement_interval> warmup_duration = <warmup_time> stabilization_duration = <stabilization_period> preemptible = <use_preemptible_VMs> cpu_utilization_target = <target_vCPU_utilization_level> decommission_timeout = <decommissioning_timeout> } }
Where:
max_hosts_count
: Maximum number of hosts (VMs) in the Yandex Data Proc subcluster. The minimum value is1
and the maximum value is100
.measurement_duration
: Period, in seconds, for which utilization measurements are averaged for each instance, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is600s
(10 minutes).warmup_duration
: Time required to warm up a VM instance, in<value>s
format. The minimum value is0s
and the maximum value is600s
(10 minutes).stabilization_duration
: Period, in seconds, during which the required number of instances cannot be decreased, in<value>s
format. The minimum value is60s
(1 minute) and the maximum value is1800s
(30 minutes).preemptible
: Indicates if preemptible VMs are used. It may take either thetrue
orfalse
value.cpu_utilization_target
: Target CPU utilization level, in %. Use this setting to enable scaling based on CPU utilization. Otherwise,yarn.cluster.containersPending
will be used as a metric (based on the number of pending resources). The minimum value is10
and the maximum value is100
.decommission_timeout
: Decommissioning timeout in seconds. The minimum value is0
and the maximum value is86400
(24 hours).
For more information about resources you can create using Terraform, see the provider documentation
. -
Check that the Terraform configuration files are correct:
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Create a Yandex Data Proc cluster:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
To create a Yandex Data Proc cluster, use the create API method and include the following in the request:
-
ID of the folder where the Yandex Data Proc cluster must reside, in the
folderId
parameter. -
Yandex Data Proc cluster name in the
name
parameter. -
Yandex Data Proc cluster configuration in the
configSpec
parameter, including:-
Image version in the
configSpec.versionId
parameter.Using an image of version
2.0.39
or higher, you can create a lightweight cluster without HDFS and subclusters for data storage. In this case, be sure to add one or more data processing subclusters and specify a bucket name.Tip
To use the most recent image version, specify
2.0
. -
Component list in the
configSpec.hadoop.services
parameter. -
Public part of the SSH key in the
configSpec.hadoop.sshPublicKeys
parameter. -
Settings of the Yandex Data Proc subclusters in the
configSpec.subclustersSpec
parameter.
-
-
Availability zone of the Yandex Data Proc cluster in the
zoneId
parameter. -
ID of the Yandex Data Proc cluster's service account in the
serviceAccountId
parameter. -
Bucket name in the
bucket
parameter. -
IDs of the Yandex Data Proc cluster's security groups in the
hostGroupIds
parameter. -
Yandex Data Proc cluster deletion protection settings in the
deletionProtection
parameter.Cluster deletion protection will not prevent a manual connection to a cluster to delete data.
To assign a public IP address to all hosts of a Yandex Data Proc subcluster, provide the true
value in the configSpec.subclustersSpec.assignPublicIp
parameter.
To create a Yandex Data Proc cluster residing on groups of dedicated hosts, provide the list of the host group IDs in the hostGroupIds
parameter.
Alert
You cannot edit this setting after you create a cluster. The use of dedicated hosts significantly affects cluster pricing.
To configure Yandex Data Proc cluster hosts using initialization scripts, specify them in one or multiple configSpec.hadoop.initializationActions
parameters.
After the Yandex Data Proc cluster's status changes to Running, you can connect to the Yandex Data Proc subcluster hosts using the specified SSH key.
Create a Yandex Data Proc cluster copy
You can create a Yandex Data Proc cluster with the settings of another cluster created earlier. To do so, you need to import the configuration of the source Yandex Data Proc cluster to Terraform. Thus you can either create an identical copy or use the imported configuration as the baseline and modify it as needed. Importing is a convenient option when the source Yandex Data Proc cluster has lots of settings (e.g., it is an HDFS cluster) and you need to create a similar one.
To create a Yandex Data Proc cluster copy:
-
If you do not have Terraform yet, install it.
-
Get the authentication credentials. You can add them to environment variables or specify them later in the provider configuration file.
-
Configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in a separate working directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.
-
In the same working directory, place a file with a
.tf
extension and the following contents:resource "yandex_dataproc_cluster" "old" { }
-
Write the ID of the initial Yandex Data Proc cluster to the environment variable:
export DATAPROC_CLUSTER_ID=<cluster_ID>
-
Import the settings of the initial Yandex Data Proc cluster into the Terraform configuration:
terraform import yandex_dataproc_cluster.old ${DATAPROC_CLUSTER_ID}
-
Get the imported configuration:
terraform show
-
Copy it from the terminal and paste it into the
.tf
extension file. -
Place the file in the new
imported-cluster
directory. -
Edit the copied configuration so that you can create a new Yandex Data Proc cluster from it:
-
Specify the name of the new Yandex Data Proc cluster in the
resource
string and thename
parameter. -
Delete the
created_at
,host_group_ids
,id
, andsubcluster_spec.id
parameters. -
Change the SSH key format in the
ssh_public_keys
parameter. Source format:ssh_public_keys = [ <<-EOT <key> EOT, ]
Required format:
ssh_public_keys = [ "<key>" ]
-
(Optional) Make further modifications if you need a customized copy rather than identical one.
-
-
In the
imported-cluster
directory, get the authentication data. -
In the same directory, configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in the
imported-cluster
directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file. -
Make sure the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Create the required infrastructure:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
Example
Creating a lightweight Yandex Data Proc cluster for Spark and PySpark jobs
Create a Yandex Data Proc cluster to run Spark jobs without HDFS and data storage subclusters and set the test characteristics:
-
Cluster name:
my-dataproc
-
Bucket name:
dataproc-bucket
-
Availability zone:
ru-central1-c
-
Service account:
dataproc-sa
-
Image version:
2.0
-
SPARK
andYARN
components -
Path to the public part of the SSH key:
/home/username/.ssh/id_rsa.pub
-
With the
master
Yandex Data Proc subcluster for master hosts and a singlecompute
subcluster for processing data:- Class:
s2.micro
- Network SSD storage (
network-ssd
): 20 GB - Subnet:
default-ru-central1-c
- Public access: Allowed
- Class:
-
Security group:
enp6saqnq4ie244g67sb
-
Protection against accidental Yandex Data Proc cluster deletion: Enabled
Run the following command:
yc dataproc cluster create my-dataproc \
--bucket=dataproc-bucket \
--zone=ru-central1-c \
--service-account-name=dataproc-sa \
--version=2.0 \
--services=SPARK,YARN \
--ssh-public-keys-file=/home/username/.ssh/id_rsa.pub \
--subcluster name="master",`
`role=masternode,`
`resource-preset=s2.micro,`
`disk-type=network-ssd,`
`disk-size=20,`
`subnet-name=default-ru-central1-c,`
`assign-public-ip=true \
--subcluster name="compute",`
`role=computenode,`
`resource-preset=s2.micro,`
`disk-type=network-ssd,`
`disk-size=20,`
`subnet-name=default-ru-central1-c,`
`assign-public-ip=true \
--security-group-ids=enp6saqnq4ie244g67sb \
--deletion-protection=true