Yandex Data Processing templates
A Yandex Data Processing template is a special resource for rapid deployment of Yandex Data Processing clusters in DataSphere projects. Templates define cluster configuration and can be used by DataSphere to deploy the cluster multiple times.
To use Yandex Data Processing clusters, set the following project parameters:
-
Default folder to enable integration with other Yandex Cloud services. A Yandex Data Processing cluster will be deployed in this folder based on the current cloud quotas. A fee for using the cluster will be debited from your cloud billing account.
-
Service account to be used by DataSphere for creating and managing clusters. The service account needs the following roles:
dataproc.agent
to use Yandex Data Processing clusters.dataproc.admin
to create clusters from Yandex Data Processing templates.vpc.user
to use the Yandex Data Processing cluster network.iam.serviceAccounts.user
to create resources in the folder on behalf of the service account.
-
Subnet for DataSphere to communicate with the Yandex Data Processing cluster. Since the Yandex Data Processing cluster needs to access the internet, make sure to configure a NAT gateway in the subnet.
Note
If you specified a subnet in the project settings, the time to allocate computing resources may be increased.
Warning
The Yandex Data Processing persistent cluster must have the livy:livy.spark.deploy-mode : client
setting.
Information about Yandex Data Processing templates as a resource
The following information is stored about each template:
- Resource name.
- Resource creator.
- Cluster configuration.
- Template creation date in UTC
format, such asJuly 18, 2022, 14:23
.
You can view all Yandex Data Processing templates created in your project on the Yandex Data Processing resource page. It also provides a list of all Yandex Data Processing clusters available in the project. It contains both temporary clusters based on Yandex Data Processing templates and connected clusters deployed in Yandex Data Processing. To view detailed information about a template or cluster, click it.
Specifics of a temporary cluster based on a Yandex Data Processing template
To create a cluster from a Yandex Data Processing template, activate the template in your project. When running a project in the IDE, DataSphere creates a temporary cluster in the Yandex Cloud folder and subnet specified in the project settings.
DataSphere tracks the cluster's lifetime and automatically deletes it if no computations have been performed on it within two hours. The cluster will also be deleted if you force stop the computations running in the project.
Configurations of temporary clusters
Automated Yandex Data Processing clusters are deployed on Yandex Compute Cloud VMs powered by Intel Cascade Lake (standard-v2
).
You can calculate the total disk storage capacity required for different cluster configurations using this formula:
<number_of_Yandex_Data_Processing_hosts> × 256 + 128
Cluster type | Number of hosts | Disk size | Host parameters |
---|---|---|---|
XS | 1 | 384 GB HDD | 4 vCPUs, 16 GB RAM |
S | 4 | 1152 GB SSD | 4 vCPUs, 16 GB RAM |
M | 8 | 2176 GB SSD | 16 vCPUs, 64 GB RAM |
L | 16 | 4224 GB SSD | 16 vCPUs, 64 GB RAM |
XL | 32 | 8320 GB SSD | 16 vCPUs, 64 GB RAM |
Tip
Before running a project with an activated Yandex Data Processing template, make sure the quotas
You will be charged extra for using temporary clusters created based on Yandex Data Processing templates according to the Yandex Data Processing pricing policy.
Statuses of temporary Yandex Data Processing clusters
DataSphere creates a temporary Yandex Data Processing cluster once you open your project in the IDE.
The created cluster will appear in the list of available clusters on the Yandex Data Processing resource page. A temporary cluster can have one of the following statuses:
STARTING
: The cluster is being created.UP
: The cluster has been created and is ready to run calculations.DOWN
: There have been issues while creating the cluster.