Spark connector
DataSphere allows processing large amounts of data on Yandex Data Proc clusters. With a Spark connector, you can either use existing Yandex Data Proc clusters or create temporary clusters.
A Spark connector is a special resource that stores connection and interaction settings for existing and temporary Yandex Data Proc clusters. The selected clusters are automatically connected or created when you start computing in the IDE. When creating a resource, you can also specify data for connection to the S3 object storage.
Information about a Spark connector as a resource
The following information is stored for each Spark connector:
- Unique resource ID.
- Resource creator.
- Creation and last update date in UTC
format, such asApril 22, 2024, 13:21
. - Yandex Data Proc cluster configuration.
- Settings for connecting to S3.
Working with a Spark connector
You can create a Spark connector in the DataSphere interface
Spark connectors are used in the project notebooks. When first running computations, you select the configuration on which the notebook code will run. This VM resides on the network specified in the Spark connector, so it has network access to the Yandex Data Proc cluster but does not belong to it. The VM environment for working with the cluster differs from the DataSphere standard environment and allows accessing the Yandex Data Proc cluster environment. You can also use sessions to work with the cluster.
Once created, the Spark connector becomes available for the project. Like any other resource, you can publish the Spark connector in the community to use it in other projects. To do this, you need at least the Editor
role in the project and the Developer
role in the community in which you want to publish it. You can open the access on the Access tab on the Spark connector view page. The resource available to the community will appear on the community page under Community resources.
If you chose a temporary Yandex Data Proc cluster when creating the Spark connector, DataSphere will create a Yandex Data Proc cluster the first time you run computations in your notebook and will monitor it on its own. The cluster will be deleted if there are no computations on it for the period of time specified in the Stop inactive VM after parameter, or if you force shut down the notebook VM.
Configurations of temporary clusters
Temporary Yandex Data Proc clusters are deployed on Yandex Compute Cloud VMs powered by Intel Cascade Lake (standard-v2
).
You can calculate the total disk storage capacity required for different cluster configurations using this formula:
<number_of_Data_Proc_hosts> × 256 + 128
Cluster type | Number of hosts | Disk size | Host parameters |
---|---|---|---|
XS | 1 | 384 GB HDD | 4 vCPUs, 16 GB RAM |
S | 4 | 1152 GB SSD | 4 vCPUs, 16 GB RAM |
M | 8 | 2176 GB SSD | 16 vCPUs, 64 GB RAM |
L | 16 | 4224 GB SSD | 16 vCPUs, 64 GB RAM |
XL | 32 | 8320 GB SSD | 16 vCPUs, 64 GB RAM |
Tip
Before running a project with the Spark connector to create a temporary Yandex Data Proc cluster, make sure the quotas
You will be charged extra for using temporary clusters created based on Yandex Data Proc templates according to the Yandex Data Proc pricing policy.