Spark connector
DataSphere allows processing large amounts of data on Yandex Data Processing clusters. With a Spark connector, you can either use existing Yandex Data Processing clusters or create temporary clusters.
A Spark connector is a special resource that stores connection and interaction settings for existing and temporary Yandex Data Processing clusters. The selected clusters are automatically connected or created when you start computing in the IDE. When creating a resource, you can also specify data for connection to the S3 object storage.
Information about a Spark connector as a resource
The following information is stored for each Spark connector:
- Unique resource ID.
- Resource creator.
- Creation and last update date in UTC
format, such asApril 22, 2024, 13:21
. - Yandex Data Processing cluster configuration.
- Settings for connecting to S3.
Working with a Spark connector
You can create a Spark connector in the DataSphere interface
Spark connectors are used in the project notebooks. When first running computations, you select the configuration on which the notebook code will run. This VM resides on the network specified in the Spark connector, so it has network access to the Yandex Data Processing cluster but does not belong to it. By default, the notebook cell code will be executed on the VM. To execute the code on a Yandex Data Processing cluster, you must explicitly specify this when making a call (e.g., via SparkContext::runJob
).
The VM environment for working with the cluster differs from the DataSphere standard environment and allows accessing the Yandex Data Processing cluster environment. You can also use sessions to work with the cluster.
Once created, the Spark connector becomes available for the project. Like any other resource, you can publish the Spark connector in the community to use it in other projects. To do this, you need at least the Editor
role in the project and the Developer
role in the community in which you want to publish it. You can open the access on the Access tab on the Spark connector view page. The resource available to the community will appear on the community page under Community resources.
If you chose a temporary Yandex Data Processing cluster when creating the Spark connector, DataSphere will create a Yandex Data Processing cluster the first time you run computations in your notebook and will monitor it all by itself. The cluster starts and stops together with the notebook VM. The cluster will be deleted if there are no computations on it for the period of time specified in the Stop inactive VM after parameter, or if you force shut down the notebook VM.
Configurations of temporary clusters
Temporary Yandex Data Processing clusters are deployed on Yandex Compute Cloud VMs powered by Intel Cascade Lake (standard-v2
).
You can calculate the total disk storage capacity required for different cluster configurations using this formula:
<number_of_Yandex_Data_Processing_hosts> × 256 + 128
Cluster type | Number of hosts | Disk size | Host parameters |
---|---|---|---|
XS | 1 | 384 GB HDD | 4 vCPUs, 16 GB RAM |
S | 4 | 1152 GB SSD | 4 vCPUs, 16 GB RAM |
M | 8 | 2176 GB SSD | 16 vCPUs, 64 GB RAM |
L | 16 | 4224 GB SSD | 16 vCPUs, 64 GB RAM |
XL | 32 | 8320 GB SSD | 16 vCPUs, 64 GB RAM |
Tip
Before running a project with the Spark connector to create a temporary Yandex Data Processing cluster, make sure the quotas
You will be charged extra for using temporary clusters created based on Yandex Data Processing templates according to the Yandex Data Processing pricing policy.