Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex DataSphere
  • Getting started
    • About Yandex DataSphere
    • DataSphere resource relationships
    • Communities
    • Cost management
    • Project
    • Computing resource configurations
      • Overview
      • Secrets
      • Docker images
      • Datasets
      • Yandex Data Processing templates
      • S3 connectors
      • Spark connectors
      • Models
      • Fine-tuned foundation models
      • File storages
    • Foundation models
    • Quotas and limits
    • Special terms for educational institutions
  • Terraform reference
  • Audit Trails events
  • Access management
  • Pricing policy
  • Public materials
  • Release notes

In this article:

  • Information about a Spark connector as a resource
  • Working with a Spark connector
  • Configurations of temporary clusters
  1. Concepts
  2. Resources
  3. Spark connectors

Spark connector

Written by
Yandex Cloud
Updated at March 10, 2025
  • Information about a Spark connector as a resource
  • Working with a Spark connector
    • Configurations of temporary clusters

DataSphere allows processing large amounts of data on Yandex Data Processing clusters. With a Spark connector, you can either use existing Yandex Data Processing clusters or create temporary clusters.

A Spark connector is a special resource that stores connection and interaction settings for existing and temporary Yandex Data Processing clusters. The selected clusters are automatically connected or created when you start computing in the IDE. When creating a resource, you can also specify data for connection to the S3 object storage.

Information about a Spark connector as a resourceInformation about a Spark connector as a resource

The following information is stored for each Spark connector:

  • Unique resource ID.
  • Resource creator.
  • Creation and last update date in UTC format, such as April 22, 2024, 13:21.
  • Yandex Data Processing cluster configuration.
  • Settings for connecting to S3.

Working with a Spark connectorWorking with a Spark connector

You can create a Spark connector in the DataSphere interface. When creating a Spark connector, you can choose the type of connection to an existing Yandex Data Processing cluster: SparkContext or Spark Connect (available only for Yandex Data Processing clusters version 2.2 or older). The SparkContext connection is used for temporary clusters.

Spark connectors are used in the project notebooks. When first running computations, you select the configuration on which the notebook code will run. This VM resides on the network specified in the Spark connector, so it has network access to the Yandex Data Processing cluster but does not belong to it. By default, the notebook cell code will be executed on the VM. To execute the code on a Yandex Data Processing cluster, you must explicitly specify this when making a call (e.g., via SparkContext::runJob).

The VM environment for working with the cluster differs from the DataSphere standard environment and allows accessing the Yandex Data Processing cluster environment. You can also use sessions to work with the cluster.

Once created, the Spark connector becomes available for the project. Like any other resource, you can publish the Spark connector in the community to use it in other projects. To do this, you need at least the Editor role in the project and the Developer role in the community in which you want to publish it. You can open the access on the Access tab on the Spark connector view page. The resource available to the community will appear on the community page under Community resources.

If you chose a temporary Yandex Data Processing cluster when creating the Spark connector, DataSphere will create a Yandex Data Processing cluster the first time you run computations in your notebook and will monitor it all by itself. The cluster starts and stops together with the notebook VM. The cluster will be deleted if there are no computations on it for the period of time specified in the Stop inactive VM after parameter, or if you force shut down the notebook VM.

You can also work with Spark connectors from the DataSphere CLI.

Configurations of temporary clustersConfigurations of temporary clusters

Temporary Yandex Data Processing clusters are deployed on Yandex Compute Cloud VMs powered by Intel Cascade Lake (standard-v2).

You can calculate the total disk storage capacity required for different cluster configurations using this formula:

<number_of_Yandex_Data_Processing_hosts> × 256 + 128
Cluster type Number of hosts Disk size Host parameters
XS 1 384 GB HDD 4 vCPUs, 16 GB RAM
S 4 1152 GB SSD 4 vCPUs, 16 GB RAM
M 8 2176 GB SSD 16 vCPUs, 64 GB RAM
L 16 4224 GB SSD 16 vCPUs, 64 GB RAM
XL 32 8320 GB SSD 16 vCPUs, 64 GB RAM

Tip

Before running a project with the Spark connector to create a temporary Yandex Data Processing cluster, make sure the quotas for creating HDDs or SSDs allow you to create a disk of a sufficient size.

You will be charged extra for using temporary clusters created based on Yandex Data Processing templates according to the Yandex Data Processing pricing policy.

See alsoSee also

  • How to create, modify, and delete a Spark connector.
  • Errors when using a Spark connector

Was the article helpful?

Previous
S3 connectors
Next
Models
© 2025 Direct Cursus Technology L.L.C.