Using Yandex Managed Service for Apache Spark™ in DataSphere

Written by

Updated at May 8, 2026

Getting started
- Required paid resources
Set up your infrastructure
Configure DataSphere
- Create a project
- Edit the project settings
Create a Managed Service for Apache Spark™ cluster and a Spark connector
Run the computations
Delete the resources you created

In DataSphere, you can use Apache Spark™ clusters deployed in Yandex Managed Service for Apache Spark™. Apache Spark™ clusters are connected via Spark connectors. To run computations, a SparkConnect job is initiated in a cluster.

To set up integration with Managed Service for Apache Spark™ in DataSphere:

If you no longer need the resources you created, delete them.

Getting started

Before getting started, register in Yandex Cloud, set up a community, and link your billing account to it.

On the DataSphere home page, click Try for free and select an account to log in with: Yandex ID or your working account with the identity federation (SSO).
Select the Yandex Identity Hub organization you are going to use in Yandex Cloud.
Create a community.
Link your billing account to the DataSphere community you are going to work in. Make sure you have a linked billing account and its status is ACTIVE or TRIAL_ACTIVE. If you do not have a billing account yet, create one in the DataSphere interface.

Required paid resources

Managed Service for Apache Spark™ cluster: computing resources of cluster components (see Managed Service for Apache Spark™ pricing).
NAT gateway: hourly use of the gateway and its outgoing traffic (see Virtual Private Cloud pricing).

Set up your infrastructure

Create a folder

Create a folder where your Apache Spark™ cluster will run.

Management console

In the management console, select a cloud and click Create folder.
Name your folder, e.g., data-folder.
Disable Create a default network to create your network and subnets manually.
Click Create.

Learn more about clouds and folders.

Create a network

Create a network the Apache Spark™ cluster will operate in.

Management console

In the management console, navigate to data-folder you created earlier.
In the list of services, select Virtual Private Cloud.
In the top-right corner, click Create network.
In the Name field, enter a name for the network: data-network.

This will automatically create three subnets in different availability zones.
Click Create network.

Create and set up a NAT gateway for internet access

Management console

In data-folder, select Virtual Private Cloud.
In the left-hand panel, select Gateways.
Click Create and set the gateway parameters:
1. Name the gateway, e.g., nat-for-cluster.
2. Select the gateway Type: Egress NAT.
3. Click Save.
In the left-hand panel, select Routing tables.
Click Create and specify the route table parameters:
1. Enter a name, e.g., route-table.
2. Select data-network.
3. Click Add.
4. In the window that opens, select Next hop in the Gateway field.
5. In the Gateway field, select the NAT gateway you created. The destination prefix will apply automatically.
6. Click Add.
7. Click Create routing table.
Associate the route table with a subnet to route traffic from it through the NAT gateway:
1. In the left-hand panel, select Subnets.
2. In the row with the subnet you need, click .
3. In the menu that opens, select Link routing table.
4. In the window that opens, select your route table from the list.
5. Click Link.

Create a service account for the cluster

Management console

Navigate to data-folder.
In the list of services, select Identity and Access Management.
Click Create service account.
Name the service account, e.g., sa-for-spark.
Click Add role and assign the following roles to the service account:
- managed-spark.user to use Apache Spark™ clusters.
- dataproc.agent to get job information.
- dataproc.user to run jobs in Apache Spark™ clusters.
- vpc.user to use the Apache Spark™ cluster network.
- iam.serviceAccounts.user to create resources in the folder on behalf of the service account.
Click Create.

Configure DataSphere

To work with Apache Spark™ clusters in DataSphere, create and set up a project.

Create a project

Open the DataSphere home page.
In the left-hand panel, select Communities.
Select the community where you want to create a project.
On the community page, click Create project.
In the window that opens, enter a name for the project. You can also add a description as needed.
Click Create.

Edit the project settings

Navigate to the Settings tab.
Under Advanced settings, click Edit.
Specify the parameters:
- Default folder: data-folder.
- Service account: sa-for-spark.
- Subnet: data-network-ru-central1-a.
  
  Note
  
  If you specify a subnet in the project settings, the VM preparation during the first computation run may take longer.
- Security groups, if used in your organization.
Click Save.

Create a Managed Service for Apache Spark™ cluster and a Spark connector

Create a Managed Service for Apache Spark™ cluster with your preferred configuration and the following properties:
- Service account: sa-for-spark.
- Network: data-network.
- Subnet: data-network-ru-central1-a.
Go to your DataSphere workspace.
In the left-hand panel, select Communities.
Open the community you created the project in and select the project you need.
Under Project resources, click Spark connector.
Click Create connector.
In the Name field, enter a name for your connector. Follow these naming requirements:
- The name must be from 3 to 63 characters long.
- It may contain uppercase and lowercase Latin and Cyrillic letters, numbers, hyphens, underscores, and spaces.
- The first character must be a letter. The last character cannot be a hyphen, underscore, or space.
Under Yandex Data Processing cluster:
1. Click the Select cluster tile.
2. Select the previously created Managed Service for Apache Spark™ cluster from the list.
Optionally, to use an Object Storage bucket for computations, under S3 settings, specify the static access key ID and the secret storing the secret part of the static key.
Under Spark settings, specify the SparkConnect job settings:
- To use the standard Apache Spark™ cluster settings for computations, select Use default settings.
- To manually add or update job settings, specify one or more Key and Value settings.
Click Create. You will see a page with detailed info on the connector you created.

Run the computations

Open the DataSphere project:
1. Select the project in your community or on the DataSphere home page in the Recent projects tab.
2. Click Open project in JupyterLab and wait for the loading to complete.
3. Open the notebook tab.

In the cell, insert the code to compute, e.g.:

df = spark.createDataFrame([(1, "Sarah"), (2, "Maria")]).toDF(*["id", "name"])
df.show()

Select Run → Run Selected Cells from the menu or press Shift + Enter.
In the Notebook VM configurations window that opens, go to the With Yandex Data Processing cluster tab.
Select the required configuration and connector.
Click Select.

A local PySpark session will become available in the notebook via the spark variable. To run code in the notebook cells, the system will create and initiate a SparkConnect job in the Apache Spark™ cluster.

To terminate the job in the Apache Spark™ cluster, stop the notebook VM.

Delete the resources you created

Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:

Using Yandex Managed Service for Apache Spark™ in DataSphere

Getting startedGetting started

Required paid resourcesRequired paid resources

Set up your infrastructureSet up your infrastructure

Create a folderCreate a folder

Create a networkCreate a network

Create and set up a NAT gateway for internet accessCreate and set up a NAT gateway for internet access

Create a service account for the clusterCreate a service account for the cluster

Configure DataSphereConfigure DataSphere

Create a projectCreate a project

Edit the project settingsEdit the project settings

Create a Managed Service for Apache Spark™ cluster and a Spark connectorCreate a Managed Service for Apache Spark™ cluster and a Spark connector

Run the computationsRun the computations

Delete the resources you createdDelete the resources you created

Was the article helpful?

Getting started

Required paid resources

Set up your infrastructure

Create a folder

Create a network

Create and set up a NAT gateway for internet access

Create a service account for the cluster

Configure DataSphere

Create a project

Edit the project settings

Create a Managed Service for Apache Spark™ cluster and a Spark connector

Run the computations

Delete the resources you created