Yandex Cloud
Search
Contact UsTry it for free
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
  • Marketplace
    • Featured
    • Infrastructure & Network
    • Data Platform
    • AI for business
    • Security
    • DevOps tools
    • Serverless
    • Monitoring & Resources
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
    • Price calculator
    • Pricing plans
  • Customer Stories
  • Documentation
  • Blog
© 2026 Direct Cursus Technology L.L.C.
Yandex DataSphere
  • Getting started
    • All tutorials
      • Web analytics with funnels and cohorts calculated based on Yandex Metrica data
      • AppMetrica: data export, post-processing, and visualization
      • Analyzing data with Query
      • Working with data in Object Storage
      • Working with data in Managed Service for ClickHouse®
      • Working with data in Managed Service for PostgreSQL
      • Federated data queries
      • Using the service Yandex Managed Service for Apache Spark™
  • Terraform reference
  • Audit Trails events
  • Access management
  • Pricing policy
  • Public materials
  • Release notes

In this article:

  • Getting started
  • Required paid resources
  • Set up your infrastructure
  • Create a folder
  • Create a network
  • Create and set up a NAT gateway for internet access
  • Create a service account for the cluster
  • Configure DataSphere
  • Create a project
  • Edit the project settings
  • Create a Managed Service for Apache Spark™ cluster and a Spark connector
  • Run the computations
  • Delete the resources you created
  1. Tutorials
  2. Data analytics
  3. Using the service Yandex Managed Service for Apache Spark™

Using Yandex Managed Service for Apache Spark™ in DataSphere

Written by
Yandex Cloud
Updated at May 8, 2026
  • Getting started
    • Required paid resources
  • Set up your infrastructure
    • Create a folder
    • Create a network
    • Create and set up a NAT gateway for internet access
    • Create a service account for the cluster
  • Configure DataSphere
    • Create a project
    • Edit the project settings
  • Create a Managed Service for Apache Spark™ cluster and a Spark connector
  • Run the computations
  • Delete the resources you created

In DataSphere, you can use Apache Spark™ clusters deployed in Yandex Managed Service for Apache Spark™. Apache Spark™ clusters are connected via Spark connectors. To run computations, a SparkConnect job is initiated in a cluster.

To set up integration with Managed Service for Apache Spark™ in DataSphere:

  1. Set up your infrastructure.
  2. Set up the DataSphere project.
  3. Create a Managed Service for Apache Spark™ cluster and a Spark connector.
  4. Run your computations.

If you no longer need the resources you created, delete them.

Getting startedGetting started

Before getting started, register in Yandex Cloud, set up a community, and link your billing account to it.

  1. On the DataSphere home page, click Try for free and select an account to log in with: Yandex ID or your working account with the identity federation (SSO).
  2. Select the Yandex Identity Hub organization you are going to use in Yandex Cloud.
  3. Create a community.
  4. Link your billing account to the DataSphere community you are going to work in. Make sure you have a linked billing account and its status is ACTIVE or TRIAL_ACTIVE. If you do not have a billing account yet, create one in the DataSphere interface.

Required paid resourcesRequired paid resources

  • Managed Service for Apache Spark™ cluster: computing resources of cluster components (see Managed Service for Apache Spark™ pricing).
  • NAT gateway: hourly use of the gateway and its outgoing traffic (see Virtual Private Cloud pricing).

Set up your infrastructureSet up your infrastructure

Create a folderCreate a folder

Create a folder where your Apache Spark™ cluster will run.

Management console
  1. In the management console, select a cloud and click Create folder.
  2. Name your folder, e.g., data-folder.
  3. Disable Create a default network to create your network and subnets manually.
  4. Click Create.

Learn more about clouds and folders.

Create a networkCreate a network

Create a network the Apache Spark™ cluster will operate in.

Management console
  1. In the management console, navigate to data-folder you created earlier.

  2. In the list of services, select Virtual Private Cloud.

  3. In the top-right corner, click Create network.

  4. In the Name field, enter a name for the network: data-network.

    This will automatically create three subnets in different availability zones.

  5. Click Create network.

Create and set up a NAT gateway for internet accessCreate and set up a NAT gateway for internet access

Management console
  1. In data-folder, select Virtual Private Cloud.
  2. In the left-hand panel, select Gateways.
  3. Click Create and set the gateway parameters:
    1. Name the gateway, e.g., nat-for-cluster.
    2. Select the gateway Type: Egress NAT.
    3. Click Save.
  4. In the left-hand panel, select Routing tables.
  5. Click Create and specify the route table parameters:
    1. Enter a name, e.g., route-table.
    2. Select data-network.
    3. Click Add.
    4. In the window that opens, select Next hop in the Gateway field.
    5. In the Gateway field, select the NAT gateway you created. The destination prefix will apply automatically.
    6. Click Add.
    7. Click Create routing table.
  6. Associate the route table with a subnet to route traffic from it through the NAT gateway:
    1. In the left-hand panel, select Subnets.
    2. In the row with the subnet you need, click .
    3. In the menu that opens, select Link routing table.
    4. In the window that opens, select your route table from the list.
    5. Click Link.

Create a service account for the clusterCreate a service account for the cluster

Management console
  1. Navigate to data-folder.

  2. In the list of services, select Identity and Access Management.

  3. Click Create service account.

  4. Name the service account, e.g., sa-for-spark.

  5. Click Add role and assign the following roles to the service account:

    • managed-spark.user to use Apache Spark™ clusters.
    • dataproc.agent to get job information.
    • dataproc.user to run jobs in Apache Spark™ clusters.
    • vpc.user to use the Apache Spark™ cluster network.
    • iam.serviceAccounts.user to create resources in the folder on behalf of the service account.
  6. Click Create.

Configure DataSphereConfigure DataSphere

To work with Apache Spark™ clusters in DataSphere, create and set up a project.

Create a projectCreate a project

  1. Open the DataSphere home page.
  2. In the left-hand panel, select Communities.
  3. Select the community where you want to create a project.
  4. On the community page, click Create project.
  5. In the window that opens, enter a name for the project. You can also add a description as needed.
  6. Click Create.

Edit the project settingsEdit the project settings

  1. Navigate to the Settings tab.

  2. Under Advanced settings, click Edit.

  3. Specify the parameters:

    • Default folder: data-folder.

    • Service account: sa-for-spark.

    • Subnet: data-network-ru-central1-a.

      Note

      If you specify a subnet in the project settings, the VM preparation during the first computation run may take longer.

    • Security groups, if used in your organization.

  4. Click Save.

Create a Managed Service for Apache Spark™ cluster and a Spark connectorCreate a Managed Service for Apache Spark™ cluster and a Spark connector

  1. Create a Managed Service for Apache Spark™ cluster with your preferred configuration and the following properties:

    • Service account: sa-for-spark.
    • Network: data-network.
    • Subnet: data-network-ru-central1-a.
  2. Go to your DataSphere workspace.

  3. In the left-hand panel, select Communities.

  4. Open the community you created the project in and select the project you need.

  5. Under Project resources, click Spark connector.

  6. Click Create connector.

  7. In the Name field, enter a name for your connector. Follow these naming requirements:

    • The name must be from 3 to 63 characters long.
    • It may contain uppercase and lowercase Latin and Cyrillic letters, numbers, hyphens, underscores, and spaces.
    • The first character must be a letter. The last character cannot be a hyphen, underscore, or space.
  8. Under Yandex Data Processing cluster:

    1. Click the Select cluster tile.
    2. Select the previously created Managed Service for Apache Spark™ cluster from the list.
  9. Optionally, to use an Object Storage bucket for computations, under S3 settings, specify the static access key ID and the secret storing the secret part of the static key.

  10. Under Spark settings, specify the SparkConnect job settings:

    • To use the standard Apache Spark™ cluster settings for computations, select Use default settings.
    • To manually add or update job settings, specify one or more Key and Value settings.
  11. Click Create. You will see a page with detailed info on the connector you created.

Run the computationsRun the computations

  1. Open the DataSphere project:

    1. Select the project in your community or on the DataSphere home page in the Recent projects tab.

    2. Click Open project in JupyterLab and wait for the loading to complete.
    3. Open the notebook tab.
  2. In the cell, insert the code to compute, e.g.:

    df = spark.createDataFrame([(1, "Sarah"), (2, "Maria")]).toDF(*["id", "name"])
    df.show()
    
  3. Select Run → Run Selected Cells from the menu or press Shift + Enter.

  4. In the Notebook VM configurations window that opens, go to the With Yandex Data Processing cluster tab.

  5. Select the required configuration and connector.

  6. Click Select.

    A local PySpark session will become available in the notebook via the spark variable. To run code in the notebook cells, the system will create and initiate a SparkConnect job in the Apache Spark™ cluster.

To terminate the job in the Apache Spark™ cluster, stop the notebook VM.

Delete the resources you createdDelete the resources you created

Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:

  1. Managed Service for Apache Spark™ cluster.
  2. NAT gateway.

Was the article helpful?

Previous
Federated data queries
Next
Terraform reference
© 2026 Direct Cursus Technology L.L.C.