Using Yandex Managed Service for Apache Spark™ in Yandex DataSphere
In DataSphere, you can use Apache Spark™ clusters deployed in Yandex Managed Service for Apache Spark™. Apache Spark™ clusters are connected via Spark connectors. To run computations, a SparkConnect job is initiated in a cluster.
To set up integration with Managed Service for Apache Spark™ in DataSphere:
- Set up your infrastructure.
- Set up the DataSphere project.
- Create a Managed Service for Apache Spark™ cluster and a Spark connector.
- Run your computations.
If you no longer need the resources you created, delete them.
Getting started
Before getting started, register in Yandex Cloud, set up a community, and link your billing account to it.
- On the DataSphere home page
, click Try for free and select an account to log in with: Yandex ID or your working account with the identity federation (SSO). - Select the Yandex Identity Hub organization you are going to use in Yandex Cloud.
- Create a community.
- Link your billing account to the DataSphere community you are going to work in. Make sure you have a linked billing account and its status is
ACTIVEorTRIAL_ACTIVE. If you do not have a billing account yet, create one in the DataSphere interface.
Required paid resources
- Managed Service for Apache Spark™ cluster: computing resources of cluster components (see Managed Service for Apache Spark™ pricing).
- NAT gateway: hourly use of the gateway and its outgoing traffic (see Virtual Private Cloud pricing).
Set up your infrastructure
Create a folder
Create a folder where your Apache Spark™ cluster will run.
- In the management console
, select a cloud and click Create folder. - Name your folder, e.g.,
data-folder. - Disable Create a default network to create your network and subnets manually.
- Click Create.
Learn more about clouds and folders.
Create a network
Create a network the Apache Spark™ cluster will operate in.
-
In the management console
, navigate todata-folderyou created earlier. -
In the list of services, select Virtual Private Cloud.
-
In the top-right corner, click Create network.
-
In the Name field, enter a name for the network:
data-network.This will automatically create three subnets in different availability zones.
-
Click Create network.
Create and set up a NAT gateway for internet access
- In
data-folder, select Virtual Private Cloud. - In the left-hand panel, select
Gateways. - Click Create and set the gateway parameters:
- Name the gateway, e.g.,
nat-for-cluster. - Select the gateway Type: Egress NAT.
- Click Save.
- Name the gateway, e.g.,
- In the left-hand panel, select
Routing tables. - Click Create and specify the route table parameters:
- Enter a name, e.g.,
route-table. - Select
data-network. - Click Add.
- In the window that opens, select Next hop in the Gateway field.
- In the Gateway field, select the NAT gateway you created. The destination prefix will apply automatically.
- Click Add.
- Click Create routing table.
- Enter a name, e.g.,
- Associate the route table with a subnet to route traffic from it through the NAT gateway:
- In the left-hand panel, select
Subnets. - In the row with the subnet you need, click
. - In the menu that opens, select Link routing table.
- In the window that opens, select your route table from the list.
- Click Link.
- In the left-hand panel, select
Create a service account for the cluster
-
Navigate to
data-folder. -
In the list of services, select Identity and Access Management.
-
Click Create service account.
-
Name the service account, e.g.,
sa-for-spark. -
Click Add role and assign the following roles to the service account:
managed-spark.userto use Apache Spark™ clusters.dataproc.agentto get job information.dataproc.userto run jobs in Apache Spark™ clusters.vpc.userto use the Apache Spark™ cluster network.iam.serviceAccounts.userto create resources in the folder on behalf of the service account.
-
Click Create.
Configure DataSphere
To work with Apache Spark™ clusters in DataSphere, create and set up a project.
Create a project
- Open the DataSphere home page
. - In the left-hand panel, select
Communities. - Select the community where you want to create a project.
- On the community page, click
Create project. - In the window that opens, enter a name for the project. You can also add a description as needed.
- Click Create.
Edit the project settings
-
Navigate to the Settings tab.
-
Under Advanced settings, click
Edit. -
Specify the parameters:
-
Default folder:
data-folder. -
Service account:
sa-for-spark. -
Subnet:
data-network-ru-central1-a.Note
If you specify a subnet in the project settings, the VM preparation during the first computation run may take longer.
-
Security groups, if used in your organization.
-
-
Click Save.
Create a Managed Service for Apache Spark™ cluster and a Spark connector
-
Create a Managed Service for Apache Spark™ cluster with your preferred configuration and the following properties:
- Service account:
sa-for-spark. - Network:
data-network. - Subnet:
data-network-ru-central1-a.
- Service account:
-
Go to your DataSphere workspace
. -
In the left-hand panel, select
Communities. -
Open the community you created the project in and select the project you need.
-
Under Project resources, click
Spark connector. -
Click Create connector.
-
In the Name field, enter a name for your connector. Follow these naming requirements:
- The name must be from 3 to 63 characters long.
- It may contain uppercase and lowercase Latin and Cyrillic letters, numbers, hyphens, underscores, and spaces.
- The first character must be a letter. The last character cannot be a hyphen, underscore, or space.
-
Under Yandex Data Processing cluster:
- Click the Select cluster tile.
- Select the previously created Managed Service for Apache Spark™ cluster from the list.
-
Optionally, to use an Object Storage bucket for computations, under S3 settings, specify the static access key ID and the secret storing the secret part of the static key.
-
Under Spark settings, specify the SparkConnect job settings:
- To use the standard Apache Spark™ cluster settings for computations, select Use default settings.
- To manually add or update job settings, specify one or more Key and Value settings.
-
Click Create. You will see a page with detailed info on the connector you created.
Run the computations
-
Open the DataSphere project:
-
Select the project in your community or on the DataSphere home page
in the Recent projects tab. - Click Open project in JupyterLab and wait for the loading to complete.
- Open the notebook tab.
-
-
In the cell, insert the code to compute, e.g.:
df = spark.createDataFrame([(1, "Sarah"), (2, "Maria")]).toDF(*["id", "name"]) df.show() -
Select Run → Run Selected Cells from the menu or press Shift + Enter.
-
In the Notebook VM configurations window that opens, go to the With Yandex Data Processing cluster tab.
-
Select the required configuration and connector.
-
Click Select.
A local PySpark
session will become available in the notebook via thesparkvariable. To run code in the notebook cells, the system will create and initiate a SparkConnect job in the Apache Spark™ cluster.
To terminate the job in the Apache Spark™ cluster, stop the notebook VM.
Delete the resources you created
Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them: