Integrating with Yandex DataSphere
You can use the Apache Spark™ clusters deployed in Yandex Data Proc, in your Yandex DataSphere projects. To set up integration with Yandex Data Proc in DataSphere:
- Prepare your infrastructure.
- Create a bucket.
- Create a Yandex Data Proc cluster.
- Set up the DataSphere project.
- Run your computations.
If you no longer need the resources you created, delete them.
Getting started
Before getting started, register in Yandex Cloud, set up a community, and link your billing account to it.
- On the DataSphere home page
, click Try for free and select an account to log in with: Yandex ID or your working account in the identity federation (SSO). - Select the Yandex Cloud Organization organization you are going to use in Yandex Cloud.
- Create a community.
- Link your billing account to the DataSphere community you are going to work in. Make sure that you have a billing account linked and its status is
ACTIVE
orTRIAL_ACTIVE
. If you do not have a billing account yet, create one in the DataSphere interface.
Required paid resources
The Yandex Data Proc cluster support cost covers the computing resources of the cluster and the storage size (see Yandex Data Proc pricing).
Prepare the infrastructure
Log in to the Yandex Cloud management console
If you have an active billing account, you can create or select a folder to deploy your infrastructure in, on the cloud page
Note
If you use an identity federation to access Yandex Cloud, billing details might be unavailable to you. In this case, contact your Yandex Cloud organization administrator.
Create a folder and network
Create a folder where your Yandex Data Proc cluster will run.
- In the management console
, select a cloud and click Create folder. - Name your folder, e.g.,
data-folder
. - Select Create a default network. This will create a network with subnets in each availability zone.
- Click Create.
Learn more about clouds and folders.
Create an egress NAT gateway
- In
data-folder
, select Virtual Private Cloud. - In the left-hand panel, select
Gateways. - Click Create and set the gateway parameters:
- Enter the gateway name, e.g.,
nat-for-cluster
. - Gateway Type: Egress NAT.
- Click Save.
- Enter the gateway name, e.g.,
- In the left-hand panel, select
Routing tables. - Click Create and specify the route table parameters:
- Enter a name, e.g.,
route-table
. - Select
data-network
. - Click Add a route.
- In the window that opens, select Gateway in the Next hop field.
- In the Gateway field, select the NAT gateway you created. The destination prefix will be propagated automatically.
- Click Add.
- Enter a name, e.g.,
- Click Create a routing table.
Next, link the route table to a subnet to route traffic from it via the NAT gateway:
- In the left-hand panel, select
Subnets. - In the line with the subnet you need, click
. - In the menu that opens, select Link routing table.
- In the window that opens, select the created table from the list.
- Click Link.
Create a service account for the cluster
-
Go to
data-folder
. -
In the Service accounts tab, click Create service account.
-
Enter a name for the service account, e.g.,
sa-for-data-proc
. -
Click Add role and assign the following roles to the service account:
dataproc.agent
to create and use Yandex Data Proc clustersdataproc.provisioner
to enable subcluster autoscalingvpc.user
to use the Yandex Data Proc cluster networkiam.serviceAccounts.user
to create resources in the folder on behalf of the service account
-
Click Create.
Create an SSH key pair
To ensure a safe connection to the Yandex Data Proc cluster hosts, you will need SSH keys. You may skip this step if you have already generated your SSH keys.
-
Open the terminal.
-
Use the
ssh-keygen
command to create a new key:ssh-keygen -t ed25519
After you run the command, you will be asked to specify the names of files where the keys will be saved and enter the password for the private key. Press Enter to use the default name and path suggested by the command.
The key pair will be created in the current directory. The public key will be saved in a
.pub
file.
If you do not have OpenSSH
-
Run
cmd.exe
orpowershell.exe
(make sure to update PowerShell first). -
Use the
ssh-keygen
command to create a new key. Run this command:ssh-keygen -t ed25519
After you run the command, you will be asked to specify the names of files where the keys will be saved and enter the password for the private key. Press Enter to use the default name and path suggested by the command.
The key pair will be created in the current directory. The public key will be saved in a
.pub
file.
Create keys using the PuTTY app:
-
Download
and install PuTTY. -
Make sure that the directory where you installed PuTTY is included in
PATH
:- Right-click My computer. Click Properties.
- In the window that opens, select Additional system parameters, then Environment variables (located in the lower part of the window).
- Under System variables, find
PATH
and click Edit. - In the Variable value field, append the path to the directory where you installed PuTTY.
-
Launch the PuTTYgen app.
-
Select EdDSA as the pair type to generate. Click Generate and move the cursor in the field above it until key creation is complete.
-
In Key passphrase, enter a strong password. Enter it again in the field below.
-
Click Save private key and save the private key. Do not share its key phrase with anyone.
-
Save the key to a text file. To do this, copy the single-line public key from the text field to a text file named
id_ed25519.pub
.
Warning
Save the private key in a secure location, as you will not be able to connect to the VM without it.
Configure DataSphere
To work with Yandex Data Proc clusters in DataSphere, create and set up a project.
Create a project
- Open the DataSphere home page
. - In the left-hand panel, select
Communities. - Select the community to create a project in.
- On the community page, click
Create project. - In the window that opens, enter a name and description (optional) for the project.
- Click Create.
Edit the project settings
-
Go to the Settings tab.
-
Under Advanced settings, click
Edit. -
Specify the parameters:
-
Default folder:
data-folder
. -
Service account:
sa-for-data-proc
. -
Subnet: Subnet of the
ru-central1-a
availability zone indata-folder
.Note
If you specified a subnet in the project settings, the time to allocate computing resources may be increased.
-
Security groups, if used in your organization.
-
-
Click Save.
Create a bucket
- In the management console
, select the folder where you want to create your bucket. - In the list of services, select Object Storage.
- Click Create bucket.
- In the ** Name** field, enter a name for the bucket.
- In the Object read access, Object listing access, and Read access to settings fields, select Restricted.
- Click Create bucket.
Create a Yandex Data Proc cluster
Before creating a cluster, make sure that your cloud has enough total SSD space (200 GB is allocated for a new cloud by default).
You can view your current resources under Quotas
-
In the management console
, select the folder where you want to create a cluster. -
Click Create resource and select Yandex Data Proc cluster from the drop-down list.
-
Enter a name for the cluster in the Cluster name field. It must be unique within the folder.
-
In the Version field, select
2.0
. -
In the Services field, select
LIVY
,SPARK
,YARN
, andHDFS
. -
Enter the public part of your SSH key in the SSH key field.
-
In the Service account field, select
sa-for-data-proc
. -
In the Availability zone field, select
ru-central1-a
. -
If required, set the properties of Hadoop and its components in the Properties field, such as:
hdfs:dfs.replication : 2 hdfs:dfs.blocksize : 1073741824 spark:spark.driver.cores : 1
The available properties are listed in the official documentation for the components -
Select the created bucket in the Bucket name field.
-
Select a network for the cluster.
-
Enable UI Proxy to access the web interfaces of Yandex Data Proc components.
-
Configure subclusters: no more than one main subcluster with a Master host and subclusters for data storage or computing.
Note
To run computations on clusters, make sure you have at least one
Compute
orData
subcluster.Compute
andData
subcluster roles are different: you can deploy data storage components onData
subclusters and data processing components onCompute
subclusters. TheCompute
subcluster storage is only used to temporarily store processed files. -
For each subcluster, you can configure:
- Number of hosts.
- Host class: Platform and computing resources available to the host.
- Storage size and type.
- Subnet of the network where the cluster is located.
-
For
Compute
subclusters, you can specify the autoscaling parameters. -
When you have set up all the subclusters, click Create cluster.
Yandex Data Proc will run the cluster create operation. After the cluster status changes to Running, you can connect to any active subcluster using the specified SSH key.
The Yandex Data Proc cluster you created will be added to your DataSphere project under Project resources ⟶ Yandex Data Proc ⟶ Available clusters.
Run your computations on the cluster
-
Open the DataSphere project:
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Click Open project in JupyterLab and wait for the loading to complete.
- Open the notebook tab.
-
-
In the cell, insert the code to compute. For example:
#!spark --cluster <cluster_name> import random def inside(p): x, y = random.random(), random.random() return x*x + y*y < 1 NUM_SAMPLES = 1_000_000 count = sc.parallelize(range(0, NUM_SAMPLES)) \ .filter(inside).count() print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
Where
#!spark --cluster <cluster_name>
is a required system command to run computations on a cluster.Wait for the computation to start. While it is in progress, you will see logs under the cell.
-
Write data to S3 by specifying the bucket name:
#!spark data = [[1, "tiger"], [2, "lion"], [3, "snow leopard"]] df = spark.createDataFrame(df, schema="id LONG, name STRING") df.repartition(1).write.option("header", True).csv("s3://<bucket_name>/")
-
Run the cells. To do this, select Run ⟶ Run Selected Cells or press Shift + Enter.
The file will appear in the bucket. To view bucket contents in the JupyterLab interface, create and activate an S3 connector in your project.
Note
To get more than 100 MB of the Yandex Data Proc cluster data, use an S3 connector.
To learn more about running computations on Yandex Data Proc clusters in DataSphere, see this concept.
Delete the resources you created
Warning
As a user of a cluster deployed in Yandex Data Proc, you manage its lifecycle yourself. The cluster will run, and fees will be charged, until you shut it down.
Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need:
- Objects from the bucket
- Bucket
- Cluster
- Project
- Subnet
- Route table
- NAT gateway
- Network
- Service account