Ways to use Apache Spark™ clusters in DataSphere
Yandex Data Processing allows you to deploy Apache Spark™ clusters. You can use Yandex Data Processing clusters to run distributed training.
Cluster deployment options
To work with Yandex Data Processing clusters in DataSphere, you can use the following:
If you have no existing Yandex Data Processing clusters or you need a cluster for a short time, use temporary Yandex Data Processing clusters. You can create them using the following:
- Spark connector (preferred)
- Yandex Data Processing template
Regardless of the deployment option, all Yandex Data Processing clusters are charged based on the Yandex Data Processing pricing policy.
Setting up a DataSphere project to work with Yandex Data Processing clusters
To use Yandex Data Processing clusters, set the following project parameters:
-
Default folder to enable integration with other Yandex Cloud services. A Yandex Data Processing cluster will be deployed in this folder based on the current cloud quotas. A fee for using the cluster will be debited from your cloud billing account.
-
Service account to be used by DataSphere for creating and managing clusters. The service account needs the following roles:
dataproc.agent
to use Yandex Data Processing clusters.dataproc.admin
to create clusters from Yandex Data Processing templates.vpc.user
to use the Yandex Data Processing cluster network.iam.serviceAccounts.user
to create resources in the folder on behalf of the service account.
-
Subnet for DataSphere to communicate with the Yandex Data Processing cluster. Since the Yandex Data Processing cluster needs to access the internet, make sure to configure a NAT gateway in the subnet.
Note
If you specified a subnet in the project settings, the time to allocate computing resources may be increased.
Warning
The Yandex Data Processing persistent cluster must have the livy:livy.spark.deploy-mode : client
setting.