Connecting to a cluster Apache Spark™
This section presents settings for connection to a Yandex Managed Service for Apache Spark™ cluster via Spark Connect
Connecting via Spark Connect
-
Create an IAM token and save it to the environment variable:
export TOKEN=$(yc iam create-token) -
Create a SparkConnect job without specifying any parameters in your cluster.
-
Copy the Spark Connect Server endpoint of the new connection job.
You can get the endpoint with the job information. Its value is specified in the Connection URL field in the management console or in the
connect_urlfield in the CLI and API. -
Install the
pysparkpackage and relevant dependencies in your environment using the pip package manager.Warning
The PySpark version in your environment must match the Apache Spark™ version in the cluster. Currently, connections are supported for PySpark versions
3.5.6and3.5.7. -
Run the code for connection to the cluster:
import os from pyspark.sql import SparkSession url_spark = "<cluster_connecton_endpoint>" TOKEN = os.environ.get("TOKEN") spark = SparkSession.builder.remote(f"{url_spark}/;use_ssl=true;token={TOKEN}").getOrCreate() df = spark.createDataFrame([(1, "Sarah"), (2, "Maria")]).toDF(*["id", "name"]) df.show()Result:
+---+-----+ | id| name| +---+-----+ | 1|Sarah| | 2|Maria| +---+-----+