Setting up and using Python virtual environments
Using PySpark often requires installing or updating Python packages in Yandex Data Processing clusters. In this case, modifying the primary Python environment in a cluster may prove too time-consuming and unpractical. As a workaround, you can use isolated virtual environments when running each job:
-
For compatibility, set up your virtual environment on a temporary Yandex Data Processing cluster. Then place it in an archive, saved in an Object Storage bucket.
-
Use the virtual environment from the archive when running jobs in Yandex Data Processing clusters.
You can use Python virtual environments:
Setting up a Python virtual environment
-
Create a service account with the
dataproc.agentanddataproc.provisionerroles. -
In Object Storage, create a bucket to store cluster logs and a virtual environment archive.
-
Configure an ACL for the bucket by granting the
READ and WRITEpermissions to the service account. -
Create a temporary Yandex Data Processing cluster. When creating it, specify:
- Yandex Data Processing version, same as that of the cluster to use the environment on. This ensures compatibility.
- Components:
SPARKYARN
- Bucket for storing logs.
- Service account with bucket access.
- Optionally, public access to the master host subcluster.
We recommend specifying the minimum required settings for host resources.
-
Use SSH to connect to the temporary Yandex Data Processing cluster.
-
Run the built-in Virtualenv
tool for managing virtual environments:python -m venv pyspark_venv && \ source pyspark_venv/bin/activate -
Install
venv-packand other Python environment modules you need:pip install venv-pack <list_of_modules>Here is an example:
pip install venv-pack pyarrow pandas catboost -
Archive the environment you got with the
venv-packcommand:venv-pack -o <archive_name>.tar.gz -
Send the environment archive to the Object Storage bucket you created earlier:
hdfs dfs -copyFromLocal <archive_name>.tar.gz s3a://<bucket_name>/ -
Delete the temporary Yandex Data Processing cluster to avoid paying for it.
Using a virtual environment
To use your virtual environment in the Yandex Data Processing cluster, grant the cluster service account permissions to write and read data in the archive bucket. There are two ways to do this:
- Edit the bucket ACL by granting the
READpermission to the cluster service account. - Assign the
storage.viewerrole to the service account.
Using a virtual environment when running PySpark jobs normally
When creating a PySpark job, set the following Spark property
-
spark.submit.deployMode=cluster: Driver deploy mode.Make sure to run jobs in
clustermode to set up your virtual environment properly. For more information, see Resource allocation. -
spark.yarn.dist.archives='s3a://<bucket_name>/<archive_name>.tar.gz#<alias>': Path to the archive with your environment.Enter an environment alias of your choice after the
#character. The alias will serve as the name of the subdirectory the archive will be unpacked to. -
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./<alias>/bin/python: Overrides the run Python interpreter command for the YARN Application Master process. -
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./<alias>/bin/python: Overrides the run Python interpreter command for the job driver.
This makes the dependencies within the virtual environment available when running jobs.
Using a virtual environment when integrating a Yandex Data Processing cluster with Yandex DataSphere
Yandex DataSphere and Yandex Data Processing integration is based on using the Apache Livy
To use Python virtual environments integrated with Yandex DataSphere, configure the additional settings:
-
Set the
clusterdriver deploy mode on the Yandex Data Processing cluster’s side. To do this, provide thelivy:livy.spark.deploy-mode=clustercomponent property value when updating the cluster.For more information, see Resource allocation.
-
On the Yandex DataSphere side, create a Livy session with the following settings:
%create_livy_session \ --cluster <cluster_name_or_ID> --id <Livy_session_ID> \ --conf spark.yarn.dist.archives=s3a://<bucket_name>/<archive_name>.tar.gz#<alias> \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./<alias>/bin/python \ --conf <other_Spark_context_parameters> ...Where:
<cluster_name_or_ID>: Name or ID of the Yandex Data Processing cluster used for integration.<Livy_session_ID>: Arbitrary string to identify the Livy session in the cluster.<bucket_name>: Bucket with the environment archive.<archive_name>: Ready-to-use archive with the Python environment.<alias>: Environment alias of your choice. The alias will serve as the name of the subdirectory the archive will be unpacked to.<other_Spark_context_parameters>: Specify them if needed. For a complete list of parameters, see this Spark article .
Specify the session you created when running Python code in the cluster. You will be able to use dependencies included in the virtual environment.
Using a virtual environment in a Zeppelin notebook
To use Python virtual environments for operations in a Zeppelin notebook:
-
In the Zeppelin interface, navigate to the Interpreter settings and edit the spark section by setting the Spark context mode to
Per NoteandIsolated. This will allow different notebooks to use different virtual environments. -
Create a new cell named
%spark.confand use it to provide variables for a Spark session:%spark.conf spark.submit.deployMode cluster spark.yarn.dist.archives s3a://<bucket_name>/<archive_name>.tar.gz#<alias> spark.yarn.appMasterEnv.PYSPARK_PYTHON ./<alias>/bin/python spark.pyspark.python ./<alias>/bin/pythonWhere:
-
spark.submit.deployMode cluster: Driver deploy mode.Make sure to run jobs in
clustermode to set up your virtual environment properly. For more information, see Resource allocation. -
spark.yarn.dist.archives 's3a://<bucket_name>/<archive_name>.tar.gz#<alias>': Path to the archive with your environment.Enter an environment alias of your choice after the
#character. The alias will serve as the name of the subdirectory the archive will be unpacked to. -
spark.yarn.appMasterEnv.PYSPARK_PYTHON ./<alias>/bin/python: Overrides the run Python interpreter command for the YARN Application Master process. -
spark.pyspark.python ./<alias>/bin/python: Overrides the run Python interpreter command for the job driver.
When executing the next cell that uses Spark, such as a cell in
%spark.pysparkor%spark.sqlmode, a Spark session will be created with the specified settings. The virtual environment dependencies will be available in this session. -