Setting up and using Python virtual environments
When using PySpark, it is often required to install or update Python packages in Yandex Data Processing clusters. In this case, changing the primary Python environment in a cluster may turn out too time-consuming and unpractical. As an alternative, you can use isolated virtual environments when running each job:
-
For compatibility, a virtual environment is set up on a temporary Yandex Data Processing cluster. It is then placed in an archive that is saved in an Object Storage bucket.
-
Use the virtual environment from the archive when running jobs in Yandex Data Processing clusters.
Python virtual environments can be used:
Setting up a Python virtual environment
-
Create a service account with the
dataproc.agent
anddataproc.provisioner
roles. -
In Object Storage, create a bucket to store cluster logs and a virtual environment archive.
-
Configure an ACL for the bucket by granting the
READ and WRITE
permissions to the service account. -
Create a temporary Yandex Data Processing cluster. When creating it, specify:
- Yandex Data Processing version, same as that of the cluster to use the environment on. This ensures compatibility.
- Components:
SPARK
YARN
- Bucket for storing logs.
- Service account with bucket access.
- (Optional) Public access to the subcluster with the master host.
We recommend specifying the minimum required settings of host resources.
-
Use SSH to connect to the temporary Yandex Data Processing cluster.
-
Run the built-in Virtualenv
tool for managing virtual environments:python -m venv pyspark_venv && \ source pyspark_venv/bin/activate
-
Install
venv-pack
and other Python environment modules you need:pip install venv-pack <list_of_modules>
Example:
pip install venv-pack pyarrow pandas catboost
-
Archive the resulting environment with the
venv-pack
command:venv-pack -o <archive_name>.tar.gz
-
Send the archive with the environment to the previously created Object Storage bucket:
hdfs dfs -copyFromLocal <archive_name>.tar.gz s3a://<bucket_name>/
-
Delete the temporary Yandex Data Processing cluster to avoid paying for it.
Using a virtual environment
To use the prepared virtual environment in the Yandex Data Processing cluster, grant the cluster service account permissions to write and read data in the bucket that stores the archive. There are two ways to do this:
- Edit the bucket ACL by granting the
READ
permission to the cluster service account. - Assign the
storage.viewer
role to the service account.
Using a virtual environment at regular runs of PySpark jobs
When creating a PySpark job, set the following Spark property
-
spark.submit.deployMode=cluster
: Driver deploy mode.Make sure to run jobs in
cluster
mode to set up your virtual environment properly. For more information about the driver deploy mode, see Resource allocation. -
spark.yarn.dist.archives='s3a://<bucket_name>/<archive_name>.tar.gz#<alias>'
: Path to the archive with the prepared environment.Enter an environment alias of your choice after the
#
character. The alias will be used as the name of the subdirectory the archive will be unpacked to. -
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./<alias>/bin/python
: Overrides the run Python interpreter command for the YARN Application Master process. -
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./<alias>/bin/python
: Overrides the run Python interpreter command for the job driver.
This makes the dependencies within the virtual environment available during job execution.
Using a virtual environment when integrating the Yandex Data Processing cluster with Yandex DataSphere
Yandex DataSphere and Yandex Data Processing integration is based on using the Apache Livy
To use Python virtual environments integrated with Yandex DataSphere, configure additional settings:
-
Set the
cluster
driver deploy mode on the Yandex Data Processing cluster side. To do this, provide thelivy:livy.spark.deploy-mode=cluster
component property value when updating the cluster.For more information about the driver deploy mode, see Resource allocation.
-
On the Yandex DataSphere side, create a Livy session with the following settings:
%create_livy_session \ --cluster <cluster_name_or_ID> --id <Livy_session_ID> \ --conf spark.yarn.dist.archives=s3a://<bucket_name>/<archive_name>.tar.gz#<alias> \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./<alias>/bin/python \ --conf <other_Spark_context_parameters> ...
Where:
<cluster_name_or_ID>
: Name or ID of the Yandex Data Processing cluster used for integration.<Livy_session_ID>
: Arbitrary string to identify the Livy session in the cluster.<bucket_name>
: Bucket with the environment archive.<archive_name>
: Ready-to-use archive with the Python environment.<alias>
: Environment alias of your choice. The alias will be used as the name of the subdirectory the archive will be unpacked to.<other_Spark_context_parameters>
: Specified if needed. For a complete list of parameters, see the Spark documentation .
Specify the created session when running Python code in the cluster. Dependencies included in the virtual environment will be available for use.
Using a virtual environment in a Zeppelin notebook
To use Python virtual environments when working in a Zeppelin notebook:
-
In the Zeppelin interface, go to the Interpreter settings, and edit the spark section by setting the Spark context mode to
Per Note
andIsolated
. This will allow different notebooks to use different virtual environments. -
Create a new cell named
%spark.conf
and use it to provide variables for a Spark session:%spark.conf spark.submit.deployMode cluster spark.yarn.dist.archives s3a://<bucket_name>/<archive_name>.tar.gz#<alias> spark.yarn.appMasterEnv.PYSPARK_PYTHON ./<alias>/bin/python spark.pyspark.python ./<alias>/bin/python
Where:
-
spark.submit.deployMode cluster
: Driver deploy mode.Make sure to run jobs in
cluster
mode to set up your virtual environment properly. For more information about the driver deploy mode, see Resource allocation. -
spark.yarn.dist.archives 's3a://<bucket_name>/<archive_name>.tar.gz#<alias>'
: Path to the archive with the prepared environment.Enter an environment alias of your choice after the
#
character. The alias will be used as the name of the subdirectory the archive will be unpacked to. -
spark.yarn.appMasterEnv.PYSPARK_PYTHON ./<alias>/bin/python
: Overrides the run Python interpreter command for the YARN Application Master process. -
spark.pyspark.python ./<alias>/bin/python
: Overrides the run Python interpreter command for the job driver.
When executing the next cell that uses Spark, such as a cell in
%spark.pyspark
or%spark.sql
mode, a Spark session will be created with the specified settings. The virtual environment dependencies will be available in this session. -