Setting up and using Python virtual environments

Written by

Updated at December 26, 2024

Setting up a Python virtual environment
Using a virtual environment

When using PySpark, it is often required to install or update Python packages in Yandex Data Processing clusters. In this case, changing the primary Python environment in a cluster may turn out too time-consuming and unpractical. As an alternative, you can use isolated virtual environments when running each job:

Set up a virtual environment.

For compatibility, a virtual environment is set up on a temporary Yandex Data Processing cluster. It is then placed in an archive that is saved in an Object Storage bucket.
Use the virtual environment from the archive when running jobs in Yandex Data Processing clusters.

Python virtual environments can be used:

Setting up a Python virtual environment

Create a service account with the dataproc.agent and dataproc.provisioner roles.
In Object Storage, create a bucket to store cluster logs and a virtual environment archive.
Configure an ACL for the bucket by granting the READ and WRITE permissions to the service account.
Create a temporary Yandex Data Processing cluster. When creating it, specify:
- Yandex Data Processing version, same as that of the cluster to use the environment on. This ensures compatibility.
- Components:
  - SPARK
  - YARN
- Bucket for storing logs.
- Service account with bucket access.
- (Optional) Public access to the subcluster with the master host.
We recommend specifying the minimum required settings of host resources.
Use SSH to connect to the temporary Yandex Data Processing cluster.

Run the built-in Virtualenv tool for managing virtual environments:

python -m venv pyspark_venv && \
source pyspark_venv/bin/activate

Install venv-pack and other Python environment modules you need:

pip install venv-pack <list_of_modules>

Example:

pip install venv-pack pyarrow pandas catboost

Archive the resulting environment with the venv-pack command:
```
venv-pack -o <archive_name>.tar.gz
```
Send the archive with the environment to the previously created Object Storage bucket:
```
hdfs dfs -copyFromLocal <archive_name>.tar.gz s3a://<bucket_name>/
```
Delete the temporary Yandex Data Processing cluster to avoid paying for it.

Using a virtual environment

To use the prepared virtual environment in the Yandex Data Processing cluster, grant the cluster service account permissions to write and read data in the bucket that stores the archive. There are two ways to do this:

Edit the bucket ACL by granting the READ permission to the cluster service account.
Assign the storage.viewer role to the service account.

Using a virtual environment at regular runs of PySpark jobs

When creating a PySpark job, set the following Spark property values at the job level:

spark.submit.deployMode=cluster: Driver deploy mode.

Make sure to run jobs in cluster mode to set up your virtual environment properly. For more information about the driver deploy mode, see Resource allocation.
spark.yarn.dist.archives='s3a://<bucket_name>/<archive_name>.tar.gz#<alias>': Path to the archive with the prepared environment.

Enter an environment alias of your choice after the # character. The alias will be used as the name of the subdirectory the archive will be unpacked to.
spark.yarn.appMasterEnv.PYSPARK_PYTHON=./<alias>/bin/python: Overrides the run Python interpreter command for the YARN Application Master process.
spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./<alias>/bin/python: Overrides the run Python interpreter command for the job driver.

This makes the dependencies within the virtual environment available during job execution.

Using a virtual environment when integrating the Yandex Data Processing cluster with Yandex DataSphere

Yandex DataSphere and Yandex Data Processing integration is based on using the Apache Livy component as part of Yandex Data Processing. For more information about setting up integration with Yandex DataSphere, see Integration with Yandex DataSphere.

To use Python virtual environments integrated with Yandex DataSphere, configure additional settings:

Set the cluster driver deploy mode on the Yandex Data Processing cluster side. To do this, provide the livy:livy.spark.deploy-mode=cluster component property value when updating the cluster.

For more information about the driver deploy mode, see Resource allocation.
On the Yandex DataSphere side, create a Livy session with the following settings:
```
%create_livy_session \
    --cluster <cluster_name_or_ID> --id <Livy_session_ID> \
    --conf spark.yarn.dist.archives=s3a://<bucket_name>/<archive_name>.tar.gz#<alias> \
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./<alias>/bin/python \
    --conf <other_Spark_context_parameters> ...
```
Where:
- <cluster_name_or_ID>: Name or ID of the Yandex Data Processing cluster used for integration.
- <Livy_session_ID>: Arbitrary string to identify the Livy session in the cluster.
- <bucket_name>: Bucket with the environment archive.
- <archive_name>: Ready-to-use archive with the Python environment.
- <alias>: Environment alias of your choice. The alias will be used as the name of the subdirectory the archive will be unpacked to.
- <other_Spark_context_parameters>: Specified if needed. For a complete list of parameters, see the Spark documentation.

Specify the created session when running Python code in the cluster. Dependencies included in the virtual environment will be available for use.

Using a virtual environment in a Zeppelin notebook

To use Python virtual environments when working in a Zeppelin notebook:

In the Zeppelin interface, go to the Interpreter settings, and edit the spark section by setting the Spark context mode to Per Note and Isolated. This will allow different notebooks to use different virtual environments.
Create a new cell named %spark.conf and use it to provide variables for a Spark session:
```
%spark.conf
spark.submit.deployMode cluster
spark.yarn.dist.archives s3a://<bucket_name>/<archive_name>.tar.gz#<alias>
spark.yarn.appMasterEnv.PYSPARK_PYTHON ./<alias>/bin/python
spark.pyspark.python ./<alias>/bin/python
```
Where:
- spark.submit.deployMode cluster: Driver deploy mode.
  
  Make sure to run jobs in cluster mode to set up your virtual environment properly. For more information about the driver deploy mode, see Resource allocation.
- spark.yarn.dist.archives 's3a://<bucket_name>/<archive_name>.tar.gz#<alias>': Path to the archive with the prepared environment.
  
  Enter an environment alias of your choice after the # character. The alias will be used as the name of the subdirectory the archive will be unpacked to.
- spark.yarn.appMasterEnv.PYSPARK_PYTHON ./<alias>/bin/python: Overrides the run Python interpreter command for the YARN Application Master process.
- spark.pyspark.python ./<alias>/bin/python: Overrides the run Python interpreter command for the job driver.
When executing the next cell that uses Spark, such as a cell in %spark.pyspark or %spark.sql mode, a Spark session will be created with the specified settings. The virtual environment dependencies will be available in this session.

Setting up and using Python virtual environments

Setting up a Python virtual environmentSetting up a Python virtual environment

Using a virtual environmentUsing a virtual environment

Using a virtual environment at regular runs of PySpark jobsUsing a virtual environment at regular runs of PySpark jobs

Using a virtual environment when integrating the Yandex Data Processing cluster with Yandex DataSphereUsing a virtual environment when integrating the Yandex Data Processing cluster with Yandex DataSphere

Using a virtual environment in a Zeppelin notebookUsing a virtual environment in a Zeppelin notebook

Was the article helpful?

Setting up a Python virtual environment

Using a virtual environment

Using a virtual environment at regular runs of PySpark jobs

Using a virtual environment when integrating the Yandex Data Processing cluster with Yandex DataSphere

Using a virtual environment in a Zeppelin notebook