Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex Data Processing
  • Getting started
    • All guides
    • Creating and using Python virtual environments
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Setting up a Python virtual environment
  • Using a virtual environment
  • Using a virtual environment when running PySpark jobs normally
  • Using a virtual environment when integrating a Yandex Data Processing cluster with Yandex DataSphere
  • Using a virtual environment in a Zeppelin notebook
  1. Step-by-step guides
  2. Creating and using Python virtual environments

Setting up and using Python virtual environments

Written by
Yandex Cloud
Updated at September 25, 2025
  • Setting up a Python virtual environment
  • Using a virtual environment
    • Using a virtual environment when running PySpark jobs normally
    • Using a virtual environment when integrating a Yandex Data Processing cluster with Yandex DataSphere
    • Using a virtual environment in a Zeppelin notebook

Using PySpark often requires installing or updating Python packages in Yandex Data Processing clusters. In this case, modifying the primary Python environment in a cluster may prove too time-consuming and unpractical. As a workaround, you can use isolated virtual environments when running each job:

  1. Set up a virtual environment.

    For compatibility, set up your virtual environment on a temporary Yandex Data Processing cluster. Then place it in an archive, saved in an Object Storage bucket.

  2. Use the virtual environment from the archive when running jobs in Yandex Data Processing clusters.

    You can use Python virtual environments:

    • When running PySpark jobs normally.
    • When running PySpark jobs in Yandex Data Processing clusters integrated with Yandex DataSphere.
    • When running PySpark jobs in Zeppelin notebooks.

Setting up a Python virtual environmentSetting up a Python virtual environment

  1. Create a service account with the dataproc.agent and dataproc.provisioner roles.

  2. In Object Storage, create a bucket to store cluster logs and a virtual environment archive.

  3. Configure an ACL for the bucket by granting the READ and WRITE permissions to the service account.

  4. Create a temporary Yandex Data Processing cluster. When creating it, specify:

    • Yandex Data Processing version, same as that of the cluster to use the environment on. This ensures compatibility.
    • Components:
      • SPARK
      • YARN
    • Bucket for storing logs.
    • Service account with bucket access.
    • Optionally, public access to the master host subcluster.

    We recommend specifying the minimum required settings for host resources.

  5. Use SSH to connect to the temporary Yandex Data Processing cluster.

  6. Run the built-in Virtualenv tool for managing virtual environments:

    python -m venv pyspark_venv && \
    source pyspark_venv/bin/activate
    
  7. Install venv-pack and other Python environment modules you need:

    pip install venv-pack <list_of_modules>
    

    Here is an example:

    pip install venv-pack pyarrow pandas catboost
    
  8. Archive the environment you got with the venv-pack command:

    venv-pack -o <archive_name>.tar.gz
    
  9. Send the environment archive to the Object Storage bucket you created earlier:

    hdfs dfs -copyFromLocal <archive_name>.tar.gz s3a://<bucket_name>/
    
  10. Delete the temporary Yandex Data Processing cluster to avoid paying for it.

Using a virtual environmentUsing a virtual environment

To use your virtual environment in the Yandex Data Processing cluster, grant the cluster service account permissions to write and read data in the archive bucket. There are two ways to do this:

  • Edit the bucket ACL by granting the READ permission to the cluster service account.
  • Assign the storage.viewer role to the service account.

Using a virtual environment when running PySpark jobs normallyUsing a virtual environment when running PySpark jobs normally

When creating a PySpark job, set the following Spark property values at the job level:

  • spark.submit.deployMode=cluster: Driver deploy mode.

    Make sure to run jobs in cluster mode to set up your virtual environment properly. For more information, see Resource allocation.

  • spark.yarn.dist.archives='s3a://<bucket_name>/<archive_name>.tar.gz#<alias>': Path to the archive with your environment.

    Enter an environment alias of your choice after the # character. The alias will serve as the name of the subdirectory the archive will be unpacked to.

  • spark.yarn.appMasterEnv.PYSPARK_PYTHON=./<alias>/bin/python: Overrides the run Python interpreter command for the YARN Application Master process.

  • spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./<alias>/bin/python: Overrides the run Python interpreter command for the job driver.

This makes the dependencies within the virtual environment available when running jobs.

Using a virtual environment when integrating a Yandex Data Processing cluster with Yandex DataSphereUsing a virtual environment when integrating a Yandex Data Processing cluster with Yandex DataSphere

Yandex DataSphere and Yandex Data Processing integration is based on using the Apache Livy component as part of Yandex Data Processing. For more information, see Integration with Yandex DataSphere.

To use Python virtual environments integrated with Yandex DataSphere, configure the additional settings:

  1. Set the cluster driver deploy mode on the Yandex Data Processing cluster’s side. To do this, provide the livy:livy.spark.deploy-mode=cluster component property value when updating the cluster.

    For more information, see Resource allocation.

  2. On the Yandex DataSphere side, create a Livy session with the following settings:

    %create_livy_session \
        --cluster <cluster_name_or_ID> --id <Livy_session_ID> \
        --conf spark.yarn.dist.archives=s3a://<bucket_name>/<archive_name>.tar.gz#<alias> \
        --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./<alias>/bin/python \
        --conf <other_Spark_context_parameters> ...
    

    Where:

    • <cluster_name_or_ID>: Name or ID of the Yandex Data Processing cluster used for integration.
    • <Livy_session_ID>: Arbitrary string to identify the Livy session in the cluster.
    • <bucket_name>: Bucket with the environment archive.
    • <archive_name>: Ready-to-use archive with the Python environment.
    • <alias>: Environment alias of your choice. The alias will serve as the name of the subdirectory the archive will be unpacked to.
    • <other_Spark_context_parameters>: Specify them if needed. For a complete list of parameters, see this Spark article.

Specify the session you created when running Python code in the cluster. You will be able to use dependencies included in the virtual environment.

Using a virtual environment in a Zeppelin notebookUsing a virtual environment in a Zeppelin notebook

To use Python virtual environments for operations in a Zeppelin notebook:

  1. In the Zeppelin interface, navigate to the Interpreter settings and edit the spark section by setting the Spark context mode to Per Note and Isolated. This will allow different notebooks to use different virtual environments.

  2. Create a new cell named %spark.conf and use it to provide variables for a Spark session:

    %spark.conf
    spark.submit.deployMode cluster
    spark.yarn.dist.archives s3a://<bucket_name>/<archive_name>.tar.gz#<alias>
    spark.yarn.appMasterEnv.PYSPARK_PYTHON ./<alias>/bin/python
    spark.pyspark.python ./<alias>/bin/python
    

    Where:

    • spark.submit.deployMode cluster: Driver deploy mode.

      Make sure to run jobs in cluster mode to set up your virtual environment properly. For more information, see Resource allocation.

    • spark.yarn.dist.archives 's3a://<bucket_name>/<archive_name>.tar.gz#<alias>': Path to the archive with your environment.

      Enter an environment alias of your choice after the # character. The alias will serve as the name of the subdirectory the archive will be unpacked to.

    • spark.yarn.appMasterEnv.PYSPARK_PYTHON ./<alias>/bin/python: Overrides the run Python interpreter command for the YARN Application Master process.

    • spark.pyspark.python ./<alias>/bin/python: Overrides the run Python interpreter command for the job driver.

    When executing the next cell that uses Spark, such as a cell in %spark.pyspark or %spark.sql mode, a Spark session will be created with the specified settings. The virtual environment dependencies will be available in this session.

Was the article helpful?

Previous
Diagnosing and troubleshooting Spark application performance issues
Next
All tutorials
© 2025 Direct Cursus Technology L.L.C.