Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex Data Processing
  • Getting started
    • All guides
    • Setting up and using Python virtual environments
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Setting up a Python virtual environment
  • Using a virtual environment
  • Using a virtual environment at regular runs of PySpark jobs
  • Using a virtual environment when integrating the Yandex Data Processing cluster with Yandex DataSphere
  • Using a virtual environment in a Zeppelin notebook
  1. Step-by-step guides
  2. Setting up and using Python virtual environments

Setting up and using Python virtual environments

Written by
Yandex Cloud
Updated at December 26, 2024
  • Setting up a Python virtual environment
  • Using a virtual environment
    • Using a virtual environment at regular runs of PySpark jobs
    • Using a virtual environment when integrating the Yandex Data Processing cluster with Yandex DataSphere
    • Using a virtual environment in a Zeppelin notebook

When using PySpark, it is often required to install or update Python packages in Yandex Data Processing clusters. In this case, changing the primary Python environment in a cluster may turn out too time-consuming and unpractical. As an alternative, you can use isolated virtual environments when running each job:

  1. Set up a virtual environment.

    For compatibility, a virtual environment is set up on a temporary Yandex Data Processing cluster. It is then placed in an archive that is saved in an Object Storage bucket.

  2. Use the virtual environment from the archive when running jobs in Yandex Data Processing clusters.

    Python virtual environments can be used:

    • At regular runs of PySpark jobs
    • When running PySpark jobs in Yandex Data Processing clusters integrated with Yandex DataSphere
    • When running PySpark jobs in Zeppelin notebooks

Setting up a Python virtual environmentSetting up a Python virtual environment

  1. Create a service account with the dataproc.agent and dataproc.provisioner roles.

  2. In Object Storage, create a bucket to store cluster logs and a virtual environment archive.

  3. Configure an ACL for the bucket by granting the READ and WRITE permissions to the service account.

  4. Create a temporary Yandex Data Processing cluster. When creating it, specify:

    • Yandex Data Processing version, same as that of the cluster to use the environment on. This ensures compatibility.
    • Components:
      • SPARK
      • YARN
    • Bucket for storing logs.
    • Service account with bucket access.
    • (Optional) Public access to the subcluster with the master host.

    We recommend specifying the minimum required settings of host resources.

  5. Use SSH to connect to the temporary Yandex Data Processing cluster.

  6. Run the built-in Virtualenv tool for managing virtual environments:

    python -m venv pyspark_venv && \
    source pyspark_venv/bin/activate
    
  7. Install venv-pack and other Python environment modules you need:

    pip install venv-pack <list_of_modules>
    

    Example:

    pip install venv-pack pyarrow pandas catboost
    
  8. Archive the resulting environment with the venv-pack command:

    venv-pack -o <archive_name>.tar.gz
    
  9. Send the archive with the environment to the previously created Object Storage bucket:

    hdfs dfs -copyFromLocal <archive_name>.tar.gz s3a://<bucket_name>/
    
  10. Delete the temporary Yandex Data Processing cluster to avoid paying for it.

Using a virtual environmentUsing a virtual environment

To use the prepared virtual environment in the Yandex Data Processing cluster, grant the cluster service account permissions to write and read data in the bucket that stores the archive. There are two ways to do this:

  • Edit the bucket ACL by granting the READ permission to the cluster service account.
  • Assign the storage.viewer role to the service account.

Using a virtual environment at regular runs of PySpark jobsUsing a virtual environment at regular runs of PySpark jobs

When creating a PySpark job, set the following Spark property values at the job level:

  • spark.submit.deployMode=cluster: Driver deploy mode.

    Make sure to run jobs in cluster mode to set up your virtual environment properly. For more information about the driver deploy mode, see Resource allocation.

  • spark.yarn.dist.archives='s3a://<bucket_name>/<archive_name>.tar.gz#<alias>': Path to the archive with the prepared environment.

    Enter an environment alias of your choice after the # character. The alias will be used as the name of the subdirectory the archive will be unpacked to.

  • spark.yarn.appMasterEnv.PYSPARK_PYTHON=./<alias>/bin/python: Overrides the run Python interpreter command for the YARN Application Master process.

  • spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./<alias>/bin/python: Overrides the run Python interpreter command for the job driver.

This makes the dependencies within the virtual environment available during job execution.

Using a virtual environment when integrating the Yandex Data Processing cluster with Yandex DataSphereUsing a virtual environment when integrating the Yandex Data Processing cluster with Yandex DataSphere

Yandex DataSphere and Yandex Data Processing integration is based on using the Apache Livy component as part of Yandex Data Processing. For more information about setting up integration with Yandex DataSphere, see Integration with Yandex DataSphere.

To use Python virtual environments integrated with Yandex DataSphere, configure additional settings:

  1. Set the cluster driver deploy mode on the Yandex Data Processing cluster side. To do this, provide the livy:livy.spark.deploy-mode=cluster component property value when updating the cluster.

    For more information about the driver deploy mode, see Resource allocation.

  2. On the Yandex DataSphere side, create a Livy session with the following settings:

    %create_livy_session \
        --cluster <cluster_name_or_ID> --id <Livy_session_ID> \
        --conf spark.yarn.dist.archives=s3a://<bucket_name>/<archive_name>.tar.gz#<alias> \
        --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./<alias>/bin/python \
        --conf <other_Spark_context_parameters> ...
    

    Where:

    • <cluster_name_or_ID>: Name or ID of the Yandex Data Processing cluster used for integration.
    • <Livy_session_ID>: Arbitrary string to identify the Livy session in the cluster.
    • <bucket_name>: Bucket with the environment archive.
    • <archive_name>: Ready-to-use archive with the Python environment.
    • <alias>: Environment alias of your choice. The alias will be used as the name of the subdirectory the archive will be unpacked to.
    • <other_Spark_context_parameters>: Specified if needed. For a complete list of parameters, see the Spark documentation.

Specify the created session when running Python code in the cluster. Dependencies included in the virtual environment will be available for use.

Using a virtual environment in a Zeppelin notebookUsing a virtual environment in a Zeppelin notebook

To use Python virtual environments when working in a Zeppelin notebook:

  1. In the Zeppelin interface, go to the Interpreter settings, and edit the spark section by setting the Spark context mode to Per Note and Isolated. This will allow different notebooks to use different virtual environments.

  2. Create a new cell named %spark.conf and use it to provide variables for a Spark session:

    %spark.conf
    spark.submit.deployMode cluster
    spark.yarn.dist.archives s3a://<bucket_name>/<archive_name>.tar.gz#<alias>
    spark.yarn.appMasterEnv.PYSPARK_PYTHON ./<alias>/bin/python
    spark.pyspark.python ./<alias>/bin/python
    

    Where:

    • spark.submit.deployMode cluster: Driver deploy mode.

      Make sure to run jobs in cluster mode to set up your virtual environment properly. For more information about the driver deploy mode, see Resource allocation.

    • spark.yarn.dist.archives 's3a://<bucket_name>/<archive_name>.tar.gz#<alias>': Path to the archive with the prepared environment.

      Enter an environment alias of your choice after the # character. The alias will be used as the name of the subdirectory the archive will be unpacked to.

    • spark.yarn.appMasterEnv.PYSPARK_PYTHON ./<alias>/bin/python: Overrides the run Python interpreter command for the YARN Application Master process.

    • spark.pyspark.python ./<alias>/bin/python: Overrides the run Python interpreter command for the job driver.

    When executing the next cell that uses Spark, such as a cell in %spark.pyspark or %spark.sql mode, a Spark session will be created with the specified settings. The virtual environment dependencies will be available in this session.

Was the article helpful?

Previous
Diagnostics and troubleshooting of Spark application performance issues
Next
All tutorials
© 2025 Direct Cursus Technology L.L.C.