Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex Managed Service for Apache Spark™
  • Getting started
    • Resource relationships
    • Quotas and limits
    • Host classes
    • Autoscaling
  • Access management
  • Pricing policy
  • Yandex Monitoring metrics
  • Terraform reference
  • Release notes

In this article:

  • Concepts and entities
  • Host pools
  • Job
  • Parallelism and autoscaling
  1. Concepts
  2. Resource relationships

Resource relationships in Yandex Managed Service for Apache Spark™

Written by
Yandex Cloud
Updated at October 20, 2025
  • Concepts and entities
    • Host pools
  • Job
  • Parallelism and autoscaling

Yandex Managed Service for Apache Spark™ is a managed service for deploying Apache Spark™ clusters and running Spark/PySpark jobs in the Yandex Cloud infrastructure.

Yandex Managed Service for Apache Spark™ enables you to:

  • Deploy clusters and manage them via the management console, Terraform, and API, without the need to administer clusters and VMs manually.
  • Configure separate independent host pools for drivers and executors.
  • Run jobs in Python, Scala, and Java while flexibly managing their settings.
  • Define job dependencies in a declarative way, including pip, .deb, and Maven packages.
  • Track job progress in the Spark app web interface and analyze results using logs in Yandex Cloud Logging and Spark History Server.
  • Export job results to Object Storage, databases, and analytical data marts.
  • Integrate with Apache Hive™ Metastore in the cluster configuration or in the job parameters.
  • Set the preferred time and day of week for cluster maintenance.

Concepts and entitiesConcepts and entities

The main entity Yandex Managed Service for Apache Spark™ operates is an Apache Spark™ cluster.

Running a job in a cluster creates a Spark application. This applications includes the following processes:

  • Driver: This is the coordinator process that creates the job plan, breaks the work down into stages and issues, tracks the order of tasks, and sends requests to create executors and releases them.
  • Executors: These are processes handling individual tasks within a job.

Host poolsHost pools

An Apache Spark™ cluster consists of three host groups (pools):

  • Driver host pool to run drivers and some service components (networking, logging, metrics, etc.).

  • Executor host pool to run executors and some service components.

  • Auxiliary pool to run Spark History Server and other system components.

You can configure the host pools for drivers and executors by specifying the number of hosts in each pool and their class. Number of hosts in each of the two pools can be either fixed or autoscalable. The pools are configured when you create or update an Apache Spark™ cluster.

JobJob

A job is a specification for running a Spark application on the selected Apache Spark™ cluster.

A job includes:

  • Target Apache Spark™ cluster.

    Note

    Some cluster configuration elements set the default job parameters which can be overriden at the job level. You cannot override the following: network settings, service account, Deb-packages, use of Spark History Server, and logging settings.

  • Job type: Spark for Java and Scala or PySpark for Python.

  • Executable module and entry point: JAR file or main class for a Spark job or a PY file for a PySpark job.

  • Arguments for running the job.

  • Spark properties: Job run resources, dynamic allocation and parallelism parameters, SQL parameters, etc.

  • Libraries and files: JAR files, Maven packages, additional files and archives.

Parallelism and autoscalingParallelism and autoscaling

One driver host may simultaneously run one or multiple jobs. The exact number depends on the following:

  • Driver host class, i.e., memory size and number of cores per host.
  • Resources requested by each job.
  • Host resources reserved for internal needs.

The number of driver hosts can be:

  • Fixed, if autoscaling is disabled for driver hosts.
  • Autoscalable, i.e., changing dynamically depending on the number of available cores and memory size, if autoscaling is enabled for driver hosts.

Overall, the number of jobs a Apache Spark™ cluster can run simultaneously depends on the following factors:

  • Specified number of driver hosts or autoscaling settings.
  • Host class.
  • Parameters of each job.
  • Amount of resources reserved for internal needs.

Was the article helpful?

Previous
Using Yandex Object Storage in Managed Service for Apache Spark™
Next
Quotas and limits
© 2025 Direct Cursus Technology L.L.C.