Resource relationships in Yandex Managed Service for Apache Spark™
Yandex Managed Service for Apache Spark™ is a managed service for deploying Apache Spark™ clusters and running Spark/PySpark jobs in the Yandex Cloud infrastructure.
Yandex Managed Service for Apache Spark™ enables you to:
- Deploy clusters and manage them via the management console, Terraform, and API, without the need to administer clusters and VMs manually.
- Configure separate independent host pools for drivers and executors.
- Run jobs in Python, Scala, and Java while flexibly managing their settings.
- Define job dependencies in a declarative way, including pip, .deb, and Maven packages.
- Track job progress in the Spark app web interface and analyze results using logs in Yandex Cloud Logging and Spark History Server
. - Export job results to Object Storage, databases, and analytical data marts.
- Integrate with Apache Hive™ Metastore in the cluster configuration or in the job parameters.
- Set the preferred time and day of week for cluster maintenance.
Concepts and entities
The main entity Yandex Managed Service for Apache Spark™ operates is an Apache Spark™ cluster.
Running a job in a cluster creates a Spark application. This applications includes the following processes:
- Driver: This is the coordinator process that creates the job plan, breaks the work down into stages and issues, tracks the order of tasks, and sends requests to create executors and releases them.
- Executors: These are processes handling individual tasks within a job.
Host pools
An Apache Spark™ cluster consists of three host groups (pools):
-
Driver host pool to run drivers and some service components (networking, logging, metrics, etc.).
-
Executor host pool to run executors and some service components.
-
Auxiliary pool to run Spark History Server and other system components.
You can configure the host pools for drivers and executors by specifying the number of hosts in each pool and their class. Number of hosts in each of the two pools can be either fixed or autoscalable. The pools are configured when you create or update an Apache Spark™ cluster.
Job
A job is a specification for running a Spark application on the selected Apache Spark™ cluster.
A job includes:
-
Target Apache Spark™ cluster.
Note
Some cluster configuration elements set the default job parameters which can be overriden at the job level. You cannot override the following: network settings, service account, Deb-packages, use of Spark History Server, and logging settings.
-
Job type:
Sparkfor Java and Scala orPySparkfor Python. -
Executable module and entry point: JAR file or main class for a
Sparkjob or a PY file for aPySparkjob. -
Arguments for running the job.
-
Spark properties: Job run resources, dynamic allocation and parallelism parameters, SQL parameters, etc.
-
Libraries and files: JAR files, Maven packages, additional files and archives.
Parallelism and autoscaling
One driver host may simultaneously run one or multiple jobs. The exact number depends on the following:
- Driver host class, i.e., memory size and number of cores per host.
- Resources requested by each job.
- Host resources reserved for internal needs.
The number of driver hosts can be:
- Fixed, if autoscaling is disabled for driver hosts.
- Autoscalable, i.e., changing dynamically depending on the number of available cores and memory size, if autoscaling is enabled for driver hosts.
Overall, the number of jobs a Apache Spark™ cluster can run simultaneously depends on the following factors:
- Specified number of driver hosts or autoscaling settings.
- Host class.
- Parameters of each job.
- Amount of resources reserved for internal needs.