Yandex Data Processing resource relationships
Yandex Data Processing helps implement distributed data storage and processing using the Apache Hadoop
Resources
The service's main entity is a cluster. It groups together all resources available in Hadoop, including storage and computing capacity.
Each cluster consists of subclusters. They integrate hosts that perform identical functions:
-
Subcluster with a master host (
masternode
), e.g., NameNode for HDFS or ResourceManager for YARN.Each cluster may only have one subcluster with a master host.
-
Data storage subclusters (
Data
ordatanode
), e.g., DataNode for HDFS. -
Data processing subclusters (
Compute
orcomputenode
), e.g., NodeManager for YARN.
Subclusters of a single cluster must reside in the same cloud network and availability zone.
Hosts in each subcluster are created with the computing capacity that is consistent with the specified host class. For a list of available host classes and their specs, see Host classes.
VMs for cluster hosts can be hosted on:
-
Regular Yandex Cloud hosts:
These are physical servers for hosting cluster VMs. These hosts are selected randomly from the available pool of hosts that meet the requirements of the selected subcluster configuration.
-
Dedicated Yandex Cloud hosts:
These are physical servers that only host your VMs. Such VMs ensure the operation of both the cluster and your other services that support dedicated hosts. The hosts are selected from dedicated host groups specified when creating a cluster.
This placement option ensures physical isolation of the VMs. A Yandex Data Processing cluster using dedicated hosts includes all features of a regular cluster.
For more information about dedicated hosts, see the Yandex Compute Cloud documentation.
For information about network configuration and network access to clusters, see Networking in Yandex Data Processing.
Warning
Changing host properties through the Yandex Compute Cloud interfaces may result in host failure. To change the cluster host settings, use the Yandex Data Processing interfaces, such as the management console, CLI, Terraform, or API.
Leightweight clusters
Starting from image version 2.0.39
, you can use a lightweight cluster configuration without HDFS and data storage subclusters. For example, such clusters may include only YARN and SPARK. They are faster to create and use host computing resources more efficiently. We recommend using lightweight clusters to run single jobs for processing data in Spark or PySpark.
Benefits of lightweight clusters:
-
The Spark Driver is run on a subcluster with master hosts. This enables you to allocate different resources to the subcluster with master hosts, which will run the Spark Driver, and to the data processing subclusters, which will run Spark Executors.
-
At least one Spark Driver and Spark Executor instance will be running on each data processing subcluster of regular clusters. In lightweight clusters, the Spark Driver can use all free resources of the subcluster with master hosts, while Spark Executors can use all free resources of the data processing subclusters. This improves the performance of hosts.
Requirements for using lightweight clusters:
- The HDFS component is not selected.
- No data storage subclusters are used in a cluster.
- The cluster contains one or more data processing subclusters.
- The cluster settings specify a bucket in Yandex Object Storage.
For more information about resource allocation, see Spark jobs.
Security
Since a Yandex Data Processing cluster can run jobs without directly accessing clusters over SSH, the cluster logs the job execution results to an S3 bucket. This is done for the user's convenience. The logs are written under the service account specified during cluster creation. For more information, see Service accounts.
We recommend using at least two separate S3 buckets for a Yandex Data Processing cluster:
- One for the source data, where the service account has read-only access.
- Another one for the operation logs and results, where the service account has full access.
This is required to minimize the risk of unexpectedly modifying or deleting source data.