General questions about Yandex Data Processing

Written by

Yandex Cloud

Updated at July 16, 2026

View in Markdown

Which clusters can I move to a different availability zone?

You can move lightweight clusters and HDFS clusters.

What should I do if data is unevenly distributed across storage subcluster hosts?

Connect to the cluster master host and run this command to rebalance the data:

sudo -u hdfs hdfs balancer

You can update the load balancer parameters. For example, to change the maximum amount of data to transfer, add the following argument: -D dfs.balancer.max-size-to-move=<data-size-in-bytes>.

Where can I view Yandex Data Processing cluster logs?

You can find cluster logs in its log group. To monitor the events of a cluster and its hosts, specify the relevant log group in the cluster settings when creating or updating the cluster. If no log group is selected for the cluster, the default log group in the cluster directory will be used for sending and storing logs. For more information, see Working with logs.

Can I get logs of my operations in Yandex Cloud?

Yes, you can request information about operations with your resources from Yandex Cloud logs. Do it by contacting support.

Why is my cluster slow even though the computing resources are not fully utilized?

Your storage may have insufficient maximum IOPS and bandwidth to process the current number of requests. In this case, throttling occurs, which degrades the entire cluster performance.

The maximum IOPS and bandwidth values increase by a fixed value when the storage size increases by a certain step. The step and increment values depend on the disk type:

Disk type	Step, GB	Max IOPS increase (read/write)	Max bandwidth increase (read/write), MB/s
`network-hdd`	256	300/300	30/30
`network-ssd`	32	1,000/1,000	15/15
`network-ssd-nonreplicated`, `network-ssd-io-m3`	93	28,000/5,600	110/82

To increase the maximum IOPS and bandwidth values and make throttling less likely, consider migrating to a different cluster with larger host storage or higher-performance disks. You can migrate data to a new cluster, e.g., using Apache Hive™ Metastore.

I get the `^M: bad interpreter` error when running the initialization script. How do I fix this?

Since the script runtime is Linux (Ubuntu), scripts created in Windows may terminate with the ^M: bad interpreter error due to using the CR/LF new line character (LF in Linux). To fix this error, save the script file in Linux format. For more information, see Syntax errors.

When I run a PySpark job, I get an error related to `com/amazonaws/auth/AWSCredentialsProvider`. How do I fix this?

If a Yandex Data Processing cluster is connected to a Apache Hive™ Metastore cluster, you may get the following error when running PySpark jobs:

previously initiated loading for a different type with name "com/amazonaws/auth/AWSCredentialsProvider";

To fix this, add the spark:spark.sql.hive.metastore.sharedPrefixes property set to com.amazonaws,ru.yandex.cloud to the Yandex Data Processing cluster.

When using dynamic partition overwrites, I get an error related to `PathOutputCommitProtocol`. How do I fix it?

When data processing uses dynamic partition overwrites, you may get this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o264.parquet.
: java.io.IOException: PathOutputCommitProtocol does not support dynamicPartitionOverwrite

To fix it, add the following properties to the Yandex Data Processing cluster:

spark:spark.sql.sources.partitionOverwriteMode : dynamic
spark:spark.sql.parquet.output.committer.class : org.apache.parquet.hadoop.ParquetOutputCommitter
spark:spark.sql.sources.commitProtocolClass : org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol

You can also add properties when creating a job.

Why does the `NAT should be enabled on the subnet` error occur and how do I fix it?

This error occurs when trying to create a Yandex Data Processing cluster in a subnet with no NAT gateway configured. To fix it, configure a network for Yandex Data Processing.

Why does the `Using fileUris is forbidden on lightweight cluster` error occur and how do I fix it?

This error occurs because the lightweight cluster configuration does not include HDFS. To fix it, create a cluster with HDFS support.

We also recommend using Yandex Object Storage buckets to work with jobs. You can upload scripts for running jobs to these buckets. Scripts are stored as objects that can be linked. This allows you to use links from Object Storage in your jobs rather than links in file:/ format.

Why does the `Create Yandex Data Processing cluster Error: 0 Address space exhausted` error occur and how do I fix it?

This error means that your Yandex Data Processing cluster's subnet has no free IP addresses available for assignment to cluster hosts. To check how many IP addresses are available, view the list of addresses used in the subnet and its mask.

To fix this error, do one of the following:

Delete unused resources taking up the subnet's IP addresses.
Create a subnet with a CIDR block that matches your cluster configuration. Next, create a Yandex Data Processing cluster in the new subnet.

For more information about subnet sizes, see this Yandex Virtual Private Cloud guide.

Why is my cluster's status `Unknown`?

If your cluster's status changed from Alive to Unknown:

Make sure you have set up a network for Yandex Data Processing. For a cluster to run, you need to create and configure the following network resources:
- Network
- Subnet
- NAT gateway
- Route table
- Security group
- Service account for the cluster
- Bucket to store job dependencies and results
Review the logs that describe the cluster state over the specified period:
```
yc logging read \
   --group-id=<log_group_ID> \
   --resource-ids=<cluster_ID> \
   --filter=log_type=yandex-dataproc-agent \
   --since 'YYYY-MM-DDThh:mm:ssZ' \
   --until 'YYYY-MM-DDThh:mm:ssZ'
```
In --since and --until, specify the period boundaries. Use the YYYY-MM-DDThh:mm:ssZ time format, e.g., 2020-08-10T12:00:00Z. Provide all times in UTC.

For more information, see Working with logs.

What is the minimum compute capacity required for a subcluster with a master host?

It depends on the driver deploy mode:

In deployMode=cluster mode, when the driver runs on one of the cluster's compute hosts, a subcluster with the master host requires 4–8 CPU cores and 16 GB of RAM.
In deployMode=client mode, when the driver run on the cluster's master host, the compute capacity depends on the job logic and the number of concurrent jobs.

For more information on driver deploy modes and computing resource usage, see Resource allocation.

In Yandex Cloud, compute capacity depends on the host class. For their ratio, see Host classes.

How do I upgrade my image version in Yandex Data Processing?

Yandex Data Processing has no native mechanism for image version upgrades. To upgrade your image version, create a new cluster.

To make sure you always use the latest version, automate the creation and removal of temporary Yandex Data Processing clusters using Yandex Managed Service for Apache Airflow™. To run jobs automatically, you can use Yandex DataSphere alongside Managed Service for Apache Airflow™.

How do I run jobs?

You can do this in several ways:

Create jobs in Yandex Data Processing. Once created, they will run automatically.
Run Apache Hive jobs using the Yandex Cloud CLI or Hive CLI.
Run Spark or PySpark applications using Spark Shell, spark-submit, or the Yandex Cloud CLI.
Use spark-submit to run jobs from remote hosts that are not part of the Yandex Data Processing cluster.
Set up integration with Yandex Managed Service for Apache Airflow™ or Yandex DataSphere. This will automate running the jobs.

What limits apply to security groups?

You can create no more than five security groups per network. Each group may have a maximum of 50 rules. Learn more about limits in Yandex Virtual Private Cloud.

Can I get superuser privileges on hosts?

Yes. To switch to the superuser, enter the following command after connecting to the host:

  sudo su

However, you do not have to switch to the superuser: just use sudo.

How can I fix the no permission error when attaching a service account to a cluster?

Error message:

ERROR: rpc error: code = PermissionDenied desc = you do not have permission to access the requested service account or service account does not exist

This error occurs if you link a service account to a cluster while creating or modifying it.

Solution
Assign the iam.serviceAccounts.user role or higher to your Yandex Cloud account.

General questions about Yandex Data Processing

Which clusters can I move to a different availability zone?Which clusters can I move to a different availability zone?

What should I do if data is unevenly distributed across storage subcluster hosts?What should I do if data is unevenly distributed across storage subcluster hosts?

Where can I view Yandex Data Processing cluster logs?Where can I view Yandex Data Processing cluster logs?

Can I get logs of my operations in Yandex Cloud?Can I get logs of my operations in Yandex Cloud?

Why is my cluster slow even though the computing resources are not fully utilized?Why is my cluster slow even though the computing resources are not fully utilized?

I get the error when running the initialization script. How do I fix this?I get the ^M: bad interpreter error when running the initialization script. How do I fix this?

When I run a PySpark job, I get an error related to . How do I fix this?When I run a PySpark job, I get an error related to com/amazonaws/auth/AWSCredentialsProvider. How do I fix this?

When using dynamic partition overwrites, I get an error related to . How do I fix it?When using dynamic partition overwrites, I get an error related to PathOutputCommitProtocol. How do I fix it?

Why does the error occur and how do I fix it?Why does the NAT should be enabled on the subnet error occur and how do I fix it?

Why does the error occur and how do I fix it?Why does the Using fileUris is forbidden on lightweight cluster error occur and how do I fix it?

Why does the error occur and how do I fix it?Why does the Create Yandex Data Processing cluster Error: 0 Address space exhausted error occur and how do I fix it?

Why is my cluster's status ?Why is my cluster's status Unknown?

What is the minimum compute capacity required for a subcluster with a master host?What is the minimum compute capacity required for a subcluster with a master host?

How do I upgrade my image version in Yandex Data Processing?How do I upgrade my image version in Yandex Data Processing?

How do I run jobs?How do I run jobs?

What limits apply to security groups?What limits apply to security groups?

Can I get superuser privileges on hosts?Can I get superuser privileges on hosts?

How can I fix the no permission error when attaching a service account to a cluster?How can I fix the no permission error when attaching a service account to a cluster?

Was the article helpful?