General questions about Yandex Data Processing
-
What should I do if data on storage subcluster hosts is distributed unevenly?
-
Why is the cluster slow even though the computing resources are not used fully?
-
I get the
^M: bad interpreter
error when running the initialization script. How do I fix this? -
Why does the
NAT should be enabled on the subnet
error occur and how do I fix it? -
Why does the
Using fileUris is forbidden on lightweight cluster
error occur and how do I fix it? -
What is the minimum computing power required for a subcluster with a master host?
-
How do I upgrade the image version in Yandex Data Processing?
-
How can I fix the no permission error when connecting a service account to the cluster?
Which clusters can be moved to a different availability zone?
You can move light-weight clusters and HDFS clusters.
What should I do if data on storage subcluster hosts is distributed unevenly?
Connect to the cluster master host and run this command to rebalance the data:
sudo -u hdfs hdfs balancer
You can configure the load balancer parameters. For example, to change the maximum amount of data to transfer, add the following argument: -D dfs.balancer.max-size-to-move=<data-size-in-bytes>
.
Where can I view Yandex Data Processing cluster logs?
You can find cluster logs in its log group. To track the events of a cluster and its individual hosts, specify the relevant log group in cluster settings when creating or updating the cluster. If no log group has been selected for the cluster, a default log group in the cluster directory will be used to send and store logs. For more information, see Working with logs.
Can I get logs of my operations with services?
Yes, you can request log records about your resources from Yandex Cloud services. For more information, see Data requests.
Why is the cluster slow even though the computing resources are not used fully?
Perhaps, the maximum storage IOPS and bandwidth values are insufficient for processing the current number of requests. In this case, throttling is triggered and the performance of the entire cluster degrades.
The maximum IOPS and bandwidth values increase by a fixed value when the storage size increases by a certain step. The step and increment values depend on the disk type:
Disk type | Step, GB | Max IOPS increase (read/write) | Max bandwidth increase (read/write), MB/s |
---|---|---|---|
network-hdd |
256 | 300/300 | 30/30 |
network-ssd |
32 | 1,000/1,000 | 15/15 |
network-ssd-nonreplicated |
93 | 28,000/5,600 | 110/82 |
To increase the maximum IOPS and bandwidth values and make throttling less likely, consider switching to a different cluster with larger host storage or a faster disk type. You can transfer data to a new cluster, for example, using Hive Metastore.
I get the "^M: bad interpreter" error when running the initialization script. How do I fix this?
The script runtime environment being Linux (Ubuntu), scripts created in Windows may terminate with the ^M: bad interpreter
error due to using the CR/LF
new line character (LF
in Linux). To fix the error, save the script file in Linux format. For more information, see Syntax errors.
When I run a PySpark job, I get an error related to "com/amazonaws/auth/AWSCredentialsProvider". How do I fix this?
If a Yandex Data Processing cluster is connected to a Metastore cluster, you may get the following error when running PySpark jobs:
previously initiated loading for a different type with name "com/amazonaws/auth/AWSCredentialsProvider";
To fix this, add the spark:spark.sql.hive.metastore.sharedPrefixes
property with the com.amazonaws,ru.yandex.cloud
value to the Yandex Data Processing cluster.
When using dynamic partition overwrites, I get an error related to "PathOutputCommitProtocol". How do I fix it?
When data processing uses dynamic partition overwrites, you may get this error:
py4j.protocol.Py4JJavaError: An error occurred while calling o264.parquet.
: java.io.IOException: PathOutputCommitProtocol does not support dynamicPartitionOverwrite
To fix it, add the following properties to the Yandex Data Processing cluster:
spark:spark.sql.sources.partitionOverwriteMode : dynamic
spark:spark.sql.parquet.output.committer.class : org.apache.parquet.hadoop.ParquetOutputCommitter
spark:spark.sql.sources.commitProtocolClass : org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
You can also add properties when creating a job.
Why does the "NAT should be enabled on the subnet" error occur and how do I fix it?
This error occurs when trying to create a Yandex Data Processing cluster in a subnet with no NAT gateway configured. To fix it, configure a network for Yandex Data Processing.
Why does the "Using fileUris is forbidden on lightweight cluster" error occur and how do I fix it?
This error occurs because the lightweight clusters configuration does not include HDFS. To fix the error, create a cluster with HDFS support.
We also recommend using Yandex Object Storage buckets to work with jobs. You can upload scripts to them to run jobs. These scripts are stored as objects one can get links to. As a result, you can use links from Object Storage instead of file:/
format links in your jobs.
Why does the "Create Yandex Data Processing cluster Error: 0 Address space exhausted" error occur and how do I fix it?
The error means that your Yandex Data Processing cluster's subnet has run out of IPs that can be allocated to cluster hosts. To check how many IPs are available, view the list of addresses used in the subnet and its mask.
To fix the error, do one of the following:
- Delete the unnecessary resources taking up the subnet's IPs.
- Create a subnet with CIDR that suits your cluster's configuration. Next, create a Yandex Data Processing cluster in the new subnet.
For more information about subnet sizes, see the Yandex Virtual Private Cloud documentation.
Why is my cluster's status "Unknown"?
If your cluster's status changed from Alive
to Unknown
:
-
Make sure you have set up a network for Yandex Data Processing. For a cluster to run, you need to create and configure the following network resources:
- Network
- Subnet
- NAT gateway
- Route table
- Security group
- Service account for the cluster
- Bucket to store job dependencies and results
-
Review the logs that describe the cluster status over the specified period:
yc logging read \ --group-id=<log_group_ID> \ --resource-ids=<cluster_ID> \ --filter=log_type=yandex-dataproc-agent \ --since 'YYYY-MM-DDThh:mm:ssZ' \ --until 'YYYY-MM-DDThh:mm:ssZ'
In the
--since
and--until
parameters, specify the period boundaries. Time format:YYYY-MM-DDThh:mm:ssZ
, e.g.,2020-08-10T12:00:00Z
. Use the UTC time zone.For more information, see Working with logs.
What is the minimum computing power required for a subcluster with a master host?
It depends on the driver deploy mode:
- In
deployMode=cluster
mode, when the driver is deployed on one of the cluster'scompute
hosts, 4-8 CPU cores and 16 GB RAM are sufficient for the subcluster with themaster
host. - In
deployMode=client
mode, when the driver is deployed on the cluster's master host, the computing power depends on the job logic and the number of concurrent jobs.
For more information about driver deploy modes and computing resource consumption, see Resource allocation.
In Yandex Cloud, computing power depends on the host class. For their ratio, see Host classes.
How do I upgrade the image version in Yandex Data Processing?
The service has no built-in mechanism for image version upgrades. To upgrade your image version, create a new cluster.
To make sure the version you use is always up-to-date, automate the creation and removal of temporary Yandex Data Processing clusters using Yandex Managed Service for Apache Airflow™. To run jobs automatically, other than Managed Service for Apache Airflow™, you can also use Yandex DataSphere.
How do I run jobs?
There are several ways to do it:
- Create jobs in Yandex Data Processing. Once created, they will run automatically.
- Run Apache Hive jobs using the Yandex Cloud CLI or Hive CLI.
- Run Spark or PySpark applications using Spark Shell,
spark-submit
, or the Yandex Cloud CLI. - Use
spark-submit
to run jobs from remote hosts that are not part of the Yandex Data Processing cluster. - Set up integration with Yandex Managed Service for Apache Airflow™ or Yandex DataSphere. This will automate running the jobs.
What security group limits are there?
You can create no more than five security groups per network. Each group may have a maximum of 50 rules. Learn more about limits in Yandex Virtual Private Cloud.
Can I get superuser permissions on hosts?
Yes. To switch to superuser, enter the following command after connecting to the host:
sudo su
However, you do not have to switch to superuser: just use sudo
.
How can I fix the no permission error when connecting a service account to the cluster?
Error message:
ERROR: rpc error: code = PermissionDenied desc = you do not have permission to access the requested service account or service account does not exist
This error occurs if you link a service account to a cluster while creating or modifying it.
To fix this error, assign your Yandex Cloud account the iam.serviceAccounts.user role or higher.