General questions about Yandex Data Proc
-
What do I do if data on storage subcluster hosts is distributed unevenly?
-
Why is the cluster slow even though the computing resources are not used fully?
-
I get the
^M: bad interpreter
error when running the initialization script. How do I fix this? -
Why does the
NAT should be enabled on the subnet
error occur and how do I fix it? -
Why does the
Using fileUris is forbidden on lightweight cluster
error occur and how do I fix it? -
What is the minimum computing power required for a subcluster to run with a master host?
What clusters can I move to a different availability zone?
You can move light-weight clusters and HDFS clusters. Hive Metastore cluster migration is not available. If you use such clusters located in the ru-central1-c
availability zone, the Yandex Cloud tech support will notify you once the relevant migration guide is available.
What do I do if data on storage subcluster hosts is distributed unevenly?
Connect to the cluster master host and run the command to rebalance the data:
sudo -u hdfs hdfs balancer
You can configure the load balancer parameters. For example, to change the maximum amount of data to transfer, add the -D dfs.balancer.max-size-to-move=<data-size-in-bytes>
argument.
Where can I view Yandex Data Proc cluster logs?
You can find cluster logs in its log group. To track the events of a cluster and its individual hosts, specify the relevant log group in cluster settings when creating or updating the cluster. If no log group has been selected for the cluster, a default log group in the cluster directory will be used to send and store logs. For more information, see Working with logs.
Can I get logs of my operations with services?
Yes, you can request log records about your resources from Yandex Cloud services. For more information, see Data requests.
Why is the cluster slow even though the computing resources are not used fully?
Perhaps, the maximum storage IOPS and bandwidth values are insufficient for processing the current number of requests. In this case, throttling is triggered and the performance of the entire cluster degrades.
The maximum IOPS and bandwidth values increase by a fixed value when the storage size increases by a certain step. The step and increment values depend on the disk type:
Disk type | Step, GB | Max IOPS increase (read/write) | Max bandwidth increase (read/write), MB/s |
---|---|---|---|
network-hdd |
256 | 300/300 | 30/30 |
network-ssd |
32 | 1,000/1,000 | 15/15 |
network-ssd-nonreplicated |
93 | 28,000/5,600 | 110/82 |
To increase the maximum IOPS and bandwidth values and make throttling less likely, consider switching to a different cluster with larger host storage or a faster disk type. You can transfer data to a new cluster, for example, using Hive Metastore.
^M: bad interpreter
error when running the initialization script. How do I fix this?
I get the Since the script runtime environment is Linux (Ubuntu), scripts created in Windows may end with an error saying ^M: bad interpreter
due to using the CR/LF
newline character (in Linux, it is LF
). To fix the error, save the script file in Linux format. For more information, see Syntax errors.
NAT should be enabled on the subnet
error occur and how do I fix it?
Why does the This error occurs when trying to create a Yandex Data Proc cluster in a subnet with no NAT gateway configured. To fix it, configure a network for Yandex Data Proc.
Using fileUris is forbidden on lightweight cluster
error occur and how do I fix it?
Why does the This error occurs because the lightweight clusters configuration does not include HDFS. To fix the error, create a cluster with HDFS support.
We also recommend using Yandex Object Storage buckets to work with jobs. You can upload scripts to them to run jobs. These scripts are stored as objects one can get links to. As a result, you can use links from Object Storage instead of file:/
format links in your jobs.
Create Yandex Data Proc cluster Error: 0 Address space exhausted
error occur and how do I fix it?
Why does the The error means that your Yandex Data Proc cluster's subnet has run out of IPs that can be allocated to cluster hosts. To check how many IPs are available, view the list of addresses used in the subnet and its mask.
To fix the error, do one of the following:
- Delete the unnecessary resources taking up the subnet's IPs.
- Create a subnet with CIDR that suits your cluster's configuration. Next, create a Yandex Data Proc cluster in the new subnet.
For more information about subnet sizes, see the Yandex Virtual Private Cloud documentation.
Unknown
?
Why is my cluster's status If your cluster's status changed from Alive
to Unknown
:
-
Make sure you have set up a network for Yandex Data Proc. For a cluster to run, you need to create and configure the following network resources:
- Network
- Subnet
- NAT gateway
- Route table
- Security group
- Service account for the cluster
- Bucket to store job dependencies and results
-
Review the logs that describe the cluster status over the specified period:
yc logging read \ --group-id=<log_group_ID> \ --resource-ids=<cluster_ID> \ --filter=log_type=yandex-dataproc-agent \ --since 'YYYY-MM-DDThh:mm:ssZ' \ --until 'YYYY-MM-DDThh:mm:ssZ'
Specify the period in the
--since
and--until
parameters inYYYY-MM-DDThh:mm:ssZ
format, e.g.,2020-08-10T12:00:00Z
. The time zone must be specified in UTC format.For more information, see Working with logs.
What is the minimum computing power required for a subcluster to run with a master host?
It depends on the driver deploy mode:
- In
deployMode=cluster
mode, when the driver is deployed on one of the cluster'scompute
hosts, 4–8 CPU cores and 16 GB RAM are sufficient for the subcluster with themaster
host. - In
deployMode=client
mode, when the driver is deployed on the cluster's master host, the computing power depends on the job logic and the number of concurrently running jobs.
For more information about driver deploy modes and computing resource consumption, see Resource allocation.
In Yandex Cloud, the computing power depends on the host class. For their ratio, see Host classes.
How do I upgrade the image version in Yandex Data Proc?
The service has no built-in mechanism for updating image versions. To update the version of your image, create a new cluster.
To make sure the version you use is always up-to-date, automate the creation and removal of temporary Yandex Data Proc clusters using Yandex Managed Service for Apache Airflow™. To run jobs automatically, apart from Managed Service for Apache Airflow™ you can also use Yandex DataSphere.
How do I run jobs?
There are several ways to do it:
- Create jobs in Yandex Data Proc. Once created, they will run automatically.
- Run Apache Hive jobs using the Yandex Cloud CLI or Hive CLI.
- Run Spark or PySpark applications using Spark Shell,
spark-submit
, or the Yandex Cloud CLI. - Use
spark-submit
to run jobs from remote hosts that are not part of the Yandex Data Proc cluster. - Set up integration with Yandex Managed Service for Apache Airflow™ or Yandex DataSphere. This will automate running the jobs.
What security group limits are there?
You can create no more than five security groups per network. Each group may have a maximum of 50 rules. Learn more about limits in Yandex Virtual Private Cloud.