Diagnostics and troubleshooting of Spark application performance issues

Written by

Yandex Cloud

Updated at January 29, 2025

Primary diagnostics of Spark application performance
Troubleshooting common performance issues

If your Spark applications are slow:

Check their operation to identify the root cause of performance issues.
Try using one of the methods for troubleshooting common issues.

Primary diagnostics of Spark application performance

If the performance of a Spark application is low, run primary diagnostics:

Check the application queue to make sure the application's performance is not blocked by other applications.
View the application details and check the status of jobs and the time when they actually started and completed on the Event Timeline chart.
Check the resources allocated to the application:
- Make sure the application has enough executors and the available executors are not idle.
- Make sure the use of resources within a single executor is balanced.
Check the SQL query execution plans and the duration of individual tasks.
Check the application logs for warnings about failures.

Troubleshooting common performance issues

High garbage collection time

If you checked the resources allocated to the application and found out that the GC Time in the total Task Time is high:

Make sure you have enough memory allocated to the executor.
Configure the garbage collector manually. To learn how to do this, see the Apache Spark documentation.

Multiple executors are competing for CPU resources

When allocating executors, the YARN scheduler using the default settings does not consider the CPU resources available in a node. This may slow down jobs running resource-intensive computations.

To avoid this, enable an alternative algorithm for tracking resource availability during executor allocation by setting the following cluster-level property:

capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

For more information about the YARN scheduler, see the Hadoop documentation.

Regular heartbeat errors when running jobs

When running Spark jobs, executors send regular heartbeat messages to the driver informing it of executor status and task progress. If the driver does not get any heartbeat messages from an executor during a certain interval, it considers this executor unhealthy and requests the YARN resource manager to forcibly terminate it. In this case, the driver logs will contain a message of the following type:

23/02/23 20:22:09 WARN TaskSetManager: Lost task 28.0 in stage 13.0 (TID 242) 
        (rc1c-dataproc-*****.mdb.yandexcloud.net executor 5): ExecutorLostFailure 
        (executor 5 exited caused by one of the running tasks) 
        Reason: Executor heartbeat timed out after 138218 ms

Such errors may be caused by cluster networking issues. However, in real terms, heartbeat timeouts most often occur because an executor runs out of memory. Moreover, job logs may fail to register such errors as java.lang.OutOfMemoryError due to a logging failure also caused by the memory shortage.

If you regularly get heartbeat errors when running jobs and there are no signs of network errors, increase the amount of RAM allocated per parallel task. To do this, change your cluster's component properties:

Reduce the number of CPU cores per executor in the spark.executor.cores parameter.
Increase the amount of RAM reserved per executor in the spark.executor.memory parameter.

For more information about these parameters, see the Spark documentation.

Diagnostics and troubleshooting of Spark application performance issues

Primary diagnostics of Spark application performancePrimary diagnostics of Spark application performance

Troubleshooting common performance issuesTroubleshooting common performance issues

High garbage collection timeHigh garbage collection time

Multiple executors are competing for CPU resourcesMultiple executors are competing for CPU resources

Regular heartbeat errors when running jobsRegular heartbeat errors when running jobs

Was the article helpful?

Primary diagnostics of Spark application performance

Troubleshooting common performance issues

High garbage collection time

Multiple executors are competing for CPU resources

Regular heartbeat errors when running jobs