Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Yandex Data Processing
  • Getting started
    • All guides
      • Working with logs
      • Monitoring the state of clusters and hosts
      • Monitoring the state of Spark applications
      • Diagnostics and troubleshooting of Spark application performance issues
    • Setting up and using Python virtual environments
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Primary diagnostics of Spark application performance
  • Troubleshooting common performance issues
  • High garbage collection time
  • Multiple executors are competing for CPU resources
  • Regular heartbeat errors when running jobs
  1. Step-by-step guides
  2. Logs and monitoring
  3. Diagnostics and troubleshooting of Spark application performance issues

Diagnostics and troubleshooting of Spark application performance issues

Written by
Yandex Cloud
Updated at January 29, 2025
  • Primary diagnostics of Spark application performance
  • Troubleshooting common performance issues
    • High garbage collection time
    • Multiple executors are competing for CPU resources
    • Regular heartbeat errors when running jobs

If your Spark applications are slow:

  • Check their operation to identify the root cause of performance issues.
  • Try using one of the methods for troubleshooting common issues.

Primary diagnostics of Spark application performancePrimary diagnostics of Spark application performance

If the performance of a Spark application is low, run primary diagnostics:

  • Check the application queue to make sure the application's performance is not blocked by other applications.

  • View the application details and check the status of jobs and the time when they actually started and completed on the Event Timeline chart.

  • Check the resources allocated to the application:

    • Make sure the application has enough executors and the available executors are not idle.
    • Make sure the use of resources within a single executor is balanced.
  • Check the SQL query execution plans and the duration of individual tasks.

  • Check the application logs for warnings about failures.

Troubleshooting common performance issuesTroubleshooting common performance issues

High garbage collection timeHigh garbage collection time

If you checked the resources allocated to the application and found out that the GC Time in the total Task Time is high:

  • Make sure you have enough memory allocated to the executor.
  • Configure the garbage collector manually. To learn how to do this, see the Apache Spark documentation.

Multiple executors are competing for CPU resourcesMultiple executors are competing for CPU resources

When allocating executors, the YARN scheduler using the default settings does not consider the CPU resources available in a node. This may slow down jobs running resource-intensive computations.

To avoid this, enable an alternative algorithm for tracking resource availability during executor allocation by setting the following cluster-level property:

capacity-scheduler:yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

For more information about the YARN scheduler, see the Hadoop documentation.

Regular heartbeat errors when running jobsRegular heartbeat errors when running jobs

When running Spark jobs, executors send regular heartbeat messages to the driver informing it of executor status and task progress. If the driver does not get any heartbeat messages from an executor during a certain interval, it considers this executor unhealthy and requests the YARN resource manager to forcibly terminate it. In this case, the driver logs will contain a message of the following type:

23/02/23 20:22:09 WARN TaskSetManager: Lost task 28.0 in stage 13.0 (TID 242) 
        (rc1c-dataproc-*****.mdb.yandexcloud.net executor 5): ExecutorLostFailure 
        (executor 5 exited caused by one of the running tasks) 
        Reason: Executor heartbeat timed out after 138218 ms

Such errors may be caused by cluster networking issues. However, in real terms, heartbeat timeouts most often occur because an executor runs out of memory. Moreover, job logs may fail to register such errors as java.lang.OutOfMemoryError due to a logging failure also caused by the memory shortage.

If you regularly get heartbeat errors when running jobs and there are no signs of network errors, increase the amount of RAM allocated per parallel task. To do this, change your cluster's component properties:

  • Reduce the number of CPU cores per executor in the spark.executor.cores parameter.
  • Increase the amount of RAM reserved per executor in the spark.executor.memory parameter.

For more information about these parameters, see the Spark documentation.

Was the article helpful?

Previous
Monitoring the state of Spark applications
Next
Setting up and using Python virtual environments
Yandex project
© 2025 Yandex.Cloud LLC