Diagnostics and examples of troubleshooting performance issues in Spark applications

Written by

Yandex Cloud

Updated at May 19, 2026

View in Markdown

Initial Spark application performance diagnostics
Examples of troubleshooting performance issues in Spark applications
- Slow data loading from S3
- Poor parallelism on a cluster with a large number of vCPUs

If your Spark applications are slow:

Check their execution to identify the cause of performance issues.
Try using one of the examples below to troubleshoot your performance issues.

Initial Spark application performance diagnostics

If your Spark application underperforms, run the initial diagnostics:

Check the state of the cluster and its hosts and make sure the metrics are within normal limits.
View detailed application info and check the jobs' statuses, actual start and end times on the Event Timeline chart:
- Identify jobs that take the most time.
- Look into jobs that threw an error and find out why.
- Analyze the overall application execution time.
- Check the sequence of actions.
View detailed info on stages:
- Identify stages that take the most time.
- Analyze shuffle operations to detect excessive data movement.
- Pay attention to the GC Time metric: a high value may indicate memory issues.
- Look at DAG Visualization to understand the sequence and dependencies between operations.
Check the resources allocated to the application:
- Make sure the application has enough executors and the available ones are not idle.
- Make sure the resources within each executor are used in a balanced way.
Check the SQL query plans and the execution time for each operation.

Examples of troubleshooting performance issues in Spark applications

Slow data loading from S3

Issue: Long Spark application execution time.

Diagnostic steps

Identify the affected job.
- Open the Jobs tab in Spark History Server.
- Compare Duration across different jobs.
- Find the job that takes the most time.
Analyze the affected job's stages.
- Navigate to the affected job and open the Stages tab.
- Sort the stages by Duration.
- Analyze the stages that take the most time.
- Check the Input Size metrics to figure the amount of read data.
Result: You have found the stage responsible for reading data from S3.
Check the cluster monitoring metrics.

Check metrics related to vCPU.

Result: You have detected near-full vCPU utilization during the data read stage.

Data loading from S3 takes place over the internet and is limited by available vCPU resources. The more vCPUs you have, the faster Spark is able to process the incoming data stream.

Solution

To fix the issue, we recommend increasing the number of cluster host vCPUs.

Poor parallelism on a cluster with a large number of vCPUs

Issue: Inefficient utilization of vCPU resources due to insufficient number of tasks, leading to core idling and increased execution time.

Diagnostic steps

Cluster resource utilization check.

Check the cluster's configuration and available vCPU resources.
Analysis of Spark History Server stages.
- Open the Stages tab in Spark History Server.
- Sort stages by Duration.
- Select the longest stage.
- Pay attention to the Total Tasks column representing the total number of tasks per stage.
- Correlate the number of tasks to the number of available vCPUs.
Result: You have found the stage that lacks tasks. Some cores are executing one job each, while others are executing two. Some cores are idle.

To fix the issue, we recommend breaking the stages down into smaller sub-operations. Increase the number of partitions for shuffle operations by configuring the spark.sql.shuffle.partitions property to achieve a more even work distribution.

Diagnostics and examples of troubleshooting performance issues in Spark applications

Initial Spark application performance diagnosticsInitial Spark application performance diagnostics

Examples of troubleshooting performance issues in Spark applicationsExamples of troubleshooting performance issues in Spark applications

Slow data loading from S3Slow data loading from S3

Diagnostic stepsDiagnostic steps

SolutionSolution

Poor parallelism on a cluster with a large number of vCPUsPoor parallelism on a cluster with a large number of vCPUs

Diagnostic stepsDiagnostic steps

SolutionSolution

Was the article helpful?

Initial Spark application performance diagnostics

Examples of troubleshooting performance issues in Spark applications

Slow data loading from S3

Diagnostic steps

Solution

Poor parallelism on a cluster with a large number of vCPUs

Diagnostic steps

Solution