Yandex Cloud
Search
Contact UsTry it for free
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
  • Marketplace
    • Featured
    • Infrastructure & Network
    • Data Platform
    • AI for business
    • Security
    • DevOps tools
    • Serverless
    • Monitoring & Resources
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
    • Price calculator
    • Pricing plans
  • Customer Stories
  • Documentation
  • Blog
© 2026 Direct Cursus Technology L.L.C.
Yandex Managed Service for Apache Spark™
  • Getting started
    • All guides
      • Cluster state monitoring
      • Viewing cluster logs
      • Monitoring the state of Spark applications
      • Diagnostics and examples of troubleshooting performance issues in Spark applications
  • Access management
  • Pricing policy
  • Yandex Monitoring metrics
  • Audit Trails events
  • Terraform reference
  • Release notes

In this article:

  • Initial Spark application performance diagnostics
  • Examples of troubleshooting performance issues in Spark applications
  • Slow data loading from S3
  • Poor parallelism on a cluster with a large number of vCPUs
  1. Step-by-step guides
  2. Logs and monitoring
  3. Diagnostics and examples of troubleshooting performance issues in Spark applications

Diagnostics and examples of troubleshooting performance issues in Spark applications

Written by
Yandex Cloud
Updated at March 19, 2026
  • Initial Spark application performance diagnostics
  • Examples of troubleshooting performance issues in Spark applications
    • Slow data loading from S3
    • Poor parallelism on a cluster with a large number of vCPUs

If your Spark applications are slow:

  • Check their execution to identify the cause of performance issues.
  • Try using one of the examples below to troubleshoot your performance issues.

Initial Spark application performance diagnosticsInitial Spark application performance diagnostics

If your Spark application underperforms, run the initial diagnostics:

  • Check the state of the cluster and its hosts and make sure the metrics are within normal limits.

  • View detailed application info and check the jobs' statuses, actual start and end times on the Event Timeline chart:

    • Identify jobs that take the most time.
    • Look into jobs that threw an error and find out why.
    • Analyze the overall application execution time.
    • Check the sequence of actions.
  • View detailed info on stages:

    • Identify stages that take the most time.
    • Analyze shuffle operations to detect excessive data movement.
    • Pay attention to the GC Time metric: a high value may indicate memory issues.
    • Look at DAG Visualization to understand the sequence and dependencies between operations.
  • Check the resources allocated to the application:

    • Make sure the application has enough executors and the available ones are not idle.
    • Make sure the resources within each executor are used in a balanced way.
  • Check the SQL query plans and the execution time for each operation.

Examples of troubleshooting performance issues in Spark applicationsExamples of troubleshooting performance issues in Spark applications

Slow data loading from S3Slow data loading from S3

Issue: Long Spark application execution time.

Diagnostic stepsDiagnostic steps

  1. Identify the affected job.

    • Open the Jobs tab in Spark History Server.
    • Compare Duration across different jobs.
    • Find the job that takes the most time.
  2. Analyze the affected job's stages.

    • Navigate to the affected job and open the Stages tab.
    • Sort the stages by Duration.
    • Analyze the stages that take the most time.
    • Check the Input Size metrics to figure the amount of read data.

    Result: You have found the stage responsible for reading data from S3.

  3. Check the cluster monitoring metrics.

    Check metrics related to vCPU.

    Result: You have detected near-full vCPU utilization during the data read stage.

Data loading from S3 takes place over the internet and is limited by available vCPU resources. The more vCPUs you have, the faster Spark is able to process the incoming data stream.

SolutionSolution

To fix the issue, we recommend increasing the number of cluster host vCPUs.

Poor parallelism on a cluster with a large number of vCPUsPoor parallelism on a cluster with a large number of vCPUs

Issue: Inefficient utilization of vCPU resources due to insufficient number of tasks, leading to core idling and increased execution time.

Diagnostic stepsDiagnostic steps

  1. Cluster resource utilization check.

    Check the cluster's configuration and available vCPU resources.

  2. Analysis of Spark History Server stages.

    • Open the Stages tab in Spark History Server.
    • Sort stages by Duration.
    • Select the longest stage.
    • Pay attention to the Total Tasks column representing the total number of tasks per stage.
    • Correlate the number of tasks to the number of available vCPUs.

    Result: You have found the stage that lacks tasks. Some cores are executing one job each, while others are executing two. Some cores are idle.

SolutionSolution

To fix the issue, we recommend breaking the stages down into smaller sub-operations. Increase the number of partitions for shuffle operations by configuring the spark.sql.shuffle.partitions property to achieve a more even work distribution.

Was the article helpful?

Previous
Monitoring the state of Spark applications
Next
All tutorials
© 2026 Direct Cursus Technology L.L.C.