Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex Data Processing
  • Getting started
    • Resource relationships
    • Runtime environment
    • Yandex Data Processing component interfaces and ports
    • Jobs in Yandex Data Processing
    • Spark jobs
    • Automatic scaling
    • Decommissioning subclusters and hosts
    • Networking in Yandex Data Processing
    • Maintenance
    • Quotas and limits
    • Storage in Yandex Data Processing
    • Component properties
    • Apache Iceberg™ in Yandex Data Processing
    • Delta Lake in Yandex Data Processing
    • Logs in Yandex Data Processing
    • Initialization scripts
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Updating component properties
  • Available component properties
  • JVM settings for Spark applications set in Yandex Data Processing by default
  • Spark settings for integration with Yandex Object Storage
  • Installing Python packages
  • Using Apache Spark Thrift Server
  1. Concepts
  2. Component properties

Component properties

Written by
Yandex Cloud
Updated at February 18, 2025
  • Updating component properties
  • Available component properties
  • JVM settings for Spark applications set in Yandex Data Processing by default
  • Spark settings for integration with Yandex Object Storage
  • Installing Python packages
  • Using Apache Spark Thrift Server

The properties of cluster components, jobs, and environments are stored in the following format:

<key>:<value>

The key can either be a simple string or contain a prefix indicating that it belongs to a specific component:

<key_prefix>:<key_body>:<value>

Here is an example:

hdfs:dfs.replication : 2
hdfs:dfs.blocksize : 1073741824
spark:spark.driver.cores : 1

Updating component propertiesUpdating component properties

You can update the component properties in the following ways:

  • At the cluster level when creating or updating it. The properties provided this way apply to any new cluster jobs by default.
  • At the level of an individual job when creating it. The properties provided this way only apply to this job and override the cluster-level properties set for it.

Available component propertiesAvailable component properties

The available properties are listed in the official documentation for the components:

Prefix Path to the configuration file Documentation
core /etc/hadoop/conf/core-site.xml Hadoop
hdfs /etc/hadoop/conf/hdfs-site.xml HDFS
yarn /etc/hadoop/conf/yarn-site.xml YARN
mapreduce /etc/hadoop/conf/mapred-site.xml MapReduce
capacity-scheduler /etc/hadoop/conf/capacity-scheduler.xml CapacityScheduler
resource-type /etc/hadoop/conf/resource-types.xml ResourceTypes
node-resources /etc/hadoop/conf/node-resources.xml NodeResources
spark /etc/spark/conf/spark-defaults.xml Spark
hbase /etc/hbase/conf/hbase-site.xml HBASE
hbase-policy /etc/hbase/conf/hbase-policy.xml HBASE
hive /etc/hive/conf/hive-site.xml HIVE
hivemetastore /etc/hive/conf/hivemetastore-site.xml HIVE Metastore
hiveserver2 /etc/hive/conf/hiveserver2-site.xml HIVE Server2
tez /etc/tez/conf/tez-site.xml Tez 0.9.2 and Tez 0.10.0
zeppelin /etc/zeppelin/conf/zeppelin-site.xml Zeppelin

Settings for running jobs are specified in special properties:

  • dataproc:version: Version of the dataproc-agent which runs jobs, sends the cluster state property, and proxies the UI. Used for debugging. The default value is latest.

  • dataproc:max-concurrent-jobs: Number of concurrent jobs. The default value is auto (calculated from the min-free-memory-to-enqueue-new-job and job-memory-footprint properties).

  • dataproc:min-free-memory-to-enqueue-new-job: Minimum size of free memory to run a job, in bytes. The default value is 1073741824 (1 GB).

  • dataproc:job-memory-footprint: Memory size to run a job on the cluster's master host, used to estimate the maximum number of jobs per cluster. The default value is 536870912 (512 MB).

  • dataproc:spark_executors_per_vm: Maximum number of containers per computing host when running Spark jobs. The default values are:

    • 1: For lightweight clusters.
    • 2: For clusters with HDFS.
  • dataproc:spark_driver_memory_fraction: Computing host memory share reserved for the driver when running Spark jobs. The default value is 0.25.

JVM settings for Spark applications set in Yandex Data Processing by defaultJVM settings for Spark applications set in Yandex Data Processing by default

Generally, the following default settings are applied on the Yandex Data Processing clusters to improve JVM performance:

  • spark:spark.driver.extraJavaOptions:
    • -XX:+UseConcMarkSweepGC
    • -XX:CMSInitiatingOccupancyFraction=70
    • -XX:MaxHeapFreeRatio=70
    • -XX:+CMSClassUnloadingEnabled
    • -XX:OnOutOfMemoryError='kill -9 %p'
  • spark:spark.executor.extraJavaOptions:
    • -verbose:gc
    • -XX:+PrintGCDetails
    • -XX:+PrintGCDateStamps
    • -XX:+UseConcMarkSweepGC
    • -XX:CMSInitiatingOccupancyFraction=70
    • -XX:MaxHeapFreeRatio=70
    • -XX:+CMSClassUnloadingEnabled
    • -XX:OnOutOfMemoryError='kill -9 %p'

If you want to update JVM settings, provide them in a single space-separated string. For example, for the spark:spark.driver.extraJavaOptions cluster property:

-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=60 -XX:MaxHeapFreeRatio=80

Note

Changing the spark:spark.driver.defaultJavaOptions or spark:spark.executor.defaultJavaOptions cluster properties to values conflicting with the extraJavaOptions settings may result in cluster configuration errors.

Spark settings for integration with Yandex Object StorageSpark settings for integration with Yandex Object Storage

The following settings are available for Apache Spark:

Configuration Default value Description
fs.s3a.access.key — Static key ID
fs.s3a.secret.key — Secret key:
fs.s3a.endpoint storage.yandexcloud.net Endpoint to connect to Object Storage
fs.s3a.signing-algorithm Empty value Signature algorithm
fs.s3a.aws.credentials.provider org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider Credentials provider

For more information, see the Apache Hadoop documentation.

Installing Python packagesInstalling Python packages

To install additional Python packages, you can use the conda or pip package managers. Provide the package name in the cluster properties as follows:

Package manager Key Value Example
conda conda:<package_name> Number of the package version according to the conda specification conda:koalas : 1.5.0
pip pip:<package_name> Number of the package version according to the pip specification pip:requests : 2.31.0

Using Apache Spark Thrift ServerUsing Apache Spark Thrift Server

You can use Apache Spark Thrift Server as a distributed SQL query engine on top of Apache Spark.

To enable it, set dataproc:hive.thrift.impl : spark to make the server available on TCP port 10000. The default value is dataproc:hive.thrift.impl : hive. It causes Apache HiveServer2 to launch on TCP port 10000 if the Hive service is used.

This feature is available starting with image version 2.0.48.

Was the article helpful?

Previous
Storage in Yandex Data Processing
Next
Apache Iceberg™ in Yandex Data Processing
© 2025 Direct Cursus Technology L.L.C.