Component properties
The properties of cluster components, jobs, and environments are stored in the following format:
<key>:<value>
The key can either be a simple string or contain a prefix indicating that it belongs to a specific component:
<key_prefix>:<key_body>:<value>
For example:
hdfs:dfs.replication : 2
hdfs:dfs.blocksize : 1073741824
spark:spark.driver.cores : 1
Updating component properties
You can update the component properties in the following ways:
- At the cluster level when creating or updating it. The properties provided this way apply to any new cluster jobs by default.
- At the level of an individual job when creating it. The properties provided this way only apply to this job and override the cluster-level properties set for it.
Available component properties
The available properties are listed in the official documentation for the components:
Prefix | Path to the configuration file | Documentation |
---|---|---|
core |
/etc/hadoop/conf/core-site.xml |
Hadoop |
hdfs |
/etc/hadoop/conf/hdfs-site.xml |
HDFS |
yarn |
/etc/hadoop/conf/yarn-site.xml |
YARN |
mapreduce |
/etc/hadoop/conf/mapred-site.xml |
MapReduce |
capacity-scheduler |
/etc/hadoop/conf/capacity-scheduler.xml |
CapacityScheduler |
resource-type |
/etc/hadoop/conf/resource-types.xml |
ResourceTypes |
node-resources |
/etc/hadoop/conf/node-resources.xml |
NodeResources |
spark |
/etc/spark/conf/spark-defaults.xml |
Spark |
hbase |
/etc/hbase/conf/hbase-site.xml |
HBASE |
hbase-policy |
/etc/hbase/conf/hbase-policy.xml |
HBASE |
hive |
/etc/hive/conf/hive-site.xml |
HIVE |
hivemetastore |
/etc/hive/conf/hivemetastore-site.xml |
HIVE Metastore |
hiveserver2 |
/etc/hive/conf/hiveserver2-site.xml |
HIVE Server2 |
tez |
/etc/tez/conf/tez-site.xml |
Tez 0.9.2 |
zeppelin |
/etc/zeppelin/conf/zeppelin-site.xml |
Zeppelin |
Settings for running jobs are specified in special properties:
-
dataproc:version
: Version of thedataproc-agent
that runs jobs, sends the property of a cluster state, and proxies the UI. This poperty is used for debugging. Its default value islatest
. -
dataproc:max-concurrent-jobs
: Number of concurrent jobs. The default value isauto
(calculated based on themin-free-memory-to-enqueue-new-job
andjob-memory-footprint
properties). -
dataproc:min-free-memory-to-enqueue-new-job
: Minimum size of free memory to run a job, in bytes. The default value is1073741824
(1 GB). -
dataproc:job-memory-footprint
: Memory size to run a job on the cluster's master host, used to estimate the maximum number of jobs per cluster. The default value is536870912
(512 MB). -
dataproc:spark_executors_per_vm
: Maximum number of containers per computing host when running Spark jobs. The default values are:1
: For lightweight clusters.2
: For clusters with HDFS.
-
dataproc:spark_driver_memory_fraction
: Computing host memory fraction reserved for the driver when running Spark jobs. The default value is0.25
.
JVM settings for Spark applications set in Yandex Data Processing by default
Generally, the following default settings are applied on the Yandex Data Processing clusters to improve JVM performance:
- spark:spark.driver.extraJavaOptions:
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70
-XX:MaxHeapFreeRatio=70
-XX:+CMSClassUnloadingEnabled
-XX:OnOutOfMemoryError='kill -9 %p'
- spark:spark.executor.extraJavaOptions:
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70
-XX:MaxHeapFreeRatio=70
-XX:+CMSClassUnloadingEnabled
-XX:OnOutOfMemoryError='kill -9 %p'
If you want to update JVM settings, provide them in a single space-separated string. For example, for the spark:spark.driver.extraJavaOptions
cluster property:
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=60 -XX:MaxHeapFreeRatio=80
Note
Changing the spark:spark.driver.defaultJavaOptions
or spark:spark.executor.defaultJavaOptions
cluster properties for values conflicting with the extraJavaOptions
settings may result in cluster configuration errors.
Spark settings for integration with Yandex Object Storage
The following settings are available for Apache Spark:
Setup | Default value | Description |
---|---|---|
fs.s3a.access.key |
— | Static key ID |
fs.s3a.secret.key |
— | Secret key |
fs.s3a.endpoint |
storage.yandexcloud.net |
Endpoint to connect to Object Storage |
fs.s3a.signing-algorithm |
Empty value | Signature algorithm |
fs.s3a.aws.credentials.provider |
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider |
Credentials provider |
For more information, see the Apache Hadoop documentation
Installing Python packages
To install additional Python packages, you can use the conda or pip package managers. Provide the package name in the cluster properties as follows:
Package manager | Key | Value | Example |
---|---|---|---|
conda | conda:<package_name> |
Number of the package version according to the conda specification |
conda:koalas : 1.5.0 |
pip | pip:<package_name> |
Number of the package version according to the pip specification |
pip:requests : 2.31.0 |
Using Apache Spark Thrift Server
You can use Apache Spark Thrift Server
To enable it, set dataproc:hive.thrift.impl : spark
, and the server will be available on TCP port 10000
. The default value is dataproc:hive.thrift.impl : hive
. It causes Apache HiveServer2 to launch on TCP port 10000
if the Hive service is being used.
This feature is available starting with image version 2.0.48.