Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Yandex Data Processing
  • Getting started
    • All guides
      • Connecting to component interfaces
      • Using Sqoop
      • Apache Iceberg™ configuration
        • Configuring Delta Lake in single-cluster mode
        • Configuring Delta Lake in multi-cluster mode
        • Tips for configuring and using Delta Lake
    • Creating and using Python virtual environments
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Set up your infrastructure
  • Set up the component properties to work with Delta Lake
  • Delta Lake use case
  • Additional multi-cluster mode settings for production clusters in Yandex Data Processing
  • Setting up Managed Service for YDB throughput
  • Setting up auto cleanup
  1. Step-by-step guides
  2. Apache and other third-party services
  3. Delta Lake
  4. Configuring Delta Lake in multi-cluster mode

Setting up Delta Lake in multi-cluster mode

Written by
Yandex Cloud
Updated at October 20, 2025
  • Set up your infrastructure
  • Set up the component properties to work with Delta Lake
  • Delta Lake use case
  • Additional multi-cluster mode settings for production clusters in Yandex Data Processing
    • Setting up Managed Service for YDB throughput
    • Setting up auto cleanup

When in multi-cluster mode, Yandex Data Processing uses a Yandex Managed Service for YDB database to manage access to Delta Lake tables from different clusters and Apache Spark™ jobs.

For more information about Delta Lake, see the Delta Lake in Yandex Data Processing section and the Delta Lake documentation.

Note

Delta Lake is not part of Yandex Data Processing. It is not covered by Yandex Cloud support, and its usage is not governed by the Yandex Data Processing Terms of Use.

Set up your infrastructureSet up your infrastructure

  1. Create a serverless Managed Service for YDB database.

  2. Create a service account with the ydb.editor role for access to YDB.

  3. Create a static access key for the service account.

  4. Create a Yandex Lockbox secret and place into it the static key data as two key-value pairs:

    • Key: key-id; value: <static_key_ID>.
    • Key: key-secret; value: <static_key_secret_part>.
  5. Set up one or more Yandex Data Processing clusters to work with Delta Lake:

    1. If you do not have a Yandex Data Processing cluster, create one.

    2. If you attached a Yandex Object Storage bucket to your cluster:

      1. Create a folder named warehouse in the bucket.
      2. Set spark.sql.warehouse.dir to s3a://<bucket_name>/warehouse/.
    3. Create a Apache Hive™ Metastore cluster and connect it to your Yandex Data Processing cluster.

  6. Assign the lockbox.payloadViewer role to the service account you used to create the Yandex Data Processing clusters. You can do this:

    • Only for the secret you created earlier
    • At the entire folder level

Set up the component properties to work with Delta LakeSet up the component properties to work with Delta Lake

  1. Download the archive with the required Delta Lake libraries and add-ons to connect to Managed Service for YDB:

    • Delta Lake 2.0.2 for Yandex Data Processing 2.1.0 or 2.1.3
    • Delta Lake 2.3.0 for Yandex Data Processing 2.1.4 and higher

    You can check out the source code for add-ons to connect to YDB in the repository:

    • Add-ons for Delta Lake 2.0.2
    • Add-ons for Delta Lake 2.3.0
  2. Add the downloaded archive to the dependencies of all clusters or individual jobs that need access to Delta Lake tables. There are two ways to do this:

    • Save the archive to the Object Storage bucket and provide the file URL in the spark.jars property:

      spark.jars=s3a://<bucket_name>/<file_path>

      Make sure the cluster service account has read access to the bucket.

    • Copy the archive to all cluster nodes manually or using initialization scripts and provide the full file path in the spark.driver.extraClassPath and spark.executor.extraClassPath properties.

  3. Set the following properties at the level of clusters or individual Apache Spark™ jobs that need access to Delta Lake tables:

    • Set spark.sql.extensions to io.delta.sql.DeltaSparkSessionExtension.
    • Set spark.sql.catalog.spark_catalog to org.apache.spark.sql.delta.catalog.DeltaCatalog.
    • Set spark.delta.logStore.s3a.impl to ru.yandex.cloud.custom.delta.YcS3YdbLogStore.
    • Set spark.io.delta.storage.S3DynamoDBLogStore.ddb.endpoint to the Document API endpoint value available on the Overview tab of your database in the management console.
    • Set spark.io.delta.storage.S3DynamoDBLogStore.ddb.lockbox to the Lockbox secret ID value available on the Overview tab of your Lockbox in the management console.

You can now use Delta Lake in multi-cluster mode.

Delta Lake use caseDelta Lake use case

This use case was tested on a Yandex Data Processing cluster of version 2.1.7.

  1. Use SSH to connect to the Yandex Data Processing cluster's master host.

  2. Run an Apache Spark™ session in the cluster by providing the required parameters:

    spark-sql  \
        --conf spark.jars=s3a://<bucket_name>/yc-delta23-multi-dp21-1.1-fatjar.jar \
        --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
        --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.YcDeltaCatalog \
        --conf spark.delta.logStore.s3a.impl=ru.yandex.cloud.custom.delta.YcS3YdbLogStore \
        --conf spark.io.delta.storage.S3DynamoDBLogStore.ddb.endpoint=<Document_API_endpoint> \
        --conf spark.io.delta.storage.S3DynamoDBLogStore.ddb.lockbox=<secret_ID>
    
  3. In your active session, create a database and switch to it:

    CREATE DATABASE testdelta;
    USE testdelta;
    
  4. Create a test table and populate it with data:

    CREATE TABLE tab1(a INTEGER NOT NULL, b VARCHAR(100)) USING DELTA;
    INSERT INTO tab1 VALUES (1,'One'), (2,'Two'), (3,'Three');
    
  5. Replace the b column values by adding to them the a column values converted to a string:

    UPDATE tab1 SET b=b || ' ** ' || CAST(a AS VARCHAR(10));
    
  6. Check the result:

    SELECT * FROM tab1;
    
    3	Three ** 3
    2	Two ** 2
    1	One ** 1
    

Additional multi-cluster mode settings for production clusters in Yandex Data ProcessingAdditional multi-cluster mode settings for production clusters in Yandex Data Processing

To improve Delta Lake performance and streamline data storage when using multi-cluster mode, configure additional YDB settings.

Setting up Managed Service for YDB throughputSetting up Managed Service for YDB throughput

By default, a YDB in serverless mode is created with a throughput of 10 request units per second. This may be insufficient for intensive use of Delta Lake tables.

To avoid Delta Lake performance degradation due to insufficient YDB throughput, track the Document API units overflow parameter on the chart when monitoring YDB. Increase the throughput limit if needed.

There is a total throughput limit that applies to all YDB databases in the cloud, which depends on the quota. If necessary, contact support to have your quota increased.

Setting up auto cleanupSetting up auto cleanup

When working with Delta Lake, metadata versions that are not in use may accumulate in a YDB table or Object Storage buckets. You can streamline storage use and boost Delta Lake performance using a ready-made script that will automatically clean up outdated metadata from the YDB table and bucket on a regular basis.

The script is installed in the cloud as two serverless functions:

  • Function for cleaning up data in the YDB table. It is invoked automatically once an hour.
  • Function for cleaning up data in the buckets. It is invoked automatically once a day.

To add these cleanup functions to your cloud:

  1. If you do not have the Yandex Cloud CLI installed yet, install and initialize it.

  2. Download files from the cf-cleanup folder:

    • cfunc.py: Cleanup script source code.
    • delta-prefixes.txt: File with prefixes of paths to temporary Delta Lake files in buckets.
    • pack.sh: ZIP archive creation script.
    • requirements.txt: File with environment requirements to install functions.

    Save these files to the cf-cleanup folder in the working directory.

  3. Make the pack.sh file executable:

    chmod +x ./cf-cleanup/pack.sh
    
  4. In the delta-prefixes.txt file, specify the paths to the Object Storage bucket folders with temporary Delta Lake files. Make sure to provide each path in a new line in the following format:

    BucketName Mode PathPrefix
    

    The Mode field can take the following values:

    • W: Warehouse, path for storing multiple databases.
    • D: Database, path for storing a single database.
    • T: Table, path for storing a specific single table.

    Example:

    mybucket1 W warehouse/
    mybucket2 D warehouse/testdelta2.db/
    mybucket3 T warehouse/testdelta3.db/tab1/
    
  5. Place the delta-prefixes.txt file to your Object Storage bucket.

  6. Download and save the files for managing the cleanup functions to the working directory:

    • ddb-maint-config.sh: Setup parameters
    • ddb-maint-setup.sh: Setup script
    • ddb-maint-remove.sh: Removal script
  7. In the ddb-maint-config.sh file, specify the following parameters:

    • sa_name: Name of the service account that will be created to use the functions.
    • cf_ddb_name: Name of the serverless database cleanup function; it must be unique within the folder.
    • cf_s3_name: Name of the serverless bucket cleanup function; it must be unique within the folder.
    • docapi_endpoint: Document API endpoint. You can find it on the Overview tab of your YDB database in the management console.
    • docapi_table: Name of the Delta Lake table to clean.
    • s3_prefix_file: Path to the delta-prefixes.txt file in the Object Storage bucket, e.g., s3://<bucket_name>/delta-prefixes.txt.
  8. Run the setup script in your local directory:

    bash ./ddb-maint-setup.sh
    

This will add to the cloud the functions for cleaning up temporary files in YDB tables and Object Storage buckets. You can check the new functions using the management console.

If you no longer need the cleanup functions, run this script to remove them:

bash ./ddb-maint-remove.sh

The spark.io.delta.storage.S3DynamoDBLogStore.ddb.ttl Spark property sets the TTL for metadata records, which is 86400 seconds (24 hours) by default. The actual TTL for a specific record may be longer, as it depends on when the cleanup function was run.

Was the article helpful?

Previous
Configuring Delta Lake in single-cluster mode
Next
Tips for configuring and using Delta Lake
© 2025 Direct Cursus Technology L.L.C.