Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Yandex Data Processing
  • Getting started
    • All guides
      • Connecting to component interfaces
      • Using Sqoop
      • Configuration Apache Iceberg™
    • Setting up and using Python virtual environments
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Set up component properties to work with Apache Iceberg™
  • Apache Iceberg™ usage example
  1. Step-by-step guides
  2. Apache and other third-party services
  3. Configuration Apache Iceberg™

Setting up Apache Iceberg™ in a Yandex Data Processing cluster

Written by
Yandex Cloud
Updated at December 26, 2024
  • Set up component properties to work with Apache Iceberg™
  • Apache Iceberg™ usage example

Yandex Data Processing 2.0 or higher supports using Apache Iceberg™ tables together with the Apache Spark™ engine.

For more information about Apache Iceberg™, see Apache Iceberg™ in Yandex Data Processing and the official documentation.

Note

Apache Iceberg™ is not part of Yandex Data Processing. It is not covered by Yandex Cloud support and its usage is not governed by the Yandex Data Processing Terms of Use.

Set up component properties to work with Apache Iceberg™Set up component properties to work with Apache Iceberg™

  1. Set the spark:spark.sql.extensions property to org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions. You can do this either at cluster or individual job level.

  2. Add the Apache Iceberg™ libraries to the dependencies of your cluster or individual job. Versions of libraries must be compatible with the Yandex Data Processing version.

    Image 2.0.x
    Image 2.1.0–2.1.3
    Image 2.1.4–2.1.x
    Image 2.2.x

    To add the required library versions, use one of the following methods:

    • Set up access to the Maven repository and set the spark:spark.jars.packages property to org.apache.iceberg:iceberg-spark-runtime-3.0_2.12:1.0.0.

      You can set up Maven access in two ways:

      • In your cluster's security group, allow network access to the Maven Central repository.
      • Configure an alternative Maven repository and allow traffic to it in the cluster security group.
    • Download the iceberg-spark-runtime-3.0_2.12-1.0.0.jar library file and grant access to it by one of the following methods:

      • Save the file to a Yandex Object Storage bucket and provide the file's URL in the spark:spark.jars property.

        The file's URL has the following format: s3a://<bucket_name>/<file_path>.

        This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.

      • Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the spark:spark.driver.extraClassPath and spark:spark.executor.extraClassPath properties.

    To add the required library versions, use one of the following methods:

    • Set up access to the Maven repository and set the spark:spark.jars.packages property to org.apache.iceberg:iceberg-spark-runtime-3.2_2.12-1.4.3.

      You can set up Maven access in two ways:

      • In your cluster's security group, allow network access to the Maven Central repository.
      • Configure an alternative Maven repository and allow traffic to it in the cluster security group.
    • Download the iceberg-spark-runtime-3.2_2.12-1.4.3.jar library file and grant access to it by one of the following methods:

      • Save the file to a Yandex Object Storage bucket and provide the file's URL in the spark:spark.jars property.

        The file's URL has the following format: s3a://<bucket_name>/<file_path>.

        This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.

      • Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the spark:spark.driver.extraClassPath and spark:spark.executor.extraClassPath properties.

    To add the required library versions, use one of the following methods:

    • Set up access to the Maven repository and set the spark:spark.jars.packages property to org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.5.2.

      You can set up Maven access in two ways:

      • In your cluster's security group, allow network access to the Maven Central repository.
      • Configure an alternative Maven repository and allow traffic to it in the cluster security group.
    • Download the iceberg-spark-runtime-3.3_2.12-1.5.2.jar library file and grant access to it by one of the following methods:

      • Save the file to a Yandex Object Storage bucket and provide the file's URL in the spark:spark.jars property.

        The file's URL has the following format: s3a://<bucket_name>/<file_path>.

        This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.

      • Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the spark:spark.driver.extraClassPath and spark:spark.executor.extraClassPath properties.

    To add the required library versions, use one of the following methods:

    • Set up access to the Maven repository and set the spark:spark.jars.packages property to org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2.

      You can set up Maven access in two ways:

      • In your cluster's security group, allow network access to the Maven Central repository.
      • Configure an alternative Maven repository and allow traffic to it in the cluster security group.
    • Download the iceberg-spark-runtime-3.5_2.12-1.5.2.jar library file and grant access to it by one of the following methods:

      • Save the file to a Yandex Object Storage bucket and provide the file's URL in the spark:spark.jars property.

        The file's URL has the following format: s3a://<bucket_name>/<file_path>.

        This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.

      • Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the spark:spark.driver.extraClassPath and spark:spark.executor.extraClassPath properties.

You can now use Apache Iceberg™ in your Yandex Data Processing cluster.

Apache Iceberg™ usage exampleApache Iceberg™ usage example

The use case was tested on a Yandex Data Processing cluster version 2.0 with:

  • Spark and Hadoop components installed.
  • Object Storage bucket connected, and the cluster's service account having read and write permissions for this bucket.
  • Access to the Maven Central repository configured.
  • Component properties configured to enable downloading Apache Iceberg™ libraries from Maven Central.

To create a Apache Iceberg™ table and start working with it:

  1. Specify the settings for the folder to contain the table.

    Apache Iceberg™ operates with tables at individual folder level. Folder settings are specified at individual folder level; you cannot specify settings for all folders at the same time.

    To configure the sample Hadoop folder, set up these properties at cluster or individual job level as follows:

    • spark:spark.sql.catalog.sample property to org.apache.iceberg.spark.SparkCatalog

    • spark:spark.sql.catalog.sample.type property to hadoop

    • spark:spark.sql.catalog.sample.warehouse property to s3a://<bucket_name>/warehouse/

      Table data will be stored in the bucket at the warehouse/ path.

    For more information about the properties that have an effect on folder settings, see the Apache Iceberg™ documentatioin.

  2. Use SSH to connect to the Yandex Data Processing cluster's master host.

  3. Run a Spark SQL session:

    spark-sql
    

    You will perform all further actions within this session.

  4. Switch to the db database in the sample folder:

    USE sample.db;
    
  5. Create a two-column table named mytable:

    CREATE TABLE mytable (id bigint, data string) USING iceberg;
    
  6. View table information:

    DESC FORMATTED mytable;
    

    Result example:

    id      bigint
    data    string
    
    # Partitioning
    Not partitioned
    
    # Detailed Table Information
    Name    sample.db.mytable
    Location        s3a://<bucket_name>/warehouse/db/mytable
    Provider        iceberg
    Owner   ubuntu
    Table Properties        [current-snapshot-id=none,format=iceberg/parquet]
    
  7. Insert some entries to the table:

    INSERT INTO mytable VALUES (1, 'a'), (2, 'b'), (3, 'c');
    
  8. Run a test query to the table:

    SELECT count(1), data FROM mytable GROUP BY data;
    

    Result example:

    1       a
    1       b
    1       c
    

Was the article helpful?

Previous
Using Sqoop
Next
Setting up Delta Lake in single-cluster mode
Yandex project
© 2025 Yandex.Cloud LLC