Setting up Apache Iceberg™ in a Yandex Data Processing cluster

Written by

Updated at December 26, 2024

Set up component properties to work with Apache Iceberg™
Apache Iceberg™ usage example

Yandex Data Processing 2.0 or higher supports using Apache Iceberg™ tables together with the Apache Spark™ engine.

For more information about Apache Iceberg™, see Apache Iceberg™ in Yandex Data Processing and the official documentation.

Note

Apache Iceberg™ is not part of Yandex Data Processing. It is not covered by Yandex Cloud support and its usage is not governed by the Yandex Data Processing Terms of Use.

Set up component properties to work with Apache Iceberg™

Set the spark:spark.sql.extensions property to org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions. You can do this either at cluster or individual job level.
Add the Apache Iceberg™ libraries to the dependencies of your cluster or individual job. Versions of libraries must be compatible with the Yandex Data Processing version.
Image 2.0.x

Image 2.1.0–2.1.3

Image 2.1.4–2.1.x

Image 2.2.x
To add the required library versions, use one of the following methods:
- Set up access to the Maven repository and set the spark:spark.jars.packages property to org.apache.iceberg:iceberg-spark-runtime-3.0_2.12:1.0.0.
  
  You can set up Maven access in two ways:
  - In your cluster's security group, allow network access to the Maven Central repository.
  - Configure an alternative Maven repository and allow traffic to it in the cluster security group.
- Download the iceberg-spark-runtime-3.0_2.12-1.0.0.jar library file and grant access to it by one of the following methods:
  - Save the file to a Yandex Object Storage bucket and provide the file's URL in the spark:spark.jars property.
    
    The file's URL has the following format: s3a://<bucket_name>/<file_path>.
    
    This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
  - Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the spark:spark.driver.extraClassPath and spark:spark.executor.extraClassPath properties.
To add the required library versions, use one of the following methods:
- Set up access to the Maven repository and set the spark:spark.jars.packages property to org.apache.iceberg:iceberg-spark-runtime-3.2_2.12-1.4.3.
  
  You can set up Maven access in two ways:
  - In your cluster's security group, allow network access to the Maven Central repository.
  - Configure an alternative Maven repository and allow traffic to it in the cluster security group.
- Download the iceberg-spark-runtime-3.2_2.12-1.4.3.jar library file and grant access to it by one of the following methods:
  - Save the file to a Yandex Object Storage bucket and provide the file's URL in the spark:spark.jars property.
    
    The file's URL has the following format: s3a://<bucket_name>/<file_path>.
    
    This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
  - Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the spark:spark.driver.extraClassPath and spark:spark.executor.extraClassPath properties.
To add the required library versions, use one of the following methods:
- Set up access to the Maven repository and set the spark:spark.jars.packages property to org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.5.2.
  
  You can set up Maven access in two ways:
  - In your cluster's security group, allow network access to the Maven Central repository.
  - Configure an alternative Maven repository and allow traffic to it in the cluster security group.
- Download the iceberg-spark-runtime-3.3_2.12-1.5.2.jar library file and grant access to it by one of the following methods:
  - Save the file to a Yandex Object Storage bucket and provide the file's URL in the spark:spark.jars property.
    
    The file's URL has the following format: s3a://<bucket_name>/<file_path>.
    
    This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
  - Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the spark:spark.driver.extraClassPath and spark:spark.executor.extraClassPath properties.
To add the required library versions, use one of the following methods:
- Set up access to the Maven repository and set the spark:spark.jars.packages property to org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2.
  
  You can set up Maven access in two ways:
  - In your cluster's security group, allow network access to the Maven Central repository.
  - Configure an alternative Maven repository and allow traffic to it in the cluster security group.
- Download the iceberg-spark-runtime-3.5_2.12-1.5.2.jar library file and grant access to it by one of the following methods:
  - Save the file to a Yandex Object Storage bucket and provide the file's URL in the spark:spark.jars property.
    
    The file's URL has the following format: s3a://<bucket_name>/<file_path>.
    
    This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
  - Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the spark:spark.driver.extraClassPath and spark:spark.executor.extraClassPath properties.

You can now use Apache Iceberg™ in your Yandex Data Processing cluster.

Apache Iceberg™ usage example

The use case was tested on a Yandex Data Processing cluster version 2.0 with:

Spark and Hadoop components installed.
Object Storage bucket connected, and the cluster's service account having read and write permissions for this bucket.
Access to the Maven Central repository configured.
Component properties configured to enable downloading Apache Iceberg™ libraries from Maven Central.

To create a Apache Iceberg™ table and start working with it:

Specify the settings for the folder to contain the table.

Apache Iceberg™ operates with tables at individual folder level. Folder settings are specified at individual folder level; you cannot specify settings for all folders at the same time.

To configure the sample Hadoop folder, set up these properties at cluster or individual job level as follows:
- spark:spark.sql.catalog.sample property to org.apache.iceberg.spark.SparkCatalog
- spark:spark.sql.catalog.sample.type property to hadoop
- spark:spark.sql.catalog.sample.warehouse property to s3a://<bucket_name>/warehouse/
  
  Table data will be stored in the bucket at the warehouse/ path.
For more information about the properties that have an effect on folder settings, see the Apache Iceberg™ documentatioin.
Use SSH to connect to the Yandex Data Processing cluster's master host.
Run a Spark SQL session:
```
spark-sql
```
You will perform all further actions within this session.
Switch to the db database in the sample folder:
```
USE sample.db;
```

Create a two-column table named mytable:

CREATE TABLE mytable (id bigint, data string) USING iceberg;

View table information:

DESC FORMATTED mytable;

Result example:

id      bigint
data    string

# Partitioning
Not partitioned

# Detailed Table Information
Name    sample.db.mytable
Location        s3a://<bucket_name>/warehouse/db/mytable
Provider        iceberg
Owner   ubuntu
Table Properties        [current-snapshot-id=none,format=iceberg/parquet]

Insert some entries to the table:

INSERT INTO mytable VALUES (1, 'a'), (2, 'b'), (3, 'c');

Run a test query to the table:

SELECT count(1), data FROM mytable GROUP BY data;

Result example:

1       a
1       b
1       c

Setting up Apache Iceberg™ in a Yandex Data Processing cluster

Set up component properties to work with Apache Iceberg™Set up component properties to work with Apache Iceberg™

Apache Iceberg™ usage exampleApache Iceberg™ usage example

Was the article helpful?

Set up component properties to work with Apache Iceberg™

Apache Iceberg™ usage example