Setting up Apache Iceberg™ in a Yandex Data Processing cluster
Yandex Data Processing 2.0 or higher supports using Apache Iceberg™ tables together with the Apache Spark™ engine.
For more information about Apache Iceberg™, see Apache Iceberg™ in Yandex Data Processing and the official documentation
Note
Apache Iceberg™ is not part of Yandex Data Processing. It is not covered by Yandex Cloud support and its usage is not governed by the Yandex Data Processing Terms of Use
Set up component properties to work with Apache Iceberg™
-
Set the
spark:spark.sql.extensions
property toorg.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
. You can do this either at cluster or individual job level. -
Add the Apache Iceberg™ libraries to the dependencies of your cluster or individual job. Versions of libraries must be compatible with the Yandex Data Processing version.
Image 2.0.xImage 2.1.0–2.1.3Image 2.1.4–2.1.xImage 2.2.xTo add the required library versions, use one of the following methods:
-
Set up access to the Maven
repository and set thespark:spark.jars.packages
property toorg.apache.iceberg:iceberg-spark-runtime-3.0_2.12:1.0.0
.You can set up Maven access in two ways:
- In your cluster's security group, allow network access to the Maven Central
repository. - Configure an alternative Maven repository
and allow traffic to it in the cluster security group.
- In your cluster's security group, allow network access to the Maven Central
-
Download the iceberg-spark-runtime-3.0_2.12-1.0.0.jar
library file and grant access to it by one of the following methods:-
Save the file to a Yandex Object Storage bucket and provide the file's URL in the
spark:spark.jars
property.The file's URL has the following format:
s3a://<bucket_name>/<file_path>
.This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
-
Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the
spark:spark.driver.extraClassPath
andspark:spark.executor.extraClassPath
properties.
-
To add the required library versions, use one of the following methods:
-
Set up access to the Maven
repository and set thespark:spark.jars.packages
property toorg.apache.iceberg:iceberg-spark-runtime-3.2_2.12-1.4.3
.You can set up Maven access in two ways:
- In your cluster's security group, allow network access to the Maven Central
repository. - Configure an alternative Maven repository
and allow traffic to it in the cluster security group.
- In your cluster's security group, allow network access to the Maven Central
-
Download the iceberg-spark-runtime-3.2_2.12-1.4.3.jar
library file and grant access to it by one of the following methods:-
Save the file to a Yandex Object Storage bucket and provide the file's URL in the
spark:spark.jars
property.The file's URL has the following format:
s3a://<bucket_name>/<file_path>
.This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
-
Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the
spark:spark.driver.extraClassPath
andspark:spark.executor.extraClassPath
properties.
-
To add the required library versions, use one of the following methods:
-
Set up access to the Maven
repository and set thespark:spark.jars.packages
property toorg.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.5.2
.You can set up Maven access in two ways:
- In your cluster's security group, allow network access to the Maven Central
repository. - Configure an alternative Maven repository
and allow traffic to it in the cluster security group.
- In your cluster's security group, allow network access to the Maven Central
-
Download the iceberg-spark-runtime-3.3_2.12-1.5.2.jar
library file and grant access to it by one of the following methods:-
Save the file to a Yandex Object Storage bucket and provide the file's URL in the
spark:spark.jars
property.The file's URL has the following format:
s3a://<bucket_name>/<file_path>
.This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
-
Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the
spark:spark.driver.extraClassPath
andspark:spark.executor.extraClassPath
properties.
-
To add the required library versions, use one of the following methods:
-
Set up access to the Maven
repository and set thespark:spark.jars.packages
property toorg.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2
.You can set up Maven access in two ways:
- In your cluster's security group, allow network access to the Maven Central
repository. - Configure an alternative Maven repository
and allow traffic to it in the cluster security group.
- In your cluster's security group, allow network access to the Maven Central
-
Download the iceberg-spark-runtime-3.5_2.12-1.5.2.jar
library file and grant access to it by one of the following methods:-
Save the file to a Yandex Object Storage bucket and provide the file's URL in the
spark:spark.jars
property.The file's URL has the following format:
s3a://<bucket_name>/<file_path>
.This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
-
Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the
spark:spark.driver.extraClassPath
andspark:spark.executor.extraClassPath
properties.
-
-
You can now use Apache Iceberg™ in your Yandex Data Processing cluster.
Apache Iceberg™ usage example
The use case was tested on a Yandex Data Processing cluster version 2.0 with:
- Spark and Hadoop components installed.
- Object Storage bucket connected, and the cluster's service account having read and write permissions for this bucket.
- Access to the Maven Central repository configured.
- Component properties configured to enable downloading Apache Iceberg™ libraries from Maven Central.
To create a Apache Iceberg™ table and start working with it:
-
Specify the settings for the folder
to contain the table.Apache Iceberg™ operates with tables at individual folder level. Folder settings are specified at individual folder level; you cannot specify settings for all folders at the same time.
To configure the
sample
Hadoop folder, set up these properties at cluster or individual job level as follows:-
spark:spark.sql.catalog.sample
property toorg.apache.iceberg.spark.SparkCatalog
-
spark:spark.sql.catalog.sample.type
property tohadoop
-
spark:spark.sql.catalog.sample.warehouse
property tos3a://<bucket_name>/warehouse/
Table data will be stored in the bucket at the
warehouse/
path.
For more information about the properties that have an effect on folder settings, see the Apache Iceberg™ documentatioin
. -
-
Use SSH to connect to the Yandex Data Processing cluster's master host.
-
Run a Spark SQL session:
spark-sql
You will perform all further actions within this session.
-
Switch to the
db
database in thesample
folder:USE sample.db;
-
Create a two-column table named
mytable
:CREATE TABLE mytable (id bigint, data string) USING iceberg;
-
View table information:
DESC FORMATTED mytable;
Result example:
id bigint data string # Partitioning Not partitioned # Detailed Table Information Name sample.db.mytable Location s3a://<bucket_name>/warehouse/db/mytable Provider iceberg Owner ubuntu Table Properties [current-snapshot-id=none,format=iceberg/parquet]
-
Insert some entries to the table:
INSERT INTO mytable VALUES (1, 'a'), (2, 'b'), (3, 'c');
-
Run a test query to the table:
SELECT count(1), data FROM mytable GROUP BY data;
Result example:
1 a 1 b 1 c