Setting up Apache Iceberg™ in a Yandex Data Processing cluster
Yandex Data Processing 2.0 or higher supports using Apache Iceberg™ tables together with the Apache Spark™ engine.
For more information about Apache Iceberg™, see Apache Iceberg™ in Yandex Data Processing and the official documentation
Note
Apache Iceberg™ is not part of Yandex Data Processing. It is not covered by Yandex Cloud support and its usage is not governed by the Yandex Data Processing Terms of Use
Set up component properties to work with Apache Iceberg™
-
Set the
spark:spark.sql.extensionsproperty toorg.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions. You can do this either at the cluster or individual job level. -
Add the Apache Iceberg™ libraries to the dependencies of your cluster or individual job. Library versions must be compatible with the Yandex Data Processing version.
Image 2.0.xImage 2.1.0–2.1.3Image 2.1.4–2.1.xImage 2.2.xTo add the required library versions, use one of the following methods:
-
Set up access to the Maven
repository and set thespark:spark.jars.packagesproperty toorg.apache.iceberg:iceberg-spark-runtime-3.0_2.12:1.0.0.You can set up Maven access in two ways:
- In your cluster's security group, allow network access to the Maven Central
repository. - Configure an alternative Maven repository
and allow traffic to it in the cluster security group.
- In your cluster's security group, allow network access to the Maven Central
-
Download the iceberg-spark-runtime-3.0_2.12-1.0.0.jar
library file and grant access to it using one of the following methods:-
Save the file to a Yandex Object Storage bucket and provide the file's URL in the
spark:spark.jarsproperty.The file's URL has the following format:
s3a://<bucket_name>/<file_path>.This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
-
Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the
spark:spark.driver.extraClassPathandspark:spark.executor.extraClassPathproperties.
-
To add the required library versions, use one of the following methods:
-
Set up access to the Maven
repository and set thespark:spark.jars.packagesproperty toorg.apache.iceberg:iceberg-spark-runtime-3.2_2.12-1.4.3.You can set up Maven access in two ways:
- In your cluster's security group, allow network access to the Maven Central
repository. - Configure an alternative Maven repository
and allow traffic to it in the cluster security group.
- In your cluster's security group, allow network access to the Maven Central
-
Download the iceberg-spark-runtime-3.2_2.12-1.4.3.jar
library file and grant access to it using one of the following methods:-
Save the file to a Yandex Object Storage bucket and provide the file's URL in the
spark:spark.jarsproperty.The file's URL has the following format:
s3a://<bucket_name>/<file_path>.This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
-
Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the
spark:spark.driver.extraClassPathandspark:spark.executor.extraClassPathproperties.
-
To add the required library versions, use one of the following methods:
-
Set up access to the Maven
repository and set thespark:spark.jars.packagesproperty toorg.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.5.2.You can set up Maven access in two ways:
- In your cluster's security group, allow network access to the Maven Central
repository. - Configure an alternative Maven repository
and allow traffic to it in the cluster security group.
- In your cluster's security group, allow network access to the Maven Central
-
Download the iceberg-spark-runtime-3.3_2.12-1.5.2.jar
library file and grant access to it using one of the following methods:-
Save the file to a Yandex Object Storage bucket and provide the file's URL in the
spark:spark.jarsproperty.The file's URL has the following format:
s3a://<bucket_name>/<file_path>.This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
-
Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the
spark:spark.driver.extraClassPathandspark:spark.executor.extraClassPathproperties.
-
To add the required library versions, use one of the following methods:
-
Set up access to the Maven
repository and set thespark:spark.jars.packagesproperty toorg.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2.You can set up Maven access in two ways:
- In your cluster's security group, allow network access to the Maven Central
repository. - Configure an alternative Maven repository
and allow traffic to it in the cluster security group.
- In your cluster's security group, allow network access to the Maven Central
-
Download the iceberg-spark-runtime-3.5_2.12-1.5.2.jar
library file and grant access to it using one of the following methods:-
Save the file to a Yandex Object Storage bucket and provide the file's URL in the
spark:spark.jarsproperty.The file's URL has the following format:
s3a://<bucket_name>/<file_path>.This bucket must be specified in the cluster settings. Make sure the cluster service account has read access to the bucket.
-
Copy the file to all the cluster nodes manually or using initialization scripts and provide the full file path in the
spark:spark.driver.extraClassPathandspark:spark.executor.extraClassPathproperties.
-
-
You can now use Apache Iceberg™ in your Yandex Data Processing cluster.
Apache Iceberg™ use case example
This use case was tested on a Yandex Data Processing cluster version 2.0 with:
- Spark and Hadoop components installed.
- Object Storage bucket connected, and the cluster's service account having read and write permissions for this bucket.
- Access to the Maven Central repository configured.
- Component properties configured to enable downloading Apache Iceberg™ libraries from Maven Central.
To create an Apache Iceberg™ table and start working with it:
-
Specify the settings for the folder
to contain the table.Apache Iceberg™ operates with tables at the individual folder level. Folder settings are specified at the individual folder level; you cannot specify settings for all folders at the same time.
To configure the
sampleHadoop folder, set up these properties at the cluster or individual job level as follows:-
spark:spark.sql.catalog.sampleproperty toorg.apache.iceberg.spark.SparkCatalog -
spark:spark.sql.catalog.sample.typeproperty tohadoop -
spark:spark.sql.catalog.sample.warehouseproperty tos3a://<bucket_name>/warehouse/Table data will be stored in the bucket at the
warehouse/path.
For more information about the properties affecting folder settings, this Apache Iceberg™ guide
. -
-
Use SSH to connect to the Yandex Data Processing cluster's master host.
-
Run a Spark SQL session:
spark-sqlYou will perform all further actions within this session.
-
Create a database named
dbin thesamplefolder:CREATE DATABASE sample.db; -
Switch to the
dbdatabase in thesamplefolder:USE sample.db; -
Create a two-column table named
mytable:CREATE TABLE mytable (id bigint, data string) USING iceberg; -
View table details:
DESC FORMATTED mytable;Result example:
id bigint data string # Partitioning Not partitioned # Detailed Table Information Name sample.db.mytable Location s3a://<bucket_name>/warehouse/db/mytable Provider iceberg Owner ubuntu Table Properties [current-snapshot-id=none,format=iceberg/parquet] -
Insert some entries to the table:
INSERT INTO mytable VALUES (1, 'a'), (2, 'b'), (3, 'c'); -
Run a test query to the table:
SELECT count(1), data FROM mytable GROUP BY data;Result example:
1 a 1 b 1 c