Setting up Delta Lake in single-cluster mode
Yandex Data Proc 2.0 or higher supports using Delta Lake in single-cluster mode.
For more information about Delta Lake, see the Delta Lake in Yandex Data Proc section in the Delta Lake documentation
Note
Delta Lake is not part of Yandex Data Proc. It is not covered by Yandex Cloud support and its usage is not governed by the Yandex Data Proc Terms of Use
Warning
If different Spark jobs in single-cluster mode are concurrently updating table data, this may cause data loss.
Set up the Spark jobs to avoid concurrent data modifications or use multi-cluster mode. For more information, see the Delta Lake documentation
Prepare the infrastructure
-
If you do not have a Yandex Data Proc cluster, create one.
-
If you attached a Yandex Object Storage bucket for data storage to your cluster:
- Create a folder named
warehouse
in the bucket. - Set the property
spark.sql.warehouse.dir
tos3a://<bucket_name>/warehouse/
.
- Create a folder named
-
Create a Hive Metastore cluster and connect it to your Yandex Data Proc cluster.
Set up the component properties to work with Delta Lake
-
Set the following properties at the level of your cluster or individual job:
spark.sql.extensions
toio.delta.sql.DeltaSparkSessionExtension
.spark.sql.catalog.spark_catalog
toorg.apache.spark.sql.delta.catalog.DeltaCatalog
.
-
Add the Delta Lake libraries to the dependencies of your cluster or individual job (the required library versions depend on the Yandex Data Proc version):
Yandex Data Proc 2.0.xYandex Data Proc 2.1.0 or 2.1.3Yandex Data Proc 2.1.4 and higherUse one of these methods:
-
Download the delta-core_2.12-0.8.0.jar
library file, save it to your Object Storage bucket, and provide the file URL in thespark.jars
property:spark.jars=s3a://<bucket_name>/<file_path>
Make sure the cluster service account is granted read access to the bucket.
-
Set up cluster access to the Maven
repository and set thespark.jars.packages
property toio.delta:delta-core_2.12:0.8.0
.You can set up Maven access in two ways:
- In your cluster's security group, allow network access to the Maven Central
repository. - Configure an alternative Maven repository
and allow traffic to it in the cluster security group.
- In your cluster's security group, allow network access to the Maven Central
-
Download the delta-core_2.12-0.8.0.jar
library file, copy it to all the cluster nodes manually or using initialization scripts, and provide the full file path in thespark.driver.extraClassPath
andspark.executor.extraClassPath
properties.
Use one of these methods:
-
Download the delta-core_2.12-2.0.2.jar
and delta-storage-2.0.2.jar library files, save them to your Object Storage bucket, and provide the comma-separated file URLs in thespark.jars
property:spark.jars=s3a://<bucket_name>/<path_to_core_file>,s3a://<bucket_name>/<path_to_storage_file>
Make sure the cluster service account is granted read access to the bucket.
-
Set up cluster access to the Maven
repository and set thespark.jars.packages
property toio.delta:delta-core_2.12:2.0.2,io.delta:delta-storage:2.0.2
.You can set up Maven access in two ways:
- In your cluster's security group, allow network access to the Maven Central
repository. - Configure an alternative Maven repository
and allow traffic to it in the cluster security group.
- In your cluster's security group, allow network access to the Maven Central
-
Download the delta-core_2.12-2.0.2.jar
and delta-storage-2.0.2.jar library files, copy them to all the cluster nodes manually or using initialization scripts, and provide the full file path in thespark.driver.extraClassPath
andspark.executor.extraClassPath
properties.
Use one of these methods:
-
Download the delta-core_2.12-2.3.0.jar
and delta-storage-2.3.0.jar library files, save them to your Object Storage bucket, and provide the comma-separated file URLs in thespark.jars
property:spark.jars=s3a://<bucket_name>/<path_to_core_file>,s3a://<bucket_name>/<path_to_storage_file>
Make sure the cluster service account is granted read access to the bucket.
-
Set up cluster access to the Maven
repository and set thespark.jars.packages
property toio.delta:delta-core_2.12:2.3.0,io.delta:delta-storage:2.3.0
.You can set up Maven access in two ways:
- In your cluster's security group, allow network access to the Maven Central
repository. - Configure an alternative Maven repository
and allow traffic to it in the cluster security group.
- In your cluster's security group, allow network access to the Maven Central
-
Download the delta-core_2.12-2.3.0.jar
and delta-storage-2.3.0.jar library files, copy them to all the cluster nodes manually or using initialization scripts, and provide the full file path in thespark.driver.extraClassPath
andspark.executor.extraClassPath
properties.
-
You can now use Delta Lake in your Yandex Data Proc cluster.
If the above Spark properties are provided at the cluster level, you can use Spark Thrift Server to work with Delta Lake tables.
Delta Lake usage example
The use case was tested on a Yandex Data Proc cluster version 2.0 with access to the Maven Central repository.
-
Use SSH to connect to the Yandex Data Proc cluster's master host.
-
Run a Spark session in the cluster by providing the required parameters:
spark-sql \ --conf spark.jars.packages=io.delta:delta-core_2.12:0.8.0 \ --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \ --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
-
Within the active session, create a database and switch to it:
CREATE DATABASE testdelta; USE testdelta;
-
Create a test table and populate it with data:
CREATE TABLE tab1(a INTEGER NOT NULL, b VARCHAR(100)) USING DELTA; INSERT INTO tab1 VALUES (1,'One'), (2,'Two'), (3,'Three');
-
Replace the
b
column values by adding thea
column values converted to a string to them:UPDATE tab1 SET b=b || ' ** ' || CAST(a AS VARCHAR(10));
-
Check the result:
SELECT * FROM tab1;
3 Three ** 3 2 Two ** 2 1 One ** 1