Delta Lake in Yandex Data Processing
Delta Lake
- Adds an optimized storage layer for table data with ACID transaction
support. - Enables scalable processing of metadata.
- Allows updating data in analytical tables stored as Parquet files in HDFS or S3-compatible storage.
- Allows processing batch requests and running data streaming operations.
You can set up Delta Lake in Yandex Data Processing clusters:
- In single-cluster mode for Yandex Data Processing 2.0 and 2.1
- In multi-cluster mode for Yandex Data Processing 2.1 and higher
Although single-cluster mode allows using tables from different clusters and Apache Spark™ jobs, concurrent data writes from a variety of sources may lead to table data loss. To avoid this, you need additional setup of data writes.
In multi-cluster mode, access to Delta Lake tables from different clusters and Apache Spark™ jobs is managed by an auxiliary database. In Yandex Cloud, this role is performed by Yandex Managed Service for YDB.
Note
Delta Lake is not part of Yandex Data Processing. It is not covered by Yandex Cloud support, and its usage is not governed by the Yandex Data Processing Terms of Use
For more information about Delta Lake, see the Delta Lake documentation
Delta Lake and Yandex Data Processing version compatibility
Delta Lake and Yandex Data Processing versions are only compatible if the Delta Lake version is compatible with the Apache Spark™ version used in the cluster. The table below lists compatible versions and links to library files that you will need to set up Delta Lake in your cluster.
Yandex Data Processing version |
Apache Spark™ version |
Delta Lake version |
JAR files |
2.0.x |
3.0.3 |
||
2.1.0 and 2.1.3 |
3.2.1 |
||
2.1.4 and higher |
3.3.2 |
Note
Yandex Data Processing 2.1.x clusters are at the Preview stage and provided upon request. Contact support
Delta Lake 2.x key advantages
Here are the key advantages of Delta Lake 2.x as compared to 0.8.0:
- Support for multi-cluster mode
provides automated orchestration of changes to data in a single table from different Apache Spark™ jobs and Yandex Data Processing clusters. - The idempotent data write
feature allows maintaining exactly-once processing of data streams. - The Change Data Feed
feature allows tracking changes to data in Delta Lake tables. - The Z-Ordering
feature implements multidimensional clustering of Delta Lake tables. It speeds up running requests with restrictions on columns used for clustering. - Support for dynamic partition overwrites
. - Request performance optimization
by merging small files into larger ones. - Support for table rollbacks
to the previous state.