Tips for setting up and using Delta Lake
Optimizing data writes to S3-compatible storage
If the format of some data within a job differs from that of Delta Lake tables, to optimize data writes to S3-compatible storage, configure S3A committers.
If all data within a job is stored in Delta Lake tables, there is no need to configure S3A committers. Delta Lake uses its own algorithm to control data writes to S3-compatible storage. Its functionality is equivalent to that of S3A committers.
Boosting OPTIMIZE operator performance
The OPTIMIZE operatorspark.databricks.delta.optimize.maxThreads
property and amounts to 10
by default.
To speed up the optimization procedure when handling large tables, increase the property value. You can use much larger values, e.g., 100
or 1000
, if the cluster resources allow running so many concurrent operations.
Syntax for converting partitioned tables
The CONVERT TO DELTA
operator converts standard Spark SQL tables to Delta Lake format. To convert a partitioned table, specify partitioning columns in the request:
CONVERT TO DELTA table_name PARTITIONED BY (part_col_1 INT, part_col_2 INT);
Forcing table change history cleanup
By default, Delta Lake stores the history of table changes for 30 days. The retention period is set at the table level in the delta.logRetentionDuration
parameter and can be edited using this command:
ALTER TABLE <table_schema_and_name> SET TBLPROPERTIES ('delta.logRetentionDuration' = "interval <interval>")
To learn more about managing the table parameters, see the Delta Lake documentation
To force the table change history cleanup:
-
Rearrange the table data to optimize the access:
OPTIMIZE <table_name>;
-
Allow deleting the entire history of changes:
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
-
Clear the change history:
VACUUM <table_name> RETAIN 0 HOURS;