Tips for setting up and using Delta Lake

Written by

Updated at January 23, 2025

Optimizing data writes to S3-compatible storage
Boosting OPTIMIZE operator performance
Syntax for converting partitioned tables
Forcing table change history cleanup

Optimizing data writes to S3-compatible storage

If the format of some data within a job differs from that of Delta Lake tables, to optimize data writes to S3-compatible storage, configure S3A committers.

If all data within a job is stored in Delta Lake tables, there is no need to configure S3A committers. Delta Lake uses its own algorithm to control data writes to S3-compatible storage. Its functionality is equivalent to that of S3A committers.

Boosting OPTIMIZE operator performance

The OPTIMIZE operator in Delta Lake 2.0.2 speeds up requests to read table data by merging multiple small files into larger ones. This merge is performed within several concurrent jobs. The maximum number of such concurrent jobs is controlled by the spark.databricks.delta.optimize.maxThreads property set to 10 by default.

To speed up the optimization procedure when handling large tables, increase the property value. You can use much larger values, e.g., 100 or 1000, if the cluster resources allow running this many concurrent operations.

Syntax for converting partitioned tables

The CONVERT TO DELTA operator converts standard Spark SQL tables to Delta Lake format. To convert a partitioned table, specify partitioning columns in the request:

CONVERT TO DELTA table_name PARTITIONED BY (part_col_1 INT, part_col_2 INT);

Forcing table change history cleanup

By default, Delta Lake stores the history of table changes for 30 days. This period is set at the table level in the delta.logRetentionDuration parameter; you can edit it using this command:

ALTER TABLE <table_schema_and_name> SET TBLPROPERTIES ('delta.logRetentionDuration' = "interval <interval>")

To learn more about managing the table parameters, see the Delta Lake documentation.

To force the table change history cleanup:

Rearrange the table data to optimize the access:
```
OPTIMIZE <table_name>;
```

Allow deleting the entire history of changes:

SET spark.databricks.delta.retentionDurationCheck.enabled = false;

Clear the change history:
```
VACUUM <table_name> RETAIN 0 HOURS;
```

Tips for setting up and using Delta Lake

Optimizing data writes to S3-compatible storageOptimizing data writes to S3-compatible storage

Boosting OPTIMIZE operator performanceBoosting OPTIMIZE operator performance

Syntax for converting partitioned tablesSyntax for converting partitioned tables

Forcing table change history cleanupForcing table change history cleanup

Was the article helpful?

Optimizing data writes to S3-compatible storage

Boosting OPTIMIZE operator performance

Syntax for converting partitioned tables

Forcing table change history cleanup