Hybrid storage in Yandex MPP Analytics for PostgreSQL
Note
This feature is at the Preview stage and free of charge.
Yandex MPP Analytics for PostgreSQL clusters support hybrid storage: some data can be stored in cluster storage and some in cold storage. By default, data is stored in cluster storage on disks of the selected type. Data used rarely but meant to be stored for a long time can be moved from the cluster storage to a cold storage – to a Yandex Object Storage service bucket. This will make data storage less costly.
You can enable hybrid storage in Yandex MPP Analytics for PostgreSQL clusters when creating or updating a cluster.
Warning
Once hybrid storage is enabled in a cluster, you cannot disable it.
Hybrid storage scope of use
You can use hybrid storage only for append-optimized tables. When you migrate data between cluster storage and cold storage, whole tables are migrated. Tables are processed using the Yezzey
The data of append-optimized tables is stored as a set of segment files, compressed and encrypted, in an Object Storage service bucket. The number of segment files depends on the number of segments in the cluster and on the table structure.
I/O management when using hybrid storage
When running SQL queries against append-optimized tables, there are many requests to segment files in storage. SQL query execution time depends on how efficiently you schedule I/O requests to segment files. If you do not use I/O request scheduling, storage performance is impaired, RAM consumption and SQL query execution time increase. Using the scheduler allows you to protect the cluster from performance degradation when executing SQL queries that involve processing massive amounts of data.
When data is stored in cluster storage, I/O request scheduling is performed by the operating system on the cluster hosts.
When data is stored in cold storage, the operating system cannot schedule I/O requests to the Object Storage service bucket. Therefore, to avoid performance degradation, Yandex MPP Analytics for PostgreSQL clusters use YProxy by Yandex Cloud to schedule such requests. Even if the table resides in a cold storage, the use of YProxy minimizes the impact on SQL query execution time.
When accessing cold storage data, throughput is limited to 1 GB per second from a single segment host, regardless of the number of segments per host. For example, if you have 8 segment hosts in your cluster, the maximum throughput for cold storage data for the whole cluster will not exceed 8 GB per second.
Learn more about hybrid storage architecture from this Habr article
Hybrid storage cleanup
For AO and AOCO tables in a hybrid storage, the VACUUM and VACUUM FULL operations are performed the same way as for similar tables in the cluster storage. However, cold storage data not being affected by the cleanup process, the following procedure is performed to automatically delete unnecessary segment files:
- When you run the
VACUUMandVACUUM FULLoperations for hybrid storage tables, the list of segment files in cold storage gets updated. Only the files containing the current table rows remain in the list. - Yezzey saves the names of unlisted files and the time of the operation in a service table. These files become candidates for deletion.
- Independently of the cleanup operations, Yezzey checks the entries in this service table on a weekly basis. If a file record is created earlier than the oldest automatic backup, the file gets deleted from the Object Storage service bucket to the recycle bin, and its record gets removed from the service table. After a week's time, the file gets deleted from the recycle bin.
This procedure ensures that files required for recovery from a hybrid storage cluster backup are not deleted from cold storage.
Use cases
Greenplum® and Greenplum Database® are registered trademarks or trademarks of Broadcom Inc. in the United States and/or other countries.