Greenplum®/Cloudberry connector
The Greenplum®/Cloudberry connector developed by Yandex based on the PostgreSQL connector
The connector supports parallel data reading from SEVERAL Greenplum® segments at the same time and direct segment reading over the GPFDIST protocol, which greatly improves query performance for large-scale data reads. You can use both data reading methods at the same time to optimize the use of resources in Trino and Greenplum® clusters.
The Greenplum®/Cloudberry connector is available in Trino 476 or higher.
Parallel data reading
During parallel reading from a table, data is parallelized based on the gp_segment_id metadata column value.
The level of parallelism depends on the number of segments in the Greenplum® cluster. The maximum parallelism level is limited by the greenplum.max-read-parallelism connector setting and the relevant max_read_parallelism session property.
Parallel reading is illustrated on the following diagram:
When parallel reading is used, the connector performs only partial row filtering during LIMIT pushdown
Reading data over the GPFDIST protocol
The connector allows reading data directly from Greenplum® segments via GPFDIST serversgreenplum.gpfdist.server.enabled connector setting.
In a Trino cluster, you can create not more than eight catalogs with active GPFDIST servers.
Direct reading from Greenplum® segments follow these steps:
-
The connector creates an external table giving the address of the Trino worker to read data:
CREATE WRITABLE EXTERNAL TEMPORARY TABLE <external_table_name> ... LOCATION('gpfdist://<Trino_worker_address>'); -
The connector runs the following query:
INSERT INTO <external_table_name> SELECT ... FROM <table_name_in_Greenplum®>; -
The Greenplum® segments send data to the Trino worker at the specified address.
Data reading from segments is illustrated on the following diagram:
The use of the GPFDIST protocol for data reads introduces the following limitations for the connector:
- No support for reading multidimensional arrays.
- No support for reading string type arrays.
- No support for AS_JSON
array processing mode. - When
LIMITandORDER BYare pushed down at the same time (Top-N pushdown ), the connector sorts data only partially. This does not affect the validity of the query results.
Connector settings
The connector's basic settings and their corresponding session properties match those of the PostgreSQL connector
| Configuration | Description | Default value |
|---|---|---|
greenplum.gpfdist.server.enabled |
Enables GPFDIST servers on Trino workers | false |
greenplum.gpfdist.max-processing-threads |
Maximum size of thread pool for asynchronous GPFDIST query processing | 32 |
greenplum.gpfdist.max-query-threads |
Maximum size of thread pool creating external Greenplum® tables and initiating data writes to an external table | 32 |
greenplum.gpfdist.read.enabled |
Enables reading data directly from Greenplum® segments over the GPFDIST protocol | false |
greenplum.gpfdist.read.buffer-size |
Buffer size for GPFDIST data reads in data size Matches the |
32MB |
greenplum.gpfdist.retry-timeout |
Maximum time a Greenplum® segment will wait for a response to a GPFDIST query, in duration If the value is other than |
null |
greenplum.max-read-parallelism |
Maximum parallelism for data reads from Greenplum®. Matches the |
1 (no parallelism) |
greenplum.segment-fetch-required |
Decides the connector's behavior if it fails to get informed about the number of Greenplum® segments:
Matches the |
true |