Sharding in Yandex MPP Analytics for PostgreSQL
Sharding is a horizontal Yandex MPP Analytics for PostgreSQL cluster scaling strategy that spreads parts of each database table across different segment hosts. Every write or read request in Yandex MPP Analytics for PostgreSQL utilizes all cluster segments.
Distribution key
To optimize JOIN operations on large tables, you can specify a distribution key explicitly. In this case, when joining tables by the fields specified in the key, a join operation will be performed locally at the segment level, and the query processing will be faster.
Learn more about the distribution key in Greenplum 6
To create a table with a distribution key, provide one or more required fields in the DISTRIBUTED BY clause:
CREATE TABLE tableName
(
column1 type1,
column2 type2,
...
columnN typeN
) DISTRIBUTED BY (column1);
If you choose a key incorrectly, most of the data might be stored in a single segment. This will degrade the cluster performance or shut down the segment if its host runs out of storage space. This is why you should not select the following as your distribution key:
- Date and time fields.
- Fields that may contain a large number of identical values.
- Fields with a large number of
NULLvalues.
Note
If you do not specify a distribution key when creating a table, data will be distributed across host segments either by the primary key (if specified) or by the first table field.
Greenplum® and Greenplum Database® are registered trademarks or trademarks of Broadcom Inc. in the United States and/or other countries.
Apache® and Apache Cloudberry™ are registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.