Sharding in Greenplum®
Sharding is a horizontal Greenplum® cluster scaling strategy that puts parts of each DB table on different segment hosts. Every write or read request in Greenplum® utilizes all cluster segments.
By default, all tables in Greenplum® are created with random distribution, i.e., a segment for storing each row is determined randomly. This ensures even data distribution across segments.
Distribution key
To optimize JOIN
operations with large tables, you can explicitly specify a distribution key
To create a table with a distribution key, provide one or more required fields in the DISTRIBUTED BY
clause:
CREATE TABLE tableName
(
column1 type1,
column2 type2,
...
columnN typeN
) DISTRIBUTED BY (column1);
If you choose a key incorrectly, most of the data might be stored in a single segment. This will degrade the cluster performance or shut down the segment if its host runs out of storage space. This is why you should not select the following as your distribution key:
- Date and time fields.
- Fields that may contain a large number of identical values.
- Fields with a large number of
NULL
values.
Greenplum® and Greenplum Database® are registered trademarks or trademarks of VMware, Inc. in the United States and/or other countries.