Sharding in Greenplum®
Sharding is a horizontal Greenplum® cluster scaling strategy where parts of each database table are placed on different segment hosts. Every write or read request in Greenplum® utilizes all cluster segments.
Distribution key
To optimize JOIN
operations on large tables, you can specify a distribution key
To create a table with a distribution key, provide one or more required fields in the DISTRIBUTED BY
clause:
CREATE TABLE tableName
(
column1 type1,
column2 type2,
...
columnN typeN
) DISTRIBUTED BY (column1);
If you choose a key incorrectly, most of the data might be stored in a single segment. This will degrade the cluster performance or shut down the segment if its host runs out of storage space. This is why you should not select the following as your distribution key:
- Date and time fields.
- Fields that may contain a large number of identical values.
- Fields with a large number of
NULL
values.
Note
If you do not specify a distribution key when creating a table, data will be distributed across host segments either by the primary key (if specified) or by the first table field.
Greenplum® and Greenplum Database® are registered trademarks or trademarks of VMware, Inc. in the United States and/or other countries.