Expanding a cluster
You can expand a Managed Service for Greenplum® cluster by adding segment hosts to it. You need to add at least two hosts. You can also increase the number of segments per host.
When a cluster is being expanded, its data is automatically redistributed evenly across all hosts in the cluster (including added hosts). For more information, see Expanding a Greenplum® cluster.
Expanding a cluster
-
Go to the folder page
and select Managed Service for Greenplum. -
Select a cluster and open the
Hosts tab. -
Click Expand cluster in the top-right corner.
-
Specify cluster expansion settings:
-
Number of new segment hosts: Number of segment hosts to expand the cluster by.
The minimum value (default) is
2
, the maximum value cannot exceed the limit on the number of segment hosts in the cluster (minus existing hosts). -
Add segments to host: Number of segments added per host.
Warning
Segments will be added to all hosts, both new and existing ones.
The minimum value (default) is
0
(number of segments per host remains unchanged); the maximum value depends on the host class. -
Block cluster from load: Closing the cluster for access.
If this setting is enabled, you cannot connect to the cluster and it receives no new user requests. This will accelerate cluster expansion.
Warning
If you close the cluster from load and disable background data redistribution, you will lose access to the cluster until its expansion is complete.
The expansion process can be time-consuming.
-
Background data redistribution: Using background data redistribution.
-
If background mode is disabled, the redistribution process will be run once as soon as the cluster expansion preparation is over.
The cluster will remain in the
Updating
status until all cluster expansion stages are completed.Warning
If data redistribution terminates upon reaching the timeout with some tables remaining unprocessed, you will need to redistribute those tables manually.
-
If background mode is enabled, the redistribution process will be delayed. The process will run on a schedule as a routine maintenance operation.
The cluster will remain in the
Updating
status only during preparation for cluster expansion.
-
-
Redistribution timeout: Timeout (in seconds), after which the data redistribution process will be interrupted.
Reaching the timeout does not stop the process immediately: data redistribution will be completed for the tables currently being processed (
IN PROGRESS
status). You can request the status of the tables if the cluster is not closed from load.The minimum value is
0
. The timeout will be calculated automatically depending on the cluster configuration and data size.The maximum value depends on whether background data redistribution is enabled:
- If enabled, the maximum value is
28800
(eight hours). - If disabled, the maximum value is not limited.
- If enabled, the maximum value is
-
Number of redistricting streams: Number of threads that will be started during the data redistribution process.
Using more threads will speed up data redistribution but it will also increase the cluster load.
The minimum value (default) is
0
(the number will be calculated automatically depending on the cluster configuration and data size), and the maximum one is25
.
-
-
Click Expand.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To expand a cluster:
-
View the description of the CLI cluster expand command:
yc managed-greenplum cluster expand --help
-
To expand a cluster, run this command:
yc managed-greenplum cluster expand <cluster_name_or_ID> \ --segment-host-count <number_of_added_segment_hosts> \ --add-segments-per-host-count <number_of_added_segments_per_host> \ --close-cluster <temporary_ban_on_connection_to_cluster> \ --delay-redistribution <background_data_redistribution> \ --duration-seconds <data_redistribution_timeout> \ --parallel <number_of_data_redistribution_threads>
Where:
-
--segment-host-count
: Number of segment hosts to expand the cluster by.The minimum value (default) is
2
, the maximum value cannot exceed the limit on the number of segment hosts in the cluster (minus existing hosts). -
--add-segments-per-host-count
: Number of segments added per host.Warning
Segments will be added to all hosts, both new and existing ones.
The minimum value (default) is
0
(number of segments per host remains unchanged); the maximum value depends on the host class. -
--close-cluster
: Closing the cluster for access:true
orfalse
.If this setting is enabled (
true
), you cannot connect to the cluster and it does not receive new user requests. This will accelerate cluster expansion.Warning
If you close the cluster from load and disable background data redistribution, you will lose access to the cluster until its expansion is complete.
The expansion process can be time-consuming.
-
--delay-redistribution
: Using background data redistribution:true
orfalse
.-
If background mode is disabled (
false
), the redistribution process will be run once as soon as the cluster expansion preparation is over.The cluster will remain in the
Updating
status until all cluster expansion stages are completed.Warning
If data redistribution terminates upon reaching the timeout with some tables remaining unprocessed, you will need to redistribute those tables manually.
-
If background mode is enabled (
true
), the redistribution process will be delayed. The process will run on a schedule as a routine maintenance operation.The cluster will remain in the
Updating
status only during preparation for cluster expansion.
-
-
--duration-seconds
: Timeout (in seconds), after which the data redistribution process will be interrupted.Reaching the timeout does not stop the process immediately: data redistribution will be completed for the tables currently being processed (
IN PROGRESS
status). You can request the status of the tables if the cluster is not closed from load.The minimum value is
0
. The timeout will be calculated automatically depending on the cluster configuration and data size.The maximum value depends on whether background data redistribution is enabled:
- If enabled, the maximum value is
28800
(eight hours). - If disabled, the maximum value is not limited.
- If enabled, the maximum value is
-
--parallel
: Number of threads that will be started during the data redistribution process.Using more threads will speed up data redistribution but it will also increase the cluster load.
The minimum value (default) is
0
(the number will be calculated automatically depending on the cluster configuration and data size), and the maximum one is25
.
You can get the cluster ID and name with a list of clusters in the folder.
-
To expand a cluster, use the expand REST API method for the Cluster resource or the ClusterService/Expand gRPC API call and provide the following in the request:
-
Cluster ID in the
clusterId
parameter.You can get the cluster ID with a list of clusters in the folder.
-
segmentHostCount
parameter: Number of segment hosts to expand the cluster by.The minimum value (default) is
2
, the maximum value cannot exceed the limit on the number of segment hosts in the cluster (minus existing hosts). -
addSegmentsPerHostCount
parameter: Number of segments added per host.Warning
Segments will be added to all hosts, both new and existing ones.
The minimum value (default) is
0
(number of segments per host remains unchanged); the maximum value depends on the host class. -
closeCluster
parameter: Closing the cluster for access:true
orfalse
.If this setting is enabled (
true
), you cannot connect to the cluster and it does not receive new user requests. This will accelerate cluster expansion.Warning
If you close the cluster from load and disable background data redistribution, you will lose access to the cluster until its expansion is complete.
The expansion process can be time-consuming.
-
delayRedistribution
parameter: Using background data redistribution:true
orfalse
.-
If background mode is disabled (
false
), the redistribution process will be run once as soon as the cluster expansion preparation is over.The cluster will remain in the
Updating
status until all cluster expansion stages are completed.Warning
If data redistribution terminates upon reaching the timeout with some tables remaining unprocessed, you will need to redistribute those tables manually.
-
If background mode is enabled (
true
), the redistribution process will be delayed. The process will run on a schedule as a routine maintenance operation.The cluster will remain in the
Updating
status only during preparation for cluster expansion.
-
-
duration
parameter: Timeout (in seconds), after which the data redistribution process will be interrupted.Reaching the timeout does not stop the process immediately: data redistribution will be completed for the tables currently being processed (
IN PROGRESS
status). You can request the status of the tables if the cluster is not closed from load.The minimum value is
0
. The timeout will be calculated automatically depending on the cluster configuration and data size.The maximum value depends on whether background data redistribution is enabled:
- If enabled, the maximum value is
28800
(eight hours). - If disabled, the maximum value is not limited.
- If enabled, the maximum value is
-
parallel
parameter: Number of threads that will be started during the data redistribution process.Using more threads will speed up data redistribution but it will also increase the cluster load.
The minimum value (default) is
0
(the number will be calculated automatically depending on the cluster configuration and data size), and the maximum one is25
.
Data redistribution monitoring
There are several ways to monitor the data redistribution process:
-
If the cluster is closed from load and background data redistribution is disabled, you can only track progress using Yandex Cloud monitoring tools.
Follow the Expansion progress chart on the Monitoring tab of the cluster management page or look up the
gp_expand.*
metrics using Yandex Monitoring. -
If the cluster is not closed from load, you can also run SQL queries to the
gpexpand
data schema objects.Note
This schema is available only during cluster expansion, then it gets deleted.
To track the progress of the data redistribution process using SQL queries:
-
Connect to the
postgres
database as a user with themdb_admin
role. -
Run a query to get the current status of the process:
SELECT dbname, fq_name, rank, status, expansion_started, expansion_finished FROM gpexpand.status_detail ORDER BY rank;
The query result will contain various information
about the progress of the data redistribution process. Thestatus
column will indicate the current processing status of specific tables:-
NOT STARTED
: Table data redistribution is scheduled, the table is in the redistribution queue.If the cluster is not closed from load, you can influence the table priority to change its position in the queue if needed.
-
IN PROGRESS
: Table data redistribution is in progress. -
COMPLETED
: Table data redistribution is completed. -
NO LONGER EXISTS
: Table no longer exists.
-
resgroup
Additional actions Changing table redistribution priority
During the cluster expansion preparation phase, each table is assigned a rank that affects its redistribution queue priority. The higher its rank, the higher the table's queue priority and, therefore, the faster its data redistribution will be completed.
If you need to, you can change the rank for one or more tables provided the cluster is not closed from load. To do this, you must modify the table in the gpexpand
data schema.
Note
This schema is available only during cluster expansion, then it gets deleted.
To change the rank of redistributed tables:
-
Connect to the
postgres
database as a user with themdb_admin
role. -
Run a query to get the current status of the data redistribution process:
SELECT dbname, fq_name, rank, status, expansion_started, expansion_finished FROM gpexpand.status_detail ORDER BY rank;
This will include full table names (
fq_name
) and their ranks (rank
). -
Change the rank for one or more tables:
UPDATE gpexpand.status_detail SET rank=<required_rank> WHERE fq_name IN (<list_of_full_table_names>);
For more information, see the Greenplum®
documentation.
Running data redistribution manually
If background data redistribution was not used during cluster expansion and the data redistribution process timeout setting was not long enough, the process may end before data redistribution is over for all tables.
This will leave some of the data distributed unevenly. As a result, the Unknown error: Partially Distributed Data
error will occur in cluster logs. To fix the error, run data redistribution manually:
-
Learn the total number of segments in the cluster:
- Get detailed information about a cluster.
- Multiply the number of segments per host by the number of segment hosts.
-
Find tables with unevenly distributed data:
SELECT * FROM (SELECT oid, relname FROM pg_class) pgc JOIN (SELECT localoid AS oid, numsegments FROM gp_distribution_policy) gpdp USING (oid) WHERE numsegments != <total_number_of_segments>;
A list of tables will be displayed stating the following for each table:
oid
: Table ID.relname
: Table name.numsegment
: Number of segments storing the data.
For more information, see the Greenplum® documentation on the gp_distribution_policy
table. -
Run data redistribution using the information you got earlier.
To redistribute data of an individual table, run the following request:
-
For an ordinary table:
ALTER TABLE ONLY <table_name> EXPAND TABLE;
-
For a partitioned table:
ALTER TABLE <table_name> SET WITH (REORGANIZE=true) <distribution_policy>;
To learn the Greenplum® distribution policy value for an individual table partition, run the following query:
SELECT pg_get_table_distributedby(<partition_OID>) AS distribution_policy;
-
Greenplum® and Greenplum Database® are registered trademarks or trademarks of VMware, Inc. in the United States and/or other countries.