Expanding a cluster
You can expand a Yandex MPP Analytics for PostgreSQL cluster by adding segment hosts to it. You need to add at least two hosts. You can also increase the number of segments per host.
When a cluster is being expanded, its data is automatically redistributed evenly across all hosts in the cluster (including added hosts). For more information, see Expanding a Greenplum® cluster.
Expanding a cluster
-
Navigate to the folder dashboard
and select Yandex MPP Analytics for PostgreSQL. -
Select a cluster and open the
Hosts tab. -
Click Expand cluster in the top-right corner.
-
Specify the cluster expansion settings:
-
Number of new segment hosts: Number of segment hosts to expand the cluster by.
The minimum value (default) is
2, the maximum value cannot exceed the limit on the number of segment hosts in the cluster (minus existing hosts). -
Add segments to host: Number of segments added per host.
Warning
Segments will be added to all hosts, both new and existing ones.
The minimum value (default) is
0(number of segments per host remains unchanged); the maximum value depends on the host class.The segment host class and the number of segments per host affect the maximum amount of memory allocated to each Greenplum® server process. If you select a host class with small RAM and specify a large number of segments, an error may occur.
-
Block cluster from load: Closing the cluster for access.
If this setting is enabled, you cannot connect to the cluster and it receives no new user requests. This will accelerate cluster expansion.
Warning
If you close the cluster from load and disable background data redistribution, you will lose access to the cluster until its expansion is complete.
The expansion process can be time-consuming.
-
Background data redistribution: Using background data redistribution.
-
If background mode is disabled, the redistribution process will start once as soon as preparation for the cluster expansion is over.
The cluster will remain in the
Updatingstatus until all cluster expansion stages are completed.Warning
If data redistribution terminates upon reaching the timeout with some tables remaining unprocessed, you will need to redistribute those tables manually.
-
If background mode is enabled, the redistribution process will be delayed. The process will run on a schedule as a routine maintenance operation.
The cluster will remain in the
Updatingstatus only during preparation for cluster expansion.
-
-
Redistribution timeout: Timeout (in seconds) after which the data redistribution process will be interrupted.
Reaching the timeout does not stop the process immediately: data redistribution will be completed for the tables currently being processed (
IN PROGRESSstatus). You can request the status of the tables if the cluster is not closed from load.The minimum value is
0. The timeout will be calculated automatically depending on the cluster configuration and data size.The maximum value depends on whether background data redistribution is enabled:
- If enabled, the maximum value is
28800(eight hours). - If disabled, the maximum value is not limited.
- If enabled, the maximum value is
-
Number of redistricting streams: Number of threads that will be started during the data redistribution process.
Using more threads will speed up data redistribution but it will also increase the cluster load.
The minimum value (default) is
0(the number will be calculated automatically depending on the cluster configuration and data size), and the maximum one is25.
-
-
Click Expand.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To expand a cluster:
-
View the description of the CLI command for expanding a cluster:
yc managed-greenplum cluster expand --help -
To expand a cluster, run this command:
yc managed-greenplum cluster expand <cluster_name_or_ID> \ --segment-host-count <number_of_added_segment_hosts> \ --add-segments-per-host-count <number_of_added_segments_per_host> \ --close-cluster <temporary_ban_on_connection_to_cluster> \ --delay-redistribution <background_data_redistribution> \ --duration-seconds <data_redistribution_timeout> \ --parallel <number_of_data_redistribution_threads>Where:
-
--segment-host-count: Number of segment hosts to expand the cluster by.The minimum value (default) is
2, the maximum value cannot exceed the limit on the number of segment hosts in the cluster (minus existing hosts). -
--add-segments-per-host-count: Number of segments added per host.Warning
Segments will be added to all hosts, both new and existing ones.
The minimum value (default) is
0(number of segments per host remains unchanged); the maximum value depends on the host class.The segment host class and the number of segments per host affect the maximum amount of memory allocated to each Greenplum® server process. If you select a host class with small RAM and specify a large number of segments, an error may occur.
-
--close-cluster: Closing the cluster for access:trueorfalse.If this setting is enabled (
true), you cannot connect to the cluster and it does not receive new user requests. This will accelerate cluster expansion.Warning
If you close the cluster from load and disable background data redistribution, you will lose access to the cluster until its expansion is complete.
The expansion process can be time-consuming.
-
--delay-redistribution: Using background data redistribution:trueorfalse.-
If background mode is disabled (
false), the redistribution process will be run once as soon as the cluster expansion preparation is over.The cluster will remain in the
Updatingstatus until all cluster expansion stages are completed.Warning
If data redistribution terminates upon reaching the timeout with some tables remaining unprocessed, you will need to redistribute those tables manually.
-
If background mode is enabled (
true), the redistribution process will be delayed. The process will run on a schedule as a routine maintenance operation.The cluster will remain in the
Updatingstatus only during preparation for cluster expansion.
-
-
--duration-seconds: Timeout (in seconds) after which the data redistribution process will be interrupted.Reaching the timeout does not stop the process immediately: data redistribution will be completed for the tables currently being processed (
IN PROGRESSstatus). You can request the status of the tables if the cluster is not closed from load.The minimum value is
0. The timeout will be calculated automatically depending on the cluster configuration and data size.The maximum value depends on whether background data redistribution is enabled:
- If enabled, the maximum value is
28800(eight hours). - If disabled, the maximum value is not limited.
- If enabled, the maximum value is
-
--parallel: Number of threads that will be started during the data redistribution process.Using more threads will speed up data redistribution but it will also increase the cluster load.
The minimum value (default) is
0(the number will be calculated automatically depending on the cluster configuration and data size), and the maximum one is25.
You can get the cluster ID and name with the list of clusters in the folder.
-
-
Get an IAM token for API authentication and set it as an environment variable:
export IAM_TOKEN="<IAM_token>" -
Call the Cluster.Expand method, e.g., via the following cURL
request:curl \ --request POST \ --header "Authorization: Bearer $IAM_TOKEN" \ --header "Content-Type: application/json" \ --url 'https://mdb.api.cloud.yandex.net/managed-greenplum/v1/clusters/<cluster_ID>/expand' \ --data '{ "segmentHostCount": "<number_of_added_segment_hosts>", "addSegmentsPerHostCount": "<number_of_added_segments_per_host>", "duration": "<data_redistribution_timeout>", "parallel": "<number_of_data_redistribution_threads>", "closeCluster": "<temporary_ban_on_connection_to_cluster>", "delayRedistribution": "<background_data_redistribution>" }'You can request the cluster ID with the list of clusters in the folder.
Request body parameters:
-
segmentHostCount: Number of segment hosts to expand the cluster by.The minimum value (default) is
2, the maximum value cannot exceed the limit on the number of segment hosts in the cluster (minus existing hosts). -
addSegmentsPerHostCount: Number of segments added per host.Warning
Segments will be added to all hosts, both new and existing ones.
The minimum value (default) is
0(number of segments per host remains unchanged); the maximum value depends on the host class.The segment host class and the number of segments per host affect the maximum amount of memory allocated to each Greenplum® server process. If you select a host class with small RAM and specify a large number of segments, an error may occur.
-
duration: Timeout (in seconds) after which the data redistribution process will be interrupted.Reaching the timeout does not stop the process immediately: data redistribution will be completed for the tables currently being processed (
IN PROGRESSstatus). You can request the status of the tables if the cluster is not closed from load.The minimum value is
0. The timeout will be calculated automatically depending on the cluster configuration and data size.The maximum value depends on whether background data redistribution is enabled:
- If enabled, the maximum value is
28800(eight hours). - If disabled, the maximum value is not limited.
- If enabled, the maximum value is
-
parallel: Number of threads that will be started during the data redistribution process.Using more threads will speed up data redistribution but it will also increase the cluster load.
The minimum value (default) is
0(the number will be calculated automatically depending on the cluster configuration and data size), and the maximum one is25. -
closeCluster: Closing the cluster for access,trueorfalse.If this setting is enabled (
true), you cannot connect to the cluster and it does not receive new user requests. This will accelerate cluster expansion.Warning
If you close the cluster from load and disable background data redistribution, you will lose access to the cluster until its expansion is complete.
The expansion process can be time-consuming.
-
delayRedistribution: Using background data redistribution,trueorfalse.-
If background mode is disabled (
false), the redistribution process will be run once as soon as the cluster expansion preparation is over.The cluster will remain in the
Updatingstatus until all cluster expansion stages are completed.Warning
If data redistribution terminates upon reaching the timeout with some tables remaining unprocessed, you will need to redistribute those tables manually.
-
If background mode is enabled (
true), the redistribution process will be delayed. The process will run on a schedule as a routine maintenance operation.The cluster will remain in the
Updatingstatus only during preparation for cluster expansion.
-
-
-
Check the server response to make sure your request was successful.
-
Get an IAM token for API authentication and set it as an environment variable:
export IAM_TOKEN="<IAM_token>" -
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapiBelow, we assume the repository contents are stored in the
~/cloudapi/directory. -
Use the ClusterService.Expand call and send the following request, e.g., via gRPCurl
:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/mdb/greenplum/v1/cluster_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d '{ "cluster_id": "<cluster_ID>", "segment_host_count": "<number_of_added_segment_hosts>", "add_segments_per_host_count": "<number_of_added_segments_per_host>", "duration": "<data_redistribution_timeout>", "parallel": "<number_of_data_redistribution_threads>", "close_cluster": "<temporary_ban_on_connection_to_cluster>", "delay_redistribution": "<background_data_redistribution>" }' \ mdb.api.cloud.yandex.net:443 \ yandex.cloud.mdb.greenplum.v1.ClusterService.ExpandYou can request the cluster ID with the list of clusters in the folder.
Request body parameters:
-
segment_host_count: Number of segment hosts to expand the cluster by.The minimum value (default) is
2, the maximum value cannot exceed the limit on the number of segment hosts in the cluster (minus existing hosts). -
add_segments_per_host_count: Number of segments added per host.Warning
Segments will be added to all hosts, both new and existing ones.
The minimum value (default) is
0(number of segments per host remains unchanged); the maximum value depends on the host class.The segment host class and the number of segments per host affect the maximum amount of memory allocated to each Greenplum® server process. If you select a host class with small RAM and specify a large number of segments, an error may occur.
-
duration: Timeout (in seconds) after which the data redistribution process will be interrupted.Reaching the timeout does not stop the process immediately: data redistribution will be completed for the tables currently being processed (
IN PROGRESSstatus). You can request the status of the tables if the cluster is not closed from load.The minimum value is
0. The timeout will be calculated automatically depending on the cluster configuration and data size.The maximum value depends on whether background data redistribution is enabled:
- If enabled, the maximum value is
28800(eight hours). - If disabled, the maximum value is not limited.
- If enabled, the maximum value is
-
parallel: Number of threads that will be started during the data redistribution process.Using more threads will speed up data redistribution but it will also increase the cluster load.
The minimum value (default) is
0(the number will be calculated automatically depending on the cluster configuration and data size), and the maximum one is25. -
close_cluster: Closing the cluster for access,trueorfalse.If this setting is enabled (
true), you cannot connect to the cluster and it does not receive new user requests. This will accelerate cluster expansion.Warning
If you close the cluster from load and disable background data redistribution, you will lose access to the cluster until its expansion is complete.
The expansion process can be time-consuming.
-
delay_redistribution: Using background data redistribution,trueorfalse.-
If background mode is disabled (
false), the redistribution process will be run once as soon as the cluster expansion preparation is over.The cluster will remain in the
Updatingstatus until all cluster expansion stages are completed.Warning
If data redistribution terminates upon reaching the timeout with some tables remaining unprocessed, you will need to redistribute those tables manually.
-
If background mode is enabled (
true), the redistribution process will be delayed. The process will run on a schedule as a routine maintenance operation.The cluster will remain in the
Updatingstatus only during preparation for cluster expansion.
-
-
-
Check the server response to make sure your request was successful.
Data redistribution monitoring
There are several ways to monitor the data redistribution process:
-
If the cluster is closed from load and background data redistribution is disabled, you can only track progress using Yandex Cloud monitoring tools.
Follow the Expansion progress chart on the Monitoring tab of the cluster management page or request the
gp_expand.*metrics using Yandex Monitoring. -
If the cluster is not closed from load, you can also run SQL queries to the
gpexpanddata schema's objects.Note
This schema is available only during cluster expansion, then it gets deleted.
To track the progress of the data redistribution process using SQL queries:
-
Connect to the
postgresdatabase as a user with themdb_adminrole. -
Run a query to get the current process status:
SELECT dbname, fq_name, rank, status, expansion_started, expansion_finished FROM gpexpand.status_detail ORDER BY rank;
The query result will contain various details
on the data redistribution status. Thestatuscolumn will indicate the current processing status for specific tables:-
NOT STARTED: Table data redistribution is scheduled, and the table is in the redistribution queue.If the cluster is not closed from load, you can modify the table priority to change its position in the queue if needed.
-
IN PROGRESS: Table data redistribution is in progress. -
COMPLETED: Table data redistribution is completed. -
NO LONGER EXISTS: The table no longer exists.
-
Additional actions
Changing the table redistribution priority
During preparation for cluster expansion, each table is assigned a rank that affects its redistribution queue priority. The higher its rank, the higher the table's queue priority and, therefore, the faster its data redistribution will be completed.
If you need to, you can change the rank for one or more tables provided the cluster is not closed from load. To do this, you must modify the table in the gpexpand data schema.
Note
This schema is available only during cluster expansion, then it gets deleted.
To edit the ranks of redistributed tables:
-
Connect to the
postgresdatabase as a user with themdb_adminrole. -
Run a query to get the current status of the data redistribution process:
SELECT dbname, fq_name, rank, status, expansion_started, expansion_finished FROM gpexpand.status_detail ORDER BY rank;This will include full table names (
fq_name) and their ranks (rank). -
Change the rank for one or more tables:
UPDATE gpexpand.status_detail SET rank=<required_rank> WHERE fq_name IN (<list_of_full_table_names>);For more information, see this Greenplum® guide
.
Running data redistribution manually
If background data redistribution was not used during cluster expansion and the set timeout for the data redistribution process was not long enough, the process may end before data redistribution is over for all tables.
This will leave some of the data distributed unevenly. As a result, the Unknown error: Partially Distributed Data error will occur in cluster logs. To fix the error, run data redistribution manually:
-
Learn the total number of segments in the cluster:
- Get detailed information about a cluster.
- Multiply the number of segments per host by the number of segment hosts.
-
Find tables with unevenly distributed data:
SELECT * FROM (SELECT oid, relname FROM pg_class) pgc JOIN (SELECT localoid AS oid, numsegments FROM gp_distribution_policy) gpdp USING (oid) WHERE numsegments != <total_number_of_segments>;A list of tables will be displayed stating the following for each table:
oid: Table ID.relname: Table name.numsegment: Number of segments storing the data.
For more information, see the Greenplum® guide on the gp_distribution_policy table
. -
Run data redistribution using the information you got earlier.
To redistribute the data of an individual table, run the following request:
-
For an ordinary table:
ALTER TABLE ONLY <table_name> EXPAND TABLE; -
For a partitioned table:
ALTER TABLE <table_name> SET WITH (REORGANIZE=true) <distribution_policy>;To learn the Greenplum® distribution policy value for an individual table partition, run the following query:
SELECT pg_get_table_distributedby(<partition_OID>) AS distribution_policy;
-
Greenplum® and Greenplum Database® are registered trademarks or trademarks of Broadcom Inc. in the United States and/or other countries.