Transferring data from a Greenplum® source endpoint
Yandex Data Transfer enables you to migrate data from a Greenplum® database and implement various data transfer, processing, and transformation scenarios. To implement a transfer:
- Explore possible data transfer scenarios.
- Prepare the Greenplum® database for the transfer.
- Set up an endpoint source in Yandex Data Transfer.
- Set up one of the supported data targets.
- Create a transfer and start it.
- Perform required operations with the database and control the transfer.
- In case of any issues, use ready-made solutions to resolve them.
Scenarios for transferring data from Greenplum®
-
Migration: Moving data from one storage to another. Migration often means migrating a database from obsolete local databases to managed cloud ones.
-
Uploading data to data marts is a process of transferring prepared data to storage for subsequent visualization.
For a detailed description of possible Yandex Data Transfer data transfer scenarios, see Tutorials.
Configuring the Greenplum® source endpoint
When creating or editing an endpoint, you can define:
- Yandex Managed Service for Greenplum® cluster connection or custom installation settings, including those based on Yandex Compute Cloud VMs. These are required parameters.
- Additional parameters.
Managed Service for Greenplum® cluster
Warning
To create or edit an endpoint of a managed database, you need to have the managed-greenplum.viewer
role or the viewer
primitive role assigned for the folder where this managed database cluster resides.
Connecting to the database with the cluster ID specified in Yandex Cloud.
-
Managed Service for Greenplum cluster: Specify ID of the cluster to connect to.
-
User: Specify the username that Data Transfer will use to connect to the database.
-
Password: Enter the user password to the database.
-
Database: Specify the name of the database in the selected cluster.
-
Security groups: Select the cloud network to host the endpoint and security groups for network traffic.
Thus, you will be able to apply the specified security group rules to the VMs and clusters in the selected network without changing the settings of these VMs and clusters. For more information, see Networking in Yandex Data Transfer.
Custom installation
Connecting to the database with explicitly specified network addresses and ports.
-
Coordinator host: Specify the IP or FQDN of the primary master host to connect to.
-
Coordinator port: Specify the port for Data Transfer to use to connect to the primary master host.
-
Coordinator mirror host: Specify the IP address or FQDN of the standby master host to connect to (leave the field empty if your cluster only has one master host).
-
Coordinator mirror port: Specify the port for Data Transfer to use to connect to the standby master host (leave the field empty if there is only one master host in your cluster).
-
Greenplum cluster segments: Specify segment host connection information. If you omit these, segment host addresses will be retrieved automatically from the master host housekeeping table.
-
CA certificate: Upload the certificate file or add its contents as text if transmitted data must be encrypted, for example, to meet PCI DSS
requirements. -
Subnet ID: Select or create a subnet in the desired availability zone.
If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
Database: Specify the name of the database in the selected cluster.
-
User: Specify the username that Data Transfer will use to connect to the database.
-
Password: Enter the user's password to the database.
- Security groups: Select the cloud network to host the endpoint and security groups for network traffic.
This will let you apply the specified security group rules to the VMs and clusters in the selected network without changing the settings of these VMs and clusters. For more information, see Networking in Yandex Data Transfer.
Table filter
-
Included tables: Only data from the tables listed here will be transferred.
If a table is partitioned, you can use this field to specify both the entire table and individual partitions.
Make sure that, for tables to be included in the list, all the necessary privileges are granted to the user to perform the data transfer.
When you add new tables when editing an endpoint used in Snapshot and increment or Replication transfers with the Replicating status, the data history for these tables will not get uploaded. To add a table with its historical data, use the List of objects for transfer field in the transfer settings.
-
Excluded tables: Data from the listed tables is not transferred.
If a table is partitioned, to exclude it from the list, make sure to list all of its partitions.
The lists include the name of the schema
(description of DB contents, structure, and integrity constraints) and the table name. Both lists support expressions in the following format:<schema_name>.<table_name>
: Fully qualified table name.<schema_name>.*
: All tables in the specified schema.<table_name>
: Table in the default schema.
Included and excluded table names must meet the ID naming rules in Greenplum®. Double quotes within a table name are not supported. Outer quotes are only used as delimiters and will be deleted when processing paths.
-
Snapshot consistency: When enabled, Data Transfer will apply additional steps to the source to assure snapshot consistency.
-
Service object schema: Schema for placing auxiliary objects of the transfer.
The schema name must meet the ID naming rules in Greenplum®. Double quotes in a schema name are not supported.
Specifics of working with the Greenplum source
Data Transfer only supports Greenplum® version 6. Greenplum® versions 4 and 5 are not supported.
The service performs operations on a Greenplum® cluster with the READ COMMITTEDlevel of isolation
.
Data Transfer supports operation with parallel copy enabled for a Greenplum® source.
During operation with parallel copy enabled, Data Transfer maintains an open transaction on the Greenplum® master host. If this transaction is interrupted, a transfer will return an error.
With parallel copy disabled, a transfer will move data from Greenplum® objects such as TABLE
, VIEW
, FOREIGN TABLE
, and EXTERNAL TABLE
. Data from these objects will be treated as data from ordinary tables and processed by the target accordingly. With parallel copy enabled, a transfer will only move tables (TABLE
objects). Tables with the DISTRIBUTED REPLICATED
(https://docs.vmware.com/en/VMware-Greenplum/6-19/admin_guide/distribution.html) [allocation policy] will not be transferred.
Snapshot consistency
When starting a transfer with parallel copy disabled (by default), the service copies data only interacting with the Greenplum® cluster's master host. The copied tables are accessed in ACCESS SHARElock mode
. Snapshot consistency is achieved through Greenplum® mechanisms.
When starting a transfer with parallel copy enabled, the service copies data interacting with both the Greenplum® cluster's master host and segment hosts in utility mode. Access to the tables to be copied locks the tables in ACCESS SHARE
or SHARE
mode depending on the Snapshot consistency setting.
To guarantee snapshot consistency, transfers with parallel copy enabled need to ensure that data in the tables being transferred remains static. For ACCESS SHARE
locks (by default), the service does not guarantee that the data will remain static: this must be assured externally. For SHARE
locks, the Greenplum® mechanisms guarantee that data in the source tables remains static.
Greenplum® and Greenplum Database® are registered trademarks or trademarks of VMware, Inc. in the United States and/or other countries.
Configuring the data target
Configure one of the supported data targets:
For a complete list of supported sources and targets in Yandex Data Transfer, see Available Transfers.
After configuring the data source and target, create and start the transfer.