Loading data from Yandex Object Storage to Yandex Managed Service for Greenplum® using Yandex Data Transfer
You can migrate data from Object Storage to the Managed Service for Greenplum® table using Data Transfer. To do this:
- Prepare the test data.
- Create a database in the target cluster.
- Prepare and activate the transfer.
- Test the transfer.
If you no longer need the resources you created, delete them.
Getting started
-
Prepare the infrastructure:
ManuallyUsing Terraform-
Create a Managed Service for Greenplum® target cluster in any applicable configuration with publicly available hosts and the following settings:
- Username:
user1
- Password:
<user_password>
- Username:
-
If using security groups in your cluster, make sure they are configured correctly and allow connecting to the cluster.
-
Create a service account named
storage-viewer
with thestorage.viewer
role. The transfer will use it to access the bucket. -
Create a static access key for the
storage-viewer
service account.
-
If you do not have Terraform yet, install it.
-
Get the authentication credentials. You can add them to environment variables or specify them later in the provider configuration file.
-
Configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in a separate working directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.
-
Download the object-storage-to-greenplum.tf
configuration file to the same working directory.This file describes:
- Network.
- Subnet.
- Security group required to connect to a cluster.
- Service account to be used to create and access the bucket.
- Yandex Lockbox secret which will store the static key of the service account to configure the source endpoint.
- Object Storage source bucket.
- Managed Service for Greenplum® target cluster.
- Transfer.
-
In the
object-storage-to-greenplum.tf
file, specify the values of the following variables:folder_id
: Cloud folder ID, same as in the provider settings.bucket_name
: Bucket name consistent with the naming conventions.gp_version
: Greenplum® version.gp_password
: Greenplum® user password.
-
Make sure the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Create the required infrastructure:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
-
-
Enable the Access from Data Transfer option in the target cluster.
Prepare the test data
-
Prepare two CSV files with test data:
-
demo_data1.csv
:1,Anna 2,Robert 3,Umar 4,Algul 5,Viktor
-
demo_data2.csv
:6,Maria 7,Alex
-
-
Upload the
demo_data1.csv
file to the Object Storage bucket.
Create a database in the target cluster
-
Connect to the auxiliary
postgres
database in the Managed Service for Greenplum® target cluster asuser1
. -
Create a database named
db1
:CREATE DATABASE db1;
Prepare and activate the transfer
-
Create a source endpoint of the
Object Storage
type with the following settings:-
Database type:
Object Storage
-
Bucket: Bucket name in Object Storage
-
Access Key ID: Public part of the service account static key. If you created your infrastructure with Terraform, copy the key value from the Yandex Lockbox secret.
-
Secret Access Key: Private part of the service account static key. If you created your infrastructure with Terraform, copy the key value from the Yandex Lockbox secret.
-
Endpoint:
https://storage.yandexcloud.net
-
Region:
ru-central1
-
Data format:
CSV
-
Delimiter: Comma (
,
) -
Table:
table1
-
Result table schema: Select
Manual
and specify field names and data types:Id
:Int64
Name
:UTF8
Leave the default values for other properties.
-
-
Create a target endpoint of the
Greenplum®
type and specify the cluster connection parameters in it:- Connection type:
Managed Service for Greenplum cluster
- Managed Service for Greenplum cluster:
<Greenplum®_target_cluster_name>
from the drop-down list - Database:
db1
- User:
user1
- Password:
<user_password>
- Connection type:
-
Create and activate the transfer:
ManuallyUsing Terraform-
Create a transfer of the Snapshot and replication type that will use the created endpoints.
-
Activate the transfer and wait for its status to change to Replicating.
-
In the
object-storage-to-greenplum.tf
file, specify these variables:source_endpoint_id
: Source endpoint ID.target_endpoint_id
: Target endpoint ID.transfer_enabled
:1
to create a transfer.
-
Make sure the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Create the required infrastructure:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
-
The transfer will be activated automatically. Wait for its status to change to Replicating.
-
Test the transfer
Check the transfer performance by testing the copy and replication processes.
Test the copy process
-
Connect to
db1
in the Managed Service for Greenplum® target cluster. -
Run the following query:
SELECT * FROM public.table1;
Response example
__file_name | __row_index | Id | Name ---------------+-------------+----+-------- demo_data1.csv | 1 | 1 | Anna demo_data1.csv | 2 | 2 | Robert demo_data1.csv | 3 | 3 | Umar demo_data1.csv | 4 | 4 | Algul demo_data1.csv | 5 | 5 | Viktor
Test the replication process
-
Upload the
demo_data2.csv
file to the Object Storage bucket. -
Make sure the data from
demo_data2.csv
has been added to the target database:-
Connect to
db1
in the Managed Service for Greenplum® target cluster. -
Run the following query:
SELECT * FROM public.table1;
Response example
__file_name | __row_index | Id | Name ---------------+-------------+----+-------- demo_data1.csv | 1 | 1 | Anna demo_data1.csv | 2 | 2 | Robert demo_data1.csv | 3 | 3 | Umar demo_data1.csv | 4 | 4 | Algul demo_data1.csv | 5 | 5 | Viktor demo_data2.csv | 1 | 6 | Maria demo_data2.csv | 2 | 7 | Alex
-
Delete the resources you created
Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need:
-
Delete the other resources depending on how they were created:
ManuallyUsing Terraform-
Delete all objects from the bucket.
-
In the terminal window, go to the directory containing the infrastructure plan.
-
Delete the
object-storage-to-greenplum.tf
configuration file. -
Make sure the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the resources described in the
object-storage-to-greenplum.tf
configuration file will be deleted. -
-