Transferring data between Yandex Data Processing clusters using Metastore
You can migrate metadata between Yandex Data Processing clusters with a Hive DBMS. First, you need to export metadata from a cluster, then import it into a different one using Hive Metastore.
To transfer metadata between Yandex Data Processing clusters:
- Create a test table.
- Export data.
- Connect Yandex Data Processing to Metastore.
- Import data.
- Check the result.
If you no longer need the resources you created, delete them.
Note
Metastore is at the Preview stage.
Getting started
Prepare the infrastructure:
-
Create a service account named
dataproc-s3-sa
and assign it thedataproc.agent
,dataproc.provisioner
, andstorage.uploader
roles. -
In Yandex Object Storage, create a bucket named
dataproc-bucket
. Grant theREAD and WRITE
permission for this bucket to the service account. -
Create a cloud network named
dataproc-network
. -
In this network, create a subnet named
dataproc-subnet
. -
Set up a NAT gateway for the subnet you created.
-
Create a
dataproc-security-group
with the following rules:Security group rules
What service requires the rule
Why the rule is required
Rule settings
Yandex Data Processing
For incoming service traffic
- Port range:
0-65535
- Protocol:
Any
- Source:
Security group
- Security group:
Self
Yandex Data Processing
For incoming traffic to allow access to NTP servers for time syncing
- Port range:
123
- Protocol:
UDP
- Source:
CIDR
- CIDR blocks:
0.0.0.0/0
Yandex Data Processing
For incoming traffic, to connect from the internet via SSH to subcluster hosts with public access
- Port range:
22
- Protocol:
TCP
- Source:
CIDR
- CIDR blocks:
0.0.0.0/0
Metastore
For incoming client traffic
- Port range:
30000-32767
- Protocol:
Any
- Source:
CIDR
- CIDR blocks:
0.0.0.0/0
Metastore
For incoming load balancer traffic
- Port range:
10256
- Protocol:
Any
- Source:
Load balancer health checks
Yandex Data Processing
For outgoing service traffic
- Port range:
0-65535
- Protocol:
Any
- Source:
Security group
- Security group:
Self
Yandex Data Processing
For outgoing HTTPS traffic
- Port range:
443
- Protocol:
TCP
- Destination type:
CIDR
- CIDR blocks:
0.0.0.0/0
Yandex Data Processing
For outgoing traffic to allow access to NTP servers for time syncing
- Port range:
123
- Protocol:
UDP
- Source:
CIDR
- CIDR blocks:
0.0.0.0/0
Yandex Data Processing
For outgoing traffic to allow Yandex Data Processing cluster connections to Metastore
- Port range:
9083
- Protocol:
Any
- Source:
CIDR
- CIDR blocks:
0.0.0.0/0
- Port range:
-
Create two Yandex Data Processing clusters named
dataproc-source
anddataproc-target
with the following settings:-
Services:
HDFS
HIVE
SPARK
YARN
ZEPPELIN
-
Service account:
dataproc-s3-sa
. -
Availability zone: Zone where
dataproc-subnet
resides. -
Properties:
spark:spark.sql.hive.metastore.sharedPrefixes
with thecom.amazonaws,ru.yandex.cloud
value. Required for PySpark jobs and integration with Metastore. -
Bucket name:
dataproc-bucket
. -
Network:
dataproc-network
. -
Security groups:
dataproc-security-group
. -
UI Proxy: Enabled.
-
Subnet for the Yandex Data Processing subclusters:
dataproc-subnet
. -
Public access for the master host: Enabled.
-
-
If you do not have Terraform yet, install it.
-
Get the authentication credentials. You can add them to environment variables or specify them later in the provider configuration file.
-
Configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in a separate working directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.
-
Download the metastore-import.tf
configuration file to the same working directory.This file describes:
- Network.
- NAT gateway and route table required for Yandex Data Processing.
- Subnet.
- Security group for Yandex Data Processing and Metastore.
- Service account
- Static access key for Yandex Object Storage bucket creation.
- Bucket
- Two Yandex Data Processing clusters.
-
Specify the following in the
metastore-import.tf
file:folder_id
: Cloud folder ID, same as in the provider settings.dp_ssh_key
: Absolute path to the public key for the Yandex Data Processing clusters. For more information, see Connecting to a Yandex Data Processing host via SSH.
-
Check that the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Create the required infrastructure:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
Warning
Do not assign a bucket policy for the bucket; otherwise, the Metastore cluster will not be able to write any data to the bucket.
Create a test table
In the dataproc-source
cluster, create a test table named countries
:
-
Go to the folder page
and select Yandex Data Processing. -
Open the
dataproc-source
cluster page. -
Click the Zeppelin Web UI link under UI Proxy.
-
Select Notebook, then select
Create new note. -
In the window that opens, specify the name for the note and click Create.
-
To run a PySpark job, paste a Python script into the input line:
%pyspark from pyspark.sql.types import * schema = StructType([StructField('Name', StringType(), True), StructField('Capital', StringType(), True), StructField('Area', IntegerType(), True), StructField('Population', IntegerType(), True)]) df = spark.createDataFrame([('Australia', 'Canberra', 7686850, 19731984), ('Austria', 'Vienna', 83855, 7700000)], schema) df.write.mode("overwrite").option("path","s3a://dataproc-bucket/countries").saveAsTable("countries")
-
Click
Run all paragraphs and wait until the job is complete. -
Replace the Python code in the input line with this SQL query:
%sql SELECT * FROM countries;
-
Click
Run all paragraphs.Result:
| Name | Capital | Area | Population | | --------- | -------- | ------- | ---------- | | Australia | Canberra | 7686850 | 19731984 | | Austria | Vienna | 83855 | 7700000 |
Export data
To transfer data from one Yandex Data Processing cluster to another, back up the data in the dataproc-source
cluster using pg_dump
:
-
Use SSH to connect to the
dataproc-source
cluster's master host:ssh ubuntu@<master_host_FQDN>
You can learn how to get the FQDN here.
-
Create a backup and save it in the
metastore_dump.sql
file:pg_dump --data-only --schema public postgres://hive:hive-p2ssw0rd@localhost/metastore > metastore_dump.sql
-
Disconnect from the master host.
-
Download the
metastore_dump.sql
file to your local current directory:scp ubuntu@<master_host_FQDN>:metastore_dump.sql .
-
Upload the
metastore_dump.sql
file to thedataproc-bucket
bucket.
Connect Yandex Data Processing to Metastore
-
Create a Metastore cluster with the following parameters:
- Service account:
dataproc-s3-sa
. - Network:
dataproc-network
- Subnet:
dataproc-subnet
- Security groups:
dataproc-security-group
- Service account:
-
Add to the
dataproc-target
cluster settings thespark:spark.hive.metastore.uris
property with the following value:thrift://<Metastore_cluster_IP_address>:9083
.To find out the Metastore cluster IP address, select Yandex Data Processing in the management console and then select the
Metastore page in the left-hand panel. You will see the cluster IP address under General information.
Import data
- Open the Metastore cluster page.
- Click
Import. - In the window that opens, specify the
dataproc-bucket
and themetastore_dump.sql
file. - Click Import.
- Wait for the import to complete. You can check the import status on the Metastore cluster page under
Operations.
Check the result
-
Open the
dataproc-target
cluster page. -
Click the Zeppelin Web UI link under UI Proxy.
-
Select Notebook, then select
Create new note. -
In the window that opens, specify the name for the note and click Create.
-
Run the following SQL query:
%sql SELECT * FROM countries;
-
Click
Run all paragraphs.Result:
| Name | Capital | Area | Population | | --------- | -------- | ------- | ---------- | | Australia | Canberra | 7686850 | 19731984 | | Austria | Vienna | 83855 | 7700000 |
The metadata from the dataproc-source
cluster was successfully imported into the dataproc-target
cluster.
Delete the resources you created
Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:
-
Delete other resources depending on how they were created:
ManuallyTerraform-
Delete the objects from the bucket.
-
In the terminal window, go to the directory containing the infrastructure plan.
-
Delete the
metastore-import.tf
configuration file. -
Check that the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the resources described in the
metastore-import.tf
configuration file will be deleted. -
-