Importing data from Yandex Managed Service for MySQL® clusters using Sqoop
The Sqoop utility allows you to import databases to the Yandex Data Processing cluster. Depending on the Yandex Data Processing cluster configuration, you can import data to:
- Yandex Object Storage bucket
- HDFS directory
- Apache Hive
- Apache HBase
To use Sqoop to import the source cluster databases to the Yandex Data Processing target cluster:
If you no longer need the resources you created, delete them.
Note
Sqoop is not supported for Yandex Data Processing clusters version 2.0 and higher. Alternatively, use Apache Spark™ features
Getting started
Note
Place the clusters and the VM instance in the same cloud network.
- Create a cloud network.
- Create a subnet in the
ru-central1-c
availability zone. - Set up an NAT gateway for the new subnet: this is required for the Yandex Data Processing cluster operation.
You can create other resources manually or using Terraform.
Manually
-
Create a Managed Service for MySQL® cluster in any suitable configuration with the following settings:
- DB name:
db1
- Username:
user1
- DB name:
-
To import the data to the Object Storage bucket:
-
Create a bucket with restricted access.
-
Create a service account with the following roles:
-
Grant this service account read and write permissions for this bucket.
-
-
Create a Yandex Data Processing cluster in any suitable configuration.
Specify the settings for the storage to import the data to:
Object StorageHDFS directoryApache HiveApache HBase- Service account: Name of the previously created service account.
- Bucket name: Name of the previously created bucket.
- Services:
Sqoop
.
Services:
HBase
HDFS
Sqoop
Yarn
Zookeeper
-
Services:
HDFS
Hive
Mapreduce
Sqoop
Yarn
-
Properties: The
hive:hive.execution.engine
key with themr
value.
Services:
HBase
HDFS
Sqoop
Yarn
Zookeeper
-
Create a virtual machine to connect to Managed Service for MySQL® and Yandex Data Processing clusters.
-
If you are using security groups for the clusters and the VM instance, configure them to allow connecting:
Using Terraform
-
If you do not have Terraform yet, install it.
-
Get the authentication credentials. You can add them to environment variables or specify them later in the provider configuration file.
-
Configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in a separate working directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.
-
Download the clusters-mysql-data-proc-and-vm.tf
configuration file and save it to the same working directory.This file describes:
- Security groups for the clusters and VM.
- Service account for the Yandex Data Processing cluster.
- Object Storage bucket.
- Managed Service for MySQL® cluster.
- Yandex Data Processing cluster.
- Virtual machine with public internet access.
-
Specify the infrastructure parameters in the
clusters-mysql-data-proc-and-vm.tf
configuration file underlocals
:-
folder_id
: ID of the folder to create resources in. -
network_id
: ID of the previously created cloud network. -
subnet_id
: ID of the previously created subnet. -
storage_sa_id
: ID of the service account to use for creating a bucket in Object Storage. -
data_proc_sa
: Name of the service account for the Yandex Data Processing cluster. The name must be unique within the folder. -
my_cluster_version
: MySQL® version of the Managed Service for MySQL® cluster. -
my_cluster_password
: Password of theuser1
user of thedb1
Managed Service for MySQL® database. -
vm_image_id
: ID of the public image with Ubuntu and no GPU, e.g., for Ubuntu 20.04 LTS. -
vm_username
andvm_public_key
: Username and absolute path to a public SSH key that will be used to access the virtual machine. By default, the specified username is ignored in the Ubuntu 20.04 LTS image. A user with theubuntu
username is created instead. Use it to connect to the instance. -
bucket_name
: Object Storage bucket name. The name must be unique within the entire Object Storage. -
dp_public_key
: Absolute path to a public SSH key for the Yandex Data Processing cluster.For an SSH connection to the hosts of Yandex Data Processing cluster version 1.х , use the
root
username.
-
-
Make sure the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Create the required infrastructure:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
Preparing the source cluster
-
Connect to the
db1
database of the Yandex Managed Service for MySQL® cluster asuser1
. -
Add test data to the database. The example uses a simple table with people's names and ages:
-
Create a table:
CREATE TABLE persons ( Name VARCHAR(30) NOT NULL, Age INTEGER DEFAULT 0, PRIMARY KEY (Name) );
-
Populate the table with data:
INSERT INTO persons (Name, Age) VALUES ('Anna', 19), ('Michael', 65), ('Fred', 28), ('Alsou', 50), ('Max', 27), ('John', 34), ('Dmitry', 42), ('Oleg', 19), ('Alina', 20), ('Maria', 28);
-
Importing the database
To enable database parallelismage
column.
Let's assume that:
- FQDN of the Yandex Data Processing subcluster host for data storage:
rc1c-dataproc-d-vfw6fa8x********.mdb.yandexcloud.net
. - Bucket name in Object Storage.
- Directory names in Object Storage and HDFS:
import-directory
. - Apache Hive database name:
db-hive
. - Name of the Apache HBase column family:
family1
. - Names of the HBase and Hive tables:
import-table
.
- Managed Service for MySQL® cluster ID:
c9qgcd6lplrs********
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect "jdbc:mysql://c-c9qgcd6lplrs********.rw.mdb.yandexcloud.net:3306/db1" \ --username "user1" \ --P \ --table "persons" \ --target-dir "s3a://<bucket_name>/import-directory" \ --split-by "age"
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect "jdbc:mysql://c-c9qgcd6lplrs********.rw.mdb.yandexcloud.net:3306/db1" \ --username "user1" \ --table "persons" \ --target-dir "import-directory" \ --P \ --split-by "age"
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect "jdbc:mysql://c-c9qgcd6lplrs********.rw.mdb.yandexcloud.net:3306/db1" \ --username "user1" \ --P \ --table "persons" \ --hive-import \ --create-hive-table \ --hive-database "db-hive" \ --hive-table "import-table" \ --split-by "age"
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect "jdbc:mysql://c-c9qgcd6lplrs********.rw.mdb.yandexcloud.net:3306/db1" \ --username "user1" \ --P \ --table "persons" \ --hbase-create-table \ --column-family "family1" \ --hbase-table "import-table" \ --split-by "age"
Verify the import
If the import was successful, you will see the contents of the persons
table.
Download the files with import results from the bucket.
-
Connect over SSH to the Yandex Data Processing subcluster host to store the data.
-
Run this command:
hdfs dfs -cat /user/root/import-directory/*
-
Connect over SSH to the Yandex Data Processing subcluster host to store the data.
-
Run this command:
hive -e "SELECT * FROM import-table;"
-
Connect over SSH to the Yandex Data Processing subcluster host to store the data.
-
Run this command:
echo -e "scan 'import-table'" | hbase shell -n
Deletе the resources you created
Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need:
To delete the infrastructure created with Terraform:
-
In the terminal window, go to the directory containing the infrastructure plan.
-
Delete the
clusters-mysql-data-proc-and-vm.tf
configuration file. -
Make sure the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
This will delete all the resources described in
clusters-mysql-data-proc-and-vm.tf
. -
Delete the resources you created manually: