Using Sqoop
Sqoop
-
Details about creating connect strings and setting up drivers for Sqoop.
-
Scoop commands to import data to:
Note
Sqoop is not supported for Yandex Data Processing clusters version 2.0 and higher. Alternatively, use Apache Spark™ features
Creating a JDBC connect string
A JDBC connect string has the following format:
jdbc:postgresql://<DB_server_address>:5432/<DB_name>
For Yandex Managed Service for PostgreSQL, use a string with a special FQDN pointing at the master host:
jdbc:postgresql://c-<cluster_ID>.rw.mdb.yandexcloud.net:6432/<DB_name>
You can get the cluster ID with a list of clusters in the folder.
jdbc:mysql://<DB_server_address>:3306/<DB_name>
For Yandex Managed Service for MySQL®, use a string with a special FQDN pointing at the master host:
jdbc:mysql://c-<cluster_ID>.rw.mdb.yandexcloud.net:3306/<DB_name>
You can get the cluster ID with a list of clusters in the folder.
Installing a JDBC driver
For Sqoop to connect to the database using a JDBC connect string, install a JDBC driver:
Yandex Data Processing has a pre-installed Yandex Managed Service for PostgreSQL driver, so no additional action is needed.
Connect over SSH to the Yandex Data Processing subcluster host that stores the data and run this command:
MYSQL_VER="8.0.25" && \
wget "https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-${MYSQL_VER}.tar.gz" \
--output-document 'mysql-connector-java.tar.gz' && \
sudo tar -xf mysql-connector-java.tar.gz -C /var/lib/sqoop/ && \
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/var/lib/sqoop/mysql-connector-java-${MYSQL_VER}.jar
Note
To install another driver version, change the MYSQL_VER
variable in the command.
Importing with Sqoop
To Object Storage
This type of import is available if the Sqoop
component is enabled in the Yandex Data Processing cluster.
To import the data to the Object Storage bucket:
-
When creating or editing the Yandex Data Processing cluster, specify the bucket name to import the data to Object Storage. Make sure that the Yandex Data Processing service account has write permissions for this bucket.
-
Create connect strings for JDBC.
-
Connect over SSH to the Yandex Data Processing subcluster host to store the data.
-
Install drivers for Sqoop if they are not installed.
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connect_string> \ --username <database_username> \ --P \ --table '<database_table_name>' \ --target-dir 's3a://<bucket_name_for import>/<destination_directory>' \ --split-by '<table_column>'
Where
--split-by
is the table column used for splitting.Warning
Do not specify the name of an existing bucket directory.
To the HDFS directory
This type of import is available if the following services are enabled in the Yandex Data Processing cluster:
HBase
HDFS
Sqoop
Yarn
Zookeeper
To import the data to the HDFS directory:
-
Create connect strings for JDBC.
-
Connect over SSH to the Yandex Data Processing subcluster host to store the data.
-
Install drivers for Sqoop if they are not installed.
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connect_string> \ --username <database_username> \ --table '<database_table_name>' \ --target-dir '<HDFS_directory>' \ --P \ --split-by '<table_column>'
Where
--split-by
is the table column used for splitting.Warning
Do not specify the name of an existing HDFS directory.
To Apache Hive
This type of import is available if the following services are enabled in the Yandex Data Processing cluster:
HDFS
Hive
Mapreduce
Sqoop
Yarn
To import the data to the Hive table:
-
When creating or editing a Yandex Data Processing cluster, add to the cluster properties the
hive:hive.execution.engine
key set tomr
. -
Create connect strings for JDBC.
-
Connect over SSH to the Yandex Data Processing subcluster host to store the data.
-
Install drivers for Sqoop if they are not installed.
-
Create a Hive database:
hive -e "CREATE DATABASE <Hive_database_name>;"
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connect_string> \ --username <source_database_username> \ --P \ --table '<table_name_in_the_source_database>' \ --hive-import \ --create-hive-table \ --hive-database '<Hive_database_name>' \ --hive-table '<Hive_table_name>' \ --split-by '<table_column>'
Where
--split-by
is the table column used for splitting.Warning
Do not specify the name of an existing Hive table.
To Apache HBase
This type of import is available if the following services are enabled in the Yandex Data Processing cluster:
HBase
HDFS
Sqoop
Yarn
Zookeeper
To import data to Apache HBase:
-
Create connect strings for JDBC.
-
Connect over SSH to the Yandex Data Processing subcluster host to store the data.
-
Install drivers for Sqoop if they are not installed.
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connect_string> \ --username <source_database_username> \ --P \ --table '<table_name_in_the_source_database>' \ --hbase-create-table \ --column-family '<HBase_column_family>' \ --hbase-table '<HBase_table_name>' \ --split-by '<table_column>'
Where
--split-by
is the table column used for splitting.Warning
Do not specify the name of an existing HBase table.