Using Sqoop
Sqoop
-
Details about creating connect strings and setting up drivers for Sqoop.
-
Scoop commands for importing data to:
Note
Sqoop is not supported for Yandex Data Processing clusters version 2.0 and higher. Alternatively, use Apache Spark™ features
Creating a JDBC connect string
A JDBC connect string has the following format:
jdbc:postgresql://<DB_server_address>:5432/<DB_name>
For Yandex Managed Service for PostgreSQL, use a string with a special FQDN pointing at the master host:
jdbc:postgresql://c-<cluster_ID>.rw.mdb.yandexcloud.net:6432/<DB_name>
You can get the cluster ID with the list of clusters in the folder.
jdbc:mysql://<DB_server_address>:3306/<DB_name>
For Yandex Managed Service for MySQL®, use a string with a special FQDN pointing at the master host:
jdbc:mysql://c-<cluster_ID>.rw.mdb.yandexcloud.net:3306/<DB_name>
You can get the cluster ID with the list of clusters in the folder.
Installing a JDBC driver
For Sqoop to connect to the database using a JDBC connect string, install a JDBC driver:
Yandex Data Processing has a pre-installed Yandex Managed Service for PostgreSQL driver, requiring no additional actions.
Connect to the Yandex Data Processing subcluster host that stores the data over SSH and run this command:
MYSQL_VER="8.0.25" && \
wget "https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-${MYSQL_VER}.tar.gz" \
--output-document 'mysql-connector-java.tar.gz' && \
sudo tar -xf mysql-connector-java.tar.gz -C /var/lib/sqoop/ && \
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/var/lib/sqoop/mysql-connector-java-${MYSQL_VER}.jar
Note
To install another driver version, change the MYSQL_VER variable in the command.
Importing with Sqoop
To Object Storage
This type of import is available if the Sqoop component is enabled in the Yandex Data Processing cluster.
To import data to an Object Storage bucket:
-
When creating or editing the Yandex Data Processing cluster, specify the bucket name to import the data to Object Storage. Make sure the Yandex Data Processing service account has write permissions for this bucket.
-
Create connect strings for JDBC.
-
Connect over SSH to the Yandex Data Processing subcluster’s host to store the data.
-
Install drivers for Sqoop if required.
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connect_string> \ --username <database_username> \ --P \ --table '<database_table_name>' \ --target-dir 's3a://<bucket_name_for import>/<destination_directory>' \ --split-by '<table_column>'Where
--split-byis the table column used for splitting.Warning
Do not specify the name of an existing bucket directory.
To the HDFS directory
This type of import is available if the following services are enabled in the Yandex Data Processing cluster:
HBaseHDFSSqoopYarnZookeeper
To import data to an HDFS directory:
-
Create connect strings for JDBC.
-
Connect over SSH to the Yandex Data Processing subcluster’s host to store the data.
-
Install drivers for Sqoop if required.
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connect_string> \ --username <database_username> \ --table '<database_table_name>' \ --target-dir '<HDFS_directory>' \ --P \ --split-by '<table_column>'Where
--split-byis the table column used for splitting.Warning
Do not specify the name of an existing HDFS directory.
To Apache Hive
This type of import is available if the following services are enabled in the Yandex Data Processing cluster:
HDFSHiveMapreduceSqoopYarn
To import data to a Hive table:
-
When creating or editing a Yandex Data Processing cluster, add to the cluster properties the
hive:hive.execution.enginekey set tomr. -
Create connect strings for JDBC.
-
Connect over SSH to the Yandex Data Processing subcluster’s host to store the data.
-
Install drivers for Sqoop if required.
-
Create a Hive database:
hive -e "CREATE DATABASE <Hive_database_name>;" -
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connect_string> \ --username <source_database_username> \ --P \ --table '<table_name_in_source_database>' \ --hive-import \ --create-hive-table \ --hive-database '<Hive_database_name>' \ --hive-table '<Hive_table_name>' \ --split-by '<table_column>'Where
--split-byis the table column used for splitting.Warning
Do not specify the name of an existing Hive table.
To Apache HBase
This type of import is available if the following services are enabled in the Yandex Data Processing cluster:
HBaseHDFSSqoopYarnZookeeper
To import data to Apache HBase:
-
Create connect strings for JDBC.
-
Connect over SSH to the Yandex Data Processing subcluster’s host to store the data.
-
Install drivers for Sqoop if required.
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connect_string> \ --username <source_database_username> \ --P \ --table '<table_name_in_source_database>' \ --hbase-create-table \ --column-family '<HBase_column_family>' \ --hbase-table '<HBase_table_name>' \ --split-by '<table_column>'Where
--split-byis the table column used for splitting.Warning
Do not specify the name of an existing HBase table.