Using Sqoop
Sqoop
-
Info on creating connection strings and setting up drivers for Sqoop.
-
Scoop commands for importing data to:
Note
Sqoop is not supported for Yandex Data Processing clusters version 2.0 and higher. Alternatively, use Apache Spark™ features
Creating a JDBC connection string
A JDBC connection string has the following format:
jdbc:postgresql://<DB_server_address>:5432/<DB_name>
For Yandex Managed Service for PostgreSQL, use a string with a special FQDN pointing at the master host:
jdbc:postgresql://c-<cluster_ID>.rw.mdb.yandexcloud.net:6432/<DB_name>
You can get the cluster ID with the list of clusters in the folder.
jdbc:mysql://<DB_server_address>:3306/<DB_name>
For Yandex Managed Service for MySQL®, use a string with a special FQDN pointing at the master host:
jdbc:mysql://c-<cluster_ID>.rw.mdb.yandexcloud.net:3306/<DB_name>
You can get the cluster ID with the list of clusters in the folder.
Installing a JDBC driver
For Sqoop to connect to the database using a JDBC connection string, install the JDBC driver:
Yandex Data Processing has a pre-installed Yandex Managed Service for PostgreSQL driver, requiring no additional actions.
Connect to the Yandex Data Processing data storage subcluster host over SSH and run this command:
MYSQL_VER="8.0.25" && \
wget "https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-${MYSQL_VER}.tar.gz" \
--output-document 'mysql-connector-java.tar.gz' && \
sudo tar -xf mysql-connector-java.tar.gz -C /var/lib/sqoop/ && \
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/var/lib/sqoop/mysql-connector-java-${MYSQL_VER}.jar
Note
To install another driver version, edit the MYSQL_VER variable in the command.
Importing with Sqoop
To Object Storage
This type of import is available if the Sqoop component is enabled in the Yandex Data Processing cluster.
To import data to an Object Storage bucket:
-
When creating or updating your Yandex Data Processing cluster, specify the name of the Object Storage bucket to import the data to. Make sure the Yandex Data Processing service account has write permissions for this bucket.
-
Connect over SSH to the Yandex Data Processing data storage subcluster host.
-
Install drivers for Sqoop if missing.
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connection_string> \ --username <database_username> \ --P \ --table '<database_table_name>' \ --target-dir 's3a://<bucket_name_for_import>/<destination_directory>' \ --split-by '<table_column>'Where
--split-byis the table column used for splitting.Warning
Do not specify the name of an existing bucket directory.
To an HDFS directory
This type of import is available if the following services are enabled in the Yandex Data Processing cluster:
HBaseHDFSSqoopYarnZookeeper
To import data to an HDFS directory:
-
Connect over SSH to the Yandex Data Processing subcluster’s host to store the data.
-
Install drivers for Sqoop if missing.
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connection_string> \ --username <database_username> \ --table '<database_table_name>' \ --target-dir '<HDFS_directory>' \ --P \ --split-by '<table_column>'Where
--split-byis the table column used for splitting.Warning
Do not specify the name of an existing HDFS directory.
To Apache Hive
This type of import is available if the following services are enabled in the Yandex Data Processing cluster:
HDFSHiveMapreduceSqoopYarn
To import data to a Hive table:
-
When creating or updating your Yandex Data Processing cluster, add the
hive:hive.execution.enginekey set tomrto the cluster properties. -
Connect over SSH to the Yandex Data Processing data storage subcluster host.
-
Install drivers for Sqoop if missing.
-
Create a Hive database:
hive -e "CREATE DATABASE <Hive_database_name>;" -
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connect_string> \ --username <source_database_username> \ --P \ --table '<source_database_table_name>' \ --hive-import \ --create-hive-table \ --hive-database '<Hive_database_name>' \ --hive-table '<Hive_table_name>' \ --split-by '<table_column>'Where
--split-byis the table column used for splitting.Warning
Do not specify the name of an existing Hive table.
To Apache HBase
This type of import is available if the following services are enabled in the Yandex Data Processing cluster:
HBaseHDFSSqoopYarnZookeeper
To import data to Apache HBase:
-
Connect over SSH to the Yandex Data Processing data storage subcluster host.
-
Install drivers for Sqoop if missing.
-
Run this command:
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \ --connect <JDBC_connect_string> \ --username <source_database_username> \ --P \ --table '<table_name_in_source_database>' \ --hbase-create-table \ --column-family '<HBase_column_family>' \ --hbase-table '<HBase_table_name>' \ --split-by '<table_column>'Where
--split-byis the table column used for splitting.Warning
Do not specify the name of an existing HBase table.