Using Sqoop

Written by

Updated at January 29, 2025

Creating a JDBC connect string
Installing a JDBC driver
Importing with Sqoop

Sqoop is used to import databases to the Yandex Data Processing cluster from external sources. The section includes:

Details about creating connect strings and setting up drivers for Sqoop.
Scoop commands to import data to:

Note

Sqoop is not supported for Yandex Data Processing clusters version 2.0 and higher. Alternatively, use Apache Spark™ features.

Creating a JDBC connect string

A JDBC connect string has the following format:

PostgreSQL

MySQL®

jdbc:postgresql://<DB_server_address>:5432/<DB_name>

For Yandex Managed Service for PostgreSQL, use a string with a special FQDN pointing at the master host:

jdbc:postgresql://c-<cluster_ID>.rw.mdb.yandexcloud.net:6432/<DB_name>

You can get the cluster ID with a list of clusters in the folder.

jdbc:mysql://<DB_server_address>:3306/<DB_name>

For Yandex Managed Service for MySQL®, use a string with a special FQDN pointing at the master host:

jdbc:mysql://c-<cluster_ID>.rw.mdb.yandexcloud.net:3306/<DB_name>

You can get the cluster ID with a list of clusters in the folder.

Installing a JDBC driver

For Sqoop to connect to the database using a JDBC connect string, install a JDBC driver:

PostgreSQL

MySQL®

Yandex Data Processing has a pre-installed Yandex Managed Service for PostgreSQL driver, so no additional action is needed.

Connect over SSH to the Yandex Data Processing subcluster host that stores the data and run this command:

MYSQL_VER="8.0.25" && \
wget "https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-${MYSQL_VER}.tar.gz" \
    --output-document 'mysql-connector-java.tar.gz' && \
sudo tar -xf mysql-connector-java.tar.gz -C /var/lib/sqoop/ && \
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/var/lib/sqoop/mysql-connector-java-${MYSQL_VER}.jar

Note

To install another driver version, change the MYSQL_VER variable in the command.

Importing with Sqoop

To Object Storage

This type of import is available if the Sqoop component is enabled in the Yandex Data Processing cluster.

To import the data to the Object Storage bucket:

When creating or editing the Yandex Data Processing cluster, specify the bucket name to import the data to Object Storage. Make sure that the Yandex Data Processing service account has write permissions for this bucket.
Create connect strings for JDBC.
Connect over SSH to the Yandex Data Processing subcluster host to store the data.
Install drivers for Sqoop if they are not installed.

Run this command:

sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
  --connect <JDBC_connect_string> \
  --username <database_username> \
  --P \
  --table '<database_table_name>' \
  --target-dir 's3a://<bucket_name_for import>/<destination_directory>' \
  --split-by '<table_column>'

Where --split-by is the table column used for splitting.

Warning

Do not specify the name of an existing bucket directory.

To the HDFS directory

This type of import is available if the following services are enabled in the Yandex Data Processing cluster:

HBase
HDFS
Sqoop
Yarn
Zookeeper

To import the data to the HDFS directory:

Create connect strings for JDBC.
Connect over SSH to the Yandex Data Processing subcluster host to store the data.
Install drivers for Sqoop if they are not installed.

Run this command:

sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
  --connect <JDBC_connect_string> \
  --username <database_username> \
  --table '<database_table_name>' \
  --target-dir '<HDFS_directory>' \
  --P \
  --split-by '<table_column>'

Where --split-by is the table column used for splitting.

Warning

Do not specify the name of an existing HDFS directory.

To Apache Hive

This type of import is available if the following services are enabled in the Yandex Data Processing cluster:

HDFS
Hive
Mapreduce
Sqoop
Yarn

To import the data to the Hive table:

When creating or editing a Yandex Data Processing cluster, add to the cluster properties the hive:hive.execution.engine key set to mr.
Create connect strings for JDBC.
Connect over SSH to the Yandex Data Processing subcluster host to store the data.
Install drivers for Sqoop if they are not installed.

Create a Hive database:

hive -e "CREATE DATABASE <Hive_database_name>;"

Run this command:

sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
  --connect <JDBC_connect_string> \
  --username <source_database_username> \
  --P \
  --table '<table_name_in_the_source_database>' \
  --hive-import \
  --create-hive-table \
  --hive-database '<Hive_database_name>' \
  --hive-table '<Hive_table_name>' \
  --split-by '<table_column>'

Where --split-by is the table column used for splitting.

Warning

Do not specify the name of an existing Hive table.

To Apache HBase

This type of import is available if the following services are enabled in the Yandex Data Processing cluster:

HBase
HDFS
Sqoop
Yarn
Zookeeper

To import data to Apache HBase:

Create connect strings for JDBC.
Connect over SSH to the Yandex Data Processing subcluster host to store the data.
Install drivers for Sqoop if they are not installed.

Run this command:

sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
  --connect <JDBC_connect_string> \
  --username <source_database_username> \
  --P \
  --table '<table_name_in_the_source_database>' \
  --hbase-create-table \
  --column-family '<HBase_column_family>' \
  --hbase-table '<HBase_table_name>' \
  --split-by '<table_column>'

Where --split-by is the table column used for splitting.

Warning

Do not specify the name of an existing HBase table.

Using Sqoop

Creating a JDBC connect stringCreating a JDBC connect string

Installing a JDBC driverInstalling a JDBC driver

Importing with SqoopImporting with Sqoop

To Object StorageTo Object Storage

To the HDFS directoryTo the HDFS directory

To Apache HiveTo Apache Hive

To Apache HBaseTo Apache HBase

Was the article helpful?

Creating a JDBC connect string

Installing a JDBC driver

Importing with Sqoop

To Object Storage

To the HDFS directory

To Apache Hive

To Apache HBase