Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Yandex Data Processing
  • Getting started
    • All guides
      • Connecting to component interfaces
      • Using Sqoop
      • Setup Apache Iceberg™
    • Setting up and using Python virtual environments
  • Access management
  • Pricing policy
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Public materials
  • FAQ

In this article:

  • Creating a JDBC connect string
  • Installing a JDBC driver
  • Importing with Sqoop
  • To Object Storage
  • To the HDFS directory
  • To Apache Hive
  • To Apache HBase
  1. Step-by-step guides
  2. Apache and other third-party services
  3. Using Sqoop

Using Sqoop

Written by
Yandex Cloud
Updated at January 29, 2025
  • Creating a JDBC connect string
  • Installing a JDBC driver
  • Importing with Sqoop
    • To Object Storage
    • To the HDFS directory
    • To Apache Hive
    • To Apache HBase

Sqoop is used to import databases to the Yandex Data Processing cluster from external sources. The section includes:

  • Details about creating connect strings and setting up drivers for Sqoop.

  • Scoop commands to import data to:

    • Yandex Object Storage
    • HDFS directory
    • Apache Hive
    • Apache HBase

Note

Sqoop is not supported for Yandex Data Processing clusters version 2.0 and higher. Alternatively, use Apache Spark™ features.

Creating a JDBC connect stringCreating a JDBC connect string

A JDBC connect string has the following format:

PostgreSQL
MySQL®
jdbc:postgresql://<DB_server_address>:5432/<DB_name>

For Yandex Managed Service for PostgreSQL, use a string with a special FQDN pointing at the master host:

jdbc:postgresql://c-<cluster_ID>.rw.mdb.yandexcloud.net:6432/<DB_name>

You can get the cluster ID with a list of clusters in the folder.

jdbc:mysql://<DB_server_address>:3306/<DB_name>

For Yandex Managed Service for MySQL®, use a string with a special FQDN pointing at the master host:

jdbc:mysql://c-<cluster_ID>.rw.mdb.yandexcloud.net:3306/<DB_name>

You can get the cluster ID with a list of clusters in the folder.

Installing a JDBC driverInstalling a JDBC driver

For Sqoop to connect to the database using a JDBC connect string, install a JDBC driver:

PostgreSQL
MySQL®

Yandex Data Processing has a pre-installed Yandex Managed Service for PostgreSQL driver, so no additional action is needed.

Connect over SSH to the Yandex Data Processing subcluster host that stores the data and run this command:

MYSQL_VER="8.0.25" && \
wget "https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-${MYSQL_VER}.tar.gz" \
    --output-document 'mysql-connector-java.tar.gz' && \
sudo tar -xf mysql-connector-java.tar.gz -C /var/lib/sqoop/ && \
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/var/lib/sqoop/mysql-connector-java-${MYSQL_VER}.jar

Note

To install another driver version, change the MYSQL_VER variable in the command.

Importing with SqoopImporting with Sqoop

To Object StorageTo Object Storage

This type of import is available if the Sqoop component is enabled in the Yandex Data Processing cluster.

To import the data to the Object Storage bucket:

  1. When creating or editing the Yandex Data Processing cluster, specify the bucket name to import the data to Object Storage. Make sure that the Yandex Data Processing service account has write permissions for this bucket.

  2. Create connect strings for JDBC.

  3. Connect over SSH to the Yandex Data Processing subcluster host to store the data.

  4. Install drivers for Sqoop if they are not installed.

  5. Run this command:

    sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
      --connect <JDBC_connect_string> \
      --username <database_username> \
      --P \
      --table '<database_table_name>' \
      --target-dir 's3a://<bucket_name_for import>/<destination_directory>' \
      --split-by '<table_column>'
    

    Where --split-by is the table column used for splitting.

    Warning

    Do not specify the name of an existing bucket directory.

To the HDFS directoryTo the HDFS directory

This type of import is available if the following services are enabled in the Yandex Data Processing cluster:

  • HBase
  • HDFS
  • Sqoop
  • Yarn
  • Zookeeper

To import the data to the HDFS directory:

  1. Create connect strings for JDBC.

  2. Connect over SSH to the Yandex Data Processing subcluster host to store the data.

  3. Install drivers for Sqoop if they are not installed.

  4. Run this command:

    sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
      --connect <JDBC_connect_string> \
      --username <database_username> \
      --table '<database_table_name>' \
      --target-dir '<HDFS_directory>' \
      --P \
      --split-by '<table_column>'
    

    Where --split-by is the table column used for splitting.

    Warning

    Do not specify the name of an existing HDFS directory.

To Apache HiveTo Apache Hive

This type of import is available if the following services are enabled in the Yandex Data Processing cluster:

  • HDFS
  • Hive
  • Mapreduce
  • Sqoop
  • Yarn

To import the data to the Hive table:

  1. When creating or editing a Yandex Data Processing cluster, add to the cluster properties the hive:hive.execution.engine key set to mr.

  2. Create connect strings for JDBC.

  3. Connect over SSH to the Yandex Data Processing subcluster host to store the data.

  4. Install drivers for Sqoop if they are not installed.

  5. Create a Hive database:

    hive -e "CREATE DATABASE <Hive_database_name>;"
    
  6. Run this command:

    sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
      --connect <JDBC_connect_string> \
      --username <source_database_username> \
      --P \
      --table '<table_name_in_the_source_database>' \
      --hive-import \
      --create-hive-table \
      --hive-database '<Hive_database_name>' \
      --hive-table '<Hive_table_name>' \
      --split-by '<table_column>'
    

    Where --split-by is the table column used for splitting.

    Warning

    Do not specify the name of an existing Hive table.

To Apache HBaseTo Apache HBase

This type of import is available if the following services are enabled in the Yandex Data Processing cluster:

  • HBase
  • HDFS
  • Sqoop
  • Yarn
  • Zookeeper

To import data to Apache HBase:

  1. Create connect strings for JDBC.

  2. Connect over SSH to the Yandex Data Processing subcluster host to store the data.

  3. Install drivers for Sqoop if they are not installed.

  4. Run this command:

    sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
      --connect <JDBC_connect_string> \
      --username <source_database_username> \
      --P \
      --table '<table_name_in_the_source_database>' \
      --hbase-create-table \
      --column-family '<HBase_column_family>' \
      --hbase-table '<HBase_table_name>' \
      --split-by '<table_column>'
    

    Where --split-by is the table column used for splitting.

    Warning

    Do not specify the name of an existing HBase table.

Was the article helpful?

Previous
Connecting to component interfaces
Next
Setup Apache Iceberg™
Yandex project
© 2025 Yandex.Cloud LLC