Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Tutorials
    • All tutorials
    • Deploying the Apache Kafka® web interface
    • Migrating a database from a third-party Apache Kafka® cluster to Managed Service for Apache Kafka®
    • Moving data between Managed Service for Apache Kafka® clusters using Data Transfer
    • Delivering data from Managed Service for MySQL® to Managed Service for Apache Kafka® using Data Transfer
    • Delivering data from Managed Service for MySQL® to Managed Service for Apache Kafka® using Debezium
    • Delivering data from Managed Service for PostgreSQL to Managed Service for Apache Kafka® using Data Transfer
    • Delivering data from Managed Service for PostgreSQL to Managed Service for Apache Kafka® using Debezium
    • Delivering data from Managed Service for YDB to Managed Service for Apache Kafka® using Data Transfer
    • Delivering data from Managed Service for Apache Kafka® to Managed Service for ClickHouse® using Data Transfer
    • Delivering data from Managed Service for Apache Kafka® to Managed Service for Greenplum® using Data Transfer
    • Delivering data from Managed Service for Apache Kafka® to Managed Service for MongoDB using Data Transfer
    • Delivering data from Managed Service for Apache Kafka® to Managed Service for MySQL® using Data Transfer
    • Delivering data from Managed Service for Apache Kafka® to Managed Service for OpenSearch using Data Transfer
    • Delivering data from Managed Service for Apache Kafka® to Managed Service for PostgreSQL using Data Transfer
    • Delivering data from Managed Service for Apache Kafka® to Managed Service for YDB using Data Transfer
    • Delivering data from Managed Service for Apache Kafka® to Data Streams using Data Transfer
    • Delivering data from Data Streams to Managed Service for YDB using Data Transfer
    • Delivering data from Data Streams to Managed Service for Apache Kafka® using Data Transfer
    • YDB change data capture and delivery to YDS
    • Configuring Kafka Connect to work with a Managed Service for Apache Kafka® cluster
    • Automating Query tasks with Managed Service for Apache Airflow™
    • Sending requests to the Yandex Cloud API via the Yandex Cloud Python SDK
    • Configuring an SMTP server to send e-mail notifications
    • Adding data to a ClickHouse® DB
    • Migrating data to Managed Service for ClickHouse® using Data Transfer
    • Delivering data from Managed Service for MySQL® to Managed Service for ClickHouse® using Data Transfer
    • Asynchronously replicating data from PostgreSQL to ClickHouse®
    • Exchanging data between Managed Service for ClickHouse® and Yandex Data Processing
    • Configuring Managed Service for ClickHouse® for Graphite
    • Fetching data from Managed Service for Apache Kafka® to Managed Service for ClickHouse®
    • Fetching data from Managed Service for Apache Kafka® to ksqlDB
    • Fetching data from RabbitMQ to Managed Service for ClickHouse®
    • Saving a Data Streams data stream in Managed Service for ClickHouse®
    • Asynchronous replication of data from Yandex Metrica to ClickHouse® using Data Transfer
    • Using hybrid storage in Managed Service for ClickHouse®
    • Sharding Managed Service for ClickHouse® tables
    • Data resharding in a Managed Service for ClickHouse® cluster
    • Loading data from Yandex Direct to a data mart enabled by Managed Service for ClickHouse® using Cloud Functions, Object Storage, and Data Transfer
    • Loading data from Object Storage to Managed Service for ClickHouse® using Data Transfer
    • Migrating data with change of storage from Managed Service for OpenSearch to Managed Service for ClickHouse® using Data Transfer
    • Loading data from Managed Service for YDB to Managed Service for ClickHouse® using Data Transfer
    • Migrating databases from Google BigQuery to Managed Service for ClickHouse®
    • Configuring Cloud DNS to access a Managed Service for ClickHouse® cluster from other cloud networks
    • Migrating a Yandex Data Processing HDFS cluster to a different availability zone
    • Importing data from Managed Service for MySQL® to Yandex Data Processing using Sqoop
    • Importing data from Managed Service for PostgreSQL to Yandex Data Processing using Sqoop
    • Mounting Object Storage buckets to the file system of Yandex Data Processing hosts
    • Working with Apache Kafka® topics using Yandex Data Processing
    • Automating operations with Yandex Data Processing using Managed Service for Apache Airflow™
    • Shared use of Yandex Data Processing tables through Metastore
    • Transferring metadata between Yandex Data Processing clusters using Metastore
    • Importing data from Object Storage, processing and exporting to Managed Service for ClickHouse®
    • Migrating to Managed Service for Elasticsearch using snapshots
    • Migrating collections from a third-party MongoDB cluster to Managed Service for MongoDB
    • Migrating data to Managed Service for MongoDB
    • Migrating Managed Service for MongoDB cluster from 4.4 to 6.0
    • Sharding MongoDB collections
    • MongoDB performance analysis and tuning
    • Migrating a database from a third-party MySQL® cluster to a Managed Service for MySQL® cluster
    • Managed Service for MySQL® performance analysis and tuning
    • Syncing data from a third-party MySQL® cluster to Managed Service for MySQL® using Data Transfer
    • Migrating a database from Managed Service for MySQL® to a third-party MySQL® cluster
    • Migrating a database from Managed Service for MySQL® to Object Storage using Data Transfer
    • Migrating data from Object Storage to Managed Service for MySQL® using Data Transfer
    • Delivering data from Managed Service for MySQL® to Managed Service for Apache Kafka® using Data Transfer
    • Delivering data from Managed Service for MySQL® to Managed Service for Apache Kafka® using Debezium
    • Migrating a database from Managed Service for MySQL® to Managed Service for YDB using Data Transfer
    • MySQL® change data capture and delivery to YDS
    • Migrating data from Managed Service for MySQL® to Managed Service for PostgreSQL using Data Transfer
    • Migrating data from AWS RDS for PostgreSQL to Managed Service for PostgreSQL using Data Transfer
    • Migrating data from Managed Service for MySQL® to Managed Service for Greenplum® using Data Transfer
    • Configuring an index policy in Managed Service for OpenSearch
    • Migrating data from Elasticsearch to Managed Service for OpenSearch
    • Migrating data from a third-party OpenSearch cluster to Managed Service for OpenSearch using Data Transfer
    • Loading data from Managed Service for OpenSearch to Object Storage using Data Transfer
    • Migrating data from Managed Service for OpenSearch to Managed Service for YDB using Data Transfer
    • Copying data from Managed Service for OpenSearch to Managed Service for Greenplum® using Yandex Data Transfer
    • Migrating data from Managed Service for PostgreSQL to Managed Service for OpenSearch using Data Transfer
    • Authenticating a Managed Service for OpenSearch cluster in OpenSearch Dashboards using Keycloak
    • Using the yandex-lemmer plugin in Managed Service for OpenSearch
    • Creating a PostgreSQL cluster for 1C:Enterprise
    • Searching for the Managed Service for PostgreSQL cluster performance issues
    • Managed Service for PostgreSQL performance analysis and tuning
    • Logical replication PostgreSQL
    • Migrating a database from a third-party PostgreSQL cluster to Managed Service for PostgreSQL
    • Migrating a database from Managed Service for PostgreSQL
    • Delivering data from Managed Service for PostgreSQL to Managed Service for Apache Kafka® using Data Transfer
    • Delivering data from Managed Service for PostgreSQL to Managed Service for Apache Kafka® using Debezium
    • Delivering data from Managed Service for PostgreSQL to Managed Service for YDB using Data Transfer
    • Migrating a database from Managed Service for PostgreSQL to Object Storage
    • Migrating data from Object Storage to Managed Service for PostgreSQL using Data Transfer
    • PostgreSQL change data capture and delivery to YDS
    • Migrating data from Managed Service for PostgreSQL to Managed Service for MySQL® using Data Transfer
    • Migrating data from Managed Service for PostgreSQL to Managed Service for OpenSearch using Data Transfer
    • Troubleshooting string sorting issues in PostgreSQL after upgrading glibc
    • Migrating a database from Greenplum® to ClickHouse®
    • Migrating a database from Greenplum® to PostgreSQL
    • Exporting Greenplum® data to a cold storage in Object Storage
    • Loading data from Object Storage to Managed Service for Greenplum® using Data Transfer
    • Copying data from Managed Service for OpenSearch to Managed Service for Greenplum® using Yandex Data Transfer
    • Creating an external table from a Object Storage bucket table using a configuration file
    • Migrating a database from a third-party Valkey™ cluster to Yandex Managed Service for Valkey™
    • Using a Yandex Managed Service for Valkey™ cluster as a PHP session storage
    • Loading data from Object Storage to Managed Service for YDB using Data Transfer
    • Loading data from Managed Service for YDB to Object Storage using Data Transfer
    • Processing Audit Trails events
    • Processing Cloud Logging logs
    • Processing CDC Debezium streams
    • Analyzing data with Jupyter
    • Processing files with usage details in Yandex Cloud Billing
    • Entering data into storage systems
    • Smart log processing
    • Transferring data within microservice architectures
    • Migrating data to Object Storage using Data Transfer
    • Migrating data from a third-party Greenplum® or PostgreSQL cluster to Managed Service for Greenplum® using Data Transfer
    • Migrating Managed Service for MongoDB clusters
    • Migrating MySQL® clusters
    • Migrating to a third-party MySQL® cluster
    • Migrating PostgreSQL clusters
    • Creating a schema registry to deliver data in Debezium CDC format from Apache Kafka®

In this article:

  • Inserting data from a file
  • Inserting data through buffering
  • Asynchronous data inserts
  • Inserting data via a buffer table
  • Inserting data and specifying the format schema
  • Example of using a format schema when inserting data
  • Getting started
  • Installing dependencies
  • Preparing format schemas
  • Preparing scripts
  • Inserting data
  1. Building a data platform
  2. Adding data to a ClickHouse® DB

Adding data to ClickHouse®

Written by
Yandex Cloud
Updated at April 9, 2025
  • Inserting data from a file
  • Inserting data through buffering
    • Asynchronous data inserts
    • Inserting data via a buffer table
  • Inserting data and specifying the format schema
  • Example of using a format schema when inserting data
    • Getting started
    • Installing dependencies
    • Preparing format schemas
    • Preparing scripts
    • Inserting data

For regular data inserts into tables, use an INSERT INTO statement:

INSERT INTO db_name.table_name VALUES (v11, v12, v13), (v21, v22, v23), ...

Insert queries should be run no more than once per second. To group multiple small queries into a large one, use buffering.

You can learn more about INSERT INTO in the ClickHouse® documentation.

Inserting data from a fileInserting data from a file

To insert local file data into a table, use an INSERT INTO statement, such as:

INSERT INTO db_name.table_name FROM INFILE '<full_path_to_file>'
[COMPRESSION '<compression_format>'] FORMAT <data_format>;

With the COMPRESSION option, you can transfer compressed files. Use it to upload large amounts of data. The option is supported when using clickhouse-client or the HTTP interface. If no compression format is specified, it is identified by the file extension. Possible values of the compression format: none, gzip, deflate, br, xz, zstd, lz4, and bz2.

For a list of supported data formats, see the ClickHouse® documentation. To learn how to set up Cap'n Proto and Protobuf data format schemas, see Managing data format schemas.

Inserting data through bufferingInserting data through buffering

When you insert data into ClickHouse®, a part of computing resources is used for performing housekeeping operations. Each time you run an INSERT query, ClickHouse® creates a separate data part in the storage. In addition to table rows, parts like this contain auxiliary files with metadata. Next, ClickHouse® joins data parts in the background. The more join queries are required, the more resources will be used.

As a result, the load on the cluster from one thousand queries to insert a single row will exceed that from a single query to insert one thousand rows. Therefore, we recommend inserting data into tables in large chunks from 1,000 to 100,000 rows.

If small chunks of data are delivered from various sources, use one of the following buffering options:

  • Asynchronous inserts (recommended)
  • Buffer tables

Asynchronous data insertsAsynchronous data inserts

If the Async insert setting is configured for the user, all insert queries from this user first get to the RAM buffer. Data from the buffer is flushed to a table if one of the following conditions is met:

  • The buffer size has reached the Async insert max data size setting value.
  • The time set in the Async busy timeout setting has passed since the first INSERT query run after flushing the data.

To enable asynchronous data inserts, set the Async insert setting to 1.

Note that row deduplication is not available when using asynchronous inserts.

For more information about asynchronous data inserts, see the ClickHouse® documentation.

Inserting data via a buffer tableInserting data via a buffer table

A buffer table is created in RAM. It accumulates data it receives and then flushes the data to the main table if the preset conditions are met.

To create a buffer table, the Buffer table engine is used. It accepts the following basic parameters as an input:

Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes)

Where:

  • database: DB name.
  • table: Name of the table to flush the data to.
  • num_layers: Number of bufers. A table is physically stored in memory as multiple independent buffers.
  • min_time, max_time, min_rows, max_rows, min_bytes, and max_bytes: Parameters that set conditions for flushing the data to the main table. The time is measured in seconds.

The data is flushed to the main table upon reaching all minimum values or one of the maximum values. If the size of a received chunk exceeds the max_rows or max_bytes value, the data is written to the main table directly rather than buffered.

To learn more about additional engine parameters and limits for tables based on the Buffer engine, see the ClickHouse®.

ExampleExample

  1. Create a regular table named users in the db1 database:

    CREATE TABLE IF NOT EXISTS db1.users (id UInt64, name String)
    ENGINE = MergeTree() ORDER BY id;
    
  2. Create a buffer table named users_buffer and link it to the users main table:

    CREATE TABLE db1.users_buffer AS db1.users ENGINE = Buffer(db1, users, 1, 10, 100, 10000, 1000000, 10000000, 100000000);
    
  3. Send data to the buffer table:

    INSERT INTO db1.users_buffer VALUES (1, 'Vlad'), (2, 'John'), (3, 'Zara');
    
  4. Check data in the main table, it will appear there in 100 seconds (max_time):

    SELECT * FROM db1.users;
    

Result:

┌─id─┬─name─┐
│  1 │ Vlad │
│  2 │ John │
│  3 │ Zara │
└────┴──────┘

Inserting data and specifying the format schemaInserting data and specifying the format schema

Managed Service for ClickHouse® lets you INSERT and SELECT data in different formats. Most of those formats are self-descriptive. This means that they already contain a format schema that describes acceptable data types, their order, and representation in this format. For example, it lets you directly insert data from a file.

Note

Format schema describes the format of data input or output and the data schema describes the structure and layout of the ClickHouse® databases and tables that store this data. These concepts are not interchangeable.

The Cap'n Proto and the Protobuf data formats (including ProtobufSingle) do not contain the format schema, and data is stored in binary format without any structure information. Before you begin processing data in these formats (for example, before inserting data in a table), add a format schema to the Managed Service for ClickHouse® cluster. It will help you correctly interpret the number, order, and type of values when processing binary data.

You can add one or more such format schemas to a Managed Service for ClickHouse® cluster and use them to input and output data in the relevant formats.

Warning

To use the format schemas you added, insert the data into Managed Service for ClickHouse® using the HTTP interface, in this case, data serialization and deserialization is performed on the server side based on the schemas you added.

For more information about data formats, see the ClickHouse® documentation.

Example of using a format schema when inserting dataExample of using a format schema when inserting data

The example was tested in the following environment:

  • Virtual machine in Yandex Cloud running Ubuntu 20.04 LTS.
  • Bash: 5.0.16.
  • clickhouse-client: 20.10.2.20.
  • capnproto: 0.7.0.
  • protobuf-compiler: 3.6.1.
  • Python: 3.8.5; pip3: 20.0.2.

Let's assume you have created a single-host Managed Service for ClickHouse® cluster named chcluster with the db1 database, and you need to insert user data into the db1.users table. Let's further assume that each user record contains the following information:

  • id: User ID.
  • name: Username.

To insert user data in the Cap'n Proto and Protobuf formats into the db1.users table:

  1. Install the dependencies.
  2. Prepare data format schemas.
  3. Prepare scripts.
  4. Insert data.

Getting startedGetting started

  1. Examine the data format that will be used for insertion so that the correct format schemas are prepared.

    In this scenario, for demonstration, it is assumed that:

    • The user id is represented as an unsigned 64-bit integer (Uint64 in Cap'n Proto and ClickHouse® or uint64 in Protobuf).
    • The username name is presented in the form of a string (Text in Cap'n Proto, string in Protobuf or String in ClickHouse®).

    To learn more about supported data types, see the documentation for Cap'n Proto, Protobuf, and ClickHouse®.

  2. Connect to the cluster and create the db1.users table of the preferred format if it is not there yet:

    CREATE TABLE IF NOT EXISTS db1.users (id UInt64, name String)
    ENGINE = MergeTree() ORDER BY id;
    

Installing dependenciesInstalling dependencies

sudo apt update && sudo apt install -y python3 python3-pip capnproto protobuf-compiler && \
pip3 install protobuf varint pycapnp

Preparing format schemasPreparing format schemas

  1. Create a file with a schema description:

    Cap'n Proto
    Protobuf

    user.capnp

    @0xbec0f3f99cec4fbf;
    
    struct User {
      id @0 :UInt64;
      name @1 :Text;
    }
    

    To learn more about the file format, see the documentation for Cap'n Proto.

    user.proto

    syntax = "proto3";
    
    message User {
      uint64 id = 1;
      string name = 2;
    };
    

    To learn more information about the file format, see the documentation for Protobuf.

  2. Upload the file to Object Storage and get a link to it.

  3. Connect the format schema to the chcluster cluster:

    • For the Cap'n Proto format schema (the user.capnp file), set schema-capnproto as the name.
    • For the Protobuf format schema, (the user.protobuf file), set schema-protobuf as the name.

Preparing scriptsPreparing scripts

These Python scripts prepare test data about users in the appropriate formats and insert them into a cluster table.

Note

Python scripts are provided for demonstration. You can prepare and insert binary data in the required format by creating a similar script in a different programming language.

To prepare scripts:

  1. Compile the user.proto file of the Protobuf schema for Python:

    protoc user.proto --python_out .
    

    The user_pb2.py file will be generated.

  2. Create files with Python code:

    Cap'n Proto
    Protobuf

    capnproto-example.py

    import requests
    import io
    import capnp
    from user_capnp import User
    
    DB_HOST="<ClickHouse®>_host_FQDN"
    DB_NAME="db1"
    DB_USER="<DB_username>"
    DB_PASS="<DB_user_password>"
    CA_CERT="/usr/local/share/ca-certificates/Yandex/RootCA.crt"
    
    SCHEMA_NAME = 'schema-capnproto'
    SCHEMA_TYPE = "CapnProto"
    SCHEMA_CLASS = "User"
    
    def execute_query(query, data=None):
        url = 'https://{host}:8443/'.format(host=DB_HOST)
        params = {
            'database': DB_NAME,
            'query': query.strip()
        }
        auth = {
            'X-ClickHouse-User': DB_USER,
            'X-ClickHouse-Key': DB_PASS
        }
    
        rs = requests.post(url,
                           params=params,
                           headers=auth,
                           data=data,
                           verify=CA_CERT)
    
        rs.raise_for_status()
    
    def add_user(fileobj, user_id, user_name):
        user = User.new_message()
        user.id = user_id
        user.name = user_name
    
        fileobj.write(user.to_bytes())
    
    message = io.BytesIO()
    
    add_user(message, 11, 'John')
    add_user(message, 12, 'Bob')
    add_user(message, 13, 'Jane')
    
    execute_query(
        '''
        INSERT INTO {database}.users SETTINGS format_schema='{name}:{cls}' FORMAT {type}
        '''.format(database=DB_NAME,
                   type=SCHEMA_TYPE,
                   name=SCHEMA_NAME,
                   cls=SCHEMA_CLASS), data=message.getvalue())
    

    This script:

    1. Gets the User class from the linked user.capnp file (from user_capnp import User).
    2. Executes requests to the cluster over HTTPS using SSL.
    3. Writes the test dataset to the User class object (def add_user ...) and adds this object to the message I/O bitstream.
    4. Inserts data from the message bitstream to the db1.users table based on the User class data from the schema-capnproto format schema in the cluster.

    protobuf-example.py

    import requests
    import io
    import varint
    from user_pb2 import User
    
    DB_HOST="<ClickHouse®>_host_FQDN"
    DB_NAME="db1"
    DB_USER="<DB_username>"
    DB_PASS="<DB_user_password>"
    CA_CERT="/usr/local/share/ca-certificates/Yandex/RootCA.crt"
    
    SCHEMA_NAME = 'schema-protobuf'
    SCHEMA_TYPE = "Protobuf"
    SCHEMA_CLASS = "User"
    
    def execute_query(query, data=None):
        url = 'https://{host}:8443/'.format(host=DB_HOST)
        params = {
            'database': DB_NAME,
            'query': query.strip()
        }
        auth = {
            'X-ClickHouse-User': DB_USER,
            'X-ClickHouse-Key': DB_PASS
        }
    
        rs = requests.post(url,
                           params=params,
                           headers=auth,
                           data=data,
                           verify=CA_CERT)
    
        rs.raise_for_status()
    
    def add_user(fileobj, user_id, user_name):
        user = User()
        user.id = user_id
        user.name = user_name
    
        fileobj.write(varint.encode(user.ByteSize()))
        fileobj.write(user.SerializeToString())
    
    message = io.BytesIO()
    add_user(message, 21, 'Stephen')
    add_user(message, 22, 'Olivia')
    add_user(message, 23, 'Tim')
    
    execute_query(
        '''INSERT INTO {database}.users SETTINGS format_schema='{name}:{cls}' FORMAT {type}
        '''.format(database=DB_NAME,
                   type=SCHEMA_TYPE,
                   name=SCHEMA_NAME,
                   cls=SCHEMA_CLASS), data=message.getvalue())
    

    This script:

    1. Gets the User class from the linked user_pb2.py file obtained after compiling the proto file (from user_pb2 import User).
    2. Writes the test dataset to the User class object (def add_user ...) and adds this object to the message I/O bitstream.
    3. Inserts data from the message bitstream to the db1.users table based on the User class data from the schema-protobuf format schema in the cluster.

    To learn how to get host FQDN, see this guide.

Inserting dataInserting data

  1. Run the scripts you prepared previously:

    Cap'n Proto
    Protobuf
    python3 capnproto-example.py
    
    python3 protobuf-example.py
    
  2. Connect to the cluster and make sure that the data was successfully inserted by running the SELECT query:

    Cap'n Proto
    Protobuf
    SELECT * FROM db1.users;
    
    ┌─id─┬─name─┐
    │ 11 │ John │
    │ 12 │ Bob  │
    │ 13 │ Jane │
    └────┴──────┘
    
    SELECT * FROM db1.users;
    
    ┌─id─┬─name────┐
    │ 21 │ Stephen │
    │ 22 │ Olivia  │
    │ 23 │ Tim     │
    └────┴─────────┘
    

ClickHouse® is a registered trademark of ClickHouse, Inc.

Was the article helpful?

Previous
Configuring an SMTP server to send e-mail notifications
Next
Migrating data to Managed Service for ClickHouse® using Data Transfer
Yandex project
© 2025 Yandex.Cloud LLC