Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Yandex Managed Service for ClickHouse®
  • Getting started
    • All tutorials
    • Adding data to the database
    • Migrating data to Managed Service for ClickHouse® using ClickHouse®
    • Migrating data to Managed Service for ClickHouse® using Data Transfer
    • Sharding tables
    • Data resharding in a cluster
    • Using a hybrid storage
    • Fetching data from Managed Service for Apache Kafka®
    • Fetching data from RabbitMQ
    • Exchanging data with Yandex Data Processing
    • Configuring Yandex Cloud DNS for cluster access from other cloud networks
    • Analyzing Yandex Object Storage logs in Yandex DataLens
    • Configuring Managed Service for ClickHouse® for Graphite
    • Saving a Yandex Data Streams data stream in Managed Service for ClickHouse®
    • Migrating a database from Google BigQuery
    • Delivering data from Managed Service for Apache Kafka® using Yandex Data Transfer
    • Migrating data from Yandex Direct using Yandex Cloud Functions, Yandex Object Storage, and Yandex Data Transfer
    • Loading data from Yandex Object Storage to Managed Service for ClickHouse® using Yandex Data Transfer
    • Migrating a database from Greenplum® to ClickHouse®
    • Migrating a database from MySQL® to ClickHouse® using Yandex Data Transfer
    • Asynchronously replicating data from PostgreSQL to ClickHouse®
    • Loading data from Yandex Managed Service for YDB to Managed Service for ClickHouse® using Yandex Data Transfer
    • Copying data from Managed Service for OpenSearch to Managed Service for ClickHouse® using Yandex Data Transfer
    • Entering data into storage systems
    • Using parameters
    • Examples of creating QL charts
    • Web analytics with funnels and cohorts calculated based on Yandex Metrica data
    • AppMetrica: direct connection
    • AppMetrica: data export, post-processing, and visualization
    • Loading data from Yandex Metrica to a ClickHouse® data mart
    • Yandex Tracker: data export and visualization
    • Retail chain's dashboard based on a ClickHouse® DB
    • Analyzing sales and locations of pizzerias based on data from the ClickHouse® database and Marketplace
    • Geocoding with the Yandex Maps API for data visualization in DataLens
    • Importing data from Object Storage, processing and exporting to Managed Service for ClickHouse®
    • Working with data using Query
    • Federated data queries using Query
  • Access management
  • Pricing policy
  • Terraform reference
  • Yandex Monitoring metrics
  • Audit Trails events
  • Public materials
  • Release notes

In this article:

  • Inserting data from a file
  • Inserting data through buffering
  • Asynchronous data inserts
  • Inserting data via a buffer table
  • Inserting data and specifying the format schema
  • Example of using a format schema when inserting data
  • Getting started
  • Installing dependencies
  • Preparing format schemas
  • Preparing scripts
  • Inserting data
  1. Tutorials
  2. Adding data to the database

Adding data to ClickHouse®

Written by
Yandex Cloud
Updated at April 9, 2025
  • Inserting data from a file
  • Inserting data through buffering
    • Asynchronous data inserts
    • Inserting data via a buffer table
  • Inserting data and specifying the format schema
  • Example of using a format schema when inserting data
    • Getting started
    • Installing dependencies
    • Preparing format schemas
    • Preparing scripts
    • Inserting data

For regular data inserts into tables, use an INSERT INTO statement:

INSERT INTO db_name.table_name VALUES (v11, v12, v13), (v21, v22, v23), ...

Insert queries should be run no more than once per second. To group multiple small queries into a large one, use buffering.

You can learn more about INSERT INTO in the ClickHouse® documentation.

Inserting data from a fileInserting data from a file

To insert local file data into a table, use an INSERT INTO statement, such as:

INSERT INTO db_name.table_name FROM INFILE '<full_path_to_file>'
[COMPRESSION '<compression_format>'] FORMAT <data_format>;

With the COMPRESSION option, you can transfer compressed files. Use it to upload large amounts of data. The option is supported when using clickhouse-client or the HTTP interface. If no compression format is specified, it is identified by the file extension. Possible values of the compression format: none, gzip, deflate, br, xz, zstd, lz4, and bz2.

For a list of supported data formats, see the ClickHouse® documentation. To learn how to set up Cap'n Proto and Protobuf data format schemas, see Managing data format schemas.

Inserting data through bufferingInserting data through buffering

When you insert data into ClickHouse®, a part of computing resources is used for performing housekeeping operations. Each time you run an INSERT query, ClickHouse® creates a separate data part in the storage. In addition to table rows, parts like this contain auxiliary files with metadata. Next, ClickHouse® joins data parts in the background. The more join queries are required, the more resources will be used.

As a result, the load on the cluster from one thousand queries to insert a single row will exceed that from a single query to insert one thousand rows. Therefore, we recommend inserting data into tables in large chunks from 1,000 to 100,000 rows.

If small chunks of data are delivered from various sources, use one of the following buffering options:

  • Asynchronous inserts (recommended)
  • Buffer tables

Asynchronous data insertsAsynchronous data inserts

If the Async insert setting is configured for the user, all insert queries from this user first get to the RAM buffer. Data from the buffer is flushed to a table if one of the following conditions is met:

  • The buffer size has reached the Async insert max data size setting value.
  • The time set in the Async busy timeout setting has passed since the first INSERT query run after flushing the data.

To enable asynchronous data inserts, set the Async insert setting to 1.

Note that row deduplication is not available when using asynchronous inserts.

For more information about asynchronous data inserts, see the ClickHouse® documentation.

Inserting data via a buffer tableInserting data via a buffer table

A buffer table is created in RAM. It accumulates data it receives and then flushes the data to the main table if the preset conditions are met.

To create a buffer table, the Buffer table engine is used. It accepts the following basic parameters as an input:

Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes)

Where:

  • database: DB name.
  • table: Name of the table to flush the data to.
  • num_layers: Number of bufers. A table is physically stored in memory as multiple independent buffers.
  • min_time, max_time, min_rows, max_rows, min_bytes, and max_bytes: Parameters that set conditions for flushing the data to the main table. The time is measured in seconds.

The data is flushed to the main table upon reaching all minimum values or one of the maximum values. If the size of a received chunk exceeds the max_rows or max_bytes value, the data is written to the main table directly rather than buffered.

To learn more about additional engine parameters and limits for tables based on the Buffer engine, see the ClickHouse®.

ExampleExample

  1. Create a regular table named users in the db1 database:

    CREATE TABLE IF NOT EXISTS db1.users (id UInt64, name String)
    ENGINE = MergeTree() ORDER BY id;
    
  2. Create a buffer table named users_buffer and link it to the users main table:

    CREATE TABLE db1.users_buffer AS db1.users ENGINE = Buffer(db1, users, 1, 10, 100, 10000, 1000000, 10000000, 100000000);
    
  3. Send data to the buffer table:

    INSERT INTO db1.users_buffer VALUES (1, 'Vlad'), (2, 'John'), (3, 'Zara');
    
  4. Check data in the main table, it will appear there in 100 seconds (max_time):

    SELECT * FROM db1.users;
    

Result:

┌─id─┬─name─┐
│  1 │ Vlad │
│  2 │ John │
│  3 │ Zara │
└────┴──────┘

Inserting data and specifying the format schemaInserting data and specifying the format schema

Managed Service for ClickHouse® lets you INSERT and SELECT data in different formats. Most of those formats are self-descriptive. This means that they already contain a format schema that describes acceptable data types, their order, and representation in this format. For example, it lets you directly insert data from a file.

Note

Format schema describes the format of data input or output and the data schema describes the structure and layout of the ClickHouse® databases and tables that store this data. These concepts are not interchangeable.

The Cap'n Proto and the Protobuf data formats (including ProtobufSingle) do not contain the format schema, and data is stored in binary format without any structure information. Before you begin processing data in these formats (for example, before inserting data in a table), add a format schema to the Managed Service for ClickHouse® cluster. It will help you correctly interpret the number, order, and type of values when processing binary data.

You can add one or more such format schemas to a Managed Service for ClickHouse® cluster and use them to input and output data in the relevant formats.

Warning

To use the format schemas you added, insert the data into Managed Service for ClickHouse® using the HTTP interface, in this case, data serialization and deserialization is performed on the server side based on the schemas you added.

For more information about data formats, see the ClickHouse® documentation.

Example of using a format schema when inserting dataExample of using a format schema when inserting data

The example was tested in the following environment:

  • Virtual machine in Yandex Cloud running Ubuntu 20.04 LTS.
  • Bash: 5.0.16.
  • clickhouse-client: 20.10.2.20.
  • capnproto: 0.7.0.
  • protobuf-compiler: 3.6.1.
  • Python: 3.8.5; pip3: 20.0.2.

Let's assume you have created a single-host Managed Service for ClickHouse® cluster named chcluster with the db1 database, and you need to insert user data into the db1.users table. Let's further assume that each user record contains the following information:

  • id: User ID.
  • name: Username.

To insert user data in the Cap'n Proto and Protobuf formats into the db1.users table:

  1. Install the dependencies.
  2. Prepare data format schemas.
  3. Prepare scripts.
  4. Insert data.

Getting startedGetting started

  1. Examine the data format that will be used for insertion so that the correct format schemas are prepared.

    In this scenario, for demonstration, it is assumed that:

    • The user id is represented as an unsigned 64-bit integer (Uint64 in Cap'n Proto and ClickHouse® or uint64 in Protobuf).
    • The username name is presented in the form of a string (Text in Cap'n Proto, string in Protobuf or String in ClickHouse®).

    To learn more about supported data types, see the documentation for Cap'n Proto, Protobuf, and ClickHouse®.

  2. Connect to the cluster and create the db1.users table of the preferred format if it is not there yet:

    CREATE TABLE IF NOT EXISTS db1.users (id UInt64, name String)
    ENGINE = MergeTree() ORDER BY id;
    

Installing dependenciesInstalling dependencies

sudo apt update && sudo apt install -y python3 python3-pip capnproto protobuf-compiler && \
pip3 install protobuf varint pycapnp

Preparing format schemasPreparing format schemas

  1. Create a file with a schema description:

    Cap'n Proto
    Protobuf

    user.capnp

    @0xbec0f3f99cec4fbf;
    
    struct User {
      id @0 :UInt64;
      name @1 :Text;
    }
    

    To learn more about the file format, see the documentation for Cap'n Proto.

    user.proto

    syntax = "proto3";
    
    message User {
      uint64 id = 1;
      string name = 2;
    };
    

    To learn more information about the file format, see the documentation for Protobuf.

  2. Upload the file to Object Storage and get a link to it.

  3. Connect the format schema to the chcluster cluster:

    • For the Cap'n Proto format schema (the user.capnp file), set schema-capnproto as the name.
    • For the Protobuf format schema, (the user.protobuf file), set schema-protobuf as the name.

Preparing scriptsPreparing scripts

These Python scripts prepare test data about users in the appropriate formats and insert them into a cluster table.

Note

Python scripts are provided for demonstration. You can prepare and insert binary data in the required format by creating a similar script in a different programming language.

To prepare scripts:

  1. Compile the user.proto file of the Protobuf schema for Python:

    protoc user.proto --python_out .
    

    The user_pb2.py file will be generated.

  2. Create files with Python code:

    Cap'n Proto
    Protobuf

    capnproto-example.py

    import requests
    import io
    import capnp
    from user_capnp import User
    
    DB_HOST="<ClickHouse®>_host_FQDN"
    DB_NAME="db1"
    DB_USER="<DB_username>"
    DB_PASS="<DB_user_password>"
    CA_CERT="/usr/local/share/ca-certificates/Yandex/RootCA.crt"
    
    SCHEMA_NAME = 'schema-capnproto'
    SCHEMA_TYPE = "CapnProto"
    SCHEMA_CLASS = "User"
    
    def execute_query(query, data=None):
        url = 'https://{host}:8443/'.format(host=DB_HOST)
        params = {
            'database': DB_NAME,
            'query': query.strip()
        }
        auth = {
            'X-ClickHouse-User': DB_USER,
            'X-ClickHouse-Key': DB_PASS
        }
    
        rs = requests.post(url,
                           params=params,
                           headers=auth,
                           data=data,
                           verify=CA_CERT)
    
        rs.raise_for_status()
    
    def add_user(fileobj, user_id, user_name):
        user = User.new_message()
        user.id = user_id
        user.name = user_name
    
        fileobj.write(user.to_bytes())
    
    message = io.BytesIO()
    
    add_user(message, 11, 'John')
    add_user(message, 12, 'Bob')
    add_user(message, 13, 'Jane')
    
    execute_query(
        '''
        INSERT INTO {database}.users SETTINGS format_schema='{name}:{cls}' FORMAT {type}
        '''.format(database=DB_NAME,
                   type=SCHEMA_TYPE,
                   name=SCHEMA_NAME,
                   cls=SCHEMA_CLASS), data=message.getvalue())
    

    This script:

    1. Gets the User class from the linked user.capnp file (from user_capnp import User).
    2. Executes requests to the cluster over HTTPS using SSL.
    3. Writes the test dataset to the User class object (def add_user ...) and adds this object to the message I/O bitstream.
    4. Inserts data from the message bitstream to the db1.users table based on the User class data from the schema-capnproto format schema in the cluster.

    protobuf-example.py

    import requests
    import io
    import varint
    from user_pb2 import User
    
    DB_HOST="<ClickHouse®>_host_FQDN"
    DB_NAME="db1"
    DB_USER="<DB_username>"
    DB_PASS="<DB_user_password>"
    CA_CERT="/usr/local/share/ca-certificates/Yandex/RootCA.crt"
    
    SCHEMA_NAME = 'schema-protobuf'
    SCHEMA_TYPE = "Protobuf"
    SCHEMA_CLASS = "User"
    
    def execute_query(query, data=None):
        url = 'https://{host}:8443/'.format(host=DB_HOST)
        params = {
            'database': DB_NAME,
            'query': query.strip()
        }
        auth = {
            'X-ClickHouse-User': DB_USER,
            'X-ClickHouse-Key': DB_PASS
        }
    
        rs = requests.post(url,
                           params=params,
                           headers=auth,
                           data=data,
                           verify=CA_CERT)
    
        rs.raise_for_status()
    
    def add_user(fileobj, user_id, user_name):
        user = User()
        user.id = user_id
        user.name = user_name
    
        fileobj.write(varint.encode(user.ByteSize()))
        fileobj.write(user.SerializeToString())
    
    message = io.BytesIO()
    add_user(message, 21, 'Stephen')
    add_user(message, 22, 'Olivia')
    add_user(message, 23, 'Tim')
    
    execute_query(
        '''INSERT INTO {database}.users SETTINGS format_schema='{name}:{cls}' FORMAT {type}
        '''.format(database=DB_NAME,
                   type=SCHEMA_TYPE,
                   name=SCHEMA_NAME,
                   cls=SCHEMA_CLASS), data=message.getvalue())
    

    This script:

    1. Gets the User class from the linked user_pb2.py file obtained after compiling the proto file (from user_pb2 import User).
    2. Writes the test dataset to the User class object (def add_user ...) and adds this object to the message I/O bitstream.
    3. Inserts data from the message bitstream to the db1.users table based on the User class data from the schema-protobuf format schema in the cluster.

    To learn how to get host FQDN, see this guide.

Inserting dataInserting data

  1. Run the scripts you prepared previously:

    Cap'n Proto
    Protobuf
    python3 capnproto-example.py
    
    python3 protobuf-example.py
    
  2. Connect to the cluster and make sure that the data was successfully inserted by running the SELECT query:

    Cap'n Proto
    Protobuf
    SELECT * FROM db1.users;
    
    ┌─id─┬─name─┐
    │ 11 │ John │
    │ 12 │ Bob  │
    │ 13 │ Jane │
    └────┴──────┘
    
    SELECT * FROM db1.users;
    
    ┌─id─┬─name────┐
    │ 21 │ Stephen │
    │ 22 │ Olivia  │
    │ 23 │ Tim     │
    └────┴─────────┘
    

ClickHouse® is a registered trademark of ClickHouse, Inc.

Was the article helpful?

Previous
All tutorials
Next
Migrating data to Managed Service for ClickHouse® using ClickHouse®
Yandex project
© 2025 Yandex.Cloud LLC