Adding data to ClickHouse®

Written by

Updated at April 9, 2025

Inserting data from a file
Inserting data through buffering
- Asynchronous data inserts
- Inserting data via a buffer table
Inserting data and specifying the format schema
Example of using a format schema when inserting data

For regular data inserts into tables, use an INSERT INTO statement:

INSERT INTO db_name.table_name VALUES (v11, v12, v13), (v21, v22, v23), ...

Insert queries should be run no more than once per second. To group multiple small queries into a large one, use buffering.

You can learn more about INSERT INTO in the ClickHouse® documentation.

Inserting data from a file

To insert local file data into a table, use an INSERT INTO statement, such as:

INSERT INTO db_name.table_name FROM INFILE '<full_path_to_file>'
[COMPRESSION '<compression_format>'] FORMAT <data_format>;

With the COMPRESSION option, you can transfer compressed files. Use it to upload large amounts of data. The option is supported when using clickhouse-client or the HTTP interface. If no compression format is specified, it is identified by the file extension. Possible values of the compression format: none, gzip, deflate, br, xz, zstd, lz4, and bz2.

For a list of supported data formats, see the ClickHouse® documentation. To learn how to set up Cap'n Proto and Protobuf data format schemas, see Managing data format schemas.

Inserting data through buffering

When you insert data into ClickHouse®, a part of computing resources is used for performing housekeeping operations. Each time you run an INSERT query, ClickHouse® creates a separate data part in the storage. In addition to table rows, parts like this contain auxiliary files with metadata. Next, ClickHouse® joins data parts in the background. The more join queries are required, the more resources will be used.

As a result, the load on the cluster from one thousand queries to insert a single row will exceed that from a single query to insert one thousand rows. Therefore, we recommend inserting data into tables in large chunks from 1,000 to 100,000 rows.

If small chunks of data are delivered from various sources, use one of the following buffering options:

Asynchronous inserts (recommended)
Buffer tables

Asynchronous data inserts

If the Async insert setting is configured for the user, all insert queries from this user first get to the RAM buffer. Data from the buffer is flushed to a table if one of the following conditions is met:

The buffer size has reached the Async insert max data size setting value.
The time set in the Async busy timeout setting has passed since the first INSERT query run after flushing the data.

To enable asynchronous data inserts, set the Async insert setting to 1.

Note that row deduplication is not available when using asynchronous inserts.

For more information about asynchronous data inserts, see the ClickHouse® documentation.

Inserting data via a buffer table

A buffer table is created in RAM. It accumulates data it receives and then flushes the data to the main table if the preset conditions are met.

To create a buffer table, the Buffer table engine is used. It accepts the following basic parameters as an input:

Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes)

Where:

database: DB name.
table: Name of the table to flush the data to.
num_layers: Number of bufers. A table is physically stored in memory as multiple independent buffers.
min_time, max_time, min_rows, max_rows, min_bytes, and max_bytes: Parameters that set conditions for flushing the data to the main table. The time is measured in seconds.

The data is flushed to the main table upon reaching all minimum values or one of the maximum values. If the size of a received chunk exceeds the max_rows or max_bytes value, the data is written to the main table directly rather than buffered.

To learn more about additional engine parameters and limits for tables based on the Buffer engine, see the ClickHouse®.

Example

Create a regular table named users in the db1 database:

CREATE TABLE IF NOT EXISTS db1.users (id UInt64, name String)
ENGINE = MergeTree() ORDER BY id;

Create a buffer table named users_buffer and link it to the users main table:

CREATE TABLE db1.users_buffer AS db1.users ENGINE = Buffer(db1, users, 1, 10, 100, 10000, 1000000, 10000000, 100000000);

Send data to the buffer table:

INSERT INTO db1.users_buffer VALUES (1, 'Vlad'), (2, 'John'), (3, 'Zara');

Check data in the main table, it will appear there in 100 seconds (max_time):
```
SELECT * FROM db1.users;
```

Result:

┌─id─┬─name─┐
│  1 │ Vlad │
│  2 │ John │
│  3 │ Zara │
└────┴──────┘

Inserting data and specifying the format schema

Managed Service for ClickHouse® lets you INSERT and SELECT data in different formats. Most of those formats are self-descriptive. This means that they already contain a format schema that describes acceptable data types, their order, and representation in this format. For example, it lets you directly insert data from a file.

Note

Format schema describes the format of data input or output and the data schema describes the structure and layout of the ClickHouse® databases and tables that store this data. These concepts are not interchangeable.

The Cap'n Proto and the Protobuf data formats (including ProtobufSingle) do not contain the format schema, and data is stored in binary format without any structure information. Before you begin processing data in these formats (for example, before inserting data in a table), add a format schema to the Managed Service for ClickHouse® cluster. It will help you correctly interpret the number, order, and type of values when processing binary data.

You can add one or more such format schemas to a Managed Service for ClickHouse® cluster and use them to input and output data in the relevant formats.

Warning

To use the format schemas you added, insert the data into Managed Service for ClickHouse® using the HTTP interface, in this case, data serialization and deserialization is performed on the server side based on the schemas you added.

For more information about data formats, see the ClickHouse® documentation.

Example of using a format schema when inserting data

The example was tested in the following environment:

Virtual machine in Yandex Cloud running Ubuntu 20.04 LTS.
Bash: 5.0.16.
clickhouse-client: 20.10.2.20.
capnproto: 0.7.0.
protobuf-compiler: 3.6.1.
Python: 3.8.5; pip3: 20.0.2.

Let's assume you have created a single-host Managed Service for ClickHouse® cluster named chcluster with the db1 database, and you need to insert user data into the db1.users table. Let's further assume that each user record contains the following information:

id: User ID.
name: Username.

To insert user data in the Cap'n Proto and Protobuf formats into the db1.users table:

Getting started

Examine the data format that will be used for insertion so that the correct format schemas are prepared.

In this scenario, for demonstration, it is assumed that:
- The user id is represented as an unsigned 64-bit integer (Uint64 in Cap'n Proto and ClickHouse® or uint64 in Protobuf).
- The username name is presented in the form of a string (Text in Cap'n Proto, string in Protobuf or String in ClickHouse®).
To learn more about supported data types, see the documentation for Cap'n Proto, Protobuf, and ClickHouse®.

Connect to the cluster and create the db1.users table of the preferred format if it is not there yet:

CREATE TABLE IF NOT EXISTS db1.users (id UInt64, name String)
ENGINE = MergeTree() ORDER BY id;

Installing dependencies

sudo apt update && sudo apt install -y python3 python3-pip capnproto protobuf-compiler && \
pip3 install protobuf varint pycapnp

Preparing format schemas

Create a file with a schema description:
Cap'n Proto

Protobuf
user.capnp
```
@0xbec0f3f99cec4fbf;

struct User {
  id @0 :UInt64;
  name @1 :Text;
}
```
To learn more about the file format, see the documentation for Cap'n Proto.
user.proto
```
syntax = "proto3";

message User {
  uint64 id = 1;
  string name = 2;
};
```
To learn more information about the file format, see the documentation for Protobuf.
Upload the file to Object Storage and get a link to it.
Connect the format schema to the chcluster cluster:
- For the Cap'n Proto format schema (the user.capnp file), set schema-capnproto as the name.
- For the Protobuf format schema, (the user.protobuf file), set schema-protobuf as the name.

Preparing scripts

These Python scripts prepare test data about users in the appropriate formats and insert them into a cluster table.

Note

Python scripts are provided for demonstration. You can prepare and insert binary data in the required format by creating a similar script in a different programming language.

To prepare scripts:

Compile the user.proto file of the Protobuf schema for Python:
```
protoc user.proto --python_out .
```
The user_pb2.py file will be generated.

Create files with Python code:

Cap'n Proto

Protobuf

capnproto-example.py

import requests
import io
import capnp
from user_capnp import User

DB_HOST="<ClickHouse®>_host_FQDN"
DB_NAME="db1"
DB_USER="<DB_username>"
DB_PASS="<DB_user_password>"
CA_CERT="/usr/local/share/ca-certificates/Yandex/RootCA.crt"

SCHEMA_NAME = 'schema-capnproto'
SCHEMA_TYPE = "CapnProto"
SCHEMA_CLASS = "User"

def execute_query(query, data=None):
    url = 'https://{host}:8443/'.format(host=DB_HOST)
    params = {
        'database': DB_NAME,
        'query': query.strip()
    }
    auth = {
        'X-ClickHouse-User': DB_USER,
        'X-ClickHouse-Key': DB_PASS
    }

    rs = requests.post(url,
                       params=params,
                       headers=auth,
                       data=data,
                       verify=CA_CERT)

    rs.raise_for_status()

def add_user(fileobj, user_id, user_name):
    user = User.new_message()
    user.id = user_id
    user.name = user_name

    fileobj.write(user.to_bytes())

message = io.BytesIO()

add_user(message, 11, 'John')
add_user(message, 12, 'Bob')
add_user(message, 13, 'Jane')

execute_query(
    '''
    INSERT INTO {database}.users SETTINGS format_schema='{name}:{cls}' FORMAT {type}
    '''.format(database=DB_NAME,
               type=SCHEMA_TYPE,
               name=SCHEMA_NAME,
               cls=SCHEMA_CLASS), data=message.getvalue())

This script:

Gets the User class from the linked user.capnp file (from user_capnp import User).
Executes requests to the cluster over HTTPS using SSL.
Writes the test dataset to the User class object (def add_user ...) and adds this object to the message I/O bitstream.
Inserts data from the message bitstream to the db1.users table based on the User class data from the schema-capnproto format schema in the cluster.

protobuf-example.py

import requests
import io
import varint
from user_pb2 import User

DB_HOST="<ClickHouse®>_host_FQDN"
DB_NAME="db1"
DB_USER="<DB_username>"
DB_PASS="<DB_user_password>"
CA_CERT="/usr/local/share/ca-certificates/Yandex/RootCA.crt"

SCHEMA_NAME = 'schema-protobuf'
SCHEMA_TYPE = "Protobuf"
SCHEMA_CLASS = "User"

def execute_query(query, data=None):
    url = 'https://{host}:8443/'.format(host=DB_HOST)
    params = {
        'database': DB_NAME,
        'query': query.strip()
    }
    auth = {
        'X-ClickHouse-User': DB_USER,
        'X-ClickHouse-Key': DB_PASS
    }

    rs = requests.post(url,
                       params=params,
                       headers=auth,
                       data=data,
                       verify=CA_CERT)

    rs.raise_for_status()

def add_user(fileobj, user_id, user_name):
    user = User()
    user.id = user_id
    user.name = user_name

    fileobj.write(varint.encode(user.ByteSize()))
    fileobj.write(user.SerializeToString())

message = io.BytesIO()
add_user(message, 21, 'Stephen')
add_user(message, 22, 'Olivia')
add_user(message, 23, 'Tim')

execute_query(
    '''INSERT INTO {database}.users SETTINGS format_schema='{name}:{cls}' FORMAT {type}
    '''.format(database=DB_NAME,
               type=SCHEMA_TYPE,
               name=SCHEMA_NAME,
               cls=SCHEMA_CLASS), data=message.getvalue())

This script:

Gets the User class from the linked user_pb2.py file obtained after compiling the proto file (from user_pb2 import User).
Writes the test dataset to the User class object (def add_user ...) and adds this object to the message I/O bitstream.
Inserts data from the message bitstream to the db1.users table based on the User class data from the schema-protobuf format schema in the cluster.

To learn how to get host FQDN, see this guide.

Inserting data

Run the scripts you prepared previously:

Cap'n Proto

Protobuf

python3 capnproto-example.py

python3 protobuf-example.py

Connect to the cluster and make sure that the data was successfully inserted by running the SELECT query:

Cap'n Proto

Protobuf

SELECT * FROM db1.users;

┌─id─┬─name─┐
│ 11 │ John │
│ 12 │ Bob  │
│ 13 │ Jane │
└────┴──────┘

SELECT * FROM db1.users;

┌─id─┬─name────┐
│ 21 │ Stephen │
│ 22 │ Olivia  │
│ 23 │ Tim     │
└────┴─────────┘

ClickHouse® is a registered trademark of ClickHouse, Inc.

Adding data to ClickHouse®

Inserting data from a fileInserting data from a file

Inserting data through bufferingInserting data through buffering

Asynchronous data insertsAsynchronous data inserts

Inserting data via a buffer tableInserting data via a buffer table

ExampleExample

Inserting data and specifying the format schemaInserting data and specifying the format schema

Example of using a format schema when inserting dataExample of using a format schema when inserting data

Getting startedGetting started

Installing dependenciesInstalling dependencies

Preparing format schemasPreparing format schemas

Preparing scriptsPreparing scripts

Inserting dataInserting data

Was the article helpful?

Inserting data from a file

Inserting data through buffering

Asynchronous data inserts

Inserting data via a buffer table

Example

Inserting data and specifying the format schema

Example of using a format schema when inserting data

Getting started

Installing dependencies

Preparing format schemas

Preparing scripts

Inserting data