Adding data to ClickHouse®
To insert data into a table, use an INSERT INTO statement:
INSERT INTO db_name.table_name VALUES (v11, v12, v13), (v21, v22, v23), ...
We recommend sending insert queries no more than once per second. You can use buffering to combine multiple small queries into a single large one.
Learn more about INSERT INTO from this ClickHouse® guide
Inserting data from a file
To insert data into a table from a local file, use the INSERT INTO statement in the following format:
INSERT INTO db_name.table_name FROM INFILE '<full_path_to_file>'
[COMPRESSION '<compression_format>'] FORMAT <data_format>;
The COMPRESSION allows you to transfer compressed files, which is perfect for uploading large amounts of data. This option is supported when working through the clickhouse-clientnone, gzip, deflate, br, xz, zstd, lz4, and bz2.
For a list of supported data formats, refer to this ClickHouse® guide
Inserting data using buffering
When inserting data into ClickHouse®, a portion of the computational resources is consumed by overhead operations. Each INSERT query creates a new data part in ClickHouse® storage. In addition to the data rows, each part contains auxiliary metadata files. To reduce the number of data parts, ClickHouse® merges them in the background. The more merge operations required, the more resources are consumed.
As a result, a thousand individual insert queries will impose more load on the cluster than a single query inserting one thousand rows. Therefore, we recommend inserting data in batches of 1,000 to 100,000 rows.
If data arrives in small batches from different sources, use one of the following buffering mechanisms:
- Asynchronous inserts (recommended)
- Buffer tables
Asynchronous inserts
If the Async insert setting is enabled for a user, then all insert queries from that user are first buffered in RAM. Data is flushed from the buffer to the destination table when one of the following conditions is met:
- The buffer size reaches the Async insert max data size setting.
- The time since the first
INSERTafter the last buffer flush exceeds the value of the Async busy timeout setting.
To enable asynchronous data inserts, set Async insert to 1.
Note that asynchronous inserts do not support row deduplication
For more details about asynchronous data inserts, see this ClickHouse® guide
Using a buffer table for inserts
A buffer table, created in RAM, accumulates incoming data and flushes it to the target table once preset conditions are met.
To create a buffer table, use the Buffer table engine. It requires the following parameters:
Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes)
Where:
database: Database name.table: Target table name.num_layers: Number of buffers. A table is physically stored in memory as multiple independent buffers.min_time,max_time,min_rows,max_rows,min_bytes, andmax_bytes: Conditions for flushing data from the buffer to the target table. The time is specified in seconds.
Flushing occurs when all minimum conditions are satisfied, or when any one of the maximum limits is reached. If the size of incoming data batch exceeds the max_rows or max_bytes settings, the system writes it directly to the target table, bypassing the buffer.
For details on additional Buffer engine parameters and table limitations, see this ClickHouse® guide
Example
-
Create a regular table named
usersin thedb1database:CREATE TABLE IF NOT EXISTS db1.users (id UInt64, name String) ENGINE = MergeTree() ORDER BY id; -
Create a buffer table named
users_bufferlinked to the target tableusers:CREATE TABLE db1.users_buffer AS db1.users ENGINE = Buffer(db1, users, 1, 10, 100, 10000, 1000000, 10000000, 100000000); -
Insert data into the buffer table:
INSERT INTO db1.users_buffer VALUES (1, 'Vlad'), (2, 'John'), (3, 'Zara'); -
Verify the data appears in the target table within the
max_timeinterval of 100 seconds:SELECT * FROM db1.users;
Result:
┌─id─┬─name─┐
│ 1 │ Vlad │
│ 2 │ John │
│ 3 │ Zara │
└────┴──────┘
Data insertion with schema specification
In Managed Service for ClickHouse®, you can INSERT and SELECT data in different formats. Most of these formats are self-descriptive. This means that they already contain a format schema that describes valid data types, their order, and representation in this format. Thus, for example, you can insert data directly from a file.
Note
A format schema describes the format of data input or output, while a data schema describes the structure and layout of ClickHouse® databases and tables storing this data. These concepts are not interchangeable.
Cap'n Proto
You can add one or multiple format schemas to your Managed Service for ClickHouse® cluster and use them to input and output data in the relevant formats.
Warning
To use the format schemas you added, insert the data into Managed Service for ClickHouse® via the HTTP interface
For more information about data formats, see this ClickHouse® guide
Example of using a schema for data insertion
This example was tested in the following environment:
- Yandex Cloud VM running Ubuntu 20.04 LTS.
- Bash:
5.0.16. - clickhouse-client:
20.10.2.20. - capnproto:
0.7.0. - protobuf-compiler:
3.6.1. - Python:
3.8.5; pip3:20.0.2.
Suppose you have created a single-host Managed Service for ClickHouse® cluster named chcluster with a database named db1. You need to insert user data into the db1.users table, where each record contains the following information:
id: User ID.name: Username.
To insert Cap'n Proto and Protobuf user data into the db1.users table:
Getting started
-
Examine the input data format in order to prepare the appropriate schema definitions.
In our demonstration example, we assume the following:
- The user ID (
id) is an unsigned 64-bit integer (Uint64in Cap'n Proto and ClickHouse®, anduint64in Protobuf). - The username (
name) is a string (Textin Cap'n Proto,stringin Protobuf,Stringin ClickHouse®).
For more information on supported data types, see the relevant Cap'n Proto
, Protobuf , and ClickHouse® guides . - The user ID (
-
Connect to the cluster and create the
db1.userstable if it does not exist, using the required schema:CREATE TABLE IF NOT EXISTS db1.users (id UInt64, name String) ENGINE = MergeTree() ORDER BY id;
Installing dependencies
sudo apt update && sudo apt install -y python3 python3-pip capnproto protobuf-compiler && \
pip3 install protobuf varint pycapnp
Defining the schemas
-
Create a schema definition file:
Cap'n ProtoProtobufuser.capnp@0xbec0f3f99cec4fbf; struct User { id @0 :UInt64; name @1 :Text; }For file format details, see this Cap'n Proto guide
.user.protosyntax = "proto3"; message User { uint64 id = 1; string name = 2; };For file format detials, see this Protobuf guide
. -
Upload the file to Object Storage and retrieve its public URL.
-
Add your schema to the
chclustercluster:- For the Cap'n Proto schema (
user.capnp), use the nameschema-capnproto. - For the Protobuf schema (
user.protobuf), use the nameschema-protobuf.
- For the Cap'n Proto schema (
Preparing scripts
The following Python scripts generate test user data in required formats and insert it into your cluster table.
Note
Python is used here solely for demonstration. The core logic of generating binary data in required format and inserting it into the table can be implemented in any programming language.
To prepare the scripts:
-
Generate the Python code from the
user.protoProtobuf schema file:protoc user.proto --python_out .This command will generate the
user_pb2.pyfile. -
Create files with the following Python code:
Cap'n ProtoProtobufcapnproto-example.pyimport requests import io import capnp from user_capnp import User DB_HOST="<ClickHouse®>_host_FQDN" DB_NAME="db1" DB_USER="<DB_username>" DB_PASS="<DB_user_password>" CA_CERT="/usr/local/share/ca-certificates/Yandex/RootCA.crt" SCHEMA_NAME = 'schema-capnproto' SCHEMA_TYPE = "CapnProto" SCHEMA_CLASS = "User" def execute_query(query, data=None): url = 'https://{host}:8443/'.format(host=DB_HOST) params = { 'database': DB_NAME, 'query': query.strip() } auth = { 'X-ClickHouse-User': DB_USER, 'X-ClickHouse-Key': DB_PASS } rs = requests.post(url, params=params, headers=auth, data=data, verify=CA_CERT) rs.raise_for_status() def add_user(fileobj, user_id, user_name): user = User.new_message() user.id = user_id user.name = user_name fileobj.write(user.to_bytes()) message = io.BytesIO() add_user(message, 11, 'John') add_user(message, 12, 'Bob') add_user(message, 13, 'Jane') execute_query( ''' INSERT INTO {database}.users SETTINGS format_schema='{name}:{cls}' FORMAT {type} '''.format(database=DB_NAME, type=SCHEMA_TYPE, name=SCHEMA_NAME, cls=SCHEMA_CLASS), data=message.getvalue())The script above does the following:
- Imports the
Userclass generated from theuser.capnpschema (from user_capnp import User). - Queries the cluster via HTTPS with SSL encryption.
- Creates a User class object, populates it with the test dataset, (
def add_user ...) and serializes it into amessagebitstream for I/O. - Deserializes the
messagebitstream into theUserobject using theschema-capnprotoschema and inserts the resulting data into thedb1.userscluster table.
protobuf-example.pyimport requests import io import varint from user_pb2 import User DB_HOST="<ClickHouse®>_host_FQDN" DB_NAME="db1" DB_USER="<DB_username>" DB_PASS="<DB_user_password>" CA_CERT="/usr/local/share/ca-certificates/Yandex/RootCA.crt" SCHEMA_NAME = 'schema-protobuf' SCHEMA_TYPE = "Protobuf" SCHEMA_CLASS = "User" def execute_query(query, data=None): url = 'https://{host}:8443/'.format(host=DB_HOST) params = { 'database': DB_NAME, 'query': query.strip() } auth = { 'X-ClickHouse-User': DB_USER, 'X-ClickHouse-Key': DB_PASS } rs = requests.post(url, params=params, headers=auth, data=data, verify=CA_CERT) rs.raise_for_status() def add_user(fileobj, user_id, user_name): user = User() user.id = user_id user.name = user_name fileobj.write(varint.encode(user.ByteSize())) fileobj.write(user.SerializeToString()) message = io.BytesIO() add_user(message, 21, 'Stephen') add_user(message, 22, 'Olivia') add_user(message, 23, 'Tim') execute_query( '''INSERT INTO {database}.users SETTINGS format_schema='{name}:{cls}' FORMAT {type} '''.format(database=DB_NAME, type=SCHEMA_TYPE, name=SCHEMA_NAME, cls=SCHEMA_CLASS), data=message.getvalue())The script above does the following:
- Imports the
Userclass from theuser_pb2.pymodule generated by compiling the relevant proto file (from user_pb2 import User). - Creates a User class object, populates it with the test dataset, (
def add_user ...) and serializes it into amessagebitstream for I/O. - Deserializes the
messagebitstream into theUserobject using theschema-protobufschema and inserts the resulting data into thedb1.userscluster table.
To learn how to get a host’s FQDN, see this guide.
- Imports the
Inserting data
-
Run the scripts you prepared at the previous step:
Cap'n ProtoProtobufpython3 capnproto-example.pypython3 protobuf-example.py -
Connect to the cluster and verify the data was inserted by running the
SELECTquery:Cap'n ProtoProtobufSELECT * FROM db1.users; ┌─id─┬─name─┐ │ 11 │ John │ │ 12 │ Bob │ │ 13 │ Jane │ └────┴──────┘SELECT * FROM db1.users; ┌─id─┬─name────┐ │ 21 │ Stephen │ │ 22 │ Olivia │ │ 23 │ Tim │ └────┴─────────┘
ClickHouse® is a registered trademark of ClickHouse, Inc