Adding data to ClickHouse®
For regular data inserts into tables, use an INSERT INTO
statement:
INSERT INTO db_name.table_name VALUES (v11, v12, v13), (v21, v22, v23), ...
Insert queries should be run no more than once per second. To group multiple small queries into a large one, use buffering.
You can learn more about INSERT INTO
in the ClickHouse® documentation
Inserting data from a file
To insert local file data into a table, use an INSERT INTO
statement, such as:
INSERT INTO db_name.table_name FROM INFILE '<full_path_to_file>'
[COMPRESSION '<compression_format>'] FORMAT <data_format>;
With the COMPRESSION
option, you can transfer compressed files. Use it to upload large amounts of data. The option is supported when using clickhouse-clientnone
, gzip
, deflate
, br
, xz
, zstd
, lz4
, and bz2
.
For a list of supported data formats, see the ClickHouse® documentation
Inserting data through buffering
When you insert data into ClickHouse®, a part of computing resources is used for performing housekeeping operations. Each time you run an INSERT
query, ClickHouse® creates a separate data part in the storage. In addition to table rows, parts like this contain auxiliary files with metadata. Next, ClickHouse® joins data parts in the background. The more join queries are required, the more resources will be used.
As a result, the load on the cluster from one thousand queries to insert a single row will exceed that from a single query to insert one thousand rows. Therefore, we recommend inserting data into tables in large chunks from 1,000 to 100,000 rows.
If small chunks of data are delivered from various sources, use one of the following buffering options:
- Asynchronous inserts (recommended)
- Buffer tables
Asynchronous data inserts
If the Async insert setting is configured for the user, all insert queries from this user first get to the RAM buffer. Data from the buffer is flushed to a table if one of the following conditions is met:
- The buffer size has reached the Async insert max data size setting value.
- The time set in the Async busy timeout setting has passed since the first
INSERT
query run after flushing the data.
To enable asynchronous data inserts, set the Async insert setting to 1
.
Note that row deduplication
For more information about asynchronous data inserts, see the ClickHouse® documentation
Inserting data via a buffer table
A buffer table is created in RAM. It accumulates data it receives and then flushes the data to the main table if the preset conditions are met.
To create a buffer table, the Buffer
table engine is used. It accepts the following basic parameters as an input:
Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes)
Where:
database
: DB name.table
: Name of the table to flush the data to.num_layers
: Number of buffers. A table is physically stored in memory as multiple independent buffers.min_time
,max_time
,min_rows
,max_rows
,min_bytes
, andmax_bytes
: Parameters that set conditions for flushing the data to the main table. The time is measured in seconds.
The data is flushed to the main table upon reaching all minimum values or one of the maximum values. If the size of a received chunk exceeds the max_rows
or max_bytes
value, the data is written to the main table directly rather than buffered.
To learn more about additional engine parameters and limits for tables based on the Buffer
engine, see the ClickHouse® documentation
Example
-
Create a regular table named
users
in thedb1
database:CREATE TABLE IF NOT EXISTS db1.users (id UInt64, name String) ENGINE = MergeTree() ORDER BY id;
-
Create a buffer table named
users_buffer
and link it to theusers
main table:CREATE TABLE db1.users_buffer AS db1.users ENGINE = Buffer(db1, users, 1, 10, 100, 10000, 1000000, 10000000, 100000000);
-
Send data to the buffer table:
INSERT INTO db1.users_buffer VALUES (1, 'Vlad'), (2, 'John'), (3, 'Zara');
-
Check data in the main table, it will appear there in 100 seconds (
max_time
):SELECT * FROM db1.users;
Result:
┌─id─┬─name─┐
│ 1 │ Vlad │
│ 2 │ John │
│ 3 │ Zara │
└────┴──────┘
Inserting data and specifying the format schema
Managed Service for ClickHouse® lets you INSERT
and SELECT
data in different formats. Most of those formats are self-descriptive. This means that they already contain a format schema that describes acceptable data types, their order, and representation in this format. For example, it lets you directly insert data from a file.
Note
Format schema describes the format of data input or output and the data schema describes the structure and layout of the ClickHouse® databases and tables that store this data. These concepts are not interchangeable.
The Cap'n Proto
You can add one or more such format schemas to a Managed Service for ClickHouse® cluster and use them to input and output data in the relevant formats.
Warning
To use the format schemas you added, insert the data into Managed Service for ClickHouse® using the HTTP interface
For more information about data formats, see the ClickHouse® documentation
Example of using a format schema when inserting data
The example was tested in the following environment:
- Virtual machine in Yandex Cloud running Ubuntu 20.04 LTS
- Bash:
5.0.16
- clickhouse-client:
20.10.2.20
- capnproto:
0.7.0
- protobuf-compiler:
3.6.1
- Python:
3.8.5
; pip3:20.0.2
Let's assume a single-host Managed Service for ClickHouse® cluster named chcluster
is created with the db1
database, and you need to insert user data into the db1.users
table. Let's assume that each user record contains the following information:
id
: User IDname
: Username
To insert user data in the Cap'n Proto and Protobuf formats into the db1.users
table:
Getting started
-
Examine the data format that will be used for insertion so that the correct format schemas are prepared.
In this scenario, for demonstration, it is assumed that:
- The user
id
is represented as an unsigned 64-bit integer (Uint64
in Cap'n Proto and ClickHouse® oruint64
in Protobuf). - The username
name
is presented in the form of a string (Text
in Cap'n Proto,string
in Protobuf orString
in ClickHouse®).
To learn more about supported data types, see the documentation for Cap'n Proto
, Protobuf , and ClickHouse® . - The user
-
Connect to the cluster and create the
db1.users
table of the preferred format if it is not there yet:CREATE TABLE IF NOT EXISTS db1.users (id UInt64, name String) ENGINE = MergeTree() ORDER BY id;
Installing dependencies
sudo apt update && sudo apt install -y python3 python3-pip capnproto protobuf-compiler && \
pip3 install protobuf varint pycapnp
Preparing format schemas
-
Create a file with a schema description:
Cap'n ProtoProtobufuser.capnp
@0xbec0f3f99cec4fbf; struct User { id @0 :UInt64; name @1 :Text; }
To learn more about the file format, see the documentation for Cap'n Proto
.user.proto
syntax = "proto3"; message User { uint64 id = 1; string name = 2; };
To learn more information about the file format, see the documentation for Protobuf
. -
Upload the file to Object Storage and get a link to it.
-
Connect the format schema to the
chcluster
cluster:- For the Cap'n Proto format schema (the
user.capnp
file), setschema-capnproto
as the name. - For the Protobuf format schema, (the
user.protobuf
file), setschema-protobuf
as the name.
- For the Cap'n Proto format schema (the
Preparing scripts
These Python scripts prepare test data about users in the appropriate formats and insert them into a cluster table.
Note
Python scripts are provided for demonstration. You can prepare and insert binary data in the required format by creating a similar script in a different programming language.
To prepare scripts:
-
Compile the
user.proto
file of the Protobuf schema for Python:protoc user.proto --python_out .
The
user_pb2.py
file will be generated. -
Create files with Python code:
Cap'n ProtoProtobufcapnproto-example.py
import requests import io import capnp from user_capnp import User DB_HOST="<ClickHouse®_host_FQDN>" DB_NAME="db1" DB_USER="<DB_username>" DB_PASS="<DB_user_password>" CA_CERT="/usr/local/share/ca-certificates/Yandex/RootCA.crt" SCHEMA_NAME = 'schema-capnproto' SCHEMA_TYPE = "CapnProto" SCHEMA_CLASS = "User" def execute_query(query, data=None): url = 'https://{host}:8443/'.format(host=DB_HOST) params = { 'database': DB_NAME, 'query': query.strip() } auth = { 'X-ClickHouse-User': DB_USER, 'X-ClickHouse-Key': DB_PASS } rs = requests.post(url, params=params, headers=auth, data=data, verify=CA_CERT) rs.raise_for_status() def add_user(fileobj, user_id, user_name): user = User.new_message() user.id = user_id user.name = user_name fileobj.write(user.to_bytes()) message = io.BytesIO() add_user(message, 11, 'John') add_user(message, 12, 'Bob') add_user(message, 13, 'Jane') execute_query( ''' INSERT INTO {database}.users SETTINGS format_schema='{name}:{cls}' FORMAT {type} '''.format(database=DB_NAME, type=SCHEMA_TYPE, name=SCHEMA_NAME, cls=SCHEMA_CLASS), data=message.getvalue())
This script:
- Gets the
User
class from the linkeduser.capnp
file (from user_capnp import User
). - Executes requests to the cluster over HTTPS using SSL.
- Writes the test dataset to the User class object (
def add_user ...
) and adds this object to themessage
I/O bitstream. - Inserts data from the
message
bitstream to thedb1.users
table based on theUser
class data from theschema-capnproto
format schema in the cluster.
protobuf-example.py
import requests import io import varint from user_pb2 import User DB_HOST="<ClickHouse®_host_FQDN>" DB_NAME="db1" DB_USER="<DB_username>" DB_PASS="<DB_user_password>" CA_CERT="/usr/local/share/ca-certificates/Yandex/RootCA.crt" SCHEMA_NAME = 'schema-protobuf' SCHEMA_TYPE = "Protobuf" SCHEMA_CLASS = "User" def execute_query(query, data=None): url = 'https://{host}:8443/'.format(host=DB_HOST) params = { 'database': DB_NAME, 'query': query.strip() } auth = { 'X-ClickHouse-User': DB_USER, 'X-ClickHouse-Key': DB_PASS } rs = requests.post(url, params=params, headers=auth, data=data, verify=CA_CERT) rs.raise_for_status() def add_user(fileobj, user_id, user_name): user = User() user.id = user_id user.name = user_name fileobj.write(varint.encode(user.ByteSize())) fileobj.write(user.SerializeToString()) message = io.BytesIO() add_user(message, 21, 'Stephen') add_user(message, 22, 'Olivia') add_user(message, 23, 'Tim') execute_query( '''INSERT INTO {database}.users SETTINGS format_schema='{name}:{cls}' FORMAT {type} '''.format(database=DB_NAME, type=SCHEMA_TYPE, name=SCHEMA_NAME, cls=SCHEMA_CLASS), data=message.getvalue())
This script:
- Gets the
User
class from the linkeduser_pb2.py
file obtained after compiling the proto file (from user_pb2 import User
). - Writes the test dataset to the User class object (
def add_user ...
) and adds this object to themessage
I/O bitstream. - Inserts data from the
message
bitstream to thedb1.users
table based on theUser
class data from theschema-protobuf
format schema in the cluster.
To learn how to get a host FQDN, see this guide.
- Gets the
Inserting data
-
Run the scripts you prepared previously:
Cap'n ProtoProtobufpython3 capnproto-example.py
python3 protobuf-example.py
-
Connect to the cluster and make sure that the data was successfully inserted by running the
SELECT
query:Cap'n ProtoProtobufSELECT * FROM db1.users; ┌─id─┬─name─┐ │ 11 │ John │ │ 12 │ Bob │ │ 13 │ Jane │ └────┴──────┘
SELECT * FROM db1.users; ┌─id─┬─name────┐ │ 21 │ Stephen │ │ 22 │ Olivia │ │ 23 │ Tim │ └────┴─────────┘
ClickHouse® is a registered trademark of ClickHouse, Inc