Fetching data from Yandex Managed Service for Apache Kafka® to Yandex Managed Service for ClickHouse®
- Getting started
- Set up integration with Apache Kafka® for the Managed Service for ClickHouse® cluster
- In the Managed Service for ClickHouse® cluster, create tables on the Kafka engine
- Send test data to Managed Service for Apache Kafka® topics
- Check that the test data is present in the Managed Service for ClickHouse® cluster tables
- Delete the resources you created
A Managed Service for ClickHouse® cluster can get data from Apache Kafka® topics in real time. Managed Service for ClickHouse® automatically inserts data sent to ClickHouse® tables on the Kafka
engine
To set up data delivery from Managed Service for Apache Kafka® to Managed Service for ClickHouse®:
- Set up integration with Apache Kafka® for the Managed Service for ClickHouse® cluster.
- In the Managed Service for ClickHouse® cluster, create tables on the Kafka engine.
- Send test data to Managed Service for Apache Kafka® topics.
- Check that the test data is present in the Managed Service for ClickHouse® cluster tables.
If you no longer need the resources you created, delete them.
Getting started
Prepare the infrastructure
-
Create the required number of Managed Service for Apache Kafka® clusters in any suitable configuration. To connect to clusters from a user's local machine instead of the Yandex Cloud cloud network, enable public access to clusters when creating them.
-
Create a Managed Service for ClickHouse® cluster with a single shard and a database named
db1
. To connect to the cluster from the user's local machine rather than doing so from the Yandex Cloud cloud network, enable public access to the cluster when creating it.Note
You can set up Apache Kafka® integration when creating a cluster. In this tutorial, integration will be set up later.
-
Create the required number of topics in Managed Service for Apache Kafka® clusters. Make sure topic names are unique.
-
To enable producers and consumers to work with topics, create two users per Managed Service for Apache Kafka® cluster:
- A user with the
ACCESS_ROLE_PRODUCER
role for the producer. - A user with the
ACCESS_ROLE_CONSUMER
role for the consumer.
Users in different clusters may have the same names.
- A user with the
-
If you do not have Terraform yet, install it.
-
Get the authentication credentials. You can add them to environment variables or specify them later in the provider configuration file.
-
Configure and initialize a provider. There is no need to create a provider configuration file manually, you can download it
. -
Place the configuration file in a separate working directory and specify the parameter values. If you did not add the authentication credentials to environment variables, specify them in the configuration file.
-
Download the data-from-kafka-to-clickhouse.tf
configuration file to the same working directory.This file describes:
-
Network.
-
Subnet.
-
Default security group and rules required to connect to the clusters from the internet.
-
Managed Service for Apache Kafka® cluster.
-
Topic and two Managed Service for Apache Kafka® users on whose behalf the producer and consumer will connect to the topic, respectively.
To create multiple topics or clusters, duplicate blocks with their description and specify new unique names. Users in different clusters may have the same names.
-
A Managed Service for ClickHouse® cluster with a single shard and a database named
db1
.
-
-
In
data-from-kafka-to-clickhouse.tf
, specify:- Managed Service for Apache Kafka® version.
- Usernames and passwords of users with the
ACCESS_ROLE_PRODUCER
andACCESS_ROLE_CONSUMER
roles in Managed Service for Apache Kafka® clusters. - Names of the Managed Service for Apache Kafka® clusters' topics.
- Username and password that will be used to access a Managed Service for ClickHouse® cluster.
-
Make sure the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Create the required infrastructure:
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the required resources will be created in the specified folder. You can check resource availability and their settings in the management console
. -
Configure additional settings
-
Install the utilities:
-
kafkacat
to read and write data to Apache Kafka® topics.sudo apt update && sudo apt install --yes kafkacat
Check that you can use it to connect to Managed Service for Apache Kafka® clusters over SSL.
-
clickhouse-client
: To connect to the database in the Managed Service for ClickHouse® cluster.-
Connect the DEB repository
ClickHouse®:sudo apt update && sudo apt install --yes apt-transport-https ca-certificates dirmngr && \ sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E0C56BD4 && \ echo "deb https://packages.clickhouse.com/deb stable main" | sudo tee \ /etc/apt/sources.list.d/clickhouse.list
-
Install the dependencies:
sudo apt update && sudo apt install --yes clickhouse-client
-
Download the configuration file for
clickhouse-client
:mkdir -p ~/.clickhouse-client && \ wget "https://storage.yandexcloud.net/doc-files/clickhouse-client.conf.example" \ --output-document ~/.clickhouse-client/config.xml
Check that you can use it to connect to the Managed Service for ClickHouse® cluster over SSL.
-
-
jq
for JSON file stream processing.sudo apt update && sudo apt-get install --yes jq
-
Set up integration with Apache Kafka® for the Managed Service for ClickHouse® cluster
Depending on the number of Managed Service for Apache Kafka® clusters:
- If there is a single Apache Kafka® cluster, specify authentication data under DBMS settings → Kafka. In this case, the Managed Service for ClickHouse® cluster will use these authentication credentials to access any topic.
- If there are multiple Apache Kafka® clusters, specify authentication data for each Managed Service for Apache Kafka® topic in the Managed Service for ClickHouse® cluster settings under DBMS settings → Kafka topics.
Authentication data:
- Name: Topic name (for multiple Apache Kafka® clusters).
- Sasl mechanism:
SCRAM-SHA-512
. - Sasl password: User password for the consumer.
- Sasl username: Username for the consumer.
- Security protocol:
SASL_SSL
.
-
Depending on the number of Managed Service for Apache Kafka® clusters:
-
If there is a single Apache Kafka® cluster, uncomment the
clickhouse.config.kafka
section in thedata-from-kafka-to-clickhouse.tf
file:config { kafka { security_protocol = "SECURITY_PROTOCOL_SASL_SSL" sasl_mechanism = "SASL_MECHANISM_SCRAM_SHA_512" sasl_username = "<username_for_the_consumer>" sasl_password = "<user_password_for_the_consumer>" } }
-
If there are multiple Apache Kafka® clusters, uncomment the
clickhouse.config.kafka_topic
section and specify the authentication credentials for each Managed Service for Apache Kafka® topic:config { kafka_topic { name = "<topic_name>" settings { security_protocol = "SECURITY_PROTOCOL_SASL_SSL" sasl_mechanism = "SASL_MECHANISM_SCRAM_SHA_512" sasl_username = "<username_for_the_consumer>" sasl_password = "<user_password_for_the_consumer>" } } }
If there are multiple topics in the clusters, duplicate the
kafka_topic
section as many times as required and specify the respective topic names.
-
-
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
-
In the Managed Service for ClickHouse® cluster, create tables on the Kafka engine
For example, Apache Kafka® topics receive some data from car sensors in JSON format. This data will be sent as Apache Kafka® messages, each containing a string like this:
{"device_id":"iv9a94th6rzt********","datetime":"2020-06-05 17:27:00","latitude":"55.70329032","longitude":"37.65472196","altitude":"427.5","speed":"0","battery_voltage":"23.5","cabin_temperature":"17","fuel_level":null}
The Managed Service for ClickHouse® cluster will insert data into tables run on the Kafka
engine in JSONEachRow format
For each Apache Kafka® topic, create a separate table in your Managed Service for ClickHouse® cluster to write incoming data to:
-
Connect to the
db1
database of the Managed Service for ClickHouse® cluster usingclickhouse-client
. -
Run the following query:
CREATE TABLE IF NOT EXISTS db1.<topic_table_name> ( device_id String, datetime DateTime, latitude Float32, longitude Float32, altitude Float32, speed Float32, battery_voltage Nullable(Float32), cabin_temperature Float32, fuel_level Nullable(Float32) ) ENGINE = Kafka() SETTINGS kafka_broker_list = '<broker_host_FQDN>:9091' kafka_topic_list = '<topic_name>', kafka_group_name = 'sample_group', kafka_format = 'JSONEachRow';
The created tables will be automatically populated with messages that are read from Managed Service for Apache Kafka® topics. To read data, Managed Service for ClickHouse® uses the settings previously set for users with the ACCESS_ROLE_CONSUMER
role.
To learn more about creating a table on the Kafka
engine, see the ClickHouse® documentation
Send test data to Managed Service for Apache Kafka® topics
-
Create a
sample.json
file with the following test data:{ "device_id": "iv9a94th6rzt********", "datetime": "2020-06-05 17:27:00", "latitude": 55.70329032, "longitude": 37.65472196, "altitude": 427.5, "speed": 0, "battery_voltage": 23.5, "cabin_temperature": 17, "fuel_level": null } { "device_id": "rhibbh3y08qm********", "datetime": "2020-06-06 09:49:54", "latitude": 55.71294467, "longitude": 37.66542005, "altitude": 429.13, "speed": 55.5, "battery_voltage": null, "cabin_temperature": 18, "fuel_level": 32 } { "device_id": "iv9a94th6rzt********", "datetime": "2020-06-07 15:00:10", "latitude": 55.70985913, "longitude": 37.62141918, "altitude": 417.0, "speed": 15.7, "battery_voltage": 10.3, "cabin_temperature": 17, "fuel_level": null }
-
Send the data from the
sample.json
file to each Managed Service for Apache Kafka® topic usingjq
andkafkacat
:jq -rc . sample.json | kafkacat -P \ -b <broker_host_FQDN>:9091 \ -t <topic_name> \ -k key \ -X security.protocol=SASL_SSL \ -X sasl.mechanisms=SCRAM-SHA-512 \ -X sasl.username="<username_for_the_producer>" \ -X sasl.password="<user_password_for_the_producer>" \ -X ssl.ca.location=/usr/local/share/ca-certificates/Yandex/RootCA.crt -Z
Data is sent on behalf of users with the ACCESS_ROLE_PRODUCER
role. To learn more about setting up an SSL certificate and working with kafkacat
, see Connecting to an Apache Kafka® cluster from applications.
Check that the test data is present in the Managed Service for ClickHouse® cluster tables
To access the data, use a materialized view. When a materialized view is added to a table on the Kafka
engine, it starts collecting data in the background. This allows you to continuously receive messages from Apache Kafka® and convert them to the required format using SELECT
.
Note
Since ClickHouse® can read a message from a topic only once, we do not recommend reading data directly from the table.
To create a materialized view:
-
Connect to the
db1
database of the Managed Service for ClickHouse® cluster usingclickhouse-client
. -
Run the following queries for each table on the
Kafka
engine:CREATE TABLE db1.temp_<topic_table_name> ( device_id String, datetime DateTime, latitude Float32, longitude Float32, altitude Float32, speed Float32, battery_voltage Nullable(Float32), cabin_temperature Float32, fuel_level Nullable(Float32) ) ENGINE = MergeTree() ORDER BY device_id;
CREATE MATERIALIZED VIEW db1.<view_name> TO db1.temp_<topic_table_name> AS SELECT * FROM db1.<topic_table_name>;
To get all the data from the appropriate materialized view:
-
Connect to the
db1
database of the Managed Service for ClickHouse® cluster usingclickhouse-client
. -
Run the following query:
SELECT * FROM db1.<view_name>;
The query will return a table with data sent to the respective Managed Service for Apache Kafka® topic.
To learn more about how to work with data received from Apache Kafka®, see the ClickHouse® documentation
Delete the resources you created
Some resources are not free of charge. To avoid paying for them, delete the resources you no longer need:
-
Delete the clusters:
-
If you reserved public static IP addresses for the clusters, release and delete them.
To delete the infrastructure created with Terraform:
-
In the terminal window, go to the directory containing the infrastructure plan.
-
Delete the
data-from-kafka-to-clickhouse.tf
configuration file. -
Make sure the Terraform configuration files are correct using this command:
terraform validate
If there are any errors in the configuration files, Terraform will point them out.
-
Confirm updating the resources.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All resources described in the configuration file
data-from-kafka-to-clickhouse.tf
will be deleted. -
ClickHouse® is a registered trademark of ClickHouse, Inc