Transferring data to a ClickHouse® target endpoint
Yandex Data Transfer enables you to migrate data to a ClickHouse® database and implement various data transfer, processing, and transformation scenarios. To implement a transfer:
- Explore possible data transfer scenarios.
- Configure one of the supported data sources.
- Prepare the ClickHouse® database for the transfer.
- Configure the target endpoint in Yandex Data Transfer.
- Create a transfer and start it.
- Perform required operations with the database and control the transfer.
- In case of any issues, use ready-made solutions to resolve them.
Scenarios for transferring data to ClickHouse®
-
Migration: Moving data from one repository to another; it often suggests transferring a database to the cloud, from outdated local databases to managed cloud ones.
-
Data delivery is a process of delivering arbitrary data to target storage. It includes data retrieval from a queue and its deserialization with subsequent transformation to target storage format.
-
Uploading data to data marts is a process of transferring prepared data to storage for subsequent visualization.
- Loading PostgreSQL data to ClickHouse®.
- Loading data from Object Storage to ClickHouse®.
- Loading data from YDB to a ClickHouse® data mart.
For a detailed description of possible Yandex Data Transfer data transfer scenarios, see Tutorials.
Configuring the data source
Configure one of the supported data sources:
- PostgreSQL
- MySQL
- ClickHouse®
- Greenplum®
- Apache Kafka®
- Airbyte®
- Yandex Metrica
- YDS
- Yandex Object Storage
- Oracle
- Elasticsearch
- OpenSearch
For a complete list of supported sources and targets in Yandex Data Transfer, see Available Transfers.
Preparing the target database
-
If you need to transfer multiple databases, create a separate transfer for each one of them.
-
Create a user with access to the target database.
Once started, the transfer will connect to the target on behalf of this user.
-
Assign the created security group to the Managed Service for ClickHouse® cluster.
-
Make sure the settings for the network hosting the cluster allow public connections from IP addresses used by Data Transfer
. -
Create a target database. Its name must be the same as the source database name. If you need to transfer multiple databases, create a separate transfer for each one of them.
-
Create a user with access to the target database.
Once started, the transfer will connect to the target on behalf of this user.
Configuring the ClickHouse® target endpoint
When creating or updating an endpoint, you can define:
- Yandex Managed Service for ClickHouse® cluster connection or custom installation settings, including those based on Yandex Compute Cloud VMs. These are required parameters.
- Additional parameters.
Managed Service for ClickHouse® cluster
Warning
To create or edit an endpoint of a managed database, you need to have the managed-clickhouse.viewer
role or the viewer
primitive role assigned for the folder where this managed database cluster resides.
Connecting to the database with the cluster ID specified in Yandex Cloud.
-
Managed cluster: ID of cluster to connect to.
-
User: Specify the username that Data Transfer will use to connect to the database.
-
Password: Enter the user password to the database.
-
-
Database: Specify the name of the database in the selected cluster.
-
Security groups: Select the cloud network to host the endpoint and security groups for network traffic. This will allow you to apply the specified security group rules to the VMs and clusters in the selected network without changing their settings. For more information, see Networking in Yandex Data Transfer.
Make sure the selected security groups are configured.
- Endpoint type:
clickhouse-target
.
-
--cluster-id
: ID of the cluster you need to connect to. -
--database
: Database name. -
--user
: Username that Data Transfer will use to connect to the database. -
--security-group
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer.Make sure the specified security groups are configured.
-
To set a user password to access the DB, use one of the following parameters:
-
--raw-password
: Password as text. -
--password-file
: The path to the password file.
-
- Endpoint type:
clickhouse_target
.
-
connection.connection_options.mdb_cluster_id
: ID of cluster to connect to. -
subnet_id
: ID of the subnet hosting the cluster. If not specified, the cluster must be accessible from the internet.If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
security_groups
: Security groups for network traffic.Security group rules apply to a transfer. They allow opening up network access from the transfer VM to the cluster. For more information, see Networking in Yandex Data Transfer.
Security groups and the
subnet_id
subnet, if the latter is specified, must belong to the same network as the cluster.Note
In Terraform, it is not required to specify a network for security groups.
Make sure the specified security groups are configured.
-
connection.connection_options.database
: Database name. -
connection.connection_options.user
: Username that Data Transfer will use to connect to the database. -
connection.connection_options.password.raw
: Password in text form.
Here is an example of the configuration file structure:
resource "yandex_datatransfer_endpoint" "<endpoint_name_in_Terraform>" {
name = "<endpoint_name>"
settings {
clickhouse_target {
security_groups = ["<list_of_security_group_IDs>"]
subnet_id = "<subnet_ID>"
connection {
connection_options {
mdb_cluster_id = "<cluster_ID>"
database = "<migrated_database_name>"
user = "<username_for_connection>"
password {
raw = "<user_password>"
}
}
}
<additional_endpoint_settings>
}
}
}
For more information, see the Terraform provider documentation
-
securityGroups
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer.Make sure the specified security groups are configured.
-
mdbClusterId
: ID of the cluster you need to connect to. -
database
: Database name. -
user
: Username that Data Transfer will use to connect to the database. -
password.raw
: Database user password (in text form).
Custom installation
Connecting to the database with explicitly specified network addresses and ports.
-
Shards
- Shard: Specify a row that will allow the service to distinguish shards from each other. If sharding is disabled in your custom installation, specify any name.
- Hosts: Specify FQDNs or IP addresses of the hosts in the shard.
-
HTTP port: Set the number of the port that Data Transfer will use for the connection.
When connecting via the HTTP port:
- For optional fields, default values are used (if any).
- Recording complex types is supported (such as
array
andtuple
).
-
Native port: Set the number of the native port that Data Transfer will use for the connection.
-
SSL: Enable if the cluster supports only encrypted connections.
-
CA certificate: If encryption of transmitted data is required, for example, to meet the PCI DSS
requirements, upload the certificate file or add its contents as text. -
Subnet ID: Select or create a subnet in the desired availability zone.
If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
User: Specify the username that Data Transfer will use to connect to the database.
-
Password: Enter the user password to the database.
-
Database: Specify the name of the database in the selected cluster.
-
Security groups: Select the cloud network to host the endpoint and security groups for network traffic.
Thus, you will be able to apply the specified security group rules to the VMs and clusters in the selected network without changing the settings of these VMs and clusters. For more information, see Networking in Yandex Data Transfer.
- Endpoint type:
clickhouse-target
.
-
--host
: List of IP addresses or FQDNs of hosts to connect to, in{shard_name}:{host_IP_address_or_FQDN}
format. If sharding is disabled in your custom installation, specify any shard name. -
http-port
: Port number Data Transfer will use for HTTP connections. -
native-port
: Port number Data Transfer will use for connections to the ClickHouse® native interface. -
--ca-certificate
— If the transmitted data needs to be encrypted, for example, to meet the requirements of PCI DSS . -
--subnet-id
: ID of the subnet the host resides in. -
--database
: Database name. -
--user
: Username that Data Transfer will use to connect to the database. -
--security-group
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
To set a user password to access the DB, use one of the following parameters:
-
--raw-password
: Password as text. -
--password-file
: The path to the password file.
-
- Endpoint type:
clickhouse_target
.
-
Shard settings:
connection.connection_options.on_premise.shards.name
: Shard name that the service will use to distinguish shards from each other. If sharding is disabled in your custom installation, specify any name.connection.connection_options.on_premise.shards.hosts
: specify the FQDNs or IP addresses of the hosts in the shard.
-
connection.connection_options.on_premise.http_port
: Port number that Data Transfer will use for HTTP connections. -
connection.connection_options.on_premise.native_port
: Port number that Data Transfer will use for connections to the ClickHouse® native interface. -
connection.connection_options.on_premise.tls_mode.enabled.ca_certificate
: CA certificate used if the data being transferred must be encrypted to comply with the PCI DSS requirements. -
subnet_id
: ID of the subnet hosting the cluster. If not specified, the cluster must be accessible from the internet.If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
security_groups
: Security groups for network traffic.Security group rules apply to a transfer. They allow opening up network access from the transfer VM to the VM with the database. For more information, see Networking in Yandex Data Transfer.
Security groups must belong to the same network as the
subnet_id
subnet, if the latter is specified.Note
In Terraform, it is not required to specify a network for security groups.
-
connection.connection_options.database
: Database name. -
connection.connection_options.user
: Username that Data Transfer will use to connect to the database. -
connection.connection_options.password.raw
: Password in text form.
Here is an example of the configuration file structure:
resource "yandex_datatransfer_endpoint" "<endpoint_name_in_Terraform>" {
name = "<endpoint_name>"
settings {
clickhouse_target {
security_groups = ["<list_of_security_group_IDs>"]
subnet_id = "<subnet_ID>"
connection {
connection_options {
on_premise {
http_port = "<port_for_HTTP_connections>"
native_port = "<port_for_connecting_to_native_interface>"
shards {
name = "<shard_name>"
hosts = [ "list_of_IP_addresses_or_FQDN_hosts_of_the_shard" ]
}
tls_mode {
enabled {
ca_certificate = "<PEM_certificate>"
}
}
}
database = "<migrated_database_name>"
user = "<username_for_connection>"
password {
raw = "<user_password>"
}
}
}
<additional_endpoint_settings>
}
}
}
For more information, see the Terraform provider documentation
onPremise
: Database connection parameters:-
shards
: Shard settings:name
: Shard name the service will use to distinguish shards one from another. If sharding is disabled in your custom installation, specify any name.hosts
: Specify FQDNs or IP addresses of the hosts in the shard.
-
httpPort
: Port number Data Transfer will use for HTTP connections. -
nativePort
: Port number Data Transfer will use for connections to the ClickHouse® native interface. -
tlsMode
: Parameters of encryption of transmitted data if it is required, for example, to meet the requirements of PCI DSS . -
subnetId
: ID of the subnet the host resides in.
-
-
securityGroups
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
database
: Database name. -
user
: Username that Data Transfer will use to connect to the database. -
password.raw
: Database user password (in text form).
Additional settings
-
Cleanup policy: Select a way to clean up data in the target database before the transfer:
-
Don't cleanup
: Select this option only for replication without data copying. -
Drop
: Completely delete the tables included in the transfer (default).Use this option to always transfer the latest version of the table schema to the target database from the source whenever the transfer is activated.
-
Truncate
: Delete only the data from the tables included in the transfer but keep the schema.Use this option if the schema in the target database differs from the one that would have been transferred from the source during the transfer.
-
-
Sharding settings: Specify the settings for sharding:
-
No sharding: No sharding is used.
-
Sharding by column value: Name of the table column to shard the data by. Uniform distribution between shards will depend on the hash of this column value. Specify the name of the column to be sharded in the appropriate field.
For sharding by specific column values, specify them in the Mapping field. This field defines the mapping between the column and shard index values (the sequential number of the shard in the name-sorted list of shards), to enable sharding by specific data values.
-
Sharding by transfer ID: Data will be distributed between shards based on the transfer ID value. The transfer will ignore the Mapping setting and will only shard the data based on the transfer ID.
-
Uniform random sharding: Data will be randomly distributed between shards. Each shard will contain approximately the same amount of data.
-
-
Rename tables: Specify the settings for renaming tables during a transfer, if required.
-
Flush interval: Specify the delay with which the data should arrive at the target cluster. Increase the value in this field if ClickHouse® fails to merge data parts.
-
--cluster-name
: Name of the cluster to transfer the data to. -
--alt-name
: Rules for renaming the source database tables when transferring them to the target database. The values are specified in<source_table_name>:<target_table_name>
format. -
Data sharding settings:
-
--shard-by-column-hash
: Name of the table column to shard the data by. Uniform distribution between shards will depend on the hash of this column value. -
--custom-sharding-column-name
: Name of the table column to shard the data by. Data sharding is based on the column values specified by the--custom-sharding-mapping-string
setting. -
--custom-sharding-mapping-string
: Mapping of the values from the column specified in the--custom-sharding-column-name
setting and shards. The setting values are specified in<colimn_value>:<shard_name>
format. -
--shard-by-transfer-id
: Data will be distributed between shards based on the transfer ID value. The parameter contains no value.
You can only specify one of the sharding options:
--shard-by-column-hash
--custom-sharding-column-name
and--custom-sharding-mapping-string
--shard-by-transfer-id
-
-
cleanup_policy
: Way to clean up data in the target database before the transfer:-
CLICKHOUSE_CLEANUP_POLICY_DISABLED
: Do not clean up (default).Select this option only for replication without data copying.
-
CLICKHOUSE_CLEANUP_POLICY_DROP
: Completely delete the tables included in the transfer.Use this option to always transfer the latest version of the table schema to the target database from the source whenever the transfer is activated.
-
CLICKHOUSE_CLEANUP_POLICY_TRUNCATE
: Delete only the data from the tables included in the transfer but keep the schema.Use this option if the schema in the target database differs from the one that would have been transferred from the source during the transfer.
-
-
clickhouse_cluster_name
: Name of the cluster to transfer the data to. -
alt_names
: Rules for renaming the source database tables when transferring them to the target database.from_name
: Source table name.to_name
: Target table name.
-
Data sharding settings:
-
sharding.column_value_hash.column_name
: Name of the table column to shard the data by. Uniform distribution between shards will depend on the hash of this column value. -
sharding.transfer_id
: Data is distributed between shards based on the transfer ID value. Thetransfer_id
section contains no parameters. -
sharding.custom_mapping
: Sharding by column value:-
column_name
: Name of the table column to shard the data by. -
mapping
: Mapping of column values and shards:column_value.string_value
: Column value.shard_name
: Shard name.
-
-
sharding.round_robin
: Data will be randomly distributed between shards. Each shard will contain approximately the same amount of data. Theround_robin
section contains no parameters.
You can specify only one of the sharding options:
sharding.column_value_hash.column_name
,sharding.transfer_id
,sharding.custom_mapping
, orsharding.round_robin
. If no sharding option is specified, all data will be transferred to a single shard. -
-
altNames
: Rules for renaming the source database tables when transferring them to the target database.fromName
: Source table name.toName
: Target table name.
-
cleanupPolicy
: Way to clean up data in the target database before the transfer:-
CLICKHOUSE_CLEANUP_POLICY_DISABLED
: Do not clean up (default).Select this option only for replication without data copying.
-
CLICKHOUSE_CLEANUP_POLICY_DROP
: Completely delete the tables included in the transfer.Use this option to always transfer the latest version of the table schema to the target database from the source whenever the transfer is activated.
-
CLICKHOUSE_CLEANUP_POLICY_TRUNCATE
: Delete only the data from the tables included in the transfer but keep the schema.Use this option if the schema in the target database differs from the one that would have been transferred from the source during the transfer.
-
-
sharding
: Settings for data sharding. You can only specify one of the sharding options:-
columnValueHash.columnName
: Name of the table column to shard the data by. Uniform distribution between shards will depend on the hash of this column value. -
customMapping
: Sharding by column value:-
columnName
: Name of the table column to shard the data by. -
mapping
: Mapping of column values and shards:columnValue.stringValue
: Column value.shardName
: Shard name.
-
-
transferId
: Data will be distributed between shards based on the transfer ID value. The parameter contains no value. -
roundRobin
: Data will be randomly distributed between shards. Each shard will contain approximately the same amount of data. The parameter contains no value.
If no sharding option is specified, all data will be transferred to a single shard.
-
-
clickhouseClusterName
: Name of the cluster to transfer the data to.
After configuring the data source and target, create and start the transfer.
Troubleshooting data transfer issues
See a full list of recommendations in the Troubleshooting section.
No new tables are added
No new tables are added to Snapshot and increment transfers.
Solution:
-
Create tables in the target database manually. For the transfer to work, do the following when creating a table:
-
Add the transfer service fields to it:
__data_transfer_commit_time timestamp, __data_transfer_delete_time timestamp
-
Use the
ReplacingMergeTree
engine:ENGINE = ReplacingMergeTree
-
-
Create a separate transfer of the Snapshot and increment type and add only new tables to the list of objects to transfer. Deactivating the original Snapshot and increment transfer is not required. Activate the new transfer, and once it switches to the Replicating status, deactivate it.
To add other tables, put them into the list of objects to transfer in the created separate transfer (replacing other objects in that list), reactivate it, and, once it switches to the Replicating status, deactivate it.
Note
Since two transfers were simultaneously migrating data, you will have duplicate records in the new tables on the target. To hide them, run the
SELECT * from TABLE <table_name> FINAL
query; to delete them,OPTIMIZE TABLE <table_name>
.
Cannot transfer data
An attempt to transfer data from the ClickHouse® source fails with this error:
Syntax error: failed at position 25 ('-'): <error_details>. Expected one of: token, Dot, UUID, alias, AS, identifier, FINAL, SAMPLE, INTO OUTFILE, FORMAT, SETTINGS, end of query
Solution:
Yandex Data Transfer cannot transfer a database if its name contains a hyphen. You need to rename your database, if you can.
ClickHouse® is a registered trademark of ClickHouse, Inc