Transferring data from a ClickHouse® source endpoint
Yandex Data Transfer enables you to migrate data from a ClickHouse® database and implement various data transfer, processing, and transformation scenarios. To implement a transfer:
- Explore possible data transfer scenarios.
- Prepare the ClickHouse® database for the transfer.
- Set up a source endpoint in Yandex Data Transfer.
- Set up one of the supported data targets.
- Create a transfer and start it.
- Perform the required operations with the database and see how the transfer is going.
- In case of any issues, use ready-made solutions to resolve them.
Scenarios for transferring data from ClickHouse®
Migration: Moving data from one storage to another. Migration often means migrating a database from obsolete local databases to managed cloud ones.
For a detailed description of possible Yandex Data Transfer scenarios, see Tutorials.
Preparing the source database
Note
Yandex Data Transfer cannot transfer a ClickHouse® database if its name contains a hyphen.
If transferring tables with engines other than ReplicatedMergeTree
and Distributed
in a ClickHouse® multi-host cluster, the transfer will fail with the following error: the following tables have not Distributed or Replicated engines and are not yet supported
.
-
Make sure the tables you are transferring use the
MergeTree
family engines. Only these tables and materialized views (MaterializedView) will be transferred.In case of a multi-host cluster, only tables and materialized views with the
ReplicatedMergeTree
orDistributed
engines will be transferred. Make sure these tables and views are present on all the cluster hosts. -
Create a user with access to the source database. In the user settings, specify a value of at least
1000000
for the Max execution time parameter.
-
Make sure the tables you are transferring use the
MergeTree
family engines. Only these tables and materialized views (MaterializedView) will be transferred.In case of a multi-host cluster, only tables and materialized views with the
ReplicatedMergeTree
orDistributed
engines will be transferred. Make sure these tables and views are present on all the cluster hosts. -
If not planning to use Cloud Interconnect or VPN
for connections to an external cluster, make such cluster accessible from the Internet from IP addresses used by Data Transfer .For details on linking your network up with external resources, see this concept.
-
Create a user with access to the source database. In the user settings, specify a value of at least
1000000
for the Max execution time parameter.
Configuring the ClickHouse® source endpoint
When creating or updating an endpoint, you can define:
- Yandex Managed Service for ClickHouse® cluster connection or custom installation settings, including those based on Yandex Compute Cloud VMs. These are required parameters.
- Additional parameters.
Managed Service for ClickHouse® cluster
Warning
To create or edit an endpoint of a managed database, you need to have the managed-clickhouse.viewer
role or the viewer
primitive role assigned for the folder where this managed database cluster resides.
Connecting to the database with the cluster ID specified in Yandex Cloud.
-
Managed cluster: From the list, select the name of the cluster you want to connect to.
-
clickhouseClusterName
: Specify the shard group to transfer the data from. If this value is not set, data from all shards will be transferred. -
User: Specify the username that Data Transfer will use to connect to the database.
-
Password: Enter the user's password to the database.
-
Database: Specify the name of the database in the selected cluster.
-
Security groups: Select the cloud network to host the endpoint and security groups for network traffic. This will allow you to apply the specified security group rules to the VMs and clusters in the selected network without changing their settings. For more information, see Networking in Yandex Data Transfer.
Make sure the selected security groups are configured.
- Endpoint type:
clickhouse-source
.
-
--cluster-id
: ID of the cluster you need to connect to. -
--cluster-name
: Shard group to transfer the data from. If this parameter is not set, data from all shards will be transferred. -
--database
: Database name. -
--user
: Username that Data Transfer will use to connect to the database. -
--security-group
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer.Make sure the specified security groups are configured.
-
To set a user password to access the DB, use one of the following parameters:
-
--raw-password
: Password as text. -
--password-file
: The path to the password file.
-
- Endpoint type:
clickhouse_source
.
-
connection.connection_options.mdb_cluster_id
: ID of cluster to connect to. -
clickhouse_cluster_name
: Shard group to transfer the data from. If this parameter is not set, data from all shards will be transferred. -
subnet_id
: ID of the subnet the cluster is in. The transfer will use this subnet to access the cluster. If the ID is not specified, the cluster must be accessible from the internet.If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
security_groups
: Security groups for network traffic.Security group rules apply to a transfer. They allow opening up network access from the transfer VM to the cluster. For more information, see Networking in Yandex Data Transfer.
Security groups and the
subnet_id
subnet, if the latter is specified, must belong to the same network as the cluster.Note
In Terraform, it is not required to specify a network for security groups.
Make sure the specified security groups are configured.
-
connection.connection_options.database
: Database name. -
connection.connection_options.user
: Username that Data Transfer will use to connect to the database. -
connection.connection_options.password.raw
: Password in text form.
Here is an example of the configuration file structure:
resource "yandex_datatransfer_endpoint" "<endpoint_name_in_Terraform>" {
name = "<endpoint_name>"
settings {
clickhouse_source {
clickhouse_cluster_name="<shard_group>"
security_groups = ["<list_of_security_group_IDs>"]
subnet_id = "<subnet_ID>"
connection {
connection_options {
mdb_cluster_id = "<cluster_ID>"
database = "<name_of_database_to_migrate>"
user = "<username_for_connection>"
password {
raw = "<user_password>"
}
}
}
<additional_endpoint_settings>
}
}
}
For more information, see the Terraform provider documentation
-
securityGroups
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer.Make sure the specified security groups are configured.
-
mdbClusterId
: ID of the cluster you need to connect to. -
clickhouseClusterName
: Shard group to transfer the data from. If this parameter is not set, data from all shards will be transferred. -
database
: Database name. -
user
: Username that Data Transfer will use to connect to the database. -
password.raw
: Database user password (in text form).
Custom installation
Connecting to the database with explicitly specified network addresses and ports.
-
Shards
- Shard: Specify a row that will allow the service to distinguish shards from each other. If sharding is disabled in your custom installation, specify any name.
- Hosts: Specify FQDNs or IP addresses of the hosts in the shard.
-
HTTP port: Set the number of the port that Data Transfer will use for the connection.
When connecting via the HTTP port:
- For optional fields, default values are used (if any).
- Recording complex types is supported (such as
array
andtuple
).
-
Native port: Set the number of the native port that Data Transfer will use for the connection.
-
SSL: Enable if the cluster supports only encrypted connections.
-
CA certificate: If encryption of transmitted data is required, for example, to meet the PCI DSS
requirements, upload the certificate file or add its contents as text. -
Subnet ID: Select or create a subnet in the required availability zone. The transfer will use this subnet to access the cluster.
If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
User: Specify the username that Data Transfer will use to connect to the database.
-
Password: Enter the user's password to the database.
-
Database: Specify the name of the database in the selected cluster.
-
Security groups: Select the cloud network to host the endpoint and security groups for network traffic.
Thus, you will be able to apply the specified security group rules to the VMs and clusters in the selected network without changing the settings of these VMs and clusters. For more information, see Networking in Yandex Data Transfer.
- Endpoint type:
clickhouse-source
.
-
--cluster-name
: Name of the cluster to transfer the data from. -
--host
: List of IP addresses or FQDNs of hosts to connect to, in{shard_name}:{host_IP_address_or_FQDN}
format. If sharding is disabled in your custom installation, specify any shard name. -
http-port
: Port number Data Transfer will use for HTTP connections. -
native-port
: Port number Data Transfer will use for connections to the ClickHouse® native interface. -
--ca-certificate
— If the transmitted data needs to be encrypted, for example, to meet the requirements of PCI DSS . -
--subnet-id
: ID of the subnet the host is in. The transfer will use that subnet to access the host. -
--database
: Database name. -
--user
: Username that Data Transfer will use to connect to the database. -
--security-group
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
To set a user password to access the DB, use one of the following parameters:
-
--raw-password
: Password as text. -
--password-file
: The path to the password file.
-
- Endpoint type:
clickhouse_source
.
-
Shard settings:
connection.connection_options.on_premise.shards.name
: Shard name that the service will use to distinguish shards from each other. If sharding is disabled in your custom installation, specify any name.connection.connection_options.on_premise.shards.hosts
: specify the FQDNs or IP addresses of the hosts in the shard.
-
connection.connection_options.on_premise.http_port
: Port number that Data Transfer will use for HTTP connections. -
connection.connection_options.on_premise.native_port
: Port number that Data Transfer will use for connections to the ClickHouse® native interface. -
connection.connection_options.on_premise.tls_mode.enabled.ca_certificate
: CA certificate used if the data being transferred must be encrypted to comply with the PCI DSS requirements. clickhouse_cluster_name
: Name of the cluster to transfer the data from.-
subnet_id
: ID of the subnet the cluster is in. The transfer will use this subnet to access the cluster. If the ID is not specified, the cluster must be accessible from the internet.If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
security_groups
: Security groups for network traffic.Security group rules apply to a transfer. They allow opening up network access from the transfer VM to the VM with the database. For more information, see Networking in Yandex Data Transfer.
Security groups must belong to the same network as the
subnet_id
subnet, if the latter is specified.Note
In Terraform, it is not required to specify a network for security groups.
-
connection.connection_options.database
: Database name. -
connection.connection_options.user
: Username that Data Transfer will use to connect to the database. -
connection.connection_options.password.raw
: Password in text form.
Here is an example of the configuration file structure:
resource "yandex_datatransfer_endpoint" "<endpoint_name_in_Terraform>" {
name = "<endpoint_name>"
settings {
clickhouse_source {
clickhouse_cluster_name="<cluster_name>"
security_groups = ["<list_of_security_group_IDs>"]
subnet_id = "<subnet_ID>"
connection {
connection_options {
on_premise {
http_port = "<HTTP_connection_port>"
native_port = "<port_for_native_interface_connection>"
shards {
name = "<shard_name>"
hosts = [ “list of IP addresses and FQDNs of shard hosts" ]
}
tls_mode {
enabled {
ca_certificate = "<certificate_in_PEM_format>"
}
}
}
database = "<name_of_database_to_migrate>"
user = "<username_for_connection>"
password {
raw = "<user_password>"
}
}
}
<additional_endpoint_settings>
}
}
}
For more information, see the Terraform provider documentation
onPremise
: Database connection parameters:-
shards
: Shard settings:name
: Shard name the service will use to distinguish shards one from another. If sharding is disabled in your custom installation, specify any name.hosts
: Specify FQDNs or IP addresses of the hosts in the shard.
-
httpPort
: Port number Data Transfer will use for HTTP connections. -
nativePort
: Port number Data Transfer will use for connections to the ClickHouse® native interface. -
tlsMode
: Parameters of encryption of transmitted data if it is required, for example, to meet the requirements of PCI DSS . -
subnetId
: ID of the subnet the host is in. The transfer will use that subnet to access the host.
-
clickhouseClusterName
: Name of the cluster to transfer the data from.-
securityGroups
: Security groups for network traffic, whose rules will apply to VMs and clusters without changing their settings. For more information, see Networking in Yandex Data Transfer. -
database
: Database name. -
user
: Username that Data Transfer will use to connect to the database. -
password.raw
: Database user password (in text form).
Table filter
-
Included tables: Data is only transferred from listed tables.
When you add new tables when editing an endpoint used in Snapshot and increment or Replication transfers with the Replicating status, the data history for these tables will not get uploaded. To add a table with its historical data, use the List of objects for transfer field in the transfer settings.
-
Excluded tables: Data from the listed tables is not transferred.
Included and excluded table names must meet the ID naming rules in ClickHouse®. For more information, see the ClickHouse® documentation
Leave the lists empty to transfer all the tables.
-
--include_table
: List of included tables. Only data from the tables listed here will be transferred.When you add new tables when editing an endpoint used in Snapshot and increment or Replication transfers with the Replicating status, the data history for these tables will not get uploaded. To add a table with its historical data, use the List of objects for transfer field in the transfer settings.
-
--exclude_table
: List of excluded tables. Data from the listed tables will not be transferred.
If no lists are specified, data from all tables will be transferred.
-
include_tables
: List of included tables. Only data from the tables listed here will be transferred.When you add new tables when editing an endpoint used in Snapshot and increment or Replication transfers with the Replicating status, the data history for these tables will not get uploaded. To add a table with its historical data, use the List of objects for transfer field in the transfer settings.
-
exclude_tables
: List of excluded tables. Data from the listed tables will not be transferred.
If no lists are specified, data from all tables will be transferred.
-
includeTables
: List of included tables. Only data from the tables listed here will be transferred.When you add new tables when editing an endpoint used in Snapshot and increment or Replication transfers with the Replicating status, the data history for these tables will not get uploaded. To add a table with its historical data, use the List of objects for transfer field in the transfer settings.
-
excludeTables
: List of excluded tables. Data from the listed tables will not be transferred.
If no lists are specified, data from all tables will be transferred.
Configuring the data target
Configure the target endpoint:
For a complete list of supported sources and targets in Yandex Data Transfer, see Available transfers.
After configuring the data source and target, create and start the transfer.
Troubleshooting data transfer issues
For more troubleshooting tips, see Troubleshooting.
New tables cannot be added
No new tables are added to Snapshot and increment transfers.
Solution:
-
Create tables in the target database manually. For the transfer to work, do the following when creating a table:
-
Add the transfer service fields to it:
__data_transfer_commit_time timestamp, __data_transfer_delete_time timestamp
-
Use
ReplacingMergeTree
:ENGINE = ReplacingMergeTree
-
-
Create a separate transfer of the Snapshot and increment type and add only new tables to the list of objects to transfer. Deactivating the original Snapshot and increment transfer is not required. Activate the new transfer, and once it switches to the Replicating status, deactivate it.
To add other tables, put them into the list of objects to transfer in the created separate transfer (replacing other objects in that list), reactivate it, and, once it switches to the Replicating status, deactivate it.
Note
Since two transfers were simultaneously migrating data, you will have duplicate records in the new tables on the target. Run the
SELECT * from TABLE <table_name> FINAL
query to hide duplicate records orOPTIMIZE TABLE <table_name>
to delete them.
Data is not transferred
An attempt to transfer data from a ClickHouse® source fails with this error:
Syntax error: failed at position 25 ('-'): <error_details>. Expected one of: token, Dot, UUID, alias, AS, identifier, FINAL, SAMPLE, INTO OUTFILE, FORMAT, SETTINGS, end of query
Solution:
Yandex Data Transfer cannot transfer a database if its name contains a hyphen. You need to rename your database, if you can.
Unsupported date range
If the migrated data contains dates outside the supported ranges, ClickHouse® returns the following error:
TYPE_ERROR [target]: failed to run (abstract1 source): failed to push items from 0 to 1 in batch:
Push failed: failed to push 1 rows to ClickHouse shard 0:
ClickHouse Push failed: Unable to exec changeItem: clickhouse:
dateTime <field_name> must be between 1900-01-01 00:00:00 and 2262-04-11 23:47:16
Supported date ranges in ClickHouse®:
- For the
DateTime64
type fields: 1900-01-01 to 2299-12-31. For more information, see the ClickHouse® documentation . - For the
DateTime
type fields: 1970-01-01 to 2106-02-07. For more information, see the ClickHouse® documentation .
Solution: use one of the following options:
- Convert all dates in the source DB to a range supported by ClickHouse®.
- In the source endpoint parameters, exclude the table with incorrect dates from the transfer.
- In the transfer parameters, specify the Convert values to string transformer. This will change the field type during the transfer.
ClickHouse® is a registered trademark of ClickHouse, Inc