Transferring data from an Elasticsearch source endpoint
Yandex Data Transfer enables you to migrate search and analytics data from an Elasticsearch database and implement various scenarios of data transfer, processing and transformation. To implement a transfer:
- Explore possible data transfer scenarios.
- Prepare the Elasticsearch database for the transfer.
- Set up an endpoint source in Yandex Data Transfer.
- Set up one of the supported data targets.
- Create a transfer and start it.
- Perform required operations with the database and control the transfer.
- In case of any issues, use ready-made solutions to resolve them.
Scenarios for transferring data from Elasticsearch
Migration: Moving data from one repository to another; it often suggests transferring a database to the cloud, from outdated local databases to managed cloud ones.
For a detailed description of possible Yandex Data Transfer data transfer scenarios, see Tutorials.
Preparing the source database
Make sure the settings for the network hosting the cluster allow public connections from IP addresses used by Data Transfer
Configuring the Elasticsearch source endpoint
When creating or editing an endpoint, you can define:
- custom installation settings, including those based on Yandex Compute Cloud VMs. These are required parameters.
- Additional parameters.
Custom installation
Connecting to nodes with explicitly specified network addresses and ports.
-
Data nodes: Click
to add a new data node. For each node, specify:-
Host: IP address or FQDN of the host with the
DATA
role you want to connect to. -
Port: Port number Data Transfer will use for connections to the
DATA
host.
-
-
SSL: Select this option if a secure SSL connection is used.
-
CA certificate: Upload the certificate file or add its contents as text if transmitted data must be encrypted, for example, to meet PCI DSS
requirements. -
Subnet ID: Select or create a subnet in the required availability zone.
If the value in this field is specified for both endpoints, both subnets must be hosted in the same availability zone.
-
User: Specify the username Data Transfer will use to connect to the cluster.
-
Password: Enter the user password to the cluster.
-
Security groups: Select the cloud network to host the endpoint and security groups for network traffic.
Thus, you will be able to apply the specified security group rules to the VMs and clusters in the selected network without changing the settings of these VMs and clusters. For more information, see Networking in Yandex Data Transfer.
Additional settings
- Dump an index with type mapping: Select this option to move data types from a source to a target before a transfer is started. If the option is disabled and no index schema is set on the target, data types on the target will be identified automatically during a transfer.
Warning
If a source index includes data types that are not supported on the target, enabling this option may cause a transfer run error. In this case, disable the option and create an index schema on the target manually.
Configuring the data target
Configure the target endpoint:
- OpenSearch.
- ClickHouse®.
- Greenplum®.
- Yandex Managed Service for YDB.
- Yandex Object Storage.
- Apache Kafka®.
- YDS.
For a complete list of supported sources and targets in Yandex Data Transfer, see Available Transfers.
After configuring the data source and target, create and start the transfer.
Troubleshooting data transfer issues
See a full list of recommendations in the Troubleshooting section.
Transfer failure
Error messages:
object field starting or ending with a [.] makes object resolution ambiguous <field_description>
Index -1 out of bounds for length 0
The transfer is aborted because the keys in the documents being transferred are not valid for the Elasticsearch target. Invalid keys are empty keys and keys that:
- Consist of spaces.
- Consist of periods.
- Have a period at the beginning or end.
- Have two or more periods in a row.
- Include periods separated by spaces.
Solution:
In the target endpoint additional settings, enable Sanitize documents keys and reactivate the transfer.
Document duplication on the target
When repeatedly transferring data, documents get duplicated on the target.
All documents transferred from the same source table fall under the same index named <schemaName.tableName>
on the target. In this case, the target automatically generates document IDs (_id
) by default. As a result, identical documents are assigned different IDs and get duplicated.
There is no duplication if the primary keys are specified in the source table or endpoint conversion rules. Document IDs are then generated at the transfer stage using the primary key values.
Generation is performed as follows:
- If the key value contains a period (
.
), it is escaped with\
:some.key
-->some\.key
. - All the primary key values are converted into a string:
<some_key1>.<some_key2>.<...>
. - The resulting string is converted by the url.QueryEscape
function. - If the length of the resulting string does not exceed 512 characters, it is used as the
_id
. If it is longer than 512 characters, it is hashed with SHA-1 and the resulting hash is used as the_id
.
As a result, documents with the same primary keys will receive the same ID when the data is transferred again, and the document transferred last will overwrite the existing one.
Solution:
- Set the primary key for one or more columns in the source table or in the endpoint conversion rules.
- Run the transfer.