Transferring data from a Yandex Object Storage source endpoint
Yandex Data Transfer enables you to migrate data from Yandex Object Storage storage to Yandex Cloud managed databases and implement various data processing and transformation scenarios. To implement a transfer:
- Explore possible data transfer scenarios.
- Set up a source endpoint in Yandex Data Transfer.
- Set up one of the supported data targets.
- Create a transfer and start it.
- Perform required operations with the storage and control the transfer.
- In case of any issues, use ready-made solutions to resolve them.
Scenarios for transferring data from Object Storage
You can implement scenarios for data migration and delivery from the Yandex Object Storage storage to managed databases for further storage in the cloud, processing and loading into data marts for further visualization.
Uploading data to data marts is a process of transferring prepared data to storage for subsequent visualization.
- Loading data from Object Storage to PostgreSQL.
- Loading data from Object Storage to Greenplum®.
- Loading data from Object Storage to MySQL®.
- Loading data from Object Storage to ClickHouse®.
- Loading data from Object Storage to YDB.
For a detailed description of possible Yandex Data Transfer scenarios, see Tutorials.
Configuring the Object Storage source endpoint
This feature is in the Preview stage.
When creating or updating an endpoint, configure access to S3-compatible storage.
Settings
Bucket configuration parameters
- Bucket: Bucket name.
- AWS Access Key ID and AWS Secret Access Key: ID and contents of the AWS key
used to access a private bucket. - (Optional) Path prefix: Prefix for folders and files that may be used when performing a search within the bucket.
- (Optional) Endpoint: Endpoint for an Amazon S3-compatible service. Leave this field empty to use Amazon.
- Region: Region to send requests.
- Use SSL: Select this option if the remote server uses a secure SSL/TLS connection.
- Verify SSL certificate: Allow self-signed certificates.
Path pattern: Enter the path pattern. If the bucket only includes files, use the **
value.
Event queue configuration
- Queue name: Name of the queue set up in the S3 bucket to get
s3:ObjectCreated
events. - Queue owner account ID: ID of the AWS account used to create the queue. Leave this field empty if the S3 bucket and queue were created by the same account.
- AWS Access Key ID: ID of the AWS key used as part of the credentials to read from the SQS queue. Leave this field empty if the credentials used for the S3 bucket are applicable.
- AWS Secret Access Key: AWS secret used as part of the credentials to read from the SQS queue. Leave this field empty if the credentials used for the S3 bucket are applicable.
- Endpoint: Endpoint for an S3-compatible service. Leave this field empty to use AWS.
- Region: AWS region to send requests. Leave this field empty if it matches the bucket region.
- Use SSL: Select this option if the remote server uses a secure SSL/TLS connection.
- Verify SSL certificate: Allow self-signed certificates.
Data format
- Delimiter: Delimiter character.
- Quote char: Character used to mark the beginning and end of a string.
- Escape char: Character used to escape special characters.
- Encoding: Encoding
. - Double quote: Select this option to replace double quotes with single quotes.
- Allow line breaks in values: Select this option if your text data values might include newline characters.
- Block size: Maximum length of a file segment loaded into memory when reading data.
- Advanced settings: Required CSV ConvertOptions
to edit, which are specified as a JSON-string. - Additional reader options: Required CSV ReadOptions
to edit, which are specified as a JSON-string.
- Allow line breaks in values: Allow using newline characters in JSON values. Enabling this parameter may affect performance. Leave this field empty to use the default
FALSE
value. - Unexpected field behavior: Method for handling JSON fields outside the
explicit_schema
(if specified). See the PyArrow documentation . - Block size: Size of file segments in bytes for concurrent processing in each file memory. When handling large amounts of data where the schema cannot be inferred, increasing this value should solve the problem. Setting the value too high may lead to OOM errors.
- Descriptor protobuf: Upload a descriptor file.
- Protobuf message packaging type: Specify the message packaging method:
Protoseq
: Delimited sequence of target messages.Repeated
: Target message is specified in therepeated
field of a single wrapper.Single message
: Single target message.
- Protobuf message name: If the packaging type is
Repeated
, specify the name of the message whose singlerepeated
field contains the target message. Otherwise, specify the target message name. - List of primary keys: List the fields to add to the result as primary keys.
- Included fields: List the message fields to transfer. If not specified, the output will include all the message fields.
- Use NULL value in key columns: Select this option to allow the
null
value in key columns.
Dataset
-
Schema: Specify the schema of an auxiliary table that will be used for the connection.
-
Table: Specify the name of an auxiliary table that will be used for the connection.
-
Add system columns: Add
__file_name
and__row_index
system columns to the result table schema.__file_name
matches the name of the source S3 object;__row_index
matches the count of rows populated with data in the S3 object.Warning
Disabling this option may negatively affect data targets that require a primary key, provided that the result schema may become mandatory in such cases.
Result table schema: Specify a JSON schema in {"<column>": "<data_type>"}
format or list the fields for the schema of the resulting table. If you select Automatically
, the schema will be inferred automatically.
Additionally, you can specify how to process rows that did not pass a type check: continue the transfer, repeat the attempt to define the type, abort the transfer.
Configuring the data target
Configure one of the supported data targets:
For a complete list of supported sources and targets in Yandex Data Transfer, see Available transfers.
After configuring the data source and target, create and start the transfer.
Troubleshooting data transfer issues
For more troubleshooting tips, see Troubleshooting.
Source data update error
Error message:
Push failed: kind: update not supported
Object Storage only supports inserting new data but does not support updating it. If data is updated at the source, the transfer will fail with the above error.
Solution: Use sources supporting data insertion only or select a target other than Object Storage.