Transferring data from a Yandex Object Storage source endpoint

Written by

Updated at December 10, 2024

Scenarios for transferring data from Object Storage
Configuring the Object Storage source endpoint
Configuring the data target
Troubleshooting data transfer issues
- Source data update error

Note

This feature is in the Preview stage.

Yandex Data Transfer enables you to migrate data from Yandex Object Storage storage to Yandex Cloud managed databases and implement various data processing and transformation scenarios. To implement a transfer:

Explore possible data transfer scenarios.
Set up a source endpoint in Yandex Data Transfer.
Set up one of the supported data targets.
Create a transfer and start it.
Perform required operations with the storage and control the transfer.
In case of any issues, use ready-made solutions to resolve them.

Scenarios for transferring data from Object Storage

You can implement scenarios for data migration and delivery from the Yandex Object Storage storage to managed databases for further storage in the cloud, processing and loading into data marts for further visualization.

Uploading data to data marts is a process of transferring prepared data to storage for subsequent visualization.

For a detailed description of possible Yandex Data Transfer scenarios, see Tutorials.

Configuring the Object Storage source endpoint

When creating or updating an endpoint, you can define:

Configurations and event queues for an Yandex Object Storage bucket or a custom S3-compatible storage.
Additional parameters.

Bucket configurations

Yandex Object Storage bucket

Custom S3-compatible storage

Bucket: Bucket name.
Service account: Select a service account with access to the bucket from the list.
(Optional) Path prefix: Prefix for folders and files that may be used when performing a search within the bucket.
Path pattern: Enter the path pattern. If the bucket only includes files, use the ** value.

Bucket: Bucket name.
AWS Access Key ID and AWS Secret Access Key: ID and contents of the AWS key used to access a private bucket.
(Optional) Endpoint: Endpoint for an Amazon S3-compatible service. Leave this field empty to use Amazon.
Region: Region to send requests.
Use SSL: Select this option if the remote server uses a secure SSL/TLS connection.
Verify SSL certificate: Allow self-signed certificates.
(Optional) Path prefix: Prefix for folders and files that may be used when performing a search within the bucket.
Path pattern: Enter the path pattern. If the bucket only includes files, use the ** value.

Event queue configuration

Queue name: Name of the queue set up in the S3 bucket to get s3:ObjectCreated events.
Queue owner account ID: ID of the AWS account used to create the queue. Leave this field empty if the S3 bucket and queue were created by the same account.
AWS Access Key ID: ID of the AWS key used as part of the credentials to read from the SQS queue. Leave this field empty if the credentials used for the S3 bucket are applicable.
AWS Secret Access Key: AWS secret used as part of the credentials to read from the SQS queue. Leave this field empty if the credentials used for the S3 bucket are applicable.
Endpoint: Endpoint for an S3-compatible service. Leave this field empty to use AWS.
Region: AWS region to send requests. Leave this field empty if it matches the bucket region.
Use SSL: Select this option if the remote server uses a secure SSL/TLS connection.
Verify SSL certificate: Allow self-signed certificates.

Additional settings

Data format

CSV

Parquet

JSON Lines

proto

Delimiter: Delimiter character.
Quote char: Character used to mark the beginning and end of a string.
Escape char: Character used to escape special characters.
Encoding: Encoding.
Double quote: Select this option to replace double quotes with single quotes.
Allow line breaks in values: Select this option if your text data values might include newline characters.
Block size: Maximum length of a file segment loaded into memory when reading data.
Advanced settings: Required CSV ConvertOptions to edit, which are specified as a JSON-string.
Additional reader options: Required CSV ReadOptions to edit, which are specified as a JSON-string.

Allow line breaks in values: Allow using newline characters in JSON values. Enabling this parameter may affect performance. Leave this field empty to use the default FALSE value.
Unexpected field behavior: Method for handling JSON fields outside the explicit_schema (if specified). See the PyArrow documentation.
Block size: Size of file segments in bytes for concurrent processing in each file memory. When handling large amounts of data where the schema cannot be inferred, increasing this value should solve the problem. Setting the value too high may lead to OOM errors.

Descriptor protobuf: Upload a descriptor file.
Protobuf message packaging type: Specify the message packaging method:
- Protoseq: Delimited sequence of target messages.
- Repeated: Target message is specified in the repeated field of a single wrapper.
- Single message: Single target message.
Protobuf message name: If the packaging type is Repeated, specify the name of the message whose single repeated field contains the target message. Otherwise, specify the target message name.
List of primary keys: List the fields to add to the result as primary keys.
Included fields: List the message fields to transfer. If not specified, the output will include all the message fields.
Use NULL value in key columns: Select this option to allow the null value in key columns.

Dataset

Schema: Specify the schema of an auxiliary table that will be used for the connection.
Table: Specify the name of an auxiliary table that will be used for the connection.
Add system columns: Add __file_name and __row_index system columns to the result table schema. __file_name matches the name of the source S3 object; __row_index matches the count of rows populated with data in the S3 object.

Warning

Disabling this option may negatively affect data targets that require a primary key, provided that the result schema may become mandatory in such cases.
Result table schema: Specify a JSON schema in {"<column>": "<data_type>"} format or list the fields for the resulting table's schema. If you select Automatically, the schema will be inferred automatically.
Unparsed mode: Specify how to process rows that failed to pass a type check:
- Continue: Continue data transfer.
- Fail: Abort.
- Retry: Repeat the type check attempt.

Configuring the data target

Configure one of the supported data targets:

For a complete list of supported sources and targets in Yandex Data Transfer, see Available transfers.

After configuring the data source and target, create and start the transfer.

Troubleshooting data transfer issues

For more troubleshooting tips, see Troubleshooting.

Source data update error

Error message:

Push failed: kind: update not supported

Object Storage only supports inserting new data but does not support updating it. If data is updated at the source, the transfer will fail with the above error.

Solution: Use sources supporting data insertion only or select a target other than Object Storage.

Transferring data from a Yandex Object Storage source endpoint

Scenarios for transferring data from Object StorageScenarios for transferring data from Object Storage

Configuring the Object Storage source endpointConfiguring the Object Storage source endpoint

Bucket configurationsBucket configurations

Event queue configurationEvent queue configuration

Additional settingsAdditional settings

Data formatData format

DatasetDataset

Configuring the data targetConfiguring the data target

Troubleshooting data transfer issuesTroubleshooting data transfer issues

Source data update errorSource data update error

Was the article helpful?