Transferring data from an S3 source endpoint
Yandex Data Transfer enables you to migrate data from S3 storage to Yandex Cloud managed databases and implement various data processing and transformation scenarios. To implement a transfer:
- Explore possible data transfer scenarios.
- Prepare the S3 database for the transfer.
- Set up an endpoint source in Yandex Data Transfer.
- Set up one of the supported data targets.
- Create a transfer and start it.
- In case of any issues, use ready-made solutions to resolve them.
Scenarios for transferring data from S3
You can implement scenarios for data migration and delivery from the Amazon Simple Storage Service (S3) storage to managed databases for further storage in the cloud, processing and loading into data marts for further visualization.
For a detailed description of possible Yandex Data Transfer data transfer scenarios, see Tutorials.
Preparing the S3 database
If you are using a private bucket as a source, grant the read
and list
permissions to the account you will use for connection.
For more information, see the Airbyte® documentation
Settings
When creating or updating an endpoint, configure access to S3-compatible storage.
-
Dataset: Specify the name of an auxiliary table that will be used for the connection.
-
Path Pattern: Enter the path pattern. If the bucket only includes files, use the
**
value. -
Schema: Specify a JSON schema in
{"<column>": "<data type>"}
format. Use the value{}
for automatic schema detection based on files. -
format: Select a format that matches your files (
CSV
,parquet
,Avro
, orJSON Lines
).-
CSV: Specify the settings of CSV files:
- Delimiter: Delimiter character.
- Quote char: Character used to escape reserved characters.
- Escape char: Character used to escape special characters.
- Encoding: Encoding
. - Double quote: Enable this option to replace double quotes with single quotes.
- Newlines in values: Enable the option if your text data values might include newline characters.
- Block size: Size of a data chunk used to read data from files, in bytes.
- Additional reader options: Required CSV ConvertOptions
to edit, which are specified as a JSON-string. - Advanced options: Required CSV ReadOptions
to edit, which are specified as a JSON-string.
-
parquet: Specify parquet-files settings:
- Buffer size: Size of the buffer used to deserialize specific parts of columns.
- Columns: Columns for reading data. Leave this field empty to read all the columns.
- Batch size: Maximum number of records in a batch.
-
JSON Lines: Specify the settings for JSON Lines:
- Allow newlines in values: Enable this option to allow newlines in JSON values. This may affect the transfer speed.
- Unexpected field behavior: Specify how to handle JSON fields outside the
explicit_schema
(if the field values are set). For more information, see the PyArrow documentation . - Block Size: Specify the block size (in bytes) from each file to be handled in-memory simultaneously. If the value you set is too large, the
Out of memory
error may occur during the transfer.
-
-
S3: Amazon Web Services: Specify settings for the S3 provider:
- Bucket: Bucket name.
- Access Key ID and Secret Access Key: ID and contents of the AWS key
used to access a private bucket. - (Optional) Path prefix: Prefix for folders and files not to be processed by AWS.
- (Optional) Endpoint: Services to use that are not compatible with Amazon S3. Leave this field empty to use the Amazon service.
- Use SSL: Enable to use custom servers over HTTPS. It is ignored when using the Amazon service.
- Verify SSL certificate: Enable to skip authentication of the server's SSL certificate. This setting is useful if you use self-signed certificates. It is ignored when using the Amazon service.
For more information about the settings, see the Airbyte® documentation
Airbyte® is a registered trademark of Airbyte, Inc in the United States and/or other countries.
Configuring the data target
Configure one of the supported data targets:
- MySQL.
- MongoDB.
- ClickHouse®.
- Greenplum®.
- Yandex Managed Service for YDB.
- Apache Kafka®.
- YDS.
- PostgreSQL.
For a complete list of supported sources and targets in Yandex Data Transfer, see Available Transfers.
Make sure that the network hosting the target cluster is configured to allow connections from the internet. To enable internet access, set up routing.
After configuring the data source and target, create and start the transfer.