Transferring data from a Yandex Data Streams source endpoint
Yandex Data Transfer enables you to migrate data from an Data Streams queue and implement various data transfer, processing, and transformation scenarios. To implement a transfer:
- Explore possible data transfer scenarios.
- Prepare the Data Streams database for the transfer.
- Set up an endpoint source in Yandex Data Transfer.
- Set up one of the supported data targets.
- Create a transfer and start it.
- Perform required operations with the database and control the transfer.
- In case of any issues, use ready-made solutions to resolve them.
Scenarios for transferring data from Data Streams
-
Migration: Moving data from one storage to another. Migration often means migrating a database from obsolete local databases to managed cloud ones.
Mirroring data across Data Streams queues is a separate migration task.
-
Data delivery is a process of delivering arbitrary data to target storage. It includes data retrieval from a queue and its deserialization with subsequent transformation to target storage format.
For a detailed description of possible Yandex Data Transfer data transfer scenarios, see Tutorials.
Preparing the source database
-
Create a service account with the
yds.editor
role. -
(Optional) Create a processing function.
Processing function example
const yc = require("yandex-cloud"); const { Parser } = require("@robojones/nginx-log-parser"); module.exports.handler = async function (event, context) { const schema = '$remote_addr - $remote_user [$time_local] "$request" $status $bytes_sent "$http_referer" "$http_user_agent"'; const parser = new Parser(schema); return { Records: event.Records.map((record) => { const decodedData = new Buffer(record.kinesis.data, "base64") .toString("ascii") .trim(); try { const result = parser.parseLine(decodedData); if (result.request == "") { // empty request - drop message return { eventID: record.eventID, invokeIdentityArn: record.invokeIdentityArn, eventVersion: record.eventVersion, eventName: record.eventName, eventSourceARN: record.eventSourceARN, result: "Dropped" }; } return { // successfully parsed message eventID: record.eventID, invokeIdentityArn: record.invokeIdentityArn, eventVersion: record.eventVersion, eventName: record.eventName, eventSourceARN: record.eventSourceARN, kinesis: { data: new Buffer(JSON.stringify(result)).toString( "base64" ), }, result: "Ok" }; } catch (err) { // error - fail message return { eventID: record.eventID, invokeIdentityArn: record.invokeIdentityArn, eventVersion: record.eventVersion, eventName: record.eventName, eventSourceARN: record.eventSourceARN, result: "ProcessingFailed", }; } }) }; };
-
(Optional) Prepare a data schema file in JSON format.
Sample file with a data schema:
[ { "name": "<field_name>", "type": "<type>" }, ... { "name": "<field_name>", "type": "<type>" } ]
Supported types:
any
boolean
datetime
double
int8
int16
int32
int64
string
uint8
uint16
uint32
uint64
utf8
Configuring the Data Streams source endpoint
When creating or editing an endpoint, you can define:
- Stream connection settings in Yandex Data Streams. These are required parameters.
- Additional settings.
Basic settings
-
Database: Select a Yandex Managed Service for YDB database registered in Yandex Data Streams as a source.
-
Stream: Specify the name of the data stream associated with the database.
-
Service account: Select or create a service account with the
yds.editor
role that Data Transfer will use to connect to the data source. -
Security groups: Select a cloud network you want to place your endpoint in. You can manage cloud networks in Virtual Private Cloud.
Advanced settings
Use advanced settings to specify transformation and conversion rules. Data is processed in the following order:
-
Transformation. Data in JSON format is provided to a Yandex Cloud Functions function. The function body contains metadata and raw data from the queue. The function handles the data and sends it back to Data Transfer.
-
Conversion. Data is parsed as a preparation for delivery to the target.
If no transformation rules are set, parsing is applied to raw data from the queue. If no conversion rules are set, the data goes directly to the target.
-
Transformation rules:
-
Cloud function: Select one of the functions created in Cloud Functions.
- Service account: Select or create a service account that the processing function will start under.
-
Number of retries: Set the number of attempts to invoke the processing function.
-
Buffer size for function: Set the size of the buffer (in bytes) which when full data will be transferred to the processing function.
The maximum buffer size is 3.5 MB. For more information about restrictions that apply when working with functions in Cloud Functions, see the corresponding section.
-
Flush interval: Set the duration of the interval (in seconds) after the expiration of which the data from the stream should be transferred to the processing function.
Note
If the buffer becomes full or the sending interval expires, the data is transferred to the processing function.
-
Invocation timeout: Set the allowed timeout of the response from the processing function (in seconds).
Warning
Values in the Flush interval and Invocation timeout fields are specified with the
s
postfix, for example,10s
. -
-
Conversion rules:
-
Conversion rules:
- Data format: Select one of the available formats:
-
JSON
: JSON format. -
AuditTrails.v1 parser
: Audit Trails log format. -
CloudLogging parser
: Cloud Logging log format. -
Debezium CDC parser
: Debezium CDC. It allows specifying Confluent Schema Registry in the settings.For JSON, specify:
- Data scheme: Specify the schema as a list of fields or upload a file with a description of the schema in JSON format.
Sample data schema
[ { "name": "request", "type": "string" } ]
- Enable NULL values in keys: Select this option to allow the
null
value in key columns. - Add a column for missing keys: Select this option to have the fields missing in the schema appear in the
_rest
column. - Unescape string values: Select this option to remove quotation marks from string variables. Otherwise, the string field values will remain unchanged.
For Debezium CDC, specify the following: Schema Registry URL, authentication method (including the username and user password if authentication is used), and CA certificate.
-
- Data format: Select one of the available formats:
-
-
Compresion codecs: Set the required data compression format (
GZIP
,ZSTD
, orRaw
). -
Should continue working, if consumer read lag exceed TTL of topic: Select this option to continue a transfer if a topic's TTL is exceeded (some data will be lost). If the option is not selected, a transfer will be aborted with a data loss error.
Configuring the data target
Configure one of the supported data targets:
- PostgreSQL.
- MySQL®.
- MongoDB
- ClickHouse®.
- Greenplum®.
- Yandex Managed Service for YDB.
- Yandex Object Storage.
- Apache Kafka®.
- YDS.
- Elasticsearch.
- OpenSearch.
For a complete list of supported sources and targets in Yandex Data Transfer, see Available Transfers.
After configuring the data source and target, create and start the transfer.
Troubleshooting data transfer issues
See a full list of recommendations in the Troubleshooting section.
Transfer failure
A Replication or Snapshot and increment transfer is interrupted with an error.
Error message:
/Ydb.PersQueue.V1.PersQueueService/AddReadRule failed: OVERLOADED
Transfers are aborted due to the cloud quota
Solution:
- Increase the Number of schema transactions per minute property in the Managed Service for YDB quotas for the cloud with the required database and reactivate the transfer.
Cloud Functions redirects
In rare cases, the following error may occur during transfers from Data Streams or Apache Kafka®:
redirect to SOME_URL is requested but no redirects are allowed.
Possible cause:
The use of the Cloud Functions function is set up on the source. It returns a redirect to another URL rather than data.
Solution:
Such redirects are not allowed for security reasons. Avoid using redirects to Cloud Functions during transfers.