Data transformation
Transformation is converting data using special transformer functions. These functions are executed on a data stream, applied to each data change item, and transform them. A transformer can be run at both the metadata and data levels.
You can set up data transformations when creating or updating a transfer.
Data can only be transformed if the source and target are of different types.
Transformers are set as a list. When activating a transfer, a transformation plan is made for the tables that match the specified criteria. Transformers are applied to the tables in the sequence specified in the list.
Transformer types
Some transformers may have limitations and only apply to some source-target pairs. The list of transformers will be updated moving forward. Currently, the following types are available:
- Renaming tables
- Column filter
- Data masking
- Splitting tables into subtables
- Replacing primary keys
- Converting column values to strings
- Sharding
- String filter for APPEND-ONLY sources
Renaming tables
You can set rules for renaming tables by specifying the current names of the tables in the source and new names for these tables in the target.
Column filter
You can set up a list of table columns to transfer:
- List the tables to filter using lists of included and excluded tables.
- List the columns to transfer to the target tables using lists of included and excluded columns.
Data masking
You can hash data as follows:
- List the tables to mask data in using lists of included and excluded tables.
- Specify the name of the column for data masking (a regular expression). For each column, set a salt value (a string of the password type). This string will be used in the
HMAC(sha256, salt)function applied to the column data.
Splitting tables into subtables
A transfer splits the X table into multiple tables (X_1, X_2, ..., X_n) based on data. If a row was located in the X table before it was split, it is now in the X_i table, where i is decided by the following: column list and split string.
Example:
If the column list features two columns –month of birthandgender– and the split string states@, the information about the employee John, born February 11, 1984, will get from theEmployeestable into theEmployees@February@maletable, which is the new table name.
To partition tables:
- Use lists of included and excluded tables to list the required tables.
- Specify the columns in the tables to be partitioned.
- Specify the split string to be used for merging components in a new table name.
Replacing primary keys
To override primary keys, specify a list of included and excluded tables and a list of columns to be used as primary keys.
Converting column values to strings
To convert column values to string values, specify a list of included and excluded tables and a list of columns. The values will be converted depending on the source type:
| Source type | Format |
|---|---|
| Any | Serialized JSON format |
| Date | String in 2006-01-02 format |
| DateTime | String in 2006-01-02T15:04:05.999999999Z07:00 format |
Sharding
Sharding is used when delivering data to queues, primarily to Yandex Data Streams, where throughput is limited and data sharding helps improve transfer performance.
Two sharding types are supported:
- Column list (column-based sharding), where rows with identical values in the specified columns are placed in the same shard.
- Random sharding, where rows are randomly distributed across shards based on an arbitrary UUID
, which is regenerated at each data plane restart. If the number of shards changes, a new UUID will be generated, and data will get resharded.
We recommend setting the number of shards to be 3 to 10 times more than partitions in the target. This will help to distribute data across partitions more evenly.
Considerations for data delivery to Data Streams
You can only use one worker for column-based sharding. The reason is that, when delivering data to Data Streams, the transformer generates the data source ID (source_id). Different workers would generate identical source IDs, but Data Streams does not allow writes with multiple identical source_id values.
String filter for APPEND-ONLY sources
This filter only applies to transfers using queues (such as Apache Kafka®) as a data source. When running a transfer, only the strings meeting the specified criteria remain in a changefeed.
- List the tables to filter data in using lists of included and excluded tables.
- Set a filtering criterion. For the criterion, you can specify comparison operations for numeric, string, and Boolean values, comparison with NULL, and checking whether a substring is a part of a string and whether a value belongs to a set.
Filtering criteria are determined by a set of rules separated by the AND keyword. The following operations are supported: >, >=, <, <=, =, !=, ~(substring is included into a string), !~ (substring is not included into a string), IN (value belongs to a set), NOT IN (value does not belong to a set). All columns listed in the filter must be present in the table you are filtering.
Here is an example of a filter string:
aid > 0 AND id >= 10 AND f > 1000.1 AND f <= 1000.5 AND d != 3.14 AND i < 0 AND bi >= -9223372036854775808 AND biu <= 9223372036854775807 AND da > 1999-01-04 AND da <= 2000-03-04 AND ts > 2000-01-01T00:00:00 AND c != 'a' AND str ~ 'hello' AND t !~ 'bye-bye' AND nil = NULL AND val != NULL
The filter supports single-level nesting (parentheses) and with the IN and NOT IN operators only.
Here is an example of a filter string using the IN and NOT IN operators:
i IN (3, 5, 7) AND str IN ('hello', 'hi') AND c NOT IN ('a', 'b', 'c')
When using the filter:
-
If the column value type specified in the filter does not match the column type in the table, the transformer is not applied (rows are not filtered).
-
If the filter includes a string column, the type of this column in the filtered table must be set to
UTF8for those sources where a parser explicitly specifies column types (e.g., for YDS). The transformer does not support theSTRINGcolumn type (rows are not filtered).
You can specify several filter criteria at once. If multiple criteria are specified, data will be transported if it satisfies at least one of the filters (similar to the OR operation).