Data partitioning
Yandex Object Storage allows you to store very large amounts of data. Queries may only affect some of this data. If you describe the markup rules for the structure of your data storage in Yandex Query, then you do not even need to read data that is not required for a query from Yandex Object Storage. This makes query execution much faster without affecting the results.
For example, data is stored in the following structure of folders:
year=2021
month=01
month=02
month=03
year=2022
month=01
In the query below, only the data for February 2021 should be processed, while the other data is not required.
SELECT
*
FROM
objectstorage.'/'
WITH
(
Schema =
(
data String,
year Int,
month Int
)
)
WHERE
year=2021
AND month=02
If no data partitioning scheme is specified, all stored data is read from Yandex Object Storage. However, as a result of processing, data for any other period is discarded.
If you explicitly specify the storage structure indicating that data in Yandex Object Storage is stored in folders by year and month
SELECT
*
FROM
objectstorage.'/'
WITH
(
Schema =
(
data String,
year Int,
month Int
),
partitioned_by =
(
year,
month
)
)
WHERE
year=2021
AND month=02
when executing a query from Yandex Object Storage, instead of reading all the data, only data for February 2021 is read. This will greatly reduce the amount of data to handle and speed up data processing while the results of both queries will be identical.
Note
In the example above, operations with data are performed at the level of connections. This example is given for illustrative purposes only. We strongly recommend that you work with data using bindings and do not work with connections directly.
Syntax
When working at the connection level, partitioning is set using the partitioned_by
parameter.
SELECT
*
FROM
<connection>.<path>
WITH
(
schema=(<field_1>, <field_2>, <field_3>),
partitioned_by=(<field_2>, <field_3>)
)
The partitioned_by
parameter lists data schema columns used to partition data stored in Yandex Object Storage. The order of specifying fields in the partitioned_by
parameter determines the nesting of Yandex Object Storage folders.
For example, partitioned_by=(year, month)
defines the folder structure
year=2021
month=01
month=02
month=03
year=2022
month=01
And partitioned_by=(month, year)
defines a different folder structure
month=01
year=2021
year=2022
month=02
year=2021
month=03
year=2021
Supported data types
Partitioning is only possible by the following set of YQL data types:
- Uint16, Uint32, Uint64
- Int16, Int32, Int64
- String
- Bool
If other data types are used for partitioning, an error is returned.
Supported storage path formats
The storage path format where each folder name explicitly specifies a column name is called Hive-Metastore format
This format looks as follows:
month=01
year=2021
year=2022
month=02
year=2021
month=03
year=2021
Warning
The basic partitioning mode in Yandex Query only supports Hive format.
To specify arbitrary storage paths, use Partition projection.