Data formats and compression algorithms
Below are the data formats and compression algorithms supported in Yandex Query.
Supported data formats
Yandex Query Language supports the following data formats:
Csv_with_names
This format is based on CSV
Sample data:
Year,Manufacturer,Model,Price
1997,Ford,E350,3000.00
1999,Chevy,"Venture «Extended Edition»",4900.00
Sample request
SELECT
*
FROM <connection>.<path>
WITH
(
format=csv_with_names,
SCHEMA
(
Year int,
Manufacturer string,
Model string,
Price double
)
)
Query results:
# | Manufacturer | Model | Price | Year |
---|---|---|---|---|
1 | Ford | E350 | 3000 | 1997 |
2 | Chevy | Venture «Extended Edition» | 4900 | 1999 |
Tsv_with_names
This format is based on TSV0x9
code) and stored in columns with the first file line containing column names.
Sample data:
Year Manufacturer Model Price
1997 Ford E350 3000.00
1999 Chevy "Venture «Extended Edition»" 4900.00
Sample request
SELECT
*
FROM <connection>.<path>
WITH
(
format=tsv_with_names,
SCHEMA
(
Year int,
Manufacturer string,
Model string,
Price double
)
)
Query results:
# | Manufacturer | Model | Price | Year |
---|---|---|---|---|
1 | Ford | E350 | 3000 | 1997 |
2 | Chevy | Venture «Extended Edition» | 4900 | 1999 |
Json_list
This format is based on a JSON representation
Example of correct data (represented as a list of JSON objects):
[
{ "Year": 1997, "Manufacturer": "Ford", "Model": "E350", "Price": 3000.0 },
{ "Year": 1999, "Manufacturer": "Chevy", "Model": "Venture «Extended Edition»", "Price": 4900.00 }
]
Example of INCORRECT data (each line contains a separate object in JSON format, but these objects are not represented as a list):
{ "Year": 1997, "Manufacturer": "Ford", "Model": "E350", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Chevy", "Model": "Venture «Extended Edition»", "Price": 4900.00 }
Json_each_row
This format is based on a JSON representation
Example of correct data (each line contains a separate object in JSON format, but these objects are not represented as a list):
{ "Year": 1997, "Manufacturer": "Ford", "Model": "E350", "Price": 3000.0 },
{ "Year": 1999, "Manufacturer": "Chevy", "Model": "Venture «Extended Edition»", "Price": 4900.00 }
Sample query
SELECT
*
FROM <connection>.<path>
WITH
(
format=json_each_row,
SCHEMA
(
Year int,
Manufacturer string,
Model string,
Price double
)
)
Query results:
# | Manufacturer | Model | Price | Year |
---|---|---|---|---|
1 | Ford | E350 | 3000 | 1997 |
2 | Chevy | Venture «Extended Edition» | 4900 | 1999 |
Raw
This format allows reading raw data as is. The data read this way can be processed using YQL
Use this format if the built-in features for parsing source data in Yandex Query are insufficient.
Sample query
SELECT
*
FROM <connection>.<path>
WITH
(
format=raw,
SCHEMA
(
Data String
)
)
Query results:
Year,Manufacturer,Model,Price
1997,Ford,E350,3000.00
1999,Chevy,\"Venture «Extended Edition»\",4900.00
Json_as_string
This format is based on a JSON representation
In this format, each file should contain:
- Object in a valid JSON representation in each file line.
- List of objects in a valid JSON representation.
Example of correct data (represented as a list of JSON objects):
{ "Year": 1997, "Manufacturer": "Ford", "Model": "E350", "Price": 3000.0 }
{ "Year": 1999, "Manufacturer": "Chevy", "Model": "Venture «Extended Edition»", "Price": 4900.00 }
Sample request
SELECT
*
FROM <connection>.<path>
WITH
(
format=json_as_string,
SCHEMA
(
Data Json
)
)
Query results:
# | Data |
---|---|
1 | {"Manufacturer": "Ford", "Model": "E350", "Price": 3000, "Year": 1997} |
2 | {"Manufacturer": "Chevy", "Model": "Venture «Extended Edition»", "Price": 4900, "Year": 1999} |
Parquet
This format allows you to read the contents of a file in Apache Parquet
Data compression algorithms supported in Parquet files:
- No compression.
- SNAPPY
- GZIP
- LZO
- BROTLI
- LZ4
- ZSTD
- LZ4_RAW
Sample query
SELECT
*
FROM <connection>.<path>
WITH
(
format=parquet,
SCHEMA
(
Year int,
Manufacturer string,
Model string,
Price double
)
)
Query results:
# | Manufacturer | Model | Price | Year |
---|---|---|---|---|
1 | Ford | E350 | 3000 | 1997 |
2 | Chevy | Venture «Extended Edition» | 4900 | 1999 |
Example of reading data
Sample query for reading data from Yandex Object Storage.
SELECT
*
FROM
connection.`folder/filename.csv`
WITH(
format='csv_with_names',
SCHEMA
(
Year int,
Manufacturer String,
Model String,
Price Double
)
);
Where:
Field | Description |
---|---|
connection |
Yandex Object Storage connection name |
folder/filename.csv |
Path to the file in the Yandex Object Storage bucket |
SCHEMA |
Data schema description in the file |
Supported compression algorithms
Reads
Yandex Query supports the following compression algorithms for data reads:
Compression format | Name in Query |
---|---|
Gzip |
gzip |
Zstd |
zstd |
LZ4 |
lz4 |
Brotli |
brotli |
Bzip2 |
bzip2 |
Xz |
xz |
The parquet
file format supports its own internal compression algorithms. Yandex Query enables reading data in parquet
format using the following compression algorithms:
Compression format | Name in Query |
---|---|
Raw |
raw |
Snappy |
snappy |
Writing data to Yandex Object Storage
Currently, the following data write formats are supported:
Data format | Name in Query |
---|---|
CSV |
csv_with_names |
Parquet |
parquet |
Query supports the following compression algorithms for data writes:
Compression format | Name in Query |
---|---|
Gzip |
gzip |
Zstd |
zstd |
LZ4 |
lz4 |
Brotli |
brotli |
Bzip2 |
bzip2 |
Xz |
xz |
Parquet
file format supports its own internal compression algorithms. Query allows writing data in parquet
format using the following compression algorithms:
Compression format | Name in Query |
---|---|
Snappy |
No name, by default |
Writing data to Yandex Data Streams
Data Streams only lets you write data as a byte stream that is interpreted on the receiving side.
File format and compression algorithm settings for data writes in Data Streams are not applied.