Working with data in Yandex Object Storage

Written by

Updated at May 7, 2025

Getting started
- Required paid resources
Set up your infrastructure
Get started in Query
Connect to Object Storage data
- Create a data binding
Configure partitioning in Object Storage
- Hive partitioning
- Partition projection
How to delete the resources you created

Yandex Query is an interactive service for serverless data analysis. You can use it to process information from various storages without having to to create a dedicated cluster. The service supports Yandex Object Storage, Yandex Managed Service for PostgreSQL, and Yandex Managed Service for ClickHouse® data storages.

In this tutorial, you will connect to an Object Storage data source and run queries against it from the JupyterLab notebook using Query.

If you no longer need the resources you created, delete them.

Getting started

Before getting started, register in Yandex Cloud, set up a community, and link your billing account to it.

On the DataSphere home page, click Try for free and select an account to log in with: Yandex ID or your working account with the identity federation (SSO).
Select the Yandex Cloud Organization organization you are going to use in Yandex Cloud.
Create a community.
Link your billing account to the DataSphere community you are going to work in. Make sure you have a linked billing account and its status is ACTIVE or TRIAL_ACTIVE. If you do not have a billing account yet, create one in the DataSphere interface.

Required paid resources

The cost of infrastructure support for working with Object Storage data includes:

Fee for DataSphere computing resource usage.
Fee for storing data in a bucket.
Fee for the amount of read data when executing queries using Query.

Set up your infrastructure

Log in to the Yandex Cloud management console and select the organization you use to access DataSphere. On the Yandex Cloud Billing page, make sure you have a billing account linked.

If you have an active billing account, you can create or select a folder to deploy your infrastructure in, on the cloud page.

Note

If you use an identity federation to access Yandex Cloud, billing details might be unavailable to you. In this case, contact your Yandex Cloud organization administrator.

Create a folder

Management console

In the management console, select a cloud and click Create folder.
Give your folder a name, e.g., data-folder.
Click Create.

Create a service account for the DataSphere project

Management console

Navigate to data-folder.
From the list of services, select Identity and Access Management.
Click Create service account.
Enter a name for the service account, e.g., yq-sa.
Click Add role and assign the following roles to the service account:
- yq.editor: To run queries using Query.
- storage.viewer: To view the contents of the Object Storage bucket and objects.
Click Create.

Add the service account to a project

To enable the service account to run a DataSphere project, add it to the list of project members.

Select the relevant project in your community or on the DataSphere homepage in the Recent projects tab.
In the Members tab, click Add member.
Select the yq-sa account and click Add.
Change your service account role to Editor.

Create an authorized key for the service account

To allow the service account to send Query queries, create an authorized key.

Note

Authorized keys do not expire, but you can always get new authorized keys and get authenticated again if something goes wrong.

Management console

In the management console, navigate to data-folder.
From the list of services, select Identity and Access Management.
In the left-hand panel, select Service accounts.
In the list that opens, select the yq-sa service account.
Click Create new key in the top panel and select Create authorized key.
Select the encryption algorithm and click Create.
Click Download file with keys.

Create a secret

To get an authorized key from the notebook, create a secret with the contents of the authorized key file.

Select the relevant project in your community or on the DataSphere homepage in the Recent projects tab.
Under Project resources, click Secret.
Click Create.
In the Name field, enter the name for the secret: yq_access_key.
In the Value field, paste the full contents of the authorized key file you downloaded.
Click Create.

Create a notebook

Select the relevant project in your community or on the DataSphere homepage in the Recent projects tab.
Click Open project in JupyterLab and wait for the loading to complete.
In the top panel, click File and select New ⟶ Notebook.
Select a kernel and click Select.

Get started in Query

The yandex_query_magic package provides magic commands for working in Jupyter. Install it to send queries to Query. Paste the code into the yq-storage.ipynb notebook cells.

Open the DataSphere project:
1. Select the relevant project in your community or on the DataSphere homepage in the Recent projects tab.
2. Click Open project in JupyterLab and wait for the loading to complete.
3. Open the notebook tab.
Install the yandex_query_magic package:
```
%pip install yandex_query_magic
```
Once the installation is complete, from the top panel, select Kernel ⟶ Restart kernel....
Upload the extension:
```
%load_ext yandex_query_magic
```
Configure the connection by specifying the data-folder ID and the name of the authorized key secret:
```
%yq_settings --folder-id <folder_ID> --env-auth yq_access_key
```
Run a test query to Query:
```
%yq select "Hello, world!"
```

Connect to Object Storage data

As an example, let's use data on New York Yellow Taxi rides. The data was pre-uploaded to a public Object Storage bucket named yq-sample-data, in the nyc_taxi_csv folder.

Note

Yandex Cloud provides the New York City taxi trips dataset as is. Yandex Cloud makes no representations, express or implied, warranties, or conditions pertaining to your use of the specified dataset. To the extent allowed by your local laws, Yandex Cloud shall not be liable for any loss or damage, including direct, consequential, special, indirect, incidental, or exemplary, resulting from your use of the dataset.

NYC Taxi and Limousine Commission (TLC):

The data was collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The taxi trip data is not generated by the TLC, and the TLC makes no representations whatsoever about the accuracy of this data.

Take a look at the Dataset source and its Use Policy.

Create a Query connection:

Management console

In the management console, select data-folder.
From the list of services, select Yandex Query.
In the left-hand panel, select Connections.
Click Create new.
Enter a name for the connection, e.g., storage-connection.
Select the Object Storage connection type and specify the Connection type parameters:
- In the Bucket auth field, select Public and specify the bucket name.
- Name the bucket: yq-sample-data.
Click Create.

Create a data binding

A data binding contains information about file formats and locations in the bucket as well as about the list of data fields and their types. Query can process data in CSV, TSV, JSON, and Parquet formats. For data in CSV, TSV, and Json formats, you can additionally use an external archiver: gzip, zstd, and others. For data in Parquet format, you can use Snappy, LZ4, ZSTD, and other compression types.

To create a data binding:

Management console

In the management console, select data-folder.
From the list of services, select Yandex Query.
In the left-hand panel, select Bindings.
Click Create.
Under Connection parameters:
- Type: Select Object Storage.
- Connection: Select storage-connection.
Under Binding parameters:
1. Enter a name for the binding, e.g., yq_binding.
2. In the Path field, specify the path to the data: nyc_taxi_sample/data/.
3. Select the compression type: gzip.
Under Format settings, select csv_with_names in the Format field.
Under Columns, add two columns with these names and data types:
- tpep_pickup_datetime: DATETIME
- trip_distance: DOUBLE
To verify the data you specified is correct, click Preview. A table should appear below.
Click Create.

To check the connection, get the table data from the notebook cell:

%yq SELECT * FROM yq_binding LIMIT 100;

Configure partitioning in Object Storage

In Query, you can significantly reduce query time by setting up rules for partitioning (placing) data in Object Storage to only read the required data. For example, if files in a bucket are sorted in folders by year, Query will only look for data in the specified folders when queried.

Query supports two options for specifying data placement rules: Hive partitioning and Partition projection.

Hive partitioning

In Apache Hive™, data is placed in a folder structure in this format: key=value/key2=value2/.... This format is used in systems originally built around the Apache Hadoop™ ecosystem, such as Apache Spark™.

For example, data from the Yellow Taxi dataset is placed in folders according to the Hive partitioning rules:

year=2021/month=1/...
year=2021/month=2/...
...
year=2022/month=1/...
year=2022/month=2/...

Configure Hive partitioning:

Clone the yq_binding data binding:
Management console
1. In the management console, select data-folder.
2. From the list of services, select Yandex Query.
3. In the left-hand panel, click Bindings.
4. In the row with yq_binding, click and select Clone.
5. Enter a name for the binding: yq_tutorial_hive_partitioned.
6. Under Partitioned columns:
  1. Select Basic.
  2. Add a column named year with the UINT32 data type.
  3. Add a column named month with the UINT32 data type.
7. Click Modify.

To check the result, run this query:

%yq SELECT * FROM yq_tutorial_hive_partitioned WHERE year=2021

Partition projection

Sometimes data in Object Storage is placed according to unique rules. In such cases, you can use partition projection to specify your own data path template.

When using partition projection, you need to describe each component of the year and month path separately, specifying the ranges of their values. Then you need to specify the path template in the Object Storage bucket that stores the data.

Configure partition projection:

Clone the yq_binding data binding:
Management console
1. In the management console, select data-folder.
2. From the list of services, select Yandex Query.
3. In the left-hand panel, click Bindings.
4. In the row with yq_binding, click and select Clone.
5. Enter a name for the binding: yq_tutorial_partition_projection.
6. Under Partitioned columns:
  1. Select Extended.
  2. Add a column named year with the UINT32 data type:
    
    Format type: integer
    
    Interval:
    
    Min: 2019
    
    Max: 2024
    
    Step: 1
    
    Fixed width: 2
  3. Add a column named month with the UINT32 data type:
    
    Format type: integer
    
    Interval:
    
    Min: 1
    
    Max: 12
    
    Step: 1
    
    Fixed width: 2
7. Click Modify.

To check the result, run this query:

%yq SELECT * FROM yq_tutorial_partition_projection WHERE year=2021

How to delete the resources you created

To stop paying for the resources you created:

Working with data in Yandex Object Storage

Getting started

Required paid resources

Set up your infrastructure

Create a folder

Create a service account for the DataSphere project

Add the service account to a project

Create an authorized key for the service account

Create a secret

Create a notebook

Get started in Query

Connect to Object Storage data

Create a data binding

Configure partitioning in Object Storage

Hive partitioning

Partition projection

How to delete the resources you created

Was the article helpful?