Working with data in Yandex Object Storage
Yandex Query is an interactive service for serverless data analysis. It enables you to process information from different storages without the need to create a dedicated cluster. The service supports working with Yandex Object Storage, Yandex Managed Service for PostgreSQL, and Yandex Managed Service for ClickHouse® data storages.
In this tutorial, you will connect to the Object Storage data source and run queries against it from the JupyterLab notebook using Query.
- Prepare your infrastructure.
- Get started in Query.
- Connect to the Object Storage data.
- Configure data partitioning in Object Storage.
If you no longer need the resources you created, delete them.
Getting started
Before getting started, register in Yandex Cloud, set up a community, and link your billing account to it.
- On the DataSphere home page
, click Try for free and select an account to log in with: Yandex ID or your working account in the identity federation (SSO). - Select the Yandex Cloud Organization organization you are going to use in Yandex Cloud.
- Create a community.
- Link your billing account to the DataSphere community you are going to work in. Make sure that you have a billing account linked and its status is
ACTIVE
orTRIAL_ACTIVE
. If you do not have a billing account yet, create one in the DataSphere interface.
Required paid resources
For working with Object Storage data, the cost of infrastructure support includes:
- Fee for DataSphere computing resource usage.
- Fee for storing data in a bucket.
- Fee for the amount of read data when executing Query queries.
Prepare the infrastructure
Log in to the Yandex Cloud management console
If you have an active billing account, you can create or select a folder to deploy your infrastructure in, on the cloud page
Note
If you use an identity federation to access Yandex Cloud, billing details might be unavailable to you. In this case, contact your Yandex Cloud organization administrator.
Create a folder
- In the management console
, select a cloud and click Create folder. - Give your folder a name, e.g.,
data-folder
. - Click Create.
Create a service account for the DataSphere project
- Go to the
data-folder
folder. - In the Service accounts tab, click Create service account.
- Enter a name for the service account, e.g.,
yq-sa
. - Click Add role and assign the following roles to the service account:
datasphere.community-projects.editor
: To run DataSphere computations.yq.editor
: To send Query queries.storage.viewer
: To view the contents of the Object Storage bucket and objects.
- Click Create.
Add the service account to a project
To enable the service account to run a DataSphere project, add it to the list of project members.
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - In the Members tab, click Add member.
- Select the
yq-sa
account and click Add.
Create an authorized key for a service account
To allow the service account to send Query queries, create an authorized key.
Note
Authorized keys do not expire, but you can always get new authorized keys and authenticate again if something goes wrong.
- In the management console
, go todata-folder
. - At the top of the screen, go to the Service accounts tab.
- Select the
yq-sa
service account. - Click Create new key in the top panel and select Create authorized key.
- Select the encryption algorithm and click Create.
- Click Download file with keys.
Create a secret
To get an authorized key from the notebook, create a secret with the contents of the authorized key file.
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Under Project resources, click
Secret. - Click Create.
- In the Name field, enter the secret name:
yq_access_key
. - In the Value field, paste the full contents of the downloaded file with the authorized key.
- Click Create.
Create a notebook
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Click Open project in JupyterLab and wait for the loading to complete.
- In the top panel, click File and select New ⟶ Notebook.
- Select a kernel and click Select.
Get started in Query
The yandex_query_magic
package provides magic commands for working in Jupyter. Install it to send queries to Query. Paste the code into the yq-storage.ipynb
notebook cells.
-
Open the DataSphere project:
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Click Open project in JupyterLab and wait for the loading to complete.
- Open the notebook tab.
-
-
Install the
yandex_query_magic
package:%pip install yandex_query_magic
-
Once the installation is complete, from the top panel, select Kernel ⟶ Restart kernel....
-
Upload the extension:
%load_ext yandex_query_magic
-
Configure the connection by specifying the
data-folder
ID and the name of the authorized key secret:%yq_settings --folder-id <folder_ID> --env-auth yq_access_key
-
Run a test query to Query:
%yq select "Hello, world!"
Connect to data in Object Storage
As an example, let's take the data on New York Yellow Taxi rides. The data was pre-uploaded to an Object Storage public bucket named yq-sample-data
, in the nyc_taxi_csv
folder.
Note
Yandex Cloud provides the New York City taxi trips dataset as is. Yandex Cloud makes no representations, express or implied, warranties, or conditions pertaining to your use of the specified dataset. To the extent allowed by your local laws, Yandex Cloud shall not be liable for any loss or damage, including direct, consequential, special, indirect, incidental, or exemplary, resulting from your use of the dataset.
NYC Taxi and Limousine Commission (TLC):
The data was collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The taxi trip data is not generated by the TLC, and the TLC makes no representations whatsoever about the accuracy of this data.
Take a look at the Dataset source
Create a Query connection:
-
In the management console
, selectdata-folder
. -
In the list of services, select Yandex Query.
-
In the left-hand panel, select Connections.
-
Click
Create new. -
Enter a name for the connection, e.g.,
storage-connection
. -
Select the Object Storage connection type and specify the Connection type parameters:
- In the Bucket auth field, select
Public
and specify the bucket name. - Enter the name of the bucket:
yq-sample-data
.
- In the Bucket auth field, select
-
Click Create.
Create a data binding
A data binding contains information about file formats and locations in the bucket, as well as a list of data fields and their types. Query can process data in CSV, TSV, Json, and Parquet formats. For data in CSV, TSV, Json formats, you can additionally use an external archiver: gzip, zstd, and others. For data in Parquet format, Snappy, LZ4, ZSTD, and other compression types are supported.
To create a data binding:
-
In the management console
, selectdata-folder
. -
In the list of services, select Yandex Query.
-
In the left-hand panel, select Bindings.
-
Click
Create. -
Under Connection parameters:
- Type: Select Object Storage.
- Connection: Select
storage-connection
.
-
Under Binding parameters:
- Enter a name for the binding, e.g.,
yq_binding
. - In the Path field, specify the path to the data:
nyc_taxi_sample/data/
. - Select the compression type:
gzip
.
- Enter a name for the binding, e.g.,
-
Under Format settings, select
csv_with_names
in the Format field. -
Under Columns, add two columns with the following names and data types:
tpep_pickup_datetime
:DATETIME
trip_distance
:DOUBLE
-
To verify that the specified data is correct, click Preview. The table should appear below.
-
Click Create.
To check the connection, get the table data from the notebook cell:
%yq SELECT * FROM yq_binding LIMIT 100;
Configure partitioning in Object Storage
In Query, you can significantly reduce query time if you set up rules for partitioning (placing) data in Object Storage to read only the required data. For example, if files in a bucket are sorted in folders by year, Query will only look for data in the specified folders when queried.
Query supports two options for specifying data placement rules: Hive partitioning and Partition projection.
Hive partitioning
In Apache Hive™, data is placed in a folder structure in key=value/key2=value2/...
format. This format is used in systems originally built around the Apache Hadoop™ ecosystem, such as Apache Spark™.
For example, data from the Yellow Taxi dataset is placed in folders according to the Hive partitioning rules:
year=2021/month=1/...
year=2021/month=2/...
- ...
year=2022/month=1/...
year=2022/month=2/...
Configure Hive partitioning:
-
Clone the
yq_binding
data binding:Management console-
In the management console
, selectdata-folder
. -
In the list of services, select Yandex Query.
-
In the left-hand panel, click Bindings.
-
In the line with
yq_binding
, click and select Clone. -
Enter a name for the binding:
yq_tutorial_hive_partitioned
. -
Under Partitioned columns:
- Select Basic.
- Add a column named
year
with theUINT32
data type. - Add a column named
month
with theUINT32
data type.
-
Click Modify.
-
-
To check the result, run the following query:
%yq SELECT * FROM yq_tutorial_hive_partitioned WHERE year=2021
Partition projection
Sometimes data in Object Storage is placed according to unique rules. In such cases, you can use partition projection to specify your own data path template.
When using partition projection, you need to describe each component of the year
and month
path separately, specifying the ranges of their values. Then you need to specify the path template in the Object Storage bucket that stores the data.
Configure partition projection:
-
Clone the
yq_binding
data binding:Management console-
In the management console
, selectdata-folder
. -
In the list of services, select Yandex Query.
-
In the left-hand panel, click Bindings.
-
In the line with
yq_binding
, click and select Clone. -
Enter a name for the binding:
yq_tutorial_partition_projection
. -
Under Partitioned columns:
- Select Extended.
- Add a column named
year
with theUINT32
data type:- Format type:
integer
- Interval:
- Min:
2019
- Max:
2024
- Step: 1
- Min:
- Fixed width: 2
- Format type:
- Add a column named
month
with theUINT32
data type:- Format type:
integer
- Interval:
- Min:
1
- Max:
12
- Step: 1
- Min:
- Fixed width: 2
- Format type:
-
Click Modify.
-
-
To check the result, run the following query:
%yq SELECT * FROM yq_tutorial_partition_projection WHERE year=2021
How to delete the resources you created
To stop paying for the resources you created: