Federated data queries
Yandex Query is an interactive service for serverless data analysis. It enables you to process information from different storages without the need to create a dedicated cluster. The service supports working with Yandex Object Storage, Yandex Managed Service for PostgreSQL, and Yandex Managed Service for ClickHouse® data storages.
Data from these systems can be processed either individually or as part of one common query: such queries are called federated.
In this tutorial, you will create three separate data storages: customers, purchased items, and purchase dates. Using a federated query from a notebook cell, you can get data from all the storages at the same time.
- Prepare your infrastructure.
- Get started in Query.
- Connect to the Object Storage data.
- Connect to the Managed Service for ClickHouse® data.
- Connect to the Managed Service for PostgreSQL data.
- Run a federated query.
If you no longer need the resources you created, delete them.
Getting started
Before getting started, register in Yandex Cloud, set up a community, and link your billing account to it.
- On the DataSphere home page
, click Try for free and select an account to log in with: Yandex ID or your working account in the identity federation (SSO). - Select the Yandex Cloud Organization organization you are going to use in Yandex Cloud.
- Create a community.
- Link your billing account to the DataSphere community you are going to work in. Make sure that you have a billing account linked and its status is
ACTIVE
orTRIAL_ACTIVE
. If you do not have a billing account yet, create one in the DataSphere interface.
Required paid resources
The cost of supporting the infrastructure for running federated queries in this example includes:
- Fee for DataSphere computing resource usage.
- Fee for storing data in a bucket.
- Fee for a running Managed Service for ClickHouse® cluster.
- Fee for a running Managed Service for PostgreSQL cluster.
- Fee for the amount of read data when executing Query queries.
Prepare the infrastructure
Log in to the Yandex Cloud management console
If you have an active billing account, you can create or select a folder to deploy your infrastructure in, on the cloud page
Note
If you use an identity federation to access Yandex Cloud, billing details might be unavailable to you. In this case, contact your Yandex Cloud organization administrator.
Create a folder
- In the management console
, select a cloud and click Create folder. - Give your folder a name, e.g.,
data-folder
. - Click Create.
Create a service account for the DataSphere project
- In the management console
, go todata-folder
. - In the Service accounts tab, click Create service account.
- Enter a name for the service account, e.g.,
yq-sa
. - Click Add role and assign the following roles to the service account:
datasphere.community-project.editor
: To run DataSphere computations.yq.editor
: To send Query queries.storage.viewer
: To view the contents of the Object Storage bucket and objects.managed-clickhouse.viewer
: To view the contents of the Managed Service for ClickHouse® cluster.managed-postgresql.viewer
: To view the contents of the Managed Service for PostgreSQL cluster.
- Click Create.
Add the service account to a project
To enable the service account to run a DataSphere project, add it to the list of project members.
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - In the Members tab, click Add member.
- Select the
yq-sa
account and click Add.
Create an authorized key for a service account
To allow the service account to send Query queries, create an authorized key.
Note
Authorized keys do not expire, but you can always get new authorized keys and authenticate again if something goes wrong.
- In the management console
, go todata-folder
. - At the top of the screen, go to the Service accounts tab.
- Select the
yq-sa
service account. - Click Create new key in the top panel and select Create authorized key.
- Select the encryption algorithm and click Create.
- Click Download file with keys.
Create a secret
To get an authorized key from the notebook, create a secret with the contents of the authorized key file.
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Under Project resources, click
Secret. - Click Create.
- In the Name field, enter the secret name:
yq_access_key
. - In the Value field, paste the full contents of the downloaded file with the authorized key.
- Click Create.
Create a notebook
Queries to the Managed Service for PostgreSQL database through Query will be sent from the notebook.
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Click Open project in JupyterLab and wait for the loading to complete.
- In the top panel, click File and select New ⟶ Notebook.
- Select a kernel and click Select.
Get started in Query
The yandex_query_magic
package provides magic commands for working in Jupyter. Install it to send queries to Query. Paste the code into the yq-storage.ipynb
notebook cells.
-
Open the DataSphere project:
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Click Open project in JupyterLab and wait for the loading to complete.
- Open the notebook tab.
-
-
Install the
yandex_query_magic
package:%pip install yandex_query_magic
-
Once the installation is complete, from the top panel, select Kernel ⟶ Restart kernel....
-
Upload the extension:
%load_ext yandex_query_magic
-
Configure the connection by specifying the
data-folder
ID and the name of the authorized key secret:%yq_settings --folder-id <folder_ID> --env-auth yq_access_key
-
Run a test query to Query:
%yq select "Hello, world!"
Connect to the Object Storage data
To work with the Object Storage data, you will need a bucket with a table, a Query connection, and a data binding.
Create a data bucket
The Object Storage bucket will contain the purchase dates.
-
In the management console
, go todata-folder
. -
In the list of services, select Object Storage.
-
At the top right, click Create bucket.
-
In the ** Name** field, enter a name for the bucket.
-
In the Object read access, Object listing access, and Read access to settings fields, select Restricted.
-
Click Create bucket.
-
Create a file named
visits.csv
and put a table with purchase dates in it:date|persons_id|item_id 2024-05-14 12:12:00|1|2 2024-05-15 13:13:00|2|1
-
Go to the created bucket and click Upload.
-
In the window that opens, select
visits.csv
and click Open. -
Click Upload.
Create a connection to Object Storage
-
In the management console
, go todata-folder
. -
In the list of services, select Yandex Query.
-
In the left-hand panel, select Connections.
-
Click
Create new. -
Enter a name for the connection, e.g.,
storage-connection
. -
Select the Object Storage connection type and specify the Connection type parameters.
-
In the Bucket auth field, select
Private
and set the parameters:- Cloud and Folder:
data-folder
. - Bucket: Select the created bucket.
- Service account:
yq-sa
.
- Cloud and Folder:
-
Click Create.
Create a data binding
A data binding contains information about file formats and locations in the bucket, as well as a list of data fields and their types.
To create a data binding:
-
In the management console
, selectdata-folder
. -
In the list of services, select Yandex Query.
-
In the left-hand panel, select Bindings.
-
Click
Create. -
Under Connection parameters:
- Type: Select Object Storage.
- Connection: Select
storage-connection
.
-
Under Binding parameters:
- Enter a name for the binding, e.g.,
visits
. - In the Path field, specify the path to the table in the bucket:
/visits.csv
.
- Enter a name for the binding, e.g.,
-
Under Format settings:
- Format: Select
csv_with_names
. - Delimiter {%secondary%}(optional){%secondary%}:
|
.
- Format: Select
-
Under Columns, add three columns with the following names and data types:
date
:DATETIME
person_id
:INT32
items_id
:INT32
-
To verify that the specified data is correct, click Preview. Your table should appear below.
-
Click Create.
To check the connection, get the table data from the notebook cell:
%yq SELECT * FROM visits LIMIT 100;
Connect to the Managed Service for ClickHouse® data
To work with the Managed Service for ClickHouse® data, you will need a cluster with a table and a Query connection.
Create a Managed Service for ClickHouse® cluster
Any running Managed Service for ClickHouse® cluster with the Yandex Query access option enabled is suitable for sending queries.
-
In the management console
, selectdata-folder
. -
Select Managed Service for ClickHouse.
-
Click Create cluster.
-
In the Cluster name field, enter a name for the cluster, e.g.,
clickhouse
. -
Under DBMS settings:
- In the User management via SQL field, select Enabled from the drop-down list.
- Specify Username and Password.
-
Under Service settings:
- Select the
yq-sa
service account. - Enable the Yandex Query access and Access from the management console options.
- Select the
-
You may leave default values for the other settings.
-
Click Create cluster.
Create a table
The Managed Service for ClickHouse® table will contain the names of the items.
-
In the management console
, open theclickhouse
cluster page, and go to the SQL tab. -
Enter Username and Password you specified when creating the cluster.
-
In the input window on the right, sequentially run the following SQL queries:
CREATE TABLE items(id INT, description VARCHAR) ENGINE=MergeTree ORDER BY id;
INSERT INTO items(id, description) VALUES(1, 'Refrigerator')
INSERT INTO items(id, description) VALUES(2, 'TV')
-
Click Execute.
Create a connection to Managed Service for ClickHouse®
-
In the management console
, selectdata-folder
. -
In the list of services, select Yandex Query.
-
In the left-hand panel, select Connections.
-
Click
Create new. -
Enter a name for the connection, e.g.,
clickhouse
. -
Select the Managed Service for ClickHouse connection type.
-
Under Connection type parameters:
- Cluster: Select the previously created
clickhouse
cluster. - Service account: Select the
yq-sa
service account. - Enter Login and Password specified when you created the cluster.
- Cluster: Select the previously created
-
Click Create.
To check the connection, run a query in the notebook cell. For example:
%yq SELECT * FROM clickhouse.items
Connect to the Managed Service for PostgreSQL data
To work with the Managed Service for PostgreSQL data, you will need a cluster with a table and a Query connection.
Create a Managed Service for PostgreSQL cluster
Any running Managed Service for PostgreSQL cluster with the Yandex Query access option enabled is suitable for sending queries.
- In the management console
, selectdata-folder
. - Select Managed Service for PostgreSQL.
- Click Create cluster.
- In the Cluster name field, enter a name for the cluster, e.g.,
postgresql
. - Under Database:
- Specify the DB name, e.g.,
db1
. - Specify Username and Password.
- Specify the DB name, e.g.,
- Under Service settings, enable Yandex Query access and Access from the management console.
- You may leave default values for the other settings.
- Click Create cluster.
Create a table
The Managed Service for PostgreSQL table will contain the names of the customers.
-
In the management console
, open thepostgresql
cluster page, and go to the SQL tab. -
Enter Username and Password you specified when creating the cluster.
-
In the input window on the right, sequentially run the following SQL queries:
CREATE TABLE persons(person_id INT, name VARCHAR);
INSERT INTO persons(person_id, name) values(1, 'Ivan')
INSERT INTO persons(person_id, name) values(2, 'Peter')
Create a connection to Managed Service for PostgreSQL
-
In the management console
, selectdata-folder
. -
In the list of services, select Yandex Query.
-
In the left-hand panel, select Connections.
-
Click
Create new. -
Enter a name for the connection, e.g.,
postgresql
. -
Select the Managed Service for PostgreSQL connection type.
-
Under Connection type parameters:
- Cluster: Select the previously created
postgresql
cluster. - Service account:
yq-sa
. - Database:
db1
. - Enter Login and Password specified when you created the cluster.
- Cluster: Select the previously created
-
Click Create.
To check the connection, run a query in the notebook cell. For example:
%yq SELECT * FROM postgresql.persons
Run a federated query
Working with federated data sources (cross-service analytics) is no different than working with regular data sources. You can simultaneously access external data sources, such as tables in the DB and data in Object Storage, from a query and perform any allowed YQL operations on them.
To combine and get data from all the three tables, run a federated query in a notebook cell:
%%yq SELECT
v.person_id,
date,
p.name,
c.description
FROM visits AS v
INNER JOIN postgresql.persons AS p ON
v.person_id = p.person_id
INNER JOIN clickhouse.items AS c ON
v.item_id = c.id
How to delete the resources you created
To stop paying for the resources you created:
- Delete the bucket.
- Delete the Managed Service for ClickHouse® cluster.
- Delete the Managed Service for PostgreSQL cluster.
- Delete the project.
ClickHouse® is a registered trademark of ClickHouse, Inc