Analyzing data with Yandex Query
Yandex Query is an interactive service for serverless data analysis. It allows you to process data from various storages using SQL queries without creating a dedicated data processing cluster. Yandex Query works with Yandex Object Storage, Managed Service for PostgreSQL, and Managed Service for ClickHouse® data storages.
To analyze DataSphere data using Query:
- Install and configure the
yandex_query_magic
package. - Create a query template.
- Process the execution results.
Getting started
Before getting started, register in Yandex Cloud, set up a community, and link your billing account to it.
- On the DataSphere home page
, click Try for free and select an account to log in with: Yandex ID or your working account in the identity federation (SSO). - Select the Yandex Cloud Organization organization you are going to use in Yandex Cloud.
- Create a community.
- Link your billing account to the DataSphere community you are going to work in. Make sure that you have a billing account linked and its status is
ACTIVE
orTRIAL_ACTIVE
. If you do not have a billing account yet, create one in the DataSphere interface.
Configure the infrastructure to work with Yandex Query:
- Go to the management console
and link the billing account to the cloud. - Create a folder where Yandex Query will run.
Required paid resources
The cost of analyzing data using Yandex Query includes:
- Fee for DataSphere computing resource usage.
- Fee for data read by Yandex Query when executing queries.
yandex_query_magic
package
Install and configure the -
Open the DataSphere project:
-
Select the relevant project in your community or on the DataSphere homepage
in the Recent projects tab. - Click Open project in JupyterLab and wait for the loading to complete.
- Open the notebook tab.
-
-
Install the
yandex_query_magic
package by running this command in the notebook cell:
%pip install yandex_query_magic --upgrade
- Configure the
yandex_query_magic
package by using theyq_settings
line command:
%yq_settings --folder-id <folder_ID> ...
Available parameters:
--folder-id <folder_ID>
: ID of the folder to run Query queries.--env-auth <environment_variable>
: Enables authentication with the authorized key contained in the Yandex DataSphere secret. Create a DataSphere secret and specify its name in the--env-auth
parameter.
Test the package
You can use the %yq line magic
command with single-line SQL queries. In this case, the %yq
keyword is used to execute a query.
Run the following commands in the notebook:
%load_ext yandex_query_magic
%yq_settings --env-auth <Yandex DataSphere_secret_name> --folder-id <folder_ID>
%yq SELECT "Hello, world!"
Where:
%yq
: Magic command name.SELECT "Hello, world!"
: Query query text.
To execute multi-line SQL queries, you need to use %%yq cell magic
. The query text must begin with the %%yq
keyword:
%%yq --folder-id <folder_ID> --name "My query" --raw-results
SELECT
col1,
COUNT(*)
FROM table
GROUP BY col1
Where:
--folder-id
: ID of the folder to run Query queries. The default folder is the one specified earlier through%yq_settings
. If not specified, it defaults to the folder in which the VM is running.--name
: (Optional) Query name.--description
: (Optional) Query description.--raw-results
: (Optional) Returns raw Query query processing results. For format specifications, see Correspondence of YQL and JSON types.
Creating query templates using the mustache syntax
You can use the templates of the computations exchanged between Jupyter and Query to work with queries or perform standard operations without writing code. To allow this, Query has a built-in support for the mustache syntax{{}}
key symbols. You can use the mustache syntax with Jinja2
The {{ yq-name
}} built-in mustache templates allow you to insert variables from the Jupyter runtime environment directly into SQL queries. Such variables will also be automatically converted into the required Query data structures. For example:
myQuery = "select * from Departments"
%yq {{myQuery}}
The {{myQuery}}
mustache string will indicate the name of the source variable for the text. Also, the select * from Departments
text will be sent to Query for execution.
Using mustache templates streamlines the integration between Jupyter and Query. Let's assume you have the lst=["Academy", "Physics"]
Python list containing the names of departments whose data you want to process. Without the mustache syntax support in Query, first, you would need to convert the Python list into a string and then input it into the SQL query. Query example:
var lstStr = ",".join(lst)
sqlQuery = f'select "Academy" in ListCreate({lstStr});
%yq {{sqlQuery}}
I.e., working with complex data types requires a detailed knowledge of the Query SQL syntax. With the mustache syntax, you can write a much simpler query:
%yq select "Academy" in {{lst}}
Here, lst
will be identified as a Python list and automatically converted into its SQL equivalent. This is the final query that will be sent to Query:
%yq select "Academy" in ListCreate("Academy", "Physics") as lst
Jinja2
To perform standard tasks in Jupyter and Query, we recommend using the built-in mustache syntax. For more advanced templating, you can use Jinja2.
To install Jinja2, run this command:
%pip install Jinja2
Example of using a Jinja template with the for
cycle:
{% for user in users %}
command = "select * from users where name='{{ user }}'"
{% endfor %}
You can also use Jinja templates to perform various data processing operations. This example illustrates operations performed on department names based on whether the student belongs to the department:
{% if student.department == "Academy" %}
{{ student.department|upper }}
{% elif upper(student.department) != "MATHS DEPARTMENT" %}
{{ student.department|capitalize }}
{% endif %}
To make sure Jinja conversions comply with the Query rules, use the to_yq
filter. Here is what the lst=["Academy", "Physics"]
Python list from the above example looks like in a Jinja template:
%%yq --jinja2
select "Academy" in {{lst|to_yq}}
If you need to disable templating, use the --no-var-expansion
argument:
%%yq --no-var-expansion
...
Built-in mustache templates
In Yandex Query, the built-in mustache templates are enabled by default and help to streamline basic operations with Jupyter variables:
lst=["Academy", "Physics"]
%yq select "Academy" in {{lst}}
Using Pandas DataFrame variables
Example of using yandex_query_magic
and the mustache syntax with Pandas DataFrame
-
Declare the variable in Jupyter:
df = pandas.DataFrame({'_float': [1.0], '_int': [1], '_datetime': [pd.Timestamp('20180310')], '_string': ['foo']})
You can use df
as a variable in queries to Yandex Query. During query execution, the df
value is used to create a df
temporary table. The table can be used within the Yandex Query query that is currently running.
-
Get the data:
%%yq SELECT * FROM mytable INNER JOIN {{df}} ON mytable.id=df._int
Table of Pandas types mapping to Query types:
Pandas type | YQL type | Comment |
---|---|---|
int64 | Int64 | Exceeding the int64 limit will result in a query execution error. |
float64 | Double | |
datetime64[ns] | Timestamp | Precision to the microsecond. Specifying nanoseconds (in the nanosecond field |
str | String |
Using Python dict variables
Example of using yandex_query_magic
and the mustache syntax with a Python dict:
-
Declare the variable in Jupyter:
dct = {"a": "1", "b": "2", "c": "test", "d": "4"}
Now you can use the
dct
variable in Query queries. When executing a query,dct
will convert into a mapping YQL Dict object:Key Value a "1" b "2" c "test" d "4" -
Get the data:
%%yq SELECT "a" in {{dct}}
Table of Python dict types mapping to Query types:
Python type | YQL type | Comment |
---|---|---|
int | Int64 | Exceeding the int64 limit will result in a query execution error. |
float | Double | |
datetime | Timestamp | |
str | String |
You can also convert a dictionary into a Pandas DataFrame table using a constructor:
df = pandas.DataFrame(dct)
Using Python list variables
Example of using yandex_query_magic
and the mustache syntax with a Python list:
-
Declare the variable in Jupyter:
lst = [1,2,3]
Then, you can use the
lst
variable in Query queries. When executing a query,lst
will convert into a mapping YQL Dict object: -
Get the data:
%%yq SELECT 1 IN {{lst}}
Table of Python list types mapping to Query types:
Python type | YQL type | Comment |
---|---|---|
int | Int64 | Exceeding the int64 limit will result in a query execution error. |
float | Double | |
datetime | Timestamp | |
str | String |
You can also convert a list into a Pandas DataFrame table using a constructor:
df = pandas.DataFrame(lst,
columns =['column1', 'column2', 'column3'])
Jinja templates
Jinja templates provide a convenient way to generate SQL queries. They allow you to automatically insert various data, e.g, search conditions, so that you do not need to write each query manually. This streamlines work, prevents errors, and results in more readable code.
With Jinja templates, you can also automate generating queries that contain repeated elements. For example, you can use loops in your template to add a list of values to check in a query. This provides you with additional flexibility and enables you to write complex queries faster when you need to handle large amounts of data.
The steps below explain how to filter data in Yandex Query using a Python variable.
-
Declare the variable in Jupyter:
name = "John"
-
When running the following code in the Jupyter cell, make sure to specify the
jinja2
flag before executing an SQL query for it to be interpreted as a Jinja2 template :%%yq <other_parameters> --jinja2 SELECT "{{name}}"
Parameters:
to_yq
filter
Jinja2 is a general-purpose templating engine. When processing variable values, it uses a standard string representation of data types.
For example, you have a defined lst=["Academy", "Physics"]
Python list. This is how you can use it in a Jinja template:
%%yq --jinja2
select "Academy" in {{lst}}
This will result in the Unexpected token '['
error. Jinja converts the lst
variable to an ["Academy", "Physics"]
string according to Python rules but disregards the Yandex Query-specific features of SQL queries, which causes the error.
To specify that Jinja conversions must comply with the Yandex Query rules, use the to_yq
filter. Then, the same query in the Jinja syntax will look like this:
%%yq --jinja2
select "Academy" in {{lst|to_yq}}
The to_yq
Jinja filter converts data to the Yandex Query syntax in exactly the same manner as built-in mustache templates.
Capture command results
To capture a line magic command result, you can use the assignment command:
varname = %yq <query>
To capture a cell magic command result, you can specify the variable name and the <<
operator at the beginning of the query text:
%%yq
varname << <query>
Then, you can use the result as a standard Jupyter variable.
For example, this is how you capture a command result to output
variables using cell magic:
output = %yq SELECT 1 as column1
And this is how you can capture a command result to a output2
variable using line magic:
%%yq
output2 << SELECT 'Two' as column2, 3 as column3
Then, you can use these variables as standard IPython variables. For example, you can print them:
output
By default, the %yq
and %%yq
commands output a Pandas DataFramePandas DataFrame
conversion, you can use the --raw-results argument.
The output
variable in the above example will have the following structure:
column1 | |
---|---|
0 | 1 |
The output2
variable will look like this:
column2 | column3 | |
---|---|---|
0 | Two | 3 |
If a query cannot return a result by its nature (e.g., insert into table select * from another_table
), it will return the None
value. If a query returns multiple sets of results, they will be displayed as a list
of individual results.
When executing a query, yandex_query_magic
outputs additional data, e.g., query ID, start time, and execution duration:
To hide the execution progress information for a cell, you can use the additional %%capture
command.
%%capture
%%yq
<query>
In this case, the execution progress will not be output to the console.