Automating Yandex Query tasks using Yandex Managed Service for Apache Airflow™
Managed Service for Apache Airflow™ is a popular tool for automating data operations. Yandex Query supports integration with Managed Service for Apache Airflow™ using the apache-airflow-providers-yandex
To create an infrastructure for automation of Yandex Query tasks using Managed Service for Apache Airflow™, follow these steps:
- Prepare your cloud.
- Create a service account.
- Create a cloud network and subnets.
- Prepare a bucket in Object Storage.
- Configure an egress NAT.
- Create a Managed Service for Apache Airflow™ cluster.
- Prepare the DAG file and run the graph.
- Check the result.
If you no longer need the resources you created, delete them.
Prepare your cloud
Sign up for Yandex Cloud and create a billing account:
- Go to the management console
and log in to Yandex Cloud or create an account if you do not have one yet. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account, create one.
If you have an active billing account, you can go to the cloud page
Learn more about clouds and folders.
Required paid resources
The cost of support for the new infrastructure includes:
- Fee for bucket usage (see Yandex Object Storage pricing).
- Fee for using the NAT gateway (see Yandex Virtual Private Cloud pricing).
- Fee for the amount of data read (see Yandex Query pricing).
- Fee for using the Yandex Managed Service for Apache Airflow™ cluster (see Managed Service for Apache Airflow™ pricing).
Create a service account
Create a service account named airflow-sa
with the editor
role for the folder where the Managed Service for Apache Airflow™ cluster will be created:
- In the management console
, select the folder where you want to create a service account. - Go to the Service accounts tab.
- Click Create service account.
- Enter a name for the service account:
airflow-sa
. - Click
Add role and select theeditor
role. - Click Create.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
-
Create a service account named
airflow-sa
:yc iam service-account create airflow-sa
Result:
id: nfersamh4sjq******** folder_id: b1gc1t4cb638******** created_at: "2023-09-21T10:36:29.726397755Z" name: airflow-sa
Save the
id
of theairflow-sa
service account and the folder where it was created (folder_id
). -
Assign the
editor
role for the folder to the service account by specifying the folder and service account IDs you previously saved:yc resource-manager folder add-access-binding <folder_ID> \ --role editor \ --subject serviceAccount:<service_account_ID>
To create a service account, use the create REST API method for the ServiceAccount resource or the ServiceAccountService/Create gRPC API call.
To assign the editor
role for the folder to the service account, use the setAccessBindings method for the ServiceAccount resource or the ServiceAccountService/SetAccessBindings gRPC API call.
Create a cloud network and subnets
All resources you create in the tutorial will belong to the same cloud network.
- In the management console
, select the folder. - In the list of services, select Virtual Private Cloud.
- Click Create network.
- In the Name field, specify
yq-network
. - In the Advanced field, select
Create subnets
. - Click Create network.
-
Create a network named
yq-network
:yc vpc network create yq-network
Result:
id: enptrcle5q3d******** folder_id: b1g9hv2loamq******** created_at: "2022-04-04T05:25:03Z" name: yq-network default_security_group_id: enpbsnnop4ak********
For more information about the
yc vpc network create
command, see the CLI reference. -
Create subnets:
-
Availability zone:
ru-central1-a
yc vpc subnet create yq-network-ru-central1-a \ --zone ru-central1-a \ --network-name yq-network \ --range 10.1.0.0/16
Result:
id: b0c3pte4o2kn******** folder_id: b1g9hv2loamq******** created_at: "2022-04-04T09:28:08Z" name: yq-network-ru-central1-a network_id: enptrcle5q3d******** zone_id: ru-central1-a v4_cidr_blocks: - 10.1.0.0/16
-
Availability zone:
ru-central1-b
:yc vpc subnet create yq-network-ru-central1-b \ --zone ru-central1-b \ --network-name yq-network \ --range 10.2.0.0/16
-
Availability zone:
ru-central1-d
:yc vpc subnet create yq-network-ru-central1-d \ --zone ru-central1-d \ --network-name yq-network \ --range 10.3.0.0/16
For more information about the
yc vpc subnet create
command, see the CLI reference. -
-
To create a network, use the create REST API method for the Network resource or the NetworkService/Create gRPC API call.
-
To create subnets, use the create REST API method for the Subnet resource or the SubnetService/Create gRPC API call.
Prepare a bucket in Object Storage
Create a bucket
- In the management console
, select the folder. - In the list of services, select Object Storage.
- Click Create bucket.
- In the ** Name** field, enter a unique name for the bucket, e.g.,
airflow-bucket
. - In the Object read access and Object listing access fields, select
Public
. - Click Create bucket.
If you do not have the AWS CLI yet, install and configure it.
-
Create a bucket by specifying a unique name for it:
aws --endpoint-url https://storage.yandexcloud.net \ s3 mb s3://<bucket_name>
Result:
make_bucket: s3://airflow-bucket
-
Enable public access to reading objects and their list in the bucket you created:
aws --endpoint-url https://storage.yandexcloud.net \ s3api put-bucket-acl \ --bucket <bucket_name> \ --acl public-read
To create a bucket, use the create REST API method for the Bucket resource, the BucketService/Create gRPC API call, or the create S3 API method.
Set up the bucket ACL
Edit the ACL of the new bucket to give the READ
permission to the airflow-sa
service account.
Configure an egress NAT
Create a NAT gateway
- In the management console
, select the folder. - In the list of services, select Virtual Private Cloud.
- In the left-hand panel, select Gateways.
- In the window that opens, click Create gateway:
- In the Name field, enter the name:
yq-nat
. - In the Type field, select
Egress NAT
. - Click Save.
- In the Name field, enter the name:
Create a NAT gateway in the default folder:
yc vpc gateway create \
--name yq-nat
Result:
id: enpkq1sb7hed********
folder_id: b1g681qpemb4********
created_at: "2024-05-19T13:20:36Z"
name: yq-nat
shared_egress_gateway: {}
For more information about the yc vpc gateway create
command, see the CLI reference.
Save the NAT gateway id
. You will need it when creating a route table.
To create a NAT gateway, use the create REST API method for the Gateway resource or the GatewayService/Create gRPC API call.
Create a route table
- In the left-hand panel, select Routing tables.
- Click Create routing table and specify the route table parameters:
- Enter the name:
yq-route-table
. - Select the network:
yq-network
. - Click Add a route.
- In the Next hop field, select
Gateway
. - In the Gateway field, select the
yq-nat
NAT gateway. The destination prefix will be propagated automatically.
- In the Next hop field, select
- Click Add.
- Enter the name:
- Click Create a routing table.
Create a route table with the yq-nat
NAT gateway as the next hop and the 0.0.0.0/0
destination prefix:
yc vpc route-table create \
--name=yq-route-table \
--network-name=yq-network \
--route destination=0.0.0.0/0,gateway-id=<NAT_gateway_ID>
Result:
id: enp4v8foko6s********
folder_id: b1g681qpemb4********
created_at: "2024-05-19T13:22:47Z"
name: yq-route-table
network_id: enppoggov6ub********
static_routes:
- destination_prefix: 0.0.0.0/0
gateway_id: enpkq1sb7hed********
For more information about the yc vpc route-table create
command, see the CLI reference.
To create a route table, use the create REST API method for the RouteTable resource or the RouteTableService/Create gRPC API call.
Link the route table to a subnet
Link the route table to a subnet to route subnet traffic via the NAT gateway:
- In the left-hand panel, select
Subnets. - In the
yq-network-ru-central1-a
row, click . - Click Link routing table and select
yq-route-table
. - Click Link.
Run this command:
yc vpc subnet update yq-network-ru-central1-a \
--route-table-name=yq-route-table
Result:
id: e9b6n3jj3gh6********
folder_id: b1g681qpemb4********
created_at: "2024-05-19T13:24:58Z"
name: yq-network-ru-central1-a
network_id: enppoggov6ub********
zone_id: ru-central1-a
v4_cidr_blocks:
- 10.1.0.0/16
route_table_id: enp4v8foko6s********
dhcp_options: {}
For more information about the yc vpc subnet update
command, see the CLI reference.
To link a route table to a subnet, use the update REST API method for the Subnet resource or the SubnetService/Update gRPC API call.
Warning
This API method overrides all parameters of the object being modified that were not explicitly passed in the request to the default values. To avoid this, list the settings you want to change in the updateMask
parameter (one line separated by commas).
Create a Managed Service for Apache Airflow™ cluster
-
In the management console
, select the folder you want to create a cluster in. -
Select Managed Service for Apache Airflow™.
-
Click Create a cluster.
-
Under Basic parameters, enter a name for your cluster. The name must be unique within the folder.
-
Under Access settings, set a password for the admin user. The password must be not less than 8 characters long and contain at least:
- One uppercase letter
- One lowercase letter
- One digit
- One special character
Note
Save the password locally or memorize it. The service does not show passwords after the registry is created.
-
Under Network settings, select:
-
Availability zone:
ru-central1-a
-
Cloud network:
yq-network
-
Subnet:
yq-network-ru-central1-a
-
Security group: default
Security group settings do not affect access to the Apache Airflow™ web interface.
-
-
Under Dependencies, specify the pip package name and version restriction:
apache-airflow-providers-yandex>=3.10
-
Under DAG file storage, select the bucket you created earlier.
-
Click Create.
Prepare the DAG file and run the graph
As an example, we use a directed acyclic graph (DAG) with two vertices:
yq_operator
: Runs a simple query to Yandex Query.output_operator
: Outputs the result of theyq_operator
vertex execution.
To prepare a DAG:
-
Create a local file named
yq_dag.py
and copy the following script into it:yq_dag.py
import datetime from airflow.models.dag import DAG from airflow.providers.yandex.operators.yq import YQExecuteQueryOperator from airflow.operators.python_operator import PythonOperator with DAG( dag_id="yq_hello_world_operator", schedule_interval="@hourly", start_date=datetime.datetime.now(), ) as dag: yq_operator = YQExecuteQueryOperator( task_id="yq_operator", sql="SELECT 'Hello, world!'" ) def print_context(ds=None, **kwargs): ti = kwargs["ti"] print(ti.xcom_pull(task_ids="yq_operator")) output_operator = PythonOperator( task_id="output_operator", provide_context=True, python_callable=print_context ) yq_operator >> output_operator if __name__ == "__main__": dag.test()
-
Upload the DAG file into the Managed Service for Apache Airflow™ cluster. Create a
files/dags
folder in theairflow-bucket
and upload theyq_dag.py
file to it. -
Make sure the new
yq_hello_world_operator
DAG file has appeared in the DAGs section.Note
It may take a few minutes to upload a DAG file from the bucket.
-
To run a DAG, first click in the line with its name, and then click Trigger DAG.
Check the result
- In the DAGs section, open the
yq_hello_world_operator
graph. - Go to the Grid section.
- Select the yq_operator task.
- Go to the XCom section.
- Make sure the
return_value
line has the'rows': [['Hello, world!']]
element. This indicates the query was successful.
Delete the resources you created
To delete the infrastructure and stop paying for the resources you created:
- Delete the Object Storage bucket.
- Disassociate and delete the routing table.
- Delete the NAT gateway.
- Delete the Apache Airflow™ cluster.
- Delete the subnets, network, and the service account, if required.