Automating Yandex Query tasks using Yandex Managed Service for Apache Airflow™
- Prepare your cloud
- Create a service account
- Create a static access key
- Create a cloud network and subnets
- Create a bucket in Object Storage
- Configure an egress NAT
- Create a Managed Service for Apache Airflow™ cluster
- Prepare the DAG file and run the graph
- Check the result
- How to delete the resources you created
Managed Service for Apache Airflow™ is a popular tool for automation of data operations. Yandex Query supports integration with Managed Service for Apache Airflow™ using the apache-airflow-providers-yandex
To create an infrastructure for automation of Yandex Query tasks using Managed Service for Apache Airflow™, follow these steps:
- Prepare your cloud.
- Create a service account.
- Create a static access key.
- Create a cloud network and subnets.
- Create a bucket in Object Storage.
- Configure an egress NAT.
- Create a Managed Service for Apache Airflow™ cluster.
- Prepare the DAG file and run the graph.
- Check the result.
If you no longer need the resources you created, delete them.
Prepare your cloud
Sign up for Yandex Cloud and create a billing account:
- Go to the management console
and log in to Yandex Cloud or create an account if you do not have one yet. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account, create one.
If you have an active billing account, you can go to the cloud page
Learn more about clouds and folders.
Required paid resources
The cost of support for the new infrastructure includes:
- Fee for bucket usage (see Yandex Object Storage pricing).
- Fee for using the NAT gateway (see Yandex Virtual Private Cloud pricing).
- Fee for the amount of data read (see Yandex Query pricing).
- Fee for using the Yandex Managed Service for Apache Airflow™ cluster (see Managed Service for Apache Airflow™ pricing).
Create a service account
Create a service account named airflow-sa
with the editor
role for the folder where the Managed Service for Apache Airflow™ cluster will be created:
- In the management console
, select the folder where you want to create your service account. - Go to the Service accounts tab.
- Click Create service account.
- Enter the service account name:
airflow-sa
. - Click
Add role and select theeditor
role. - Click Create.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
-
Create a service account named
airflow-sa
:yc iam service-account create airflow-sa
Result:
id: nfersamh4sjq******** folder_id: b1gc1t4cb638******** created_at: "2023-09-21T10:36:29.726397755Z" name: airflow-sa
Save the
id
of theairflow-sa
service account and the folder it was created in (folder_id
). -
Assign the service account the
editor
role for the folder by specifying the folder and service account IDs you saved earlier:yc resource-manager folder add-access-binding <folder_ID> \ --role editor \ --subject serviceAccount:<service_account_ID>
To create a service account, use the create REST API method for the ServiceAccount resource or the ServiceAccountService/Create gRPC API call.
To assign the service account the editor
role for the folder, use the setAccessBindings method for the ServiceAccount resource or the ServiceAccountService/SetAccessBindings gRPC API call.
Create a static access key
Create a static access key for the airflow-sa
service account:
-
In the management console
, select the folder containing theairflow-sa
service account. -
At the top of the screen, go to the Service accounts tab.
-
Select the
airflow-sa
service account. -
In the top panel, click
Create new key and select Create static access key. -
Specify the key description and click Create.
-
Save the key ID and secret key. You will need them when creating a cluster.
Alert
After you close the dialog, the private key value will become unavailable.
-
Run this command:
yc iam access-key create --service-account-name airflow-sa
Result:
access_key: id: aje6t3vsbj8l******** service_account_id: nfersamh4sjq******** created_at: "2018-11-22T14:37:51Z" key_id: 0n8X6WY6S24N******** secret: JyTRFdqw8t1kh2-OJNz4JX5ZTz9Dj1rI********
-
Save the key ID and secret key. You will need them when creating a cluster.
To create an access key, use the create REST API method for the AccessKey resource or the AccessKeyService/Create gRPC API call.
Create a cloud network and subnets
All resources you create in the tutorial will belong to the same cloud network.
- In the management console
, select the folder. - In the list of services, select Virtual Private Cloud.
- Click Create network.
- In the Name field, specify
yq-network
. - In the Advanced field, select
Create subnets
. - Click Create network.
-
Create a network named
yq-network
:yc vpc network create yq-network
Result:
id: enptrcle5q3d******** folder_id: b1g9hv2loamq******** created_at: "2022-04-04T05:25:03Z" name: yq-network default_security_group_id: enpbsnnop4ak********
For more information about the
yc vpc network create
command, see the CLI reference. -
Create subnets:
-
In the
ru-central1-a
availability zone:yc vpc subnet create yq-network-ru-central1-a \ --zone ru-central1-a \ --network-name yq-network \ --range 10.1.0.0/16
Result:
id: b0c3pte4o2kn******** folder_id: b1g9hv2loamq******** created_at: "2022-04-04T09:28:08Z" name: yq-network-ru-central1-a network_id: enptrcle5q3d******** zone_id: ru-central1-a v4_cidr_blocks: - 10.1.0.0/16
-
In the
ru-central1-b
availability zone:yc vpc subnet create yq-network-ru-central1-b \ --zone ru-central1-b \ --network-name yq-network \ --range 10.2.0.0/16
-
In the
ru-central1-d
availability zone:yc vpc subnet create yq-network-ru-central1-d \ --zone ru-central1-d \ --network-name yq-network \ --range 10.3.0.0/16
For more information about the
yc vpc subnet create
command, see the CLI reference. -
-
To create a network, use the create REST API method for the Network resource or the NetworkService/Create gRPC API call.
-
To create subnets, use the create REST API method for the Subnet resource or the SubnetService/Create gRPC API call.
Create a bucket in Object Storage
- In the management console
, select the folder. - In the list of services, select Object Storage.
- Click Create bucket.
- In the ** Name** field, enter a unique name for the bucket, e.g.,
airflow-bucket
. - In the Object read access and Object listing access fields, select
Public
. - Click Create bucket.
If you do not have the AWS CLI yet, install and configure it.
-
Create a bucket by specifying a unique name for it:
aws --endpoint-url https://storage.yandexcloud.net \ s3 mb s3://<bucket_name>
Result:
make_bucket: s3://airflow-bucket
-
Enable public access to reading objects and their list in the bucket you created:
aws --endpoint-url https://storage.yandexcloud.net \ s3api put-bucket-acl \ --bucket <bucket_name> \ --acl public-read
To create a bucket, use the create REST API method for the Bucket resource, the BucketService/Create gRPC API call, or the create S3 API method.
Configure an egress NAT
Create a NAT gateway
- In the management console
, select the folder. - In the list of services, select Virtual Private Cloud.
- In the left-hand panel, select Gateways.
- In the window that opens, click Create gateway:
- In the Name field, enter
yq-nat
. - In the Type field, select
Egress NAT
. - Click Save.
- In the Name field, enter
Create a NAT gateway in the default folder:
yc vpc gateway create \
--name yq-nat
Result:
id: enpkq1sb7hed********
folder_id: b1g681qpemb4********
created_at: "2024-05-19T13:20:36Z"
name: yq-nat
shared_egress_gateway: {}
For more information about the yc vpc gateway create
command, see the CLI reference.
Save the NAT gateway ID (id
). You will need it when creating a route table.
To create a NAT gateway, use the create REST API method for the Gateway resource or the GatewayService/Create gRPC API call.
Create a route table
- In the left-hand panel, select Routing tables.
- Click Create routing table and specify the route table parameters:
- Enter the name:
yq-route-table
. - Select the
yq-network
network. - Click Add a route.
- In the Next hop field, select
Gateway
. - In the Gateway field, select the
yq-nat
NAT gateway. The destination prefix will be propagated automatically.
- In the Next hop field, select
- Click Add.
- Enter the name:
- Click Create a routing table.
Create a route table with the yq-nat
NAT gateway as the next hop and the 0.0.0.0/0
destination prefix:
yc vpc route-table create \
--name=yq-route-table \
--network-name=yq-network \
--route destination=0.0.0.0/0,gateway-id=<NAT_gateway_ID>
Result:
id: enp4v8foko6s********
folder_id: b1g681qpemb4********
created_at: "2024-05-19T13:22:47Z"
name: yq-route-table
network_id: enppoggov6ub********
static_routes:
- destination_prefix: 0.0.0.0/0
gateway_id: enpkq1sb7hed********
For more information about the yc vpc route-table create
command, see the CLI reference.
To create a route table, use the create REST API method for the RouteTable resource or the RouteTableService/Create gRPC API call.
Link the route table to a subnet
Link the route table to a subnet to route subnet traffic via the NAT gateway:
- In the left-hand panel, select
Subnets. - In the
yq-network-ru-central1-a
subnet line, click . - Click Link routing table and select the
yq-route-table
route table. - Click Link.
Run this command:
yc vpc subnet update yq-network-ru-central1-a \
--route-table-name=yq-route-table
Result:
id: e9b6n3jj3gh6********
folder_id: b1g681qpemb4********
created_at: "2024-05-19T13:24:58Z"
name: yq-network-ru-central1-a
network_id: enppoggov6ub********
zone_id: ru-central1-a
v4_cidr_blocks:
- 10.1.0.0/16
route_table_id: enp4v8foko6s********
dhcp_options: {}
For more information about the yc vpc subnet update
command, see the CLI reference.
To link a route table to a subnet, use the update REST API method for the Subnet resource or the SubnetService/Update gRPC API call.
Warning
This API method overrides all parameters of the object being modified that were not explicitly passed in the request to the default values. To avoid this, list the settings you want to change in the updateMask
parameter (one line separated by commas).
Create a Managed Service for Apache Airflow™ cluster
-
In the management console
, select the folder where you want to create a cluster. -
Select Managed Service for Apache Airflow™.
-
Click Create a cluster.
-
Under Basic parameters, enter a name for your cluster. The name must be unique within the folder.
-
Under Access settings, set a password for the admin user. The password must be not less than 8 characters long and contain at least:
- One uppercase letter
- One lowercase letter
- One digit
- One special character
Note
Save the password locally or memorize it. The service does not show passwords after the registry is created.
-
Under Network settings, select:
-
Availability zone:
ru-central1-a
-
Cloud network:
yq-network
-
Subnet:
yq-network-ru-central1-a
-
Security group: default
Security group settings do not affect access to the Apache Airflow™ web interface.
-
-
Under Dependencies, specify the pip package name and version restriction:
apache-airflow-providers-yandex>=3.10
-
Under DAG file storage:
- Select the bucket you created ealier.
- Specify the static access key for the
airflow-sa
service account and the key ID.
-
Click Create.
Prepare the DAG file and run the graph
Below, we consider a directed acyclic graph (DAG) with two vertices:
yq_operator
: Runs a simple query to Yandex Query.output_operator
: Outputs the result of theyq_operator
vertex execution.
To prepare a DAG:
-
Create a local file named
yq_dag.py
and copy the following script into it:yq_dag.pyimport datetime from airflow.models.dag import DAG from airflow.providers.yandex.operators.yq import YQExecuteQueryOperator from airflow.operators.python_operator import PythonOperator with DAG( dag_id="yq_hello_world_operator", schedule_interval="@hourly", start_date=datetime.datetime.now() ) as dag: yq_operator = YQExecuteQueryOperator(task_id="yq_operator", sql="SELECT 'Hello, world!'") def print_context(ds=None, **kwargs): ti = kwargs['ti'] print(ti.xcom_pull(task_ids='yq_operator')) output_operator = PythonOperator( task_id = "output_operator", provide_context=True, python_callable = print_context ) yq_operator >> output_operator if __name__ == "__main__": dag.test()
-
Upload the DAG file into the Managed Service for Apache Airflow™ cluster. To do this, create a folder named
files/dags
in theairflow-bucket
bucket and upload theyq_dag.py
file to it. -
Make sure that a new DAG file named
yq_hello_world_operator
has appeared in the DAGs section.Note
It may take a few minutes to upload a DAG file from the bucket.
-
To run DAG, first click in the line with its name and then click Trigger DAG.
Check the result
- In the DAGs section, open the
yq_hello_world_operator
graph. - Go to the Grid section.
- Select the yq_operator task.
- Go to the XCom section.
- Make sure the
return_value
line has the'rows': [['Hello, world!']]
element. This indicates the query was successful.
How to delete the resources you created
To delete the infrastructure and stop paying for the resources you created:
- Delete the Object Storage bucket.
- Unlink the route table from the subnet.
- Delete the route table.
- Delete the NAT gateway.
- Delete the Apache Airflow™ cluster.
- Delete the subnets, network, and the service account, if required.