Creating an Apache Airflow™ cluster
Every Managed Service for Apache Airflow™ cluster consists of a set of Apache Airflow™ components, each of which can be represented in multiple instances. The instances may reside in different availability zones.
Roles for creating a cluster
To create a Managed Service for Apache Airflow™ cluster, your Yandex Cloud account needs the following roles:
- managed-airflow.editor: To create a cluster.
- vpc.user: To use the cluster network.
- iam.serviceAccounts.user: To link a service account to a cluster.
Make sure to assign the managed-airflow.integrationProvider
role to the cluster's service account. The cluster will thus get the permissions it needs to work with user resources. For more information, see Impersonation.
For more information about assigning roles, see the Yandex Identity and Access Management documentation.
Creating a cluster
-
In the management console
, select the folder where you want to create a cluster. -
Select Managed Service for Apache Airflow™.
-
Click Create a cluster.
-
Under Basic parameters:
- Enter a name for the cluster. The name must be unique within the folder.
- (Optional) Enter a cluster description.
- (Optional) Create labels:
- Click Add label.
- Enter a label in
key: value
format. - Click Enter.
-
Under Access settings:
-
Set a password for the admin user. The password must be not less than 8 characters long and contain at least:
- One uppercase letter
- One lowercase letter
- One digit
- One special character
Note
Save the password locally or memorize it. The service does not show passwords after the registry is created.
-
Select an existing service account or create a new one.
-
-
Under Network settings, select:
-
Availability zones for the cluster
-
Cloud network
-
Subnet in each of the selected availability zones
Yandex Cloud manages Managed Service for Apache Airflow™ cluster components in the auxiliary subnet. Make sure the IP address range of the subnets you selected does not overlap with the
10.248.0.0/13
auxiliary subnet address range. Otherwise, you will get an error when creating a cluster. -
Security group for the cluster network traffic
Security group settings do not affect access to the Apache Airflow™ web interface.
-
-
Set the number of instances and a computing resource configuration for the Managed Service for Apache Airflow™ components:
-
Web server
-
Scheduler
-
Workers
Note
If the issue queue is empty, the number of workers will be the minimum value. When issues appear, the number of workers will increase up to the maximum value.
-
(Optional) Triggerer services
-
-
(Optional) Under Dependencies, specify pip and deb package names to install additional libraries and applications in the cluster to run DAG files.
To specify multiples packages, click Add.
If required, you can set version restrictions for the installed packages, for example:
pandas==2.0.2 scikit-learn>=1.0.0 clickhouse-driver~=0.2.0
The package name format and version are defined by the install command:
pip install
for pip packages andapt install
for deb packages.Warning
To install pip and deb packages from public repositories, specify a network with configured egress NAT under Network settings.
-
Under DAG file storage, select a bucket or create a new one. This bucket will store DAG files.
Make sure to grant the
READ
permission for this bucket to the cluster service account. -
(Optional) Under Advanced settings, enable cluster deletion protection.
-
Optionally, under Airflow configuration:
-
Specify Apache Airflow™ additional properties
additional properties, e.g.,api.maximum_page_limit
as a key and150
as its value.Fill in the fields manually or import the settings from a configuration file (see configuration file example
). -
Enable the Use Lockbox Secret Backend option to use secrets in Yandex Lockbox to store Apache Airflow™ configuration data, variables, and connection parameters.
To extract the required information from the secret, the cluster service account must have the
lockbox.payloadViewer
role.You can assign this role either at whole folder level or individual secret level.
-
-
(Optional) Under Log settings, enable logging. Logs generated by Apache Airflow™ components will be sent to Yandex Cloud Logging. Set logging parameters:
-
In the Destination field, specify the logging destination:
- Folder: Select the folder. Logs will be written to the default log group for this folder.
- Log group: Select a custom log group or create a new one.
-
Select the minimum logging level:
TRACE
,DEBUG
,INFO
(default),WARN
,ERROR
, orFATAL
.
-
-
Click Create.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To create a Managed Service for Apache Airflow™ cluster:
-
View the description of the create cluster CLI command:
yc managed-airflow cluster create --help
-
Specify cluster parameters in the create command (the list of supported parameters in the example is not exhaustive):
yc managed-airflow cluster create \ --name <cluster_name> \ --description <cluster_description> \ --labels <label_list> \ --admin-password <administrator_password> \ --service-account-id <service_account_ID> \ --subnet-ids <subnet_IDs> \ --security-group-ids <security_group_IDs> \ --webserver count=<number_of_instances>,` `resource-preset-id=<resource_ID> \ --scheduler count=<number_of_instances>,` `resource-preset-id=<resource_ID> \ --worker min-count=<minimum_number_of_instances>,` `max-count=<maximum_number_of_instances>,` `resource-preset-id=<resource_ID> \ --triggerer count=<number_of_instances>,` `resource-preset-id=<resource_ID> \ --deb-packages <list_of_deb_packages> \ --pip-packages <list_of_pip_packages> \ --dags-bucket <bucket_name> \ --deletion-protection \ --lockbox-secrets-backend \ --airflow-config <list_of_properties> \ --log-enabled \ --log-folder-id <folder_ID> \ --log-min-level <logging_level>
Where:
-
--name
: Cluster name. -
--description
: Cluster description. -
--labels
: List of labels. Provide labels in<key>=<value>
format. -
--admin-password
: Admin user password. The password must be not less than 8 characters long and contain at least:- One uppercase letter
- One lowercase letter
- One digit
- One special character
-
--service-account-id
: Service account ID. -
--subnet-ids
: List of subnet IDs. -
--security-group-ids
: List of security group IDs. -
--webserver
,--scheduler
,--worker
,--triggerer
: Managed Service for Apache Airflow™ component configuration:-
count
: Number of instances in the cluster for the web server, scheduler, and Triggerer. -
min-count
,max-count
: Minimum and maximum number of instances in the cluster for the worker. -
resource-preset-id
: ID of the computing resources of the web server, scheduler, worker, and Triggerer. The possible values are:c1-m2
: 1 vCPU, 2 GB RAMc1-m4
: 1 vCPU, 4 GB RAMc2-m4
: 2 vCPUs, 4 GB RAMc2-m8
: 2 vCPUs, 8 GB RAMc4-m8
: 4 vCPUs, 8 GB RAMc4-m16
: 4 vCPUs, 16 GB RAMc8-m16
: 8 vCPUs, 16 GB RAMc8-m32
: 8 vCPUs, 32 GB RAM
-
-
--deb-packages
,--pip-packages
: Lists of deb and pip packages enabling you to install additional libraries and applications in the cluster for running DAG files:If required, you can set version restrictions for the installed packages, for example:
--pip-packages "pandas==2.0.2,scikit-learn>=1.0.0,clickhouse-driver~=0.2.0"
The package name format and version are defined by the install command:
pip install
for pip packages andapt install
for deb packages. -
--dags-bucket
: Name of the bucket to store DAG files in. -
--deletion-protection
: Enables cluster protection against accidental deletion.With deletion protection enabled, you will still be able to manually connect to the cluster and delete it.
-
--lockbox-secrets-backend
: Enables using secrets in Yandex Lockbox to store Apache Airflow™ configuration data, variables, and connection parameters. -
--airflow-config
: Apache Airflow™ additional properties . Provide them in<configuration_section>.<key>=<value>
format, such as the following:--airflow-config core.load_examples=False
-
Logging parameters:
-
--log-enabled
: Enables logging. Logs generated by Apache Airflow™ will be sent to Yandex Cloud Logging. -
--log-folder-id
: Folder ID. Logs will be written to the default log group for this folder. -
--log-group-id
: Custom log group ID. Logs will be written to this group.Specify one of the two parameters:
--log-folder-id
or--log-group-id
. -
--log-min-level
: Minimum logging level. Possible values:TRACE
,DEBUG
,INFO
(default),WARN
,ERROR
, andFATAL
.
You can specify only one of the parameters:
--log-folder-id
or--log-group-id
. -
-
With Terraform
Terraform is distributed under the Business Source License
For more information about the provider resources, see the documentation on the Terraform
If you don't have Terraform, install it and configure the Yandex Cloud provider.
To create a Managed Service for Apache Airflow™ cluster:
-
In the configuration file, describe the resources you are creating:
-
Managed Service for Apache Airflow™ cluster: Cluster description.
-
Network: Description of the cloud network where a cluster will be located. If you already have a suitable network, you don't have to describe it again.
-
Subnets: Description of the subnets to connect the cluster hosts to. If you already have suitable subnets, you don't have to describe them again.
Here is an example of the configuration file structure:
resource "yandex_airflow_cluster" "<cluster_name>" { name = "<cluster_name>" description = "<cluster_description>" labels = { <label_list> } admin_password = "<administrator_password>" service_account_id = "<service_account_ID>" subnet_ids = ["<list_of_subnet_IDs>"] security_group_ids = ["<list_of_security_group_IDs>"] webserver = { count = <number_of_instances> resource_preset_id = "<resource_ID>" } scheduler = { count = <number_of_instances> resource_preset_id = "<resource_ID>" } worker = { min_count = <minimum_number_of_instances> max_count = <maximum_number_of_instances> resource_preset_id = "<resource_ID>" } triggerer = { count = <number_of_instances> resource_preset_id = "<resource_ID>" } pip_packages = ["list_of_pip_packages"] deb_packages = ["list_of_deb_packages"] code_sync = { s3 = { bucket = "<bucket_name>" } } deletion_protection = <deletion_protection> lockbox_secrets_backend = { enabled = <usage_of_secrets> } airflow_config = { <configuration_section> = { <key> = "<value>" } } logging = { enabled = <use_of_logging> folder_id = "<folder_ID>" min_level = "<logging_level>" } } resource "yandex_vpc_network" "<network_name>" { name = "<network_name>" } resource "yandex_vpc_subnet" "<subnet_name>" { name = "<subnet_name>" zone = "<availability_zone>" network_id = "<network_ID>" v4_cidr_blocks = ["<range>"] }
Where:
-
name
: Cluster name. -
description
: Cluster description. -
labels
: List of labels. Provide labels in<key> = "<value>"
format. -
admin_password
: Admin user password. The password must be not less than 8 characters long and contain at least:- One uppercase letter
- One lowercase letter
- One digit
- One special character
-
service_account_id
: Service account ID. -
subnet_ids
: List of subnet IDs. -
security_group_ids
: List of security group IDs. -
webserver
,scheduler
,worker
,triggerer
: Managed Service for Apache Airflow™ component configuration:-
count
: Number of instances in the cluster for the web server, scheduler, and Triggerer. -
min_count
,max_count
: Minimum and maximum number of instances in the cluster for the worker. -
resource_preset_id
: ID of the computing resources of the web server, scheduler, worker, and Triggerer. The possible values are:c1-m2
: 1 vCPU, 2 GB RAMc1-m4
: 1 vCPU, 4 GB RAMc2-m4
: 2 vCPUs, 4 GB RAMc2-m8
: 2 vCPUs, 8 GB RAMc4-m8
: 4 vCPUs, 8 GB RAMc4-m16
: 4 vCPUs, 16 GB RAMc8-m16
: 8 vCPUs, 16 GB RAMc8-m32
: 8 vCPUs, 32 GB RAM
-
-
deb_packages
,pip_packages
: Lists of deb and pip packages enabling you to install additional libraries and applications in the cluster for running DAG files:If required, you can set version restrictions for the installed packages, for example:
pip_packages = ["pandas==2.0.2","scikit-learn>=1.0.0","clickhouse-driver~=0.2.0"]
The package name format and version are defined by the install command:
pip install
for pip packages andapt install
for deb packages. -
code_sync.s3.bucket
: Name of the bucket to store DAG files in. -
deletion_protection
: Enables cluster protection against accidental deletion. The possible values aretrue
orfalse
.With deletion protection enabled, you will still be able to manually connect to the cluster and delete it.
-
lockbox_secrets_backend.enabled
: Enables using secrets in Yandex Lockbox to store Apache Airflow™ configuration data, variables, and connection parameters. The possible values aretrue
orfalse
. -
airflow_config
: Apache Airflow™ additional properties , e.g.,core
for configuration section,load_examples
for key, andFalse
for value. -
logging
: Logging parameters:-
enabled
: Enables logging. Logs generated by Apache Airflow™ components will be sent to Yandex Cloud Logging. The possible values aretrue
orfalse
. -
folder_id
: Folder ID. Logs will be written to the default log group for this folder. -
log_group_id
: Custom log group ID. Logs will be written to this group.Specify one of the two parameters:
folder_id
orlog_group_id
. -
min_level
: Minimum logging level. Possible values:TRACE
,DEBUG
,INFO
(default),WARN
,ERROR
, andFATAL
.
You can specify only one of the parameters:
folder_id
orlog_group_id
. -
-
-
Make sure the settings are correct.
-
Using the command line, navigate to the folder that contains the up-to-date Terraform configuration files with an infrastructure plan.
-
Run the command:
terraform validate
If there are errors in the configuration files, Terraform will point to them.
-
-
Create a Managed Service for Apache Airflow™ cluster.
-
Run the command to view planned changes:
terraform plan
If the resource configuration descriptions are correct, the terminal will display a list of the resources to modify and their parameters. This is a test step. No resources are updated.
-
If you are happy with the planned changes, apply them:
-
Run the command:
terraform apply
-
Confirm the update of resources.
-
Wait for the operation to complete.
-
All the resources you need will then be created in the specified folder. You can check the new resources and their settings using the management console
. -
For more information, see the Terraform provider documentation
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Create a file named
body.json
and add the following contents to it:{ "folderId": "<folder_ID>", "name": "<cluster_name>", "description": "<cluster_description>", "labels": { <label_list> }, "config": { "versionId": "<Apache Airflow™>_version", "airflow": { "config": { <list_of_properties> } }, "webserver": { "count": "<number_of_instances>", "resources": { "resourcePresetId": "<resource_ID>" } }, "scheduler": { "count": "<number_of_instances>", "resources": { "resourcePresetId": "<resource_ID>" } }, "triggerer": { "count": "<number_of_instances>", "resources": { "resourcePresetId": "<resource_ID>" } }, "worker": { "minCount": "<minimum_number_of_instances>", "maxCount": "<maximum_number_of_instances>", "resources": { "resourcePresetId": "<resource_ID>" } }, "dependencies": { "pipPackages": [ <list_of_pip_packages> ], "debPackages": [ <list_of_deb_packages> ] }, "lockbox": { "enabled": <use_of_logging> } }, "network": { "subnetIds": [ <list_of_subnet_IDs> ], "securityGroupIds": [ <list_of_security_group_IDs> ] }, "codeSync": { "s3": { "bucket": "<bucket_name>" } }, "deletionProtection": <deletion_protection>, "serviceAccountId": "<service_account_ID>", "logging": { "enabled": <use_of_logging>, "minLevel": "<logging_level>", "folderId": "<folder_ID>" }, "adminPassword": "<administrator_password>" }
Where:
-
folderId
: Folder ID. You can request it with a list of folders in the cloud. -
name
: Cluster name. -
description
: Cluster description. -
labels
: List of labels. Provide labels in"<key>": "<value>"
format. -
config
: Cluster configuration:-
versionId
: Apache Airflow™ version. -
airflow.config
: Advanced Apache Airflow™ properties . Provide them in"<configuration_section>.<key>": "<value>"
format, for example:"airflow": { "config": { "core.load_examples": "False" } }
-
webserver
,scheduler
,triggerer
,worker
: Managed Service for Apache Airflow™ component configuration:-
count
: Number of instances in the cluster for the web server, scheduler, and Triggerer. -
minCount
,maxCount
: Minimum and maximum number of instances in the cluster for the worker. -
resources.resourcePresetId
: ID of the computing resources of the web server, scheduler, worker, and Triggerer. The possible values are:c1-m2
: 1 vCPU, 2 GB RAMc1-m4
: 1 vCPU, 4 GB RAMc2-m4
: 2 vCPUs, 4 GB RAMc2-m8
: 2 vCPUs, 8 GB RAMc4-m8
: 4 vCPUs, 8 GB RAMc4-m16
: 4 vCPUs, 16 GB RAMc8-m16
: 8 vCPUs, 16 GB RAMc8-m32
: 8 vCPUs, 32 GB RAM
-
-
dependencies
: Lists of packages enabling you to install additional libraries and applications for running DAG files in the cluster:pipPackages
: List of pip packages.debPackages
: List of deb packages.
If required, you can set version restrictions for the installed packages, for example:
"dependencies": { "pipPackages": [ "pandas==2.0.2", "scikit-learn>=1.0.0", "clickhouse-driver~=0.2.0" ] }
The package name format and version are defined by the install command:
pip install
for pip packages andapt install
for deb packages. -
lockbox.enabled
: Enables using secrets in Yandex Lockbox to store Apache Airflow™ configuration data, variables, and connection parameters. The possible values aretrue
orfalse
.
-
-
network
: Network settings:subnetIds
: List of subnet IDs.securityGroupIds
: List of security group IDs.
-
codeSync.s3.bucket
: Name of the bucket to store DAG files in. -
deletionProtection
: Enables cluster protection against accidental deletion. The possible values aretrue
orfalse
.With deletion protection enabled, you will still be able to manually connect to the cluster and delete it.
-
serviceAccountId
: Service account ID. -
logging
: Logging parameters:-
enabled
: Enables logging. Logs generated by Apache Airflow™ components will be sent to Yandex Cloud Logging. The possible values aretrue
orfalse
. -
minLevel
: Minimum logging level. Possible values:TRACE
,DEBUG
,INFO
,WARN
,ERROR
, andFATAL
. -
folderId
: Folder ID. Logs will be written to the default log group for this folder. -
logGroupId
: Custom log group ID. Logs will be written to this group.Specify either
folderId
orlogGroupId
.
-
-
adminPassword
: Admin user password. The password must be not less than 8 characters long and contain at least:- One uppercase letter
- One lowercase letter
- One digit
- One special character
Note
Save the password locally or memorize it. The service does not show passwords after the registry is created.
-
-
Use the Cluster.create method and make a request, e.g., via cURL
:curl \ --request POST \ --header "Authorization: Bearer $IAM_TOKEN" \ --url 'https://airflow.api.cloud.yandex.net/managed-airflow/v1/clusters' --data '@body.json'
-
View the server response to make sure the request was successful.
-
Get an IAM token for API authentication and put it into the environment variable:
export IAM_TOKEN="<IAM_token>"
-
Clone the cloudapi
repository:cd ~/ && git clone --depth=1 https://github.com/yandex-cloud/cloudapi
Below, we assume the repository contents are stored in the
~/cloudapi/
directory. -
Create a file named
body.json
and add the following contents to it:{ "folder_id": "<folder_ID>", "name": "<cluster_name>", "description": "<cluster_description>", "labels": { <label_list> }, "config": { "version_id": "<Apache Airflow™>_version", "airflow": { "config": { <list_of_properties> } }, "webserver": { "count": "<number_of_instances>", "resources": { "resource_preset_id": "<resource_ID>" } }, "scheduler": { "count": "<number_of_instances>", "resources": { "resource_preset_id": "<resource_ID>" } }, "triggerer": { "count": "<number_of_instances>", "resources": { "resource_preset_id": "<resource_ID>" } }, "worker": { "min_count": "<minimum_number_of_instances>", "max_count": "<maximum_number_of_instances>", "resources": { "resource_preset_id": "<resource_ID>" } }, "dependencies": { "pip_packages": [ <list_of_pip_packages> ], "deb_packages": [ <list_of_deb_packages> ] }, "lockbox": { "enabled": <use_of_logging> } }, "network": { "subnet_ids": [ <list_of_subnet_IDs> ], "security_group_ids": [ <list_of_security_group_IDs> ] }, "code_sync": { "s3": { "bucket": "<bucket_name>" } }, "deletion_protection": <deletion_protection>, "service_account_id": "<service_account_ID>", "logging": { "enabled": <use_of_logging>, "min_level": "<logging_level>", "folder_id": "<folder_ID>" }, "admin_password": "<administrator_password>" }
Where:
-
folder_id
: Folder ID. You can request it with a list of folders in the cloud. -
name
: Cluster name. -
description
: Cluster description. -
labels
: List of labels. Provide labels in"<key>": "<value>"
format. -
config
: Cluster configuration:-
version_id
: Apache Airflow™ version. -
airflow.config
: Apache Airflow™ additional properties. Provide them in"<configuration_section>.<key>": "<value>"
format, for example:"airflow": { "config": { "core.load_examples": "False" } }
-
webserver
,scheduler
,triggerer
,worker
: Managed Service for Apache Airflow™ component configuration:-
count
: Number of instances in the cluster for the web server, scheduler, and Triggerer. -
min_count
,max_count
: Minimum and maximum number of instances in the cluster for the worker. -
resources.resource_preset_id
: ID of the computing resources of the web server, scheduler, worker, and Triggerer. The possible values are:c1-m2
: 1 vCPU, 2 GB RAMc1-m4
: 1 vCPU, 4 GB RAMc2-m4
: 2 vCPUs, 4 GB RAMc2-m8
: 2 vCPUs, 8 GB RAMc4-m8
: 4 vCPUs, 8 GB RAMc4-m16
: 4 vCPUs, 16 GB RAMc8-m16
: 8 vCPUs, 16 GB RAMc8-m32
: 8 vCPUs, 32 GB RAM
-
-
dependencies
: Lists of packages enabling you to install additional libraries and applications for running DAG files in the cluster:pip_packages
: List of pip packages.deb_packages
: List of deb packages.
If required, you can set version restrictions for the installed packages, for example:
"dependencies": { "pip_packages": [ "pandas==2.0.2", "scikit-learn>=1.0.0", "clickhouse-driver~=0.2.0" ] }
The package name format and version are defined by the install command:
pip install
for pip packages andapt install
for deb packages. -
lockbox.enabled
: Enables using secrets in Yandex Lockbox to store Apache Airflow™ configuration data, variables, and connection parameters. The possible values aretrue
orfalse
.
-
-
network
: Network settings:subnet_ids
: List of subnet IDs.security_group_ids
: List of security group IDs.
-
code_sync.s3.bucket
: Name of the bucket to store DAG files in. -
deletion_protection
: Enables cluster protection against accidental deletion. The possible values aretrue
orfalse
.With deletion protection enabled, you will still be able to manually connect to the cluster and delete it.
-
service_account_id
: Service account ID. -
logging
: Logging parameters:-
enabled
: Enables logging. Logs generated by Apache Airflow™ components will be sent to Yandex Cloud Logging. The possible values aretrue
orfalse
. -
min_level
: Minimum logging level. Possible values:TRACE
,DEBUG
,INFO
,WARN
,ERROR
, andFATAL
. -
folder_id
: Folder ID. Logs will be written to the default log group for this folder. -
log_group_id
: Custom log group ID. Logs will be written to this group.Specify either
folder_id
orlog_group_id
.
-
-
admin_password
: Admin user password. The password must be not less than 8 characters long and contain at least:- One uppercase letter
- One lowercase letter
- One digit
- One special character
Note
Save the password locally or memorize it. The service does not show passwords after the registry is created.
-
-
Use the ClusterService/Create call and make a request, e.g., via gRPCurl
:grpcurl \ -format json \ -import-path ~/cloudapi/ \ -import-path ~/cloudapi/third_party/googleapis/ \ -proto ~/cloudapi/yandex/cloud/airflow/v1/cluster_service.proto \ -rpc-header "Authorization: Bearer $IAM_TOKEN" \ -d @ \ airflow.api.cloud.yandex.net:443 \ yandex.cloud.airflow.v1.ClusterService.Create \ < body.json
-
View the server response to make sure the request was successful.