Resource relationships in Managed Service for Apache Airflow™
Managed Service for Apache Airflow™ helps you deploy and maintain clusters of Apache Airflow™
About Apache Airflow™
Apache Airflow™ is an open-source platform that enables you to create, schedule, and monitor batch-oriented workflows. A workflow defines job relationships and their execution sequence. It is presented as a directed acyclic graph (DAG). DAGs in Apache Airflow™ can be used for automation and scheduled runs of any processes, e.g., data processing in Apache Spark™.
Apache Airflow™ uses the Workflows as code approach. It implies that each workflow is implemented using a Python 3.8 script. A file with this script is called a DAG file. It describes jobs, their run schedule, and dependencies between them. This approach allows storing workflows in a version control system, running tests, and enabling technology required for workflows.
Apache Airflow™ is not used for streaming and continuous data processing. If such processing is required, you can develop a solution based on Yandex Managed Service for Apache Kafka®.
For more information, see the Apache Airflow™ documentation
Managed Service for Apache Airflow™ architecture
The Managed Service for Apache Airflow™ architecture is presented on the diagram:
Each Apache Airflow™ cluster runs in a separate Kubernetes node group with the required network infrastructure. This infrastructure includes a virtual network, a security group, and a service account. Node groups are isolated from each other, both through virtual networks and through Kubernetes itself. Node groups are managed by a common Kubernetes master, and Apache Airflow™ clusters use a common PostgreSQL cluster for data storage.
To ensure isolated data storage, the service limits the use of the PostgreSQL cluster:
-
A separate database is created for each Apache Airflow™ cluster in the PostgreSQL cluster. Clusters can connect only to their own database.
-
Apache Airflow™ clusters can work only with tables created by Apache Airflow™. You cannot create and modify schemas, tables, functions, procedures, and triggers yourself.
-
Read and write speed, as well as the available database storage space, are limited.
Warning
Any malicious attempt to bypass these restrictions will result in your cluster being locked under Clause 7 of the Acceptable Use Policy
.
Apache Airflow™ cluster
The main entity Managed Service for Apache Airflow™ operates is a cluster. Inside a cluster, Apache Airflow™ components are deployed. Cluster resources may reside in different availability zones. You can learn more about Yandex Cloud availability zones in Platform overview.
A workflow running in a cluster may access any Yandex Cloud resource within the cloud network where the cluster is located. For example, a workflow can send requests to Yandex Cloud VMs or managed DB clusters. You can build a workflow using multiple resources, e.g., a workflow that collects data from one DB and sends it to another DB or Yandex Data Processing.
Apache Airflow™ main components
The main Apache Airflow™ components are shown below:
Apache Airflow™ components:
-
Web server: Server in Yandex Cloud hosting an Apache Airflow™ instance. The web server receives user commands sent through the Apache Airflow™ web interface and checks, runs, and debugs Python scripts in DAG files.
To learn more about working with the web interface, see the Apache Airflow™ documentation
. -
Scheduler: Server in Yandex Cloud that controls the job run schedule. The scheduler gets schedule information from DAG files. It uses this schedule to notify workers that it is time to run a DAG file.
-
Workers: Executors of jobs specified in DAG files. The workers run jobs on the schedule received from the scheduler.
-
Triggerer: Service that releases a worker if it goes idle while executing a job with a long event timeout (optional component).
-
DAG file storage: Yandex Object Storage bucket that stores DAG files. This storage can be accessed by web servers, schedulers, workers, and Triggerer.
To ensure fault tolerance and enhance performance, web servers, schedulers, and Triggerer may exist in multiple instances. Their number is set when creating a cluster.
For workers, you can also set the minimum and maximum number of instances while creating a cluster. Their number will be scaled dynamically. This feature is provided by the KEDA
Apache Airflow™ component configurations
A configuration decides the computing power allocated for the web server, scheduler, workers, and the Triggerer service. There are two available configuration types:
-
standard: With 4:1 RAM GB to vCPU ratio.
- 1 vCPU, 4 GB RAM
- 2 vCPUs, 8 GB RAM
- 4 vCPUs, 16 GB RAM
- 8 vCPUs, 32 GB RAM
-
cpu-optimized: With reduced RAM to vCPU ratio (2:1). These configurations may be useful for clusters with higher processor performance requirements.
- 1 vCPU, 2 GB RAM
- 2 vCPUs, 4 GB RAM
- 4 vCPUs, 8 GB RAM
- 8 vCPUs, 16 GB RAM
You can select configurations when creating a cluster or change them while editing it.
Triggerer
The Triggerer service reduces worker idle time.
DAGs may contain jobs that send requests to an external system (such as a Apache Spark™ cluster) and wait for it to respond for a certain period of time. If standard operators
Deferrable operators help avoid a situation like this. They allow pausing a job, releasing a worker, and isolating the external system request into a separate process called a trigger. All triggers are independent from each other and processed by Triggerer asynchronously, with separate resources allocated for it in the cluster. Once a response is received from the external system, a trigger fires, and the scheduler returns the job to the worker.
See how to work with Triggerer in the figure below:
For more information about deferrable operators, triggers, and the Triggerer service, see the Apache Airflow™