Apache Airflow Architecture
- neovijayk
- Jun 9, 2020
- 5 min read
In this article we will take a look at the architecture of Apache Airflow. We can divide generally divides Apache Airflow Architecture into two types:
Single-node architecture (Single machine)
Multi-node architecture (Network of machines)
But before moving forward first we will see some important points about the task executions in Apache Airflow:
Every task execution is independent of each other.
They execute on different workers. ( running as a different processes or running on different machines)
Tasks are not communicating with each other or no data is exchanged between the tasks at the time of their executions
They are working on different workers (it could be running on different machines or threads)
Tasks are not working on a single process on a single machine therefore no performance issues
for example task 1 is running on machine 1 and task 2 is running on machine 2, etc
If Airflow is installed on a single machine then tasks will be run on the same machine If it is installed on Distributed system then each tasks will run on different machines
Also in any Apache Airflow Architecture there are four main architecture components:
WebUI: the portal for users to view the related status of the DAGs.
Metadata DB: the metastore of Airflow for storing various metadata including job status, task instance status, etc.
Scheduler: a multi-process which parses the DAG bag, creates a DAG object and triggers executor to execute those dependency met tasks.
Executor: A message queuing process that orchestrates worker processes to execute tasks.
Now we will take a look at the Airflow architecture.
Single node Architecture (Single machine):
In single-node architecture all components are on the same node or on the same machine.
To use single node architecture, Airflow has to be configured with the LocalExecutormode.
In this mode, the worker pulls tasks to run from an IPC (Inter Process Communication) queue.
In this mode, the worker pulls tasks to run from an IPC (Inter Process Communication) queue.
Single node architecture
In the above single node architecture:
We start the webserver which opens the UI in browser from here we can see the status of DAG and Tasks also can fire a SQL query on Metadata
Metadata or backend database holds all the meta data that is information of login credentials of users if created, Status of DAG and Tasks, updates of progress
Scheduler checks the status of the DAGs and tasks in the metadata database, create new ones if necessary and sends the tasks to the queues.
LocalExecuter (in case of Single Machine): Pushes the scheduled tasks in the queue
Queuing service stores the task commands to be run in queues
Airflow worker: retrieve the commands from the queues, execute them and update the metadata.
Multi-node architecture (Network of machines):
In a multi node architecture daemons are spread in different machines.
To use this architecture, Airflow has to be configuring with the Celery Executor mode.
In this mode, a Celery backend has to be set (example Redis).
Celery is an asynchronous queue based on distributed message passing.
Airflow uses it to execute several tasks concurrently on several workers server using multiprocessing. This mode allows scaling up the Airflow cluster really easily by adding new workers
multi node architecture
We start the webserver on Machine 1 which opens the UI in browser from here we can see the status of DAG and Tasks also can fire a SQL query on Metadata. Also Scheduler is running on the same machine
Metadata or backend database (can be on the same machine or running on different machine) holds all the meta data that is information of login credentials of users if created, Status of DAG and Tasks, updates of progress. Since as we can see the database has to handle parallel processes running since workers on different machine continously updates the status back to the databse
Scheduler checks the status of the DAGs and tasks in the metadata database, create new ones if necessary and sends the tasks to the queues.
In case of Multi-node architecture (in case of multiple Machines) we use Celery Executer instead of Local Executer which pushes the scheduled tasks on the Message Broker
In case of Multi Node we have RabbitMQ broker or queue as a Queuing service to stores the task commands to be run in queues
Airflow worker: retrieve the commands from the queues, execute them and update the metadata.
Now let’s try to understand the above two architectures using following two examples:
Example 1: Implementing the simple sequential tasks: t1 >> t2 >> t3 in a workflow
Example 2: Implementing the parallel tasks : t1 >> [t2, t3] >> t4. Where t2 and t3 will be executed in parallel
Let’s assume we have setup the Airflow as shown in the image. Where Airflow webserver and Scheduler are on the same machine or node and we have two celery workers on two different machines or nodes.
Now we will implement example 1 that is Implementing the simple sequential tasks: t1 >> t2 >>t3 in a workflow on this setup as follows:

In the beginning, the scheduler creates a new task instance in the metadata database with the scheduled state
Then the scheduler uses the Celery Executor which sends the task to the message broker
The celery worker then receives the command from the queue
The celery worker updates the metadata to set the status of the task instance to running
The celery worker executes the command
Once the task is finished, the celery worker updates the metadata to set the status of the task instance to success
Similarly in case of example 2: Implementing the parallel tasks : t1 >> [t2, t3] >> t4. Where t2 and t3 will be executed in parallel:
After task 1 is completed, the scheduler pushes both task 2 and stask 3 tasks to the queue to be executed by the celery workers
Each celery worker (worker node 1, worker node 2) received one task from the queue and that they are executing their assigned task simultaneously
The scheduler now waits for both tasks to be reported as successful before sending the next one to the queue
Google Kubernet and Apache Airflow:
As one of the important component of Airflow architecture is Executor: It is a message queuing process that orchestrates worker processes to execute tasks. As we saw there are quite a few executors supported by Airflow.
The Kubernetes(k8s) operator and executor are added to Airflow 1.10 which provides native Kubernetes execution support for Airflow.
Google Cloud Composer:
A fully managed workflow orchestration service built on Apache Airflow. Complex workflows, simplified
Cloud Composer is a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines that span across clouds and on-premises data centers.
Built on the popular Apache Airflow open source project and operated using the Python programming language, Cloud Composer is free from lock-in and easy to use.
Some useful articles video links:
That is it for this article. If you have any questions feel free to ask in the comment section below. Also if you like this article please like and subscribe to my blog 🙂
Comentarios