Airflow introduction
- neovijayk
- Jun 13, 2020
- 3 min read
Most of the Data Science or Machine Learning Startups face this common problem that is getting data from different data sources, perform some transformations and storing it into the destination storage platform to use it in Machine Learning project. What they want is to reduce the time and efforts it takes to create this datapipeline. Also the automation of data fetching task. So that machine learning or data scientists team can focus more on ML model development.
Recently I found this tool: Apache Airflow. Though it is not exactly used only for the data-pipeline purpose but it performs more than that. Let’s see what is Apache Airflow is and for what purpose we can use it in Data Science Projects.
About Apache Airflow:
First What is Apache Airflow is?
According to the Airflow’s website: Airflow is a platform to programmatically author, schedule and monitor workflows (A sequence of tasks written in Python) or data pipelines.
Airflow is an Open-source tool allows you to configure, schedule, and monitor data pipelines programmatically in Python
It also provides a UI (browser based) for management
The Directed Acyclic Graphs (DAGs) are the Airflow’s core. I find DAG are very useful since it is easy to see the flow of the tasks. Open-source tool for According to the Airflow’s website: Airflow is a platform to programmatically author, schedule and monitor workflows (A sequence of tasks written in Python) or data pipelines.
It is a data pipeline framework written in Python.
It also provides a UI (browser based) for management
The Directed Acyclic Graphs (DAGs) are the Airflow’s core. I find DAG are very useful since it is easy to see the flow of the tasks.
What is a Workflow?
A sequence of tasks written in Python. These are scheduled to be executed in a defined order or sequence.
Frequently used to handle big data processing pipelines
Workflow in the Airflow is described as DAG (Directed Acyclic Graph). DAG consists of Tasks.
Example of a typical workflows is as follows:

DAG example to show workflow
The workflow in the DAG above if executed will perform following tasks:
First download data from sources
then send data somewhere else to process
then monitor when the process is completed
then get the results and generate the report
And at the end send the report out by email
Example of DAG:

Defining operators and their order of execution
In the DAG we define the execution order for operators programatically as follows. For example we define Name of the DAG (as test), start date, Schedule (once), three operators as well as their schedule for the execution (order) as follows :
Each DAG may or may not have a schedule. schedule_interval is defined as a DAG arguments, and receives preferably a cron expression as a str, or a datetime.timedelta object. Alternatively, you can also use one of these cron “preset”:
Note: Use schedule_interval=None and not schedule_interval='None' when you don’t want to schedule your DAG.
(Source: link)
Operators:
In Airflow we have Python API Reference: Operators. Let’s take a look at them:
Operators allow for generation of certain types of tasks that become nodes in the DAG when instantiated. All operators derive from BaseOperator and inherit many attributes and methods that way. There are 3 main types of operators:
Operators that performs an action, or tell another system to perform an action
Transfer operators move data from one system to another
Sensors are a certain type of operator that will keep running until a certain criterion is met.
Web based UI :
From the Web based UI we can view all DAGs, together with the schedule of execution as well as the recent status of execution DAGs.

DAG overview in the list screen
Also it is possible that we can look into detailed DAG view by looking at the Tree view option provided

Tree view of DAG
For more details:
Other two important key terms in Apache Airflow:
Plugin: an extension to allow users to easily extend Airflow with various custom hooks, operators, sensors, macros, and web views.
Pools: concurrency limit configuration for a set of Airflow tasks.
Where can use Apache Airflow in the Machine Learning Projects?
The two most important processes that I can use this tool is
We can use it in data pipeline
Also for the Machine Learning pipeline
That’s it. If you have any questions feel free to ask also if you find this article useful please like and subscribe to my blog. 🙂
Comments