Airflow Installation
- neovijayk
- Jun 9, 2020
- 7 min read
In this article we will take a look at the two different approaches to install the Apache Airflow on Ubuntu 18.0 VM instance of the Google Cloud platform .
I have covered following topics before the links are:
In this article we will install the Apache Airflow using two approaches. Approach 1 that worked for me and approach 2 which one will find on official site and on many different sites that worked for many people but did not worked for me. So in this article I will explained:
Approach 1 and
Approach 2
Apache Airflow- Sub-packages available to download
Approach 1 for the installation of Apache Airflow
In this approach we will use the Jupyter notebook for the installation of the Airflow that is we will run all the CLI commands from the jupyter notebook. Open a new jupyter notebook to execute all the commands from the CLI.
In case approach 2 I was facing errors at the time of installation of the Apache airflow using the commands mentioned on the official web page. After studying the errors I found the reason and that is due to pip. In the approach 1 I did not faced such issue:
Install and update pip:
First we will install software-properties-commonby running this command in Jupyter notebook cell:
1!sudo apt-get install software-properties-common
But what is “software-properties-common”?
This software provides an abstraction of the used apt repositories.
It allows you to easily manage your distribution and independent software vendor software sources.
In practice that means it provides some useful scripts for adding and removing PPAs plus the DBUS backends to do the same via the Software and Updates GUI.
Without it, you would need to add and remove repositories (such as PPAs) manually by editing /etc/apt/sources.listand/or any subsidiary files in /etc/apt/sources.list.d
(for more details refer this link)
Now adding repositories:
1!sudo apt-add-repository universe
Repositories to a Linux server:
We know the distributions of Linux: CentOS, Debian and Ubuntu.
Each of these distributions offer standard repositories for software packages, from which applications can be easily installed.
These standard repositories contain a wide variety of software packages, but as these packages are curated by the distributions, they are chosen for stability and licensing.
It can happen that a necessary software package is not available in the standard repositories. In these cases, extra repositories can be added to your server, thus allowing different or newer software to be installed.
In Ubuntu there are four main repositories:
1 2 3 41. Main, 2. Restricted, 3. Universe, and 4. Multiverse.
Main is the default basic repository of officially supported software, as curated by Canonical.
Restricted is a repository containing supported software which is not open source, such as MP3 or Flash.
Universe is maintained by the greater Ubuntu community of users and developers. These are not officially supported, but tend to be the newer releases.
Download and install the updates:
1!sudo apt-get update
What is apt-get update?
apt-get update downloads the package lists from the repositories and “updates” them to get information on the newest versions of packages and their dependencies.
It will do this for all repositories and PPAs.
1!sudo apt-get install python-pip --yes
Install Airflow
1!python --version # check the python version
1Python 3.6.3 :: Anaconda, Inc.
Now setup an environment variable and install the Apache Airflow using pip
1 2!export SLUGIFY_USES_TEXT_UNICODE=yes !pip install apache-airflow
After successful installation you will get following message:
1 2 3Successfully built setproctitle Installing collected packages: json-merge-patch, setproctitle, apache-airflow Successfully installed apache-airflow-1.10.5json-merge-patch-0.2setproctitle-1.1.10
Now we will verify whether we have installed the airflow correctly or not: Open a new terminal and in it execute following command
1!airflow --version
If airflow is correctly installed then in the output it will show the airflow version installed on the machine.
Initialise database:
Now we will initialise the database. I going to use SqlLite as the backend database for the airflow.
1!airflow initdb
1 2 3 4[2019-09-30 03:15:08,935] {__init__.py:51} INFO - Using executor SequentialExecutor DB: sqlite:////home/shraddha_sane/airflow/airflow.db [2019-09-30 03:15:09,706] {db.py:369} INFO - Creating tables INFO [alembic.runtime.migration] Context impl SQLiteImpl.....
Open a new terminal and execute following command we can see in the airflow folder we have following files:
1 2~/airflow$ ls airflow-webserver.pid airflow.cfg airflow.db logs unittests.cfg
Create a DAG folder and set the path of this folder in airflow config file:
After this step we will create a DAG folder in which we can put our python .py files consists of dag programs. One can create this new DAG folder even outside the airflow directory but then we just need to provide the path of the DAG folder in config file (airflow.cfg). For this explanation I am going to create the DAG folder in the root directory:
1!mkdir ~/DAG #creating a DAG folder at home or root directory
Now open the config file to set the path. From the terminal run the following command
~/airflow$nano airflow.cfg
This will open the airflow.cfgfile. Here we will change value of dag_folder we will change it to the path of the created DAG folder location as follows:
Change the dags_folder and set it to the path of created DAG folder that is ~/DAG in our example

After that Save the config file. In this config file there are more options available that we can change according to our needs. Please read this file to know more about the options that are provided to us.
Now once again initialise the SqlLite database:
1~/airflow$ airflow initdb
This command will update the changes and the changes will be reflected and can be seen from the browser UI of airflow.
Open airflow UI in the browser:Start the airflow web server and scheduler
Open two new terminals in terminal one to start the web server (you can set the port as well) and the other for a scheduler. On terminal 1
1~/$airflow webserver --port=5000
This command will start the airflow web server on the port 5000. Since I have created a fire wall rule for this port number of my server to access from outside. Hence I am using this port number. if you are installing using your local machine then no need to provide any port number just run airflow webserver and in that case you can access airflow UI from port 8080 (default).
Now start the scheduler by running following command on terminal 2:
1~/$airflow scheduler
Note that this command may print following line if you are using the SQLLite as the back-end database for the airflow:
1[2019-10-01 09:54:31,728] {dag_processing.py:748} ERROR - Cannot use more than 1 thread when using sqlite. Setting parallelism to 1
But ignore this and keep these two terminal running and open the airflow UI. This is because as printed SqlLite does not support parallelism that running two thread or processes simultaneously.
From the IP address and the port number we can access the airflow browser based UI. Since I am using google cloud platform VM instance or server hence I will use it;s External IP address with mentioned port number that is 5000 in the browser (since I am accessing the process running on the server from my local machine) like this
135.200.194.11:5000
This will open the airflow UI in the browser. This is it we have installed the Airflow successfully. Now we will take a look at the Approach 2 for the installation.
Approach 2:
Create a new directory “airflow” in the home (~) directory, set it as airflow home and install the airflow in it:
1 2 3~$mkdir airflow ~$exportAIRFLOW_HOME=~/airflow ~$echo$AIRFLOW_HOME

Upon running these commands, Airflow will create the $AIRFLOW_HOME folder and lay an “airflow.cfg” file with defaults that get you going fast. The last command will print the path set for the AIRFLOW_HOME. Note that airflow needs a home, ~/airflow is the default,but you can lay foundation somewhere else if you prefer.
Now in the airflow folder we will install the Apache Airflow. Before installing the Apache Airflow upgrading the pip.
1 2 3 4~$cdairflow/ ~/airflow$pip install --upgrade pip ~/airflow$pip installapache-airflow ~/airflow$airflow initdb
If it worked for you then you can start the web server and scheduler as mentioned above in the approach 1.
Apache Airflow- Sub-packages available to download
The apache-airflow PyPI basic package only installs what’s needed to get started. Subpackages (extra features) can be installed depending on what will be useful in your environment.
What other features (other than SQLLite) are available to install along with the Airflow?
Installing Airflow on its own is fine for testing the waters, but in order to build something somewhat meaningful, we’ll need to install one of Airflow’s many “extra features“. Each Airflow “feature” we install enables a built-in integrationbetween Airflow and a service, most commonly a database. Airflow installs an SQLLite feature by default.
Airflow needs a database to create tables necessary for running Airflow. Chances are we don’t be using a local SQLLite database when we use Airflow in production, so I’ve opted to use a Postgres database:
Airflow leverages the familiar SQLAlchemy library to handle database connections. As a result, the act of setting database connection strings should all be familiar.
Airflow has features for much more than just databases. Some features which can be installed with airflow include Redis, Slack, HDFS, RabbitMQ, and a whole lot more. To see everything available, check out the list: https://airflow.apache.org/installation.html
You can also install Airflow with support for extra features like gcp or postgres as follows:
1~/airflow$pip install apache-airflow[postgres,gcp]
Behind the scenes, Airflow does conditional imports of operators that require these extra dependencies. There is the list of the subpackages and what they enable on the Airflow’s official website. Some of the important ones are as follows:
subpackageinstall commandenablesallpip install ‘apache-airflow[all]’All Airflow features known to manmysqlpip install ‘apache-airflow[mysql]’MySQL operators and hook, support as an Airflow backend. The version of MySQL server has to be 5.6.4+. The exact version upper bound depends on version of mysqlclient package. For example, mysqlclient 1.3.12 can only be used with MySQL server 5.6.4 through 5.7.gcppip install ‘apache-airflow[gcp]’Google Cloud Platformgoogle_authpip install ‘apache-airflow[google_auth]’Google auth backendkubernetespip install ‘apache-airflow[kubernetes]’Kubernetes Executor and operatorpasswordpip install ‘apache-airflow[password]’Password authentication for userspostgrespip install ‘apache-airflow[postgres]’PostgreSQL operators and hook, support as an Airflow backend
When to use SqlLite?
Airflow requires a database to be initiated before you can run tasks. If you’re just experimenting and learning Airflow, you can stick with the default SQLite option.
We can install the database SQLiteas the back-end for the airflow. This database is suitable to use in case we want to use airflow on a single server and for the development purpose.
Why it is Recommended to use either MySQL or PostgressSQL instead of SqlLit?
Although sqllite is good for the experimentation and development purpose but it is recommended to use the Airflow for the production (or in the distributed system) then use either MySQLor PostgressSQL.
Out of the box, Airflow uses a sqlite database, which you should outgrow fairly quickly since no parallelization is possible using this database backend. It works in conjunction with the airflow.executors.sequential_executor.SequentialExecutor which will only run task instances sequentially. While this is very limiting, it allows you to get up and running quickly and take a tour of the UI and the command line utilities.
If you want to take a real test drive of Airflow, you should consider setting up a real database backend and switching to the LocalExecutor. As Airflow was built to interact with its metadata using the great SqlAlchemy library, you should be able to use any database backend supported as a SqlAlchemy back-end. Hence it is recommend using MySQL or Postgres.
Why Airflow needs databases?
Airflow needs a database to create tablesnecessary for running Airflow.
That’s it for this article. If you have any questions feel free to ask in the comment section below and please like and subscriber to my blog. 🙂
Comments