top of page

Airflow Installation

  • neovijayk
  • Jun 9, 2020
  • 7 min read

In this article we will take a look at the two different approaches to install the Apache Airflow on Ubuntu 18.0 VM instance of the Google Cloud platform .

I have covered following topics before the links are:

In this article we will install the Apache Airflow using two approaches. Approach 1 that worked for me and approach 2 which one will find on official site and on many different sites that worked for many people but did not worked for me. So in this article I will explained:

  1. Approach 1 and

  2. Approach 2

  3. Apache Airflow- Sub-packages available to download

Approach 1 for the installation of Apache Airflow

In this approach we will use the Jupyter notebook for the installation of the Airflow that is we will run all the CLI commands from the jupyter notebook. Open a new jupyter notebook to execute all the commands from the CLI.

In case approach 2 I was facing errors at the time of installation of the Apache airflow using the commands mentioned on the official web page. After studying the errors I found the reason and that is due to pip. In the approach 1 I did not faced such issue:

Install and update pip:

First we will install software-properties-commonby running this command in Jupyter notebook cell:

1!sudo apt-get install software-properties-common

But what is “software-properties-common”?

  • This software provides an abstraction of the used apt repositories.

  • It allows you to easily manage your distribution and independent software vendor software sources.

  • In practice that means it provides some useful scripts for adding and removing PPAs plus the DBUS backends to do the same via the Software and Updates GUI.

  • Without it, you would need to add and remove repositories (such as PPAs) manually by editing /etc/apt/sources.listand/or any subsidiary files in /etc/apt/sources.list.d

(for more details refer this link)

Now adding repositories:

1!sudo apt-add-repository universe

Repositories to a Linux server:

  • We know the distributions of Linux: CentOS, Debian and Ubuntu.

  • Each of these distributions offer standard repositories for software packages, from which applications can be easily installed.

  • These standard repositories contain a wide variety of software packages, but as these packages are curated by the distributions, they are chosen for stability and licensing.

  • It can happen that a necessary software package is not available in the standard repositories. In these cases, extra repositories can be added to your server, thus allowing different or newer software to be installed.

In Ubuntu there are four main repositories:

1 2 3 41. Main, 2. Restricted, 3. Universe, and 4. Multiverse.

  • Main is the default basic repository of officially supported software, as curated by Canonical.

  • Restricted is a repository containing supported software which is not open source, such as MP3 or Flash.

  • Universe is maintained by the greater Ubuntu community of users and developers. These are not officially supported, but tend to be the newer releases.

Download and install the updates:

1!sudo apt-get update

What is apt-get update?

  • apt-get update downloads the package lists from the repositories and “updates” them to get information on the newest versions of packages and their dependencies.

  • It will do this for all repositories and PPAs.

1!sudo apt-get install python-pip --yes

Install Airflow

1!python --version  # check the python version

1Python 3.6.3 :: Anaconda, Inc.

Now setup an environment variable and install the Apache Airflow using pip

1 2!export SLUGIFY_USES_TEXT_UNICODE=yes !pip install apache-airflow

After successful installation you will get following message:

1 2 3Successfully built setproctitle Installing collected packages: json-merge-patch, setproctitle, apache-airflow Successfully installed apache-airflow-1.10.5json-merge-patch-0.2setproctitle-1.1.10

Now we will verify whether we have installed the airflow correctly or not: Open a new terminal and in it execute following command

1!airflow --version

If airflow is correctly installed then in the output it will show the airflow version installed on the machine.

Initialise database:

Now we will initialise the database. I going to use SqlLite as the backend database for the airflow.

1!airflow initdb

1 2 3 4[2019-09-30 03:15:08,935] {__init__.py:51} INFO - Using executor SequentialExecutor DB: sqlite:////home/shraddha_sane/airflow/airflow.db [2019-09-30 03:15:09,706] {db.py:369} INFO - Creating tables INFO  [alembic.runtime.migration] Context impl SQLiteImpl.....

Open a new terminal and execute following command we can see in the airflow folder we have following files:

1 2~/airflow$ ls airflow-webserver.pid  airflow.cfg  airflow.db  logs  unittests.cfg

Create a DAG folder and set the path of this folder in airflow config file:

After this step we will create a DAG folder in which we can put our python .py files consists of dag programs. One can create this new DAG folder even outside the airflow directory but then we just need to provide the path of the DAG folder in config file (airflow.cfg). For this explanation I am going to create the DAG folder in the root directory:

1!mkdir ~/DAG   #creating a DAG folder at home or root directory

Now open the config file to set the path. From the terminal run the following command


~/airflow$nano airflow.cfg


This will open the airflow.cfgfile. Here we will change value of dag_folder we will change it to the path of the created DAG folder location as follows:

Change the dags_folder and set it to the path of created DAG folder that is ~/DAG in our example



After that Save the config file. In this config file there are more options available that we can change according to our needs. Please read this file to know more about the options that are provided to us.

Now once again initialise the SqlLite database:

1~/airflow$ airflow initdb

This command will update the changes and the changes will be reflected and can be seen from the browser UI of airflow.

Open airflow UI in the browser:Start the airflow web server and scheduler

Open two new terminals in terminal one to start the web server (you can set the port as well) and the other for a scheduler. On terminal 1

1~/$airflow webserver --port=5000

This command will start the airflow web server on the port 5000. Since I have created a fire wall rule for this port number of my server to access from outside. Hence I am using this port number. if you are installing using your local machine then no need to provide any port number just run airflow webserver and in that case you can access airflow UI from port 8080 (default).

Now start the scheduler by running following command on terminal 2:

1~/$airflow scheduler

Note that this command may print following line if you are using the SQLLite as the back-end database for the airflow:

1[2019-10-01 09:54:31,728] {dag_processing.py:748} ERROR - Cannot use more than 1 thread when using sqlite. Setting parallelism to 1

But ignore this and keep these two terminal running and open the airflow UI. This is because as printed SqlLite does not support parallelism that running two thread or processes simultaneously.

From the IP address and the port number we can access the airflow browser based UI. Since I am using google cloud platform VM instance or server hence I will use it;s External IP address with mentioned port number that is 5000 in the browser (since I am accessing the process running on the server from my local machine) like this

135.200.194.11:5000

This will open the airflow UI in the browser. This is it we have installed the Airflow successfully. Now we will take a look at the Approach 2 for the installation.

Approach 2:

Create a new directory “airflow” in the home (~) directory, set it as airflow home and install the airflow in it:

1 2 3~$mkdir airflow ~$exportAIRFLOW_HOME=~/airflow ~$echo$AIRFLOW_HOME






Upon running these commands, Airflow will create the $AIRFLOW_HOME folder and lay an “airflow.cfg” file with defaults that get you going fast. The last command will print the path set for the AIRFLOW_HOME. Note that airflow needs a home, ~/airflow is the default,but you can lay foundation somewhere else if you prefer.

Now in the airflow folder we will install the Apache Airflow. Before installing the Apache Airflow upgrading the pip.

1 2 3 4~$cdairflow/ ~/airflow$pip install --upgrade pip ~/airflow$pip installapache-airflow ~/airflow$airflow initdb

If it worked for you then you can start the web server and scheduler as mentioned above in the approach 1.

Apache Airflow- Sub-packages available to download

The apache-airflow PyPI basic package only installs what’s needed to get started. Subpackages (extra features) can be installed depending on what will be useful in your environment.

What other features (other than SQLLite) are available to install along with the Airflow?

  • Installing Airflow on its own is fine for testing the waters, but in order to build something somewhat meaningful, we’ll need to install one of Airflow’s many “extra features“. Each Airflow “feature” we install enables a built-in integrationbetween Airflow and a service, most commonly a database. Airflow installs an SQLLite feature by default.

  • Airflow needs a database to create tables necessary for running Airflow. Chances are we don’t be using a local SQLLite database when we use Airflow in production, so I’ve opted to use a Postgres database:

  • Airflow leverages the familiar SQLAlchemy library to handle database connections. As a result, the act of setting database connection strings should all be familiar.

  • Airflow has features for much more than just databases. Some features which can be installed with airflow include Redis, Slack, HDFS, RabbitMQ, and a whole lot more. To see everything available, check out the list: https://airflow.apache.org/installation.html

You can also install Airflow with support for extra features like gcp or postgres as follows:

1~/airflow$pip install apache-airflow[postgres,gcp]

Behind the scenes, Airflow does conditional imports of operators that require these extra dependencies. There is the list of the subpackages and what they enable on the Airflow’s official website. Some of the important ones are as follows:

subpackageinstall commandenablesallpip install ‘apache-airflow[all]’All Airflow features known to manmysqlpip install ‘apache-airflow[mysql]’MySQL operators and hook, support as an Airflow backend. The version of MySQL server has to be 5.6.4+. The exact version upper bound depends on version of mysqlclient package. For example, mysqlclient 1.3.12 can only be used with MySQL server 5.6.4 through 5.7.gcppip install ‘apache-airflow[gcp]’Google Cloud Platformgoogle_authpip install ‘apache-airflow[google_auth]’Google auth backendkubernetespip install ‘apache-airflow[kubernetes]’Kubernetes Executor and operatorpasswordpip install ‘apache-airflow[password]’Password authentication for userspostgrespip install ‘apache-airflow[postgres]’PostgreSQL operators and hook, support as an Airflow backend

When to use SqlLite?

  • Airflow requires a database to be initiated before you can run tasks. If you’re just experimenting and learning Airflow, you can stick with the default SQLite option.

  • We can install the database SQLiteas the back-end for the airflow. This database is suitable to use in case we want to use airflow on a single server and for the development purpose.

Why it is Recommended to use either MySQL or PostgressSQL instead of SqlLit?

  • Although sqllite is good for the experimentation and development purpose but it is recommended to use the Airflow for the production (or in the distributed system) then use either MySQLor PostgressSQL.

  • Out of the box, Airflow uses a sqlite database, which you should outgrow fairly quickly since no parallelization is possible using this database backend. It works in conjunction with the airflow.executors.sequential_executor.SequentialExecutor which will only run task instances sequentially. While this is very limiting, it allows you to get up and running quickly and take a tour of the UI and the command line utilities.

  • If you want to take a real test drive of Airflow, you should consider setting up a real database backend and switching to the LocalExecutor. As Airflow was built to interact with its metadata using the great SqlAlchemy library, you should be able to use any database backend supported as a SqlAlchemy back-end. Hence it is recommend using MySQL or Postgres.

Why Airflow needs databases?

Airflow needs a database to create tablesnecessary for running Airflow.

That’s it for this article. If you have any questions feel free to ask in the comment section below and please like and subscriber to my blog. 🙂

Recent Posts

See All

Comments


Subscribe to BrainStorm newsletter

For notifications on latest posts/blogs

Thanks for submitting!

  • Twitter
  • Facebook
  • Linkedin

© 2023 by my-learnings.   Copy rights Vijay@my-learnings.com

bottom of page