Apache Airflow on AWS ECS

neovijayk
Nov 7, 2020
7 min read

Disclaimer: this post assumes basic knowledge of Airflow, AWS ECS, VPC (security groups, etc) and Docker. I suggest an architecture that may not be perfect nor the best in your particular case. In that case, make what you want from this lecture.

A little context

Where I work, we use Apache Airflow extensively. We have approximately 15 DAGs, that may not seem like a lot, but some of them have many steps (tasks) that involve downloading big SQL backups, transforming some values using Python and re-uploading them into our warehouse.

Screenshot of our Airflow dashboard

At first, we started using the Sequential Executor (no parallelism, just 1 task running at a time) due to the easy setup and lack of many DAGs. As time went on, DAGs kept increasing and some of them presented the opportunity to make use of parallelism, so with some configurations, we started using the Local Executor.

Example DAG making use of a parallel setup

You may ask…Why the whole thing doesn’t cut it anymore?

Well, both cases had only 1 container deployed in AWS ECS doing everything: serving the web UI, scheduling the jobs and worker processes executing them. This wasn’t scalable, the only option we had was scaling vertically (you know, adding more vCPU, more RAM and such). There was not another option.

Furthermore, if something in the container fails, the whole thing fails (no high availability). Also, the whole service must be public for the webserver to be accessible through the internet. If you want to make certain components private (such as the scheduler and workers) this is NOT possible here.

Why create this guide?

There isn’t any guide talking about how to deploy Airflow in AWS, or making use of their extensive offer of services. It’s easy to deploy the whole thing locally using docker-compose or in an EC2, but is it really what you want? What about completely isolated nodes talking to each other inside the same VPC? Making private what needs to be private and public what needs to be public?

The architecture

Architecture diagram made with Lucichart

This whole diagram might be complicated at a first glance, and maybe even frightening but don’t worry. What you have to understand from this is probably just the following:

One can only connect to Airflow’s webserver or Flower (we’ll talk about Flower later) through an ingress. There’s no point of access from the outside to the scheduler, workers, Redis or even the metadata database. You don’t want connections from the outside there.
Everything’s inside the same VPC, to make things easier. Every object has its security group, allowing connections only from the correct services.
Everything is an AWS Fargate task, again, we’ll talk about Fargate later.

What services are we gonna use? Why?

Principally, AWS ECS and Fargate are the stars in this.

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service […] You can choose to run your ECS clusters using AWS Fargate, which is serverless compute for containers. Fargate removes the need to provision and manage servers, lets you specify and pay for resources per application, and improves security through application isolation by design.

A nice analogy about serverless computing is this one I read in this cool post:

Serverless computing is like driverless cars; there’s still a driver, just not one you can see. You don’t need to ask if the driver is hungry, tired, drunk or needs to stop for a bathroom break. If we ever let driverless cars drive on our roads it will be because we don’t have to care about the driver, just the fact that they will take us where we want to go — Mary Branscombe

I like to say that ECS is just a chill Kubernetes, without much to configure it’s ready to deploy your apps using just the Docker image and some extra settings–such as how much CPU or RAM you want your app to be able to use, if you want auto-scaling, use a Load Balancer out of the box and what-not.

We should also set up a metadata database, for that we’re going to use the convenient RDS.

The Guide

Before anything else, we have to set which Docker image we’re gonna use and set it as a base image to build on top of it. For this, I’ve used Puckel’s Airflow. This docker image gives you all you need to set up Airflow in any of the 3 main executors. With over 5 million downloads, it’s safe to say that this guy has done a great job.

Dockerfile

Let’s create our custom Dockerfile. I’ve used the following:https://towardsdatascience.com/media/03740133ec6681869c9b03a2951e295a

I added a personal airflow.cfg (which has configurations for s3 logging and SMTP server credentials), a custom entrypoint.sh and a dags folder that has all my DAGs. In this case, . is already defined in the base image to be /usr/local/airflow using the instruction WORKDIR.

My entrypoint is the following:https://towardsdatascience.com/media/53eaf64ea6275a5f49833f606b5358f8

I got rid of some lines in the base image’s entrypoint that were conditions serving the different executors and just made it for Celery only, also getting rid of an annoying wait_for_port function that somehow didn’t work.

What this whole thing does is first, sets up useful environment variables and then, depending on the command given in docker run, follows a switch that executes different portions of code. Let’s say, if you’re launching a worker, it’s going to install your Python requirements and then execute the worker process. If it’s the webserver, it’ll install requirements as well but also it’s going to initialize the database with airflow initdb command and then open the webserver for Airflow’s UI.

Local testing

If you want to test the whole thing and make sure everything works, you can do so with Puckel’s docker-compose celery YAML file.

After a while, you should be able to access localhost:8080 and see Airflow’s dashboard. You might as well access localhost:5555 and see Flower as well. From this point, run some example DAGs–or even yours–and see for yourself how things are processed from a trigger in the webserver, the scheduler grabbing the task and sending it to queue, and finally, a worker picking it up and running it.

Uploading the docker images

For this tutorial, we’re going to keep it simple and use AWS ECR. ECR is just a Docker image repository.

ECR is integrated with ECS out of the box.

Source: AWS ECR

To create a repository, hop into the ECR console and click on Create repository and choose whatever name you feel adequate. Tip: you can have a repository for your staging Airflow and one for production. Remember, all Airflow processes are going to use the same image to avoid redundancy and be DRY.

Screenshot of my AWS ECR

Now enter your new fresh repository and click on View push commands. That’ll walk you through pushing your Airflow image to the repo. If an error comes up during the first step, like unable to locate credentials you probably haven’t set your awscli credentials, look at this.

Once you pushed the image and see it in the ECR console, you’re ready for the next step!

Deploying the services on ECS

Let’s start by creating an ECS Cluster, go to Services and choose ECS.

Probably as this is your first time, you’re gonna see a screen presenting the service to you and an easy first cluster button. What you need to do here is just create a Network Only cluster.

Screenshot of cluster creation in AWS ECS

At this point, you will probably have a window that looks like this. This right here is a matter of choice to you, whether you use the same VPC as the rest of your instances or create another one specifically for this cluster–I did the latter. If you choose to do that, I think just 1 subnet will be sufficient.

Setting up the database (if you don’t have it already)

We’re going to set up a PostgreSQL 9.6 micro instance database. If you’re familiar with how to do this, feel free to do it and skip to the next step.

Go to Services -> RDS. Go to databases section and Create database. Select the PostgreSQL logo, go for Version 9.6.X, whatever minor version is fine. Now, I’m still deliberating on if I’m super cheap or the airflow metadata database doesn’t really need to be THAT robust, so I opted for a free tier micro instance. If you find that that isn’t enough for you, it’s easy to upgrade later so don’t worry.

Next configurations are up to you, whatever instance name, username, password, just make sure it’s going to be created in the same cluster that ECS uses.

Create Task definitions

Great, we have our empty cluster now. Let’s create our task definitions.

Task definitions are like blueprints, they define how your service is going to be executed–which container is it going to use, how much CPU and RAM is assigned to it, which ports are mapped, what environment variables does it have, etc.

Go to Task Definitions at the left panel and click in Create new Task Definition.

Screenshot of AWS ECS

Remember, we want Fargate, so select it and hit Next step.

Screenshot of AWS ECS

From now on, we’ll have to create a task definition for the webserver, the scheduler, and the workers.

I’ll walk you through all the necessary configurations you must provide for every task to work correctly.

Task Definition Name: identifier name. Choose something descriptive like airflow-webserver, airflow-worker, etc. Task Role: the IAM role the task is going to be injected in the container. Choose one that has permissions for what your task must do–extract secrets from secrets manager, log with awslogs log driver, query buckets from S3. If you’re not sure, just use the basic ecsTaskExecutionRole , if it’s not present in the dropdown check here. Network Mode: awsvpc since we’re using Fargate. Task execution role: the role that’s going to be able to pull the image from AWS ECR and log in Cloudwatch. ecsTaskExecutionRole has both these policies. Task size: almost completely depends on you. Most of your resources will on the workers, hence they’re gonna do all the dirty work. Just to offer a guide, these are my configurations:

Now click on Add container. A right panel is gonna pop up. Container name: container’s identifier name. Image: the ECR repository URL. Example: 1234567890.dkr.ecr.us-east-1.amazonaws.com/airflow-celery:latest. For Redis, use: docker.io/redis:5.0.5 Port mappings: for the webserver write 8080. For flower, 5555. For the workers, 8793–to access the logs. For Redis, 6379.

Under the ENVIRONMENT section, in Command, choose webserver , flower , worker or scheduler depending on which task you’re creating.

You can also make use of environment variables! You can either use valueto hardcode the env or use valueFrom to use Secrets Manager or AWS Parameter Store. But please, don’t inject secrets without security measures. More info here.

For ALL services except flower, you MUST set POSTGRES_ variables, the ones we referenced in the entrypoint.sh remember? Without those, the services are going to fail miserably trying to connect to a non-existent database.

Source