Running Apache Airflow locally on Kubernetes (minikube)
The goal of this guide is to show how to run Airflow entirely on a Kubernetes cluster. This means that all Airflow componentes (i.e. webserver, scheduler and workers) would run within the cluster.
Before we begin…
What does this article covers?
- How to define Kubernetes components to run Airflow and why we need them
- Deploy Airflow components and run a DAG
- Explain necessary K8s components like their definitions and learn why we need them
What doesn’t this article cover?
- Doesn’t explain Airflow in detail
- Doesn’t explain Kubernetes in detail
Important: this deployment guide is not suitable for production environments
Source code can be found in: https://github.com/ipeluffo/airflow-on-kubernetes.
Prerequisites
At the moment of writing, I’m using the following versions on macOS 10.14.6:
|
|
Introduction
Airflow is described on its website as:
Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.
Kubernetes is described on its website as:
Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications.
What are we going to build?
Why Airflow on Kubernetes?
Airflow offers a very flexible toolset to programmatically create workflows of any complexity. In order to run the individual tasks Airflow uses an executor to run them in different ways like locally or using Celery.
In version 1.10.0, Airflow introduced a new executor called KubernetesExecutor to dynamically run tasks on Kubernetes pods. In this way, Airlfow is able to run tasks creating Pods on demand without wasting resources with idle workers as it would happen with other executors.
Docker image
In order to be able to run Airflow’s components, we need a Docker image that Kubernetes will use in order to run pods inside the cluster.
For this tutorial we’ll use the image puckel/docker-airflow (Github: https://github.com/puckel/docker-airflow), more specifically image’s tag 1.10.9 which provides a flexible Airflow configuration image using the last version of Airflow at the moment of writing (1.10.9).
The only issue is that docker-airflow image doesn’t provide support for KubernetesExecutor
:
https://github.com/puckel/docker-airflow/blob/1.10.9/README.md#usage
However, we can take advantage of some flexibilities of the image definition that will allow us to use KubernetesExecutor
.
Customizations
Airflow dependency
One key thing that is not present in the image is the extra Kubernetes dependencies from Airflow: https://github.com/puckel/docker-airflow/blob/1.10.9/Dockerfile#L62
Despite the Dockerimage allows us to add more dependencies setting the AIRFLOW_DEPS
argument, in this tutorial we’re not going to create our own custom docker image, so we’re
using another customization available in the entry point script of the image:
https://github.com/puckel/docker-airflow/blob/1.10.9/script/entrypoint.sh#L26-L29
Thanks to that part of the entry point script, we’re able to add a requirements.txt
file that
will be used dynamically by the entry point when starting the container. The downside of this
is that any new pod will need to install this dependency before starting it, making the
startup time slower. A better solution would be to design a custom and optimal Docker image
for our purposes that it’s out of the scope of this guide.
Environment variables
Related to the need of a customized docker image, we should also customize Airflow’s configuration in order to use the executor.
Similar to what was described above, we have the option to create our own configuration file and load it in our custom image or we can override configuration’s setting by using environment variables.
Kubernetes objects
ConfigMap
From Kubernetes site:
ConfigMaps allow you to decouple configuration artifacts from image content to keep containerized applications portable.
In our case, we’re going to create two ConfigMap
objects described below.
ConfigMap
: requirements.txt
As described above, we need to add a requirements file in order to install Airflow’s Kubernetes dependency (i.e. apache-airflow[kubernetes]
).
Kubernetes provides different alternatives to create the file, in this case we’ll use a
ConfigMap
that will be later mounted as a Volume
in order to create the necessary file in the pod’s
file system. The objective of this guide is not only to show Airflow running on Kubernetes but use and
learn different tools that the latter provides us.
Below is the ConfigMap
definition:
|
|
ConfigMap
: environment variables
To customise Airflow’s configuration, we’ll set environment variables that override the
file configuration. To achieve this, we can define the env vars within the Kubernetes
object definition or we can also create a ConfigMap
and just configure the object
to set the env vars from it.
Below is the ConfigMap
for our custom environment variables:
|
|
Following is the explanation for each of the env vars:
EXECUTOR
: we need this one to dynamically set the Airflow’s executor. The docker image entrypoint script uses this env var to set the Airflow executor configuration.POSTGRES_
: these env vars are needed since our deployment needs a Postgres server running to which our Airflow components will connect to store information about DAGs and Airflow such as connections, variables and DAGs’ information such as tasks’ state.LOAD_EX
: this env var is used to load Airflow’s example DAGs. Feel free to disable it if you don’t want to see or use default DAGs.AIRFLOW__KUBERNETES__KUBE_CLIENT_REQUEST_ARGS
: when developing this guide, I found that Airflow failed to parse the configuration file and this value was causing some issues because of the double brackets in the configuration value in the docker image.AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY
: all env vars with the prefixAIRFLOW__KUBERNETES__
are specifically for Kubernetes integration on Airflow. As the name suggests, this env var is to specify the docker image to be used for workers. In the context of Kubernetes, workers will be run on a Pod.AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG
: this env var is used to specify the docker image tag.AIRFLOW__KUBERNETES__DAGS_VOLUME_HOST
: we’ll see this in more detail later. For now, this specifies the path of the volume in the host (i.e. cluster node) where DAGs files are stored.AIRFLOW__KUBERNETES__LOGS_VOLUME_CLAIM
: this env var specifies the Kubernetes volume claim to use to store and read logs. We’ll talk about this in more detail later.AIRFLOW__KUBERNETES__ENV_FROM_CONFIGMAP_REF
: this specifies the name of theConfigMap
that stores the env vars (i.e. this one). This will allow workers to load env vars from thisConfigMap
when running.
Volumes
For each Airflow component (i.e. Kubernetes pod) we’re going to set up three volumes for different purposes using multiple Kubernetes tools:
- Volume for Logs
- Volume for requirements file
- Volume for DAGs
Volume: Logs
There are multiple alternatives to save Airflow’s logs on a Kubernetes deployment. In this guide, we’ll define a Volume that will allow us to persist logs from all Airflow’s components. If we decide to not set a volume, then Airflow’s workers’ logs would be lost after they finish.
To achieve this, we need to create a PersistenVolumeClaim
object:
|
|
This volume claim will allow pods to create a volume that will be attached to this volume claim. For more
information about to PersistentVolume
and PersistentVolumeClaim
: https://kubernetes.io/docs/concepts/storage/persistent-volumes/.
Important things about this object:
metadata.name
: this is the value used in the env var used to set the name of the persistent volume claim to be used by Airflow’s workers.accessModes
: since the volume claim will be read and write by multiple pods, we need to set a proper access mode to allow them to use it.
Volume: requirements file
As mentioned in the section of ConfigMap
s, the requirements file is declared as a ConfigMap
which will
be mounted as a file using a volume. This volume is defined within a Kubernetes object and looks like:
|
|
We need to set configMap.name
the same value that we use in the config map definition.
Volume: DAGs directory
For the DAGs directory, we’ll use a tool that is only suitable for a local deployment that gives us a lot of flexibility when doing tests when we need to test changes on DAGs.
Below is the definition of this volume:
|
|
In this case we mount a volume of type hostPath
.
This means that the pod’s volume is attached to a path (could be a file or a directory) within the cluster
node. This last point is very important, the path should exist in the cluster node and not in the host machine.
Then, to make this work we can put the files into a directory in the host machine or mount a host directory into
the Kubernetes node (in our case the minikube
cluster).
Since this a test environment where we’d like to easily change DAGs and run tests, we’ll mount a host folder into the minikube cluster running the following command:
|
|
When running this command you should see an output like this:
|
|
Finally, having the mount running, pods will be able to mount the cluster node’s folder where they’ll be able to read and write files which will be written into our host machine.
PostgreSQL
In this and the following sections, we’ll define the necessary Kubernetes objects to run the pods that we need to run Airflow.
First, we’ll define a Deployment
and a Service
to run a PostgreSQL instance that Airflow
will use. We can also define a Pod
object, but in this case, they’ll be automatically created
when we create the Deployment
.
If you’re not familiar with these Kubernetes concepts, I recommend having a quick read to the links below:
Deployment
: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/Service
: https://kubernetes.io/docs/concepts/services-networking/service/
Service
definition:
|
|
This Service object will allow us to expose the pod port to be able to connect to the
PostgreSQL instance. Kubernetes cluster runs a DNS service that will allow other pods to
connect to this service using its name (i.e. postgres
).
Deployment
definition:
|
|
For the deployment, we specify that the number of replicas should be one and the necessary env vars to set up the default user and database to be created when the container starts.
Airflow webserver
Now, we’re going to go through the definition of the Service
and Deployment
objects for
the Airflow webserver. In this case, we need a Service
since we need to connect to the
webserver from our machine and for that we’ll need to expose the port from the cluster.
Service
definition:
|
|
Most important things from the definition:
selector
: this will be used by the service to identify which pods should receive traffic sent to the service.type: NodePort
: this will allow us to expose the service port so we’re able to connect to the service from our machine. For more info: https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-typesport
: port to expose to connect to the webserver.
Deployment
full definition:
|
|
Few comments about this definition:
volumes
section: as mentioned on theVolumes
section of this guide, we need to define the volumes to be used by the pods. So, in this case, we define one volume for the logs, one for the requirements file and one for the logs persistent volume claim.volumeMounts
: to be able to use the volumes, we need to specify how they should be mounted on within the pod and this section is used for that.ports
: define the pod’s port to expose.resources
: in this example, we just specified the maximum amount of memory allowed to be used by the pod.
Airflow scheduler
In the case of the scheduler, we only need to create a deployment since it doesn’t expose any
service that other workers need to connect to. However, the scheduler is responsible of creating
workers (i.e. pods) to run Airflow’s tasks so we need to give the needed permissions
to the scheduler to be able to manage pods on the cluster like creating and deleting them. To
achieve this, we need to define a ClusterRole
and ClusterRoleBinding
Kubernetes objects.
Deployment
definition:
|
|
The only difference with the definition of the webserver deployment is the args
setting
which overrides the Docker image command that will be run on from the entrypoint script.
Airflow scheduler permissions:
|
|
It’s easy to see that the ClusterRole
gives specific permissions to manage pods.
Running Airflow in Kubernetes
To make it easier to create and delete all resources from the Kubernetes cluster, I created two scripts:
script-apply.sh
: creates all Kubernetes objects.script-delete.sh
: deletes all objects, it can take some time to delete the persistent volume claim.
After starting minikube
, if we’re not running anything in the cluster, we should see something like this:
|
|
If we run script-apply.sh
script:
|
|
Then, we should be able all objects created by the script:
|
|
To access Airflow webserver:
|
|
And we should see Airflow UI homepage:
Let’s t try running a DAG:
|
|
If we run example_complex
DAG and we wait a few seconds, we should see the DAG tasks start to be run:
And if we check the pods on Kubernetes:
|
|
Let’s check the logs of one successful task:
However, if we wait the DAG to finish, we’ll see that it fails:
And we can check why the task failed:
If we check pods in the cluster:
|
|
We can see that all pods running Airflow’s tasks have finished and were removed.
Lastly, we can clean the Kubernetes cluster removing all objects:
|
|
|
|
Conclusion
We reached the end of this guide where we saw that running a whole Airflow deployment on a local Kubernetes cluster is straightforward.
As mentioned at the beginning, the objective of this guide is to use several tools from
Kubernetes and many things should be changed for a deployment in a production environment.
This deployment allows developers to quickly do tests, if you have a DAG you should be able
to put it in the dags
folder and immediately see it on the UI.
So that’s it for the guide, I hope it was useful and help you to continue learning about Kubernetes and Apache Airflow.