Airflow on EC2 - qyjohn/Hands-on-Linux GitHub Wiki

This is a quick start type tutorial on how to set up Airflow in standalone mode.

Ubuntu 20.04

Launch an EC2 instance with Ubuntu 20.04. SSH into the EC2 instance and install Airflow.

sudo apt update
sudo apt install python3-pip
pip3 install 'apache-airflow==2.2.2' \
 --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.2/constraints-3.7.txt"
airflow standalone

The default username is admin and the automatically created password is shown on screen. By default Airflow listens on port 8080. We will use Apache2 to forward traffic to port 80 to port 8080. This allows you to access the web UI easily.

sudo apt install apache2

Overwrite /etc/apache2/sites-available/000-default.conf with the following content:

<VirtualHost *:80>
  ProxyPreserveHost On
  ProxyPass / http://localhost:8080/
  ProxyPassReverse / http://localhost:8080/
</VirtualHost>

Add the necessary Apache module and restart Apache:

sudo a2enmod proxy && sudo a2enmod proxy_http 
sudo service apache2 restart

At this point you should be able to access the Airflow web UI by http://ip-address-of-ec2-instance/.

Amazon Linux 2 - Kernel 5.10

Launch an EC2 instance with Amazon Linux 2 AMI (HVM) - Kernel 5.10. SSH into the EC2 instance and install Airflow.

pip3 install 'apache-airflow==2.2.2' \
 --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.2/constraints-3.7.txt"

If you use airflow standalone to start Airflow, you will see the following error message:

Traceback (most recent call last):
  File "/home/ec2-user/.local/bin/airflow", line 5, in <module>
    from airflow.__main__ import main
  File "/home/ec2-user/.local/lib/python3.7/site-packages/airflow/__init__.py", line 34, in <module>
    from airflow import settings
  File "/home/ec2-user/.local/lib/python3.7/site-packages/airflow/settings.py", line 35, in <module>
    from airflow.configuration import AIRFLOW_HOME, WEBSERVER_CONFIG, conf  # NOQA F401
  File "/home/ec2-user/.local/lib/python3.7/site-packages/airflow/configuration.py", line 1129, in <module>
    conf.validate()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/airflow/configuration.py", line 224, in validate
    self._validate_config_dependencies()
  File "/home/ec2-user/.local/lib/python3.7/site-packages/airflow/configuration.py", line 278, in _validate_config_dependencies
    f"error: sqlite C library version too old (< {min_sqlite_version}). "
airflow.exceptions.AirflowConfigException: error: sqlite C library version too old (< 3.15.0). See https://airflow.apache.org/docs/apache-airflow/2.2.2/howto/set-up-database.html#setting-up-a-sqlite-database

With reference to Airflow documentation Setting up a Sqlite Database, we do the following:

sudo yum install gcc tcl
wget https://www.sqlite.org/src/tarball/sqlite.tar.gz
tar xzf sqlite.tar.gz
cd sqlite/
./configure
make
sudo make install
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

Now you should be able to start Airflow in standalone mode:

airflow standalone

The default username is admin and the automatically created password is shown on screen. By default Airflow listens on port 8080. We will use Apache2 to forward traffic to port 80 to port 8080. This allows you to access the web UI easily.

sudo yum install httpd

Edit /etc/httpd/conf.d/default-site.conf, with the following content:

<VirtualHost *:80>
    ProxyPreserveHost On
    ProxyPass / http://localhost:8080/
    ProxyPassReverse / http://localhost:8080/
</VirtualHost>

Start Apache2 to redirect traffic:

sudo systemctl start httpd

At this point you should be able to access the Airflow web UI by http://ip-address-of-ec2-instance/.

⚠️ **GitHub.com Fallback** ⚠️