Airflow - saeed349/quant_infra GitHub Wiki
Airflow plays a pivotal role in orchestrating the entire process, and over the years, I have gained extensive experience working with various iterations of Airflow, including those offered by major cloud platforms. Presently, for my personal needs, I deploy Airflow on Docker on an EC2 machine utilizing the Local Executor. I anticipate skepticism surrounding the use of LocalExecutor and the feasibility of handling substantial workloads. In the past, I employed Celery executors, necessitating more potent machines for comparable tasks. However, during my previous employment, I encountered a sophisticated workflow by a Quant who efficiently executed intricate workloads on a separate machine via ssh hooks, a strategy that resonated well with my objectives.
Consequently, I partition my operations, executing live trades, ML operations, and web scraping on distinct machines. The Airflow machines exclusively handle ETL workloads, with the warehouse assuming the lion's share of the computational burden. Report generation, being a relatively lightweight task, also occurs on the same machine. Notably, I direct Airflow logs to Opensearch using Filebeats, enabling comprehensive monitoring of DAGs and associated processes. This approach eliminates the need for disparate log shippers across different machines.
Using the user data option on EC2 we can make the startup script to automatically start docker and bring up Airflow when the EC2 is turned on.
Scaling Options
A few self-hosted scaling options are
- Running celery on worker EC2 machine and using it as a celery machine and adding more EC2 machines, package management gets easier that way. But the reason why I haven't done that is because I have very customized set of packages on these different machines based on the specific loads (ML, Webscrapping, Live trading etc.. ) and I didn't want to take the effort of managing different versions of Celery machines and making the process more complex.
- Hosting Airflow and workers on AWS ECS or EKS
Self-hosted Airflow is optimal for small teams. However, as the demand for multiple team deployments, individual user access, or a diverse range of dynamic workloads increases, the complexity of scaling operations transforms into a full-time responsibility. In such instances, hiring Site Reliability Engineers (SREs) or opting for a managed service becomes imperative. Having explored several services, I believe Astronomer stands out as a versatile solution accommodating a broad spectrum of use cases.