Data Ingestion Pipeline - saeed349/quant_infra GitHub Wiki
Every aspect of Extract, Transform, and Load (ETL) operations unfolds seamlessly within Airflow. Below, I elucidate a straightforward flow illustrating the process of retrieving price updates from a vendor API and subsequently updating a table. While the example is demonstrated on Snowflake, it's noteworthy that the workflow remains comparably consistent in Databricks as well.
And with the historical data, we would have to create the resources initially before running above.
A crucial realization across datasets, whether delivered via APIs, FTPs, flat files, or web scraping, is that the downstream process of ingesting data into the warehouse follows a consistent pattern. To streamline this process, I advocate creating reusable templates, represented as base classes in Airflow. This approach eliminates the need to manually craft SQL queries for ingestion, merging, or resource creation. Instead, a dynamic process generates SQL statements based on the incoming file's schema. Embracing the Medallion Architecture, staging data before merging into a final table proves valuable for error reconciliation during restatements from the vendor. While the signal generation process often involves more bespoke elements like UDTFs and joining multiple datasets, the concept of creating templates for reuse remains valid and the process of updating tables also follows a similar pattern, in the case of signals, I leverage a lot more resources like temporary views and tables based on the type of signals I am generating.
Despite the availability of streaming data processing in both Snowflake and Databricks and option to use Kafka for real-time processing, I have found that the majority of my use cases can be fulfilled with batch processing. For intraday data updates, I prefer to use streaming OHLC webSocket over API for fetching the pricing data to avoid the overhead of repeated API calls, and this also seamlessly integrated within the batch framework, eliminating the need for setting up Kafka or introducing additional complexity. If you have extensive use case where its necessary to use a message que like Kafka and real time streaming, I would encourage you to explore platforms like Deephaven with Redpanda before going the route of Kafka, as this leaner option catering to various quant trading use cases, also this product is a spinoff from a Hedge Fund.
When dealing with intraday data (historical and live), a prudent approach involves resampling from a lower timeframe to various required timeframes. This tactic minimizes the need to repeatedly query the vendor API, enhancing efficiency.
I refrain from using Snowpipe due to its cost inefficiency for smaller file sizes. This caution extends to other serverless technologies within Snowflake, such as Tasks, which, despite their advantages, can incur unexpected costs, especially for those new to such technologies. It's essential for beginners to exercise caution with serverless options, as maintaining an active warehouse incurs at worst a fixed hourly cost, while serverless technologies like Snowpipe and Tasks can lead to unexpected expenses even for minor oversights.