Home - saeed349/quant_infra GitHub Wiki

Quant Platform for Cash Equities

Embarking on a journey through the realms of data analysis, quantitative research, and machine learning has led me beyond the expected boundaries. My passion extends not only to the intricacies of these fields but also to the essential work in quant, infrastructure development, and ETL processes—critical for unlocking the full potential of data analysis and ML endeavors.

In 2019, my exploration took a systematic turn as I ventured into testing and running strategies for the forex spot market, resulting in the birth of MBATS. Transitioning to the realm of cash equities, scalability became paramount, prompting the migration of key architectural components to the cloud. Now, four years later, my research and trading infrastructure has undergone a significant transformation. Following a recent shift from a full-time role, my sabbatical became an opportunity for personal data projects and a revamp of my infrastructure for scalable and cost-efficient operations. This journey also included aiding a former colleague in establishing the data infrastructure for a new fund. Eager to contribute to the community, I aim to share the architecture, design, and insights gleaned from this buildout, driven by valuable feedback received from prior projects such as MBATS

Quantconnect stands as a remarkable platform and has undergone significant evolution over the years. But sometimes our requirements can get a bit more intricate like working with multiple external datasets that needs staging, and performing resource intensive computations on these dataset, then we need to find ways to boost the platform's performance and capabilities. Fortunately Quantconnect was built on open-source technologies by an amazing community, and they have made the platform very flexible and feature rich that I was able integrate it easily with my external tools.

The objectives for this setup were pretty straightforward:

  • Be able to ingest and wrangle a wide range of market and alternative datasets (structured and unstructured).
  • Perform complex computations on these datasets.
  • Implement deep learning models, reinforcement learning agents, and intricate simulations.
  • Backtest strategies and transition these strategies into production with ease.
  • Being able to run fully systematic strategies on US Cash Equities.
  • Built tools, reports and alerts to help me with discretionary trading.
  • Try not to reinvent the wheel - leverage powerhouse platform like Quantconnect handle the heavy lifting wherever possible.
  • To be able to do all at the lowest cost possible.

Throughout my career, I've had the privilege of collaborating with exceptionally talented individuals in the quantitative space. The current iteration of my approach draws inspiration from these collective experiences. It's important to note that I'm not asserting this as the definitive approach, as there is no singular right way to tackle these challenges. I've witnessed portfolio managers and traders achieving success using tools as simple as Excel to performing complex calculations and analytics directly within transactional databases using SQL. The key takeaway that I learned was that success lies not in adhering to a specific method but in finding the approach that fits the unique nature of your work and your style.

In larger institutions, standardized infrastructures are common, providing portfolio managers and traders with frameworks to conduct research and develop trading strategies. These systems are highly sophisticated, incorporating a myriad of middle office, back office, and ancillary systems crucial to overall operations. However, the objective here is not to replicate such comprehensive systems. Instead, the focus is on delving into the design and thought process behind a setup that has proven effective for me and few others helping us rapidly and cost effectively iterate over ideas and efficiently transitioning them into production.

Use cases

I will briefly explain 2 use cases that explains the reason why I need an external system outside Quantconnect to do some heavy lifting.

Alternative data use case (hypothetical):

image

I am using data from three distinct vendors to generate a signal for constructing a long-short portfolio in US Equities. SafeGraph contributes a sizable foot-traffic dataset, comprising multiple tables encompassing Points of Interest (POI), traffic, and reference tables. Similarly, Similarweb provides data on web and app traffic, posing an equally challenging task and S3 Short interest dataset provides the short interest and bearish bets on stocks. Effectively handling these datasets necessitates the use of a framework such as PySpark/Snowpark and would involve intermediate staging and joins due to the substantial volume of data involved. In the above diagram I have also skipped any pricing or reference datasets that's used for combining these datasets.

Compute use case:

Say I have to compute an indicator/feature for every stock in the US cash equity universe of 5000 active common stocks. I am not referring to common technical indicators, but something where you have to use packages like pandas, Scipy, numpy and it’s a nightmare to implement in SQL or Pyspark and your only option is Python.

  • To give a few examples of such operations from literature
    • Implementing pattern matching algorithm in Andrew Lo's paper where we have to run multiple kernel regression and pattern matching algorithm.
    • Running Fractional differentiation of time-series like in the book Advances in Financial Machine Learning by Dr. Marcos López de Prado.
    • Running an Reinforcement Learning algorithm like this. While it's undoubtedly feasible to achieve faster results in languages like C++, I deliberately chose Python to simplify the process. Python offers a robust set of packages and higher-level libraries for implementing these functions efficiently. The challenge lies in scaling these operations to thousands of tickers, particularly when dealing with lower timeframes or tick data. To address this, Python-centric frameworks like PySpark, Snowpark, Dask, Ray, Modin, and Nvidia GPU acceleration come into play. However, given that our data is already stored in a warehouse, leveraging the provided technology is the optimal approach. Fortunately, UDFs, PandasUDFs, and UDTFs come to the rescue, available in both Databricks and Snowflake.

Writing these operations in Python also offers the substantial benefit of facilitating a seamless transition to other distributed computing frameworks, such as Modin, with minimal intervention in the future.

High Level Architecture

image

  1. First, the data is retrieved from vendor APIs and stored in S3, later transferred to a Data Warehouse such as Databricks/Snowflake.
  2. Subsequently, all business logic is executed within the warehouse using SQL, Pyspark/Snowflake. This encompasses various tasks, including but not limited to:
    • Trading universe selection.
    • Generating signals from alternative data.
    • Merging datasets (market and alternative datasets).
    • Creating features for downstream ML modeling.
  3. ML research and modeling are conducted on bare EC2 machines or Lambda Lab machines with Lambda Stack as the package management layer. Logging, experiment tracking, and resource management are handled through Weights & Biases (hosted). The model is registered and served via Mlflow.
  4. Backtesting is performed in QuantConnect Cloud for supported markets, while for unsupported markets and datasets, backtests are executed locally via QuantConnect Lean CLI.
  5. Live strategies are executed on both the QuantConnect Cloud platform and Lean CLI based on the strategy, market, and resource requirements.
  6. ETL, ELT, ML model training and inference, strategy deployment & management are all managed using a self-hosted Airflow on AWS EC2.
  7. Data Warehouse to QuantConnect Cloud data transfer and AWS resource management are handled using AWS Lambda.
  8. All logs, including live trading and backtesting logs, as well as order files from QuantConnect, are loaded, parsed, and visualized in AWS OpenSearch (analogous to ELK stack) with Filebeats as the shipper.
  9. Plotly Dash is employed for visualization and charting for discretionary trading.
  10. Slack serves as the reporting and alert platform. All Airflow DAG success statuses are reported here, and signal, trading, and risk/performance PDF reports are generated and distributed via Slack as well.

Other Important Systems

Looking at the diagram above, one might observe that there are a few key components missing—questions arise about the presence of optimizers, OMS, risk framework/engines, reconciliation setup, and backoffice systems. In the realm of my personal research and trading, Quantconnect addresses all the fundamental requirements. It covers the basic aspects such as optimizers for both strategy parameters and portfolio, a comprehensive risk management framework, and reporting functionalities for both backtest and live trading.

However, it is acknowledged that for a real fund or a startup fund, certain crucial components become imperative and the operations involve integrating with external entities like clients, prime brokers, fund administrators, audit teams, and more. The startup fund I am currently consulting for recognizes these requirements and plans to utilize Enfusion for their back/middle office operations. Notably, Quantconnect stands out in its ability to cater to both retail traders and the institutional space. Its low-latency infrastructure, integrations with popular trading venues, and a low barrier of entry create a compelling use case for funds. Additionally, the platform's great community fosters an excellent environment for retail traders and emerging quants to thrive. In essence, it forms a dynamic ecosystem where a symbiotic relationship between the retail and institutional crowds is facilitated by the open platform and the remarkable team at Quantconnect.

Feeding signals into Quantconnect

image

As mentioned earlier, data undergoes processing to generate signals, and if machine learning models are part of the strategy, their inferences are stored in the warehouse. These inferences are then further processed into trading signals. Subsequently, these signals are dispatched to the strategy, either through a REST API (leveraging AWS Lambda) for Quantconnect Cloud or via the data warehouse's Python package for Quantconnect Lean (via QC Download).

Signal Examples

In a typical Equity Long-Short strategy with a rebalancing window, the signals or portfolio weights can be passed into the Alpha model within the strategy. This is achieved using the Algorithm Framework. I highly recommend designing equity strategies with the Algorithm Framework, which provides numerous options for customizing risk models, execution models, portfolio construction and optimization models, among other features. It is analogous to the Zipline Pipeline from the Quantopian platform, which, like Quantconnect, was a revolutionary platform.

<style> </style>
symbol signal_timestamp timestamp signal strategy_id
BBG000N9MNX3 1/19/23 0:00 1/19/23 0:00 0.2 st_eq_ls_1
BBG000BWQYZ5 1/19/23 0:00 1/19/23 0:00 0.3 st_eq_ls_1
BBG000BPH459 1/19/23 0:00 1/19/23 0:00 -0.25 st_eq_ls_1
BBG000B9XRY4 1/19/23 0:00 1/19/23 0:00 -0.25 st_eq_ls_1

I have another strategy that doesn't align with the Algorithm Framework since it requires implementing a somewhat intricate order structure for each position, akin to a bracket order. In such scenarios, I structure the signal data as outlined below to meet the strategy's specific needs. Fortunately, the Quantconnect platform is adaptable enough to accommodate the structural nuances of most strategies I've contemplated.

<style> </style>
symbol signal_timestamp timestamp direction entry_type entry_price stop_loss take_profit signal_id strategy_id position_status
BBG000B9XRY4 1/19/23 4:00 1/19/23 0:00 buy market   200 180 a94a8fe5ccb19ba61c4c0873d391e987982fbbd3 st_eq_pat_1 FALSE

Preventing biases

One of the significant challenges in developing signals outside the backtester lies in the potential for look-ahead biases. Here are some practices that have proven effective for me:

  • I employ the same Quantconnect strategy for both backtesting and live trading, utilizing identical signal table structures in both scenarios.
  • I conduct point-in-time tests on the pipelines generating signals. For instance, I choose a timestamp in the past, threshold all upstream tables to that timestamp, and then generate signals. Missing data, data mismatches, and discrepancies from batch runs serve as red flags. Automation of these tests is also feasible.

When implementing strategies using fundamental data within the Algorithm Framework, I prefer running them on Quantconnect Cloud and integrating external signals via REST-API. This is mainly because procuring, maintaining, and developing downstream business logic for local fundamental data is more challenging than it may seem. I have played around quite a bit with S&P and Factset fundamental data at work and there's some effort in building out all the business logics, so if battle-tested data is readily available, I opt for that route.

For strategies based on price and alternative data, especially in markets not supported by the Quantconnect Cloud platform, I prefer running the strategy locally on LEAN CLI. However, for supported markets, I conduct backtests on the Cloud platform to obtain more accurate and quicker results. The Quantconnect ecosystem, with its remarkable flexibility and customizability, accommodates a diverse range of applications seamlessly. The effective support system is noteworthy; even with the lowest support tier (bronze), a support inquiry from me about a broker integration bug was addressed promptly, posted on GitHub within two days, and resolved with a pull request the following week.

In the outlined workflow, logs and order files from live executions and backtests are ingested into the Opensearch database (a fork of Elasticsearch). Quantconnect strategies are designed to pass signal_ids or strategy_ids as tags to orders. Leveraging logs and order files and given the signal_ids makes it easy to build position-level information, facilitating the creation of live dashboards on Opensearch (Kibana) and Dash. Additionally, a process updates signal tables with order and position status, ensuring easy recovery of a strategy state if it stops midway or reset its state.

I have recently experimented with using the Quantconnect platform for discretionary trading using the above setup. The setup for executing trades simplifies post-trade analysis and reporting processes significantly. For instance, obtaining intraday equity curves, a challenging task with most retail brokers, becomes straightforward with control over logging in place.

In the subsequent sections, I have specified the links to the detailed explanation about the datasets used and the infrastructure.

⚠️ **GitHub.com Fallback** ⚠️