Core Equity Datasets - saeed349/quant_infra GitHub Wiki

Lets talk about how we have sourced datasets for this project for the US cash equities market. I will address the below points in this section.

Core datasets that are required.
The vendor we are using.
What identifier to choose for connecting your datasets.
Common problem faced with examples on share classes, dual-listing, multiple listing, ticker-change, corporate actions etc..
The importance of Point in Time (PIT).
Creating a trading universe.
Why is it better to not source and build everything and use as much as possible from Quantconnect.
Some other important equity datasets.
Alternative data and the importance of security master.
Some other vendors that provide Equity and Futures dataset at an affordable price.
Most popular institutional grade vendors.

Core datasets

Below is a compilation of crucial datasets essential for commencing systematic trading. The ordering prioritizes the most significant datasets at the top.

Market price data
Reference data
Corporate actions
Trading calendar
Security master
Sector classification
Fundamental data
Tick trade data (for TCA).

Market price data

Include OHLC EOD data with volume and intraday OHLCV data in this section. For institutional clients, ensuring accurate volume information at different levels is crucial. Evaluating vendors and comparing their numbers is a critical part of selecting this dataset, especially when dealing with EMEA-domiciled companies listed in multiple countries. Notable examples include Ferrari and Shell, both of which are listed on various international exchanges. Ferrari, for instance, is listed in both New York and Milan, while Shell is listed in Amsterdam, London, and New York. Depending on your strategy, you may need to consider whether entity-level volume or trading venue-level volume is more important. In this project, we utilize pricing data from Polygon.

Reference data.

Include market cap, security identifiers, primary exchange, and the type of instrument (common share, ADR, etc.) and other necessary details of the security in this dataset. It's crucial that the dataset is point-in-time, allowing us to track details such as which identifier was valid at each point and whether a stock is currently active or delisted. The significance of this is further illustrated in the following section. For this dataset, we utilize Polygon, but our operations heavily depend on Quantconnect's security master.

Corporate action

Stock splits, dividends, mergers and acquisitions, rights issues and spin-offs. Having total return (capital gains as well as cash distributions such as dividends) is very critical for analyzing strategies with overnight positions. Here I am going with what Quantconnect provides as well.

Trading Calendar

Public holidays, market holidays, market session timing, expirations etc..

Security master

Consider this as a robust database/table designed to store a variety of essential reference data and instruments. It's typically a derived dataset compiled from various other datasets, making it challenging to arrive at a single definition of a security master. From a front-office or trading standpoint, it serves as a dataset for connecting core datasets and alternative datasets. For middle office it could be essential part of execution and from a back-office perspective, it acts as a means to reconcile position and trade information between front-office and back-office operations. Some vendors offer variations of this, even an extensive cross-reference dataset with detailed information about the securities. In this context, I am aligning with what Quantconnect provides.

Sector classification

As the name suggests, this dataset contains information about the sector and industry in which the company falls into. Like the identifiers, there are standards to this datasets and all of them are owned by different organizations (govt and private). A few popular ones are

SIC - Occupational Safety and Health Administration (US)
NAICS - US Census Bureau (US)
ANZSIC - Australian Bureau of Statistics.
NACE - EU
ICB - London Stock Exchange
GICS - Owned by S&P global and MSCI.

Polygon provides SIC and Quantconnect has it from MorningStar. Now this dataset has application at different parts of the process starting from using this for diversification/risk in the strategy building to the back office processes. But where I have seen it been used the most is in standardized reporting and for connecting other datasets. For example, if you are risk manager in the credit risk team of a sell-side bank and if your hedge fund counterparties are reporting their sector exposures using different standards, its very hard to do an apple to apple comparison. And the same goes with choosing alternative datasets as well, some vendors provide industry level metrics based on a specific standard, so we would need that standard to connect it to our universe or portfolio. So the idea is choosing what's most popular and from my experience its GICS.

Fundamental data

This encompasses data from the balance sheet, income statement, ratios, and other company-reported metrics. Handling this dataset can be quite tricky, as most vendors lack a genuine Point-In-Time (PIT) dataset for it. Restatements and errors are common in company-reported metrics, so depending on your strategy, you need to decide which characteristics are crucial. Vendors put considerable effort into standardizing these metrics. Despite Polygon offering a version of this dataset, I opted to align with what Quantconnect provides, which is sourced from MorningStar. Here's a comprehensive report from S&P Global detailing the difference between PIT and Lagged PIT data. S&P is also among the few vendors that offer high-quality PIT global fundamental datasets.

Tick data

There are multiple use for this data, you could use this for alpha research but I have seen it been used most at the backtesting and execution analysis like transactional cost analysis and calculating VWAP. Polygon provide both trade and quotes (NBBO) from SIP feed. Another popular vendor that's used in this space is Tickdata and fortunately Quantconnect provides this dataset for US cash indices (taking away the pain of integrating Tickwrite). In our situation, I do most of my backtest in the Quantconnect cloud platform leveraging the granular data on the platform.

Choosing the right identifier

Ticker is what most people are used to using, but this is a nomenclature that mostly defined by the exchanges and it changes from venue to venue. To give a recent example when Facebook changed it name to Meta the trading ticker also changed, but that doesn't mean we have to treat this as a different time series as nothing changed other than the name. So for these reasons ticker is the least effective way of identifying a security. And top of that there are multiple levels at which an entity/security could be identified at and a common flow is as follows.

Here all the IDs are depicted to represent the hierarchies and they are not real identifiers. A entity with Entity ID could be a conglomerate that has portfolio of companies under it and we could assign Company ID for each and these could be public or private companies and so is the main entity. If the portfolio Company is traded and listed in multiple countries then we assign a Security ID for these securities and finally in each countries there could be multiple exchanges where they are being traded so we would need a Trading Item ID. From a pure execution point of view Trading Item ID is what matters, but from an analysis point of view, all the above levels are very important. For example, the fundamentals of a company could be obtained at the Company ID level where that same fundamental is applicable at multiple Security ID levels if they are traded at different countries.

Entity relationships can get very complicated and fortunately there are some good standard identifiers that we can use for our strategies. The most popular ones are FIGI, ISIN, CUSIP, SEDOL and each major vendor have their own entity mapping ids and most of them lets you seamlessly map from their mapping to these 4 IDs or at least to one of them. The same goes for Quantconnect, in our project we have chosen FIGI as our base identifier and using FIGI we are connecting to Quantconnects identifier called the Symbol. Now why FIGI, here are my reasons

FIGI is co-owned by Bloomberg and its open. So you can easily look up the current FIGIs.
ISIN, CUSIP and SEDOL are all owned by different organizations and there's a licensing agreement. And even part of FIGI is not free.
If you are a Bloomberg terminal user (which most institutions are) and if you use EMSX which is Bloomberg Execution services which is a portal to the sell-side/prime brokers, things get much easier if you use FIGI.
And last but not least, Polygon provides FIGI.

Now even if we are using FIGI, we have FIGI at different hierarchies just like other identifiers.

Even FIGI is not a perfect solution, FIGI as standard came around 2010 so most of the mapping prior that is not perfect. A well used identifier at the Trading level is SEDOL + Exchange symbol as it overcomes a few issues FIGI has. But unfortunately SEDOL is not free those who provide has licensing agreement with LSE (London Stock Exchange) or needs the client to have an agreement with LSE.

Few examples showing the significance of having the right set of identifiers and knowledge of entities.

Dual Listings

Alphabet Inc.

2004 - IPO with ticker GOOG

2015 - Dual Listings based on class of shares, ticker GOOG becomes GOOGL.

2015 - The company changed its name from Google to Alphabet.

During this change, not only the tickers but the major identifiers could also change, so its important to have an entity mapping in place.

Since there are 2 securities from the same underlying, if we don't remove in our universe selection algorithm, we would have unwanted exposure to certain entities.

Multiple Listings

Certain companies have listings in multiple countries and therefore its important to understand which one is the primary listing. This is important if you are running a Global strategy as you don't unwanted exposure to one entity. Ferrari and Shell are good examples of this. Ferrari is Listed in both New York and Milan (primary) . Shell is listed in Amsterdam, London (primary) and New York.

Mergers, Acquisition and delisting

One example of a company that changed its ticker after a merger is DowDuPont Inc. In 2017, Dow Chemical and DuPont merged to form DowDuPont. Following the completion of the merger, DowDuPont underwent a restructuring and subsequently split into three separate companies: Dow Inc., DuPont de Nemours, Inc., and Corteva Inc. Each of these new entities has its own unique ticker symbol on the stock market. This is a common occurrence in mergers and demergers, where the resulting companies often adopt new ticker symbols reflecting their changed corporate identities.

Trading Universe

Now that we have defined all our datasets, one of the first step in trading most long short strategies is defining a universe of stocks in which you can run your strategies on. And the criteria often involves selecting most liquid stocks with certain market capitalization threshold. Ideally we would have a universe defined at the beginning of each month going back to the complete history we are interested in a point in time manner. A very rudimentary selection process is

Everyday in the US equity market there are more than 10K symbols that are trading in the stock asset class. Out of there are common stock, ETF, ADR, Warrants, Rights etc.. and in our case we are only interested in single names, so common stock.
Then we need to select the primary listings (relevant more in foreign markets) and remove dual listings.
Once we have that we can filter by market cap, eg: 100mn USD cutoff.
And later we filter on the average dollar volume (closing price multiplied by total volume trader) over the last n number of days to make sure we are only selecting liquid instruments, eg: 1mn USD cutoff.

Quantconnect algorithm framework has even more advanced universe definition methods and would encourage you define that in your final strategy. But you can also define the above as course universe definition in your pipeline as an upstream process before calculating custom indicators or signals on a universe of stocks.

Some other commonly used datasets for cash equity strategies

Analyst estimates (detailed and consensus).
Short interest and 13F datasets
Factor models
Ownership reports and other Company fillings (8K, 10Q) reports from SEC.
Earnings transcripts analysis.
News and sentiment data.
Options data

Alternative data

This is where the edge (or alpha) primarily resides. Drawing from my experience in various data strategy teams over the years, I've learned that time is money, and the ability to iterate over ideas and datasets swiftly is a crucial KPI for measuring the success of a data strategy initiative.

Several aspects contribute to a successful alternative data strategy process, and here are a few key considerations:

Data Scouting: Focus on identifying partners, maintaining regular meetings, expanding your partner network, and nurturing those relationships.
Data Evaluation and Onboarding Process: Assess the duration it takes to move from identifying a potential dataset to testing it. Understand the legal and compliance processes involved and explore ways to expedite these procedures.
Data Engineering: Evaluate the efficiency of integrating a new dataset into your system.
Cross-reference Dataset: A robust cross-reference dataset simplifies the integration of a diverse range of alternative datasets.

Biases in alternative data

Biases are common in Alternative datasets are here are few common biases with examples

Selection bias: Selection bias occurs when the sample from which data is collected is not representative of the entire population, leading to skewed results.
- Example: Suppose you are analyzing foot traffic data in a retail setting to understand customer behavior around certain brands. Imagine the data vendor you are working with covers mostly high-end shopping mall known for its upscale stores. Here, there's a chance that the retail foot traffic dataset would be skewed towards affluent demographics and considering the full sample size might not be representative of the actual customers base of the brands you are trying to analyze.
- One plausible way to solve this issue is to categorize the dataset based on additional meta datasets (like foot-traffic audiences demographics or per capita income of the ROI (region of interest) by census tract) and then we can create samples from these strata's to come up with a representative dataset that we can then do our analysis on.
Look-ahead bias: Look-ahead bias occurs when future information is unintentionally used in the analysis or decision making process, leading to distorted and unrealistic results.
- This is very common in POS (point of sales) transaction dataset where the aggregator collects sales information from many different product vendors/consumers and sometimes these reported numbers might be incomplete, there could be restatements, error etc.. , but then the aggregator might report them without the details of these restatements and therefore the historical data wouldn't be point in time and would be suffering from look ahead biases as the historical data points would be having the future updates and restatements without any timestamps associated with it.
- Getting point in time (PIT) dataset is the best way to deal with this bias, but if that's not available which could be the case with a lot datasets where the current data would be PIT but the historical would be non-PIT, then understanding the characteristics of these restatements or updates can help us Introducing lags into the transactions and in a way try to make it look like PIT and therefore remove this bias partially.
- If there's enough samples in the PIT data, then we can take a smaller sample that's representative of our universe to do the analysis. But we would also have situations where the PIT data sample size is too small or not representative. Based on the type of datasets we can also perform imputations on the missing values to come up with estimates of the missing values and therefore get a bigger sample size for our analysis. For example, in the POS dataset from vendors example, if we are missing a few transactions for a particular analysis date, we could impute these values by looking at these missing vendors historical transactions and their relationship to their cohorts from the same dataset to come up with these missing values.
Survivorship bias: This bias arises when only successful or currently thriving entities are considered in an analysis, leading to an optimistic perspective on performance.

For instance, in a equity price dataset, excluding companies that have been delisted will introduce survivorship bias in further downstream analytics and backtesting as the dataset would only capture the prices of currently active stocks overlooking those that did not survive. By incorporating the entire historical ranges of stocks including those that did not survive, we can avoid this bias in this particular situation.

Collection/Sampling bias: Sampling bias occurs when the method of selecting participants or data points for a study systematically excludes certain individuals or groups, leading to unrepresentative sample. Its commonly seen in survey datasets,
- Example: In a survey about consumer interest for a particular retail product sold in physical stores, if the sample is drawn only from online forums, it may not represent the actual consumer population, introducing sampling bias. Identifying the proper group of consumers and having the survey at physical stores would yield better representative data.
Blindspot bias: This can be considered as a subset of collection bias, but less obvious. This often happens in datasets that we perceive to be less biased but there would be some underlying characteristics or mapping component of the data that makes it biased.
- Example: A foot traffic dataset might have a point in time measure for the foot traffic to particular region of interest (ROI) that could be a retail location. But if the ROI data that's used to supplement the core foot traffic data doesn't have the correct or updated ROI polygons marked due to cases like shop relocation or closing, then we would be measuring the wrong traffic. Keeping all the connected datasets (reference, mapping etc....) up to date and PIT is important for analytics and modelling. Enquiring vendors about data collection and normalization methodologies can help us reveal any hidden biases like these and this is crucial step in especially when you are working with newer data vendors. Another method is to define causal graphs of the core and the supplementary datasets including the mapping process to understand and quantify the relationship and its impact on changes to one another.
Reporting bias: Refers to intentional or the unintentional distortion of information in a company's financial statements, disclosures or reports. There could be various different types of reporting biases and here are a few of them
- Selective disclosure: providing information selectively to specific stakeholders to influence their decision making.
- Off-balance sheet financing: intentionally hiding liabilities to make the balance sheet look better.
- Revenue recognition bias: entities may recognize revenue prematurely/delay revenue recognition to achieve specific financial reporting goals.
- Non-standard measures: Adjusted EBITDA, Churn Rate, Customer Acquisition cost, Organic Revenue Growth are all measures that are non standard and can be considered misrepresentation of the actual financial health of the company.
- The best resolution for this bias starts at the source, like companies needs to stick to reporting standards (eg: US-GAAP or IFRS) and the data vendors who collects this information to redistribute should have proper standardization methodologies to recognize these biases. Evaluating the vendor standardization methodologies is important way to prevent these bias as data users.
Confirmation bias: Is a classic example of cognitive biases flowing through data and is commonly observed in analyst estimates datasets, where they exhibit a strong conviction to their initial outlook of the company. For this reason we can see that aggregated consensus datasets, forwards earnings etc... often have a lag to the actual security prices.
Ethical bias: Commonly seen in ESG datasets where an entity engages in greenwashing practices, where a company's self-reported ESG initiatives may not align with its actual environment impact. Some vendors and corporations have been under heavy scrutiny for such practices as its a fairly new type of reporting and dataset. As a data user we can sticks to datasets that adhere to internationally recognized sustainability reporting standards like SASB (Sustainability Accounting Standards Board).
Model bias: This happens in situations where a machine learning or a statistical model consistently makes inaccurate predictions or decisions due to underlying biases in the dataset or the model's design.
- An illustrative instance of this phenomenon is evident in credit scoring datasets, where a model is employed to forecast the probability of a borrower defaulting on a loan. If the underlying data exhibits imbalance across certain demographics or groups then it can lead to biased predictions.
- Most of the time fixing these issues involves correcting the imbalances and biases present in the underlying dataset before training the model or choosing models that learns and corrects for these imbalances during training.

Other Vendors

Databento

Have a very unique pricing model that's by usage and opens up a lot of institutional grade datasets to retail traders.

CME Futures pricing dataset
NASDAQ TotalView - full order book data for equities.

Intrinio

Expensive than Polygon, but have more offerings, aimed at institutional investors. Has data sharing via Snowflake as an option.

Kibot

Cheap futures and equities pricing dataset, both EOD and intraday.

Institutional Vendors

For all the core datasets mentioned in the beginning of this page, here are the most popular vendors in the institutional space that offers most of these datasets. The pricing of datasets from these vendors runs in the hundreds of thousands of dollars per year and for that reason its out of question for most retail clients.

Bloomberg
S&P Global
Refinitiv
Factset