Point in Time (PiT) Data - saeed349/quant_infra GitHub Wiki
Point-in-time (PIT) datasets refer to datasets that include all reported information with associated timestamps, allowing for historical analysis and backtesting. These datasets are particularly valuable in overcoming challenges associated with standard datasets (Non-PIT), which overwrites historical data with updates leading to biases.
PIT data is crucial for modelling on financial datasets for several reasons and here are a few.
- Avoiding Biases: Standard datasets often suffer from biases such as survivorship and look-ahead bias, which can skew the performance of machine learning or other statistical models. PIT data, by including all reported, timestamped information, helps in mitigating these biases. A common example of this in seen in equity pricing data where historical data might be lacking companies that been delisted or filed bankruptcy causing survivorship bias in the dataset. ○ Even when datasets are classified as PIT due to the presence of an associated collection timestamps, it doesn't mean its void of biases, the source could still be susceptible other biases that can render its effectiveness as PIT dataset. Therefore, its crucial to understanding the data generation process to evaluate how PIT is the dataset. ○ Biases can also seep when you are constructing derived datasets from PIT source datasets where we accidentally involves the future data to create certain statistics. A common beginner mistake I have seen is quantization of returns using full data for modelling which causes look ahead bias in the models.
- Backtesting against real conditions: PIT data allows for backtesting strategies against against real conditions and this provides a realistic assessment of strategy performance.
- Incorporation of restatements: By incorporating known restatements in the data, PIT datasets eliminate assumptions about data availability and prevent lags. Restatements are way more common in core market dataset than we imagine and it can be seen in datasets ranging from price to company reported fundamental information, and these restatements could be for several reasons like corrections, technical errors, frauds, accounting changes etc.. . A popular fraud related restatement is when in October 2001 Enron Corporation massively restated its results for the periods 1997 to 2000 to correct accounting violations. The original data, unavailable in many non-PIT databases, remains available to PIT users to this day. So its very important to keep a record of all the information as it flows in.
- Analyzing events: In finance, PIT data is particularly important for event studies. These studies analyze the impact of specific events, such as earnings announcement, policy changes, or economic crises, on financial markets. Utilizing data at precise time points is essential for accurately measuring the effects of these events.
- Temporal dependencies: Financial data often exhibits temporal and cross-sectional dependencies, where past events (across the same signal and across multiple signals) may influence the future outcomes. Utilizing PIT data enables models to comprehend and capture these dependencies, therefore enhancing their predictive capabilities.
- Model validation: Reconciliation and model validation heavily rely on comparing their predictions against the actual outcomes at the specific point in time. Not only does this helps researchers assess the robustness and accuracy of their models under different market conditions, it also plays a crucial role in the operations, compliance and regulatory commitments. PIT data is crucial for performing such validations.
- Importance of PIT reference dataset: There are multiple other datasets that needs to work together with the core datasets to make it effective and PIT in nature. For example, even if the market pricing data is PIT, if the associated identifiers and entity reference datasets associated with the market price is not PIT, then it can render the price dataset ineffective to be used in many strategies. The same goes true in the case of alternative datasets as well. I have worked with a foot-traffic vendor who has created a really good PIT foot-traffic dataset, but the accompanying Region of Interest (ROI) datasets that define polygons/regions that associate to entities was only partially PIT leading to biased results in the downstream models.
- Data quality: Data quality issues are inevitable and they can lead to operational disruptions in research and trading. The ability to trace these issues back to their source is crucial, and having accurate timestamps linked to the datasets creation enables us to associate it to the events that generated these data.
- Operational Issues: A good portion of Hedge Fund failure often stems from issues related to compliance, AML, regulatory requirements and other operational concerns. The most critical aspect of these challenges is the difficulty in tracing these issues back to their origins. This is where PIT datasets and databases plays a really critical role as they store all the information as its received, offering a vital tool for retrospective analysis.
Point In Time Database Design
To design a Point-in-Time (PiT) database, the fundamental concept is to create a structure that preserves the historical states of data without physically deleting or updating information. Traditional relational databased often use commands like Update and Delete, which can erase or modify existing data. However, a PIT database aims to overcome these challenges. It supports two essential concepts: History and Audit Trail. History encompasses all information, both current and historical, that is considered true at specific moment, while the Audit Trail captures information believed to be true at some previous point in time. In a PIT database, physical Deletes are forbidden, and Inserts are flagged with timestamp using a DateCreated column. Logical Deletes are introduced to mark the removal of a row without actually deleting it. And for updates, instead of modifying existing rows, the PIT database flags the existing row as irrelevant then a inserts a new with the updated values. Key columns like DateCreated, DateEffective, DateEnd, DateReplaced are added to track changes and facilitate queries.
Example Design
Lets design a PIT database in an OLAP framework (Snowflake) for storing the Global cash equities pricing dataset obtained from vendor (eg:Bloomberg Data License). While the previously detailed architecture could be implemented in a single table, I chose to separate the Historical and Audit Train into two distinct tables (MAIN and STAGING). This separation was aimed to optimize storage and to enhance query performance. Vendor would send us the historical and daily parquet files in AWS S3 for multiple regions (North America, LATAM, EMEA, Asia). Along with the scheduled EOD(End of Day) file deliveries for the different regions, there will be closing price updates and historical restatements for these different regions as well. The objective was to build PIT tables for these different regions and also build a clean historical tables at the regional level as well as at the global level.
Below is an example flow that shows how this was achieved.
S3 Location -> Snowflake stagging table -> Merge CDC into Regional Tables -> Merge CDC into Global Table.
Now lets take the case of one region (North America - NAMR) and explain briefly how to create the PIT (audit trail/staging) and non-PIT table (historical/main).
- Data is delivered by Bloomberg in our S3 (via s3 access points and lambda trigger) and this would trigger a data ingestion process into Snowflake staging tables via snowpipe (a serverless ingestion tool in snowflake to load data from s3). * Along with the data, we also added another 2 fields - the filename and the loaded_timestamp (time at which the data is loaded.) * To this table we are only doing inserts and no updates or deletes from this table. And this table will act as our source of PIT data (audit trail).
- Now we'll define a stream on these staging tables. Stream lets you capture the CDC(change data capture) and not the whole table and its cleared out whenever an insert operation is performed on the steam.
- Then we we'll merge the data from staging to the main table like below pseudo code.
-- NAMR_STAG - staging table-- NAMR_STAG_STR - staging stream-- NAMR_MAIN - final regional table-- table_b - deduplicated NAMR_STAG_STR-- Lets assume that the unique keys for each data point is FIGI and DATEMERGEINTONAMR_MAIN as A
USING table_b as B
ON A.FIGI=B.FIGI and A.DATE=B.DATE WHEN NOT MATCHED THEN{insertstatement}
WHEN MATCHED AND B.LOADED_TIMESTAMP > A.LOADED_TIMESTAMP
THEN {update_statement}
- Performing a merge operation on the loaded_timestamp (indicating when data was loaded into Snowflake staging) enables us to conduct updates or add data requiring overwriting in the historical table.
- Once all the regional files are loaded, then the global price table is also updated in a similar merge operation. Now in this situation, we would get a clean historical table with timestamps showing when those data points were available. And for detailed backtesting, we can easily get the PIT data from the staging table. Most often if its a PIT dataset, the vendor would also provide a timestamp field indicating when the data was available on their end. This field serves a dual purpose, it facilities PIT analysis and acts as check for the accuracy and potential data delays.
Its noteworthy that while some popular databases and data warehouses offer time-travel capabilities for obtaining a PIT snapshot at a specific time, I intentionally avoid utilizing these features. I consider such operations not only costly but also believe they should be reserved for platform management and recovery purposes, rather than replicating a dedicated PIT database. An exception exists in the case of a few timeseries database that possess inbuilt PIT architecture, facilitating the seamless storage of historical data with optimized audit trails.