Data Governance and Data Quality - saeed349/quant_infra GitHub Wiki
Data Catalogue
In the early phase documenting each external dataset in a shareable Excel file is a good start, noting its usage and storage location. As user engagement with the data increases, the catalog evolves to incorporate new features. One such feature involves updating catalog values based on live tables. In a fund, datasets with similar purposes may originate from different vendors. When new Portfolio Managers (PMs) or Quants join, understanding what others are using becomes a priority. Therefore, measuring access and usage of these datasets is crucial as the firm expands.
Data Lineage
Databricks provides comprehensive support through the Unity Catalogue and Delta Live Tables, facilitating automatic lineage creation. When constructing pipelines with DBT, lineage information can be obtained, and with Python's recent support, it can decode Pyspark lineage as well. While Airflow offers commendable lineage options, using all the connectors in Airflow to design Directed Acyclic Graphs (DAGs) becomes a challenge, as opposed to its current role primarily as a scheduler in our case.
Data Quality
In this project I ensure Data Quality by having a few key systems like the following in place.
- Checking for any anomalies and failures of tests at the end of each DAG runs. If any data quality tests failed pertaining to that DAG, then a Slack Alert is send.
- DAG failures are also alerted on Slack.
- Daily Data Quality report generated using PyFPDF that shows the latest timestamps and number of rows of data added, null point etc... for the pricing datasets and indicator pipelines.
General Data Quality Guidlines
Data is ever-changing, as firms grow, new datasets introduce novel challenges, necessitating the evolution of infrastructure and data management processes. Consequently, data pipelines, integral to the puzzle, must also evolve. From my experience, building a data pipeline is just 40% of the journey, the real challenge unfolds in the remaining 60% dedicated to vigilant monitoring, relentless maintenance and timely updates.
From my experience here are a few challenges around monitoring data ingestion pipeline:
- Timely pipelines status and alerts: Its essential to gather details regarding the success or failure of jobs (Airflow DAGs, spark jobs etc...) , analyze logs for error insights and establish a streamlined system for receiving alerts on any failures that can be promptly escalated to the appropriate team.
- Early detection of data quality issues: Proactively identifying and addressing anomalies to deliver accurate and reliable data to prevent any downstream downtimes to data end users like PMs, Quants and Researchers.
- Missing data and data delays: Identifying and escalation of issues around missing datapoints and data delivery SLA (service level agreement) breaches to data source providers (internal or external).
- Monitoring and controlling resource utilization: In modern serverless data stacks, data jobs may become unwieldy, consuming excessive resources due to errors or recursive retries. It is crucial to implement monitoring solutions that provides alerts on resource usage to prevent unexpected costs associated with these data operations.
- Recourse: Given the inevitability of data ingestion issues, identifying stakeholders, communicating delays and planning and implementing corrective actions become crucial steps. Thoroughly documenting the problem is also essential for future reference and use cases.
- Here are few common data issues that I often encounter and require monitoring:
- Timeliness: Monitoring data arrival times to identify delays and to ensure timely downstream processing. Having data delivery time SLAs with your data counterparties is a great way to quantify and create alerts for this measure.
- Accuracy: Verifying that the ingested data is accurate and is consistent with the source.
- Completeness: Monitoring if all expected data is ingested without missing values.
- Consistency: Checking for unexpected data points and anomalies. These include
- null values, uniqueness, validating field values to ensure the values conform to a defined list or range.
- distribution check, eg: If the price of a security increased 10x from yesterday.
- Schema changes: Vendor/source files/API schema can sometimes change without prior communication. So we need to catch these schema changes before further downstream processing. Having prior agreements with data counterparty around schema or DDL operations is critical so that you can be be prepared.
- Semantic checks : Semantic checks to validate the referential integrity of data. For example, verifying that we are getting all the prices from a vendor for the active instruments in our security master.
Here is a list of actions and processes that we can implement to alleviate issues in data pipelines.
- Automated data tests. Automated data tests as part of the pipeline play a crucial role in ensuring the quality and integrity of datasets. Open source tools like Great Expectation, DBT tests, Airflow SQL Check Operators provide robust mechanism to automate and validate data, enabling data engineers and analysts to maintain reliable and accurate data pipelines.
- Monitoring and Alerting platform: Establishing a robust system to promptly identify and alert about pipeline failures and data quality issues as soon as they occur.
- Standardization and abstraction of code: Standardizing ETL and ELT code is crucial, particularly when involving multiple developers and stakeholders. It is essential for these developers to reach a consensus on development practices to ensure consistency, encompassing aspects such as design patterns, tools, documentation etc... Additionally, creating templates and abstracting commonly used code is beneficial, for instance, when dealing with multiple pipelines fetches data via REST API to save it to AWS S3, it is more efficient to package core components and inherit from them for building on top of the existing code, rather than duplicating it for each pipeline. This also saves a lot of time in the maintenance of these pipelines.
- Documentation: It is crucial to document all critical processes related to data pipelines, covering aspects such as ETL design, vendor contact information, data quality concerns, schema structure, end users etc..., and to ensure that these documents are easily accessible and kept up-to-date.
- Establishing KPIs and SLAs for data delivery: Needs to be done with data providers as well with stakeholders who consume the end data. This improves the overall process and helps quantify and manage expectations.
- Prompt Recourse and communication with stakeholders: When things go wrong with pipelines, its crucial to have a prompt response to the issue and being transparent with the concerned stakeholders as soon as possible.
- Data Catalogue: It is crucial to understand the various data assets in our possession, including their ownership and related metadata.
- Data lineage: Data lineages of pipelines and field level lineages can help us easily debug issues and identify the root cause of data quality issues. And its helps immensely in other data projects like migrations and audits. Its also important to have a point in time records of data lineage so that we can go back in time and observe what caused the issues.
- Data Stewards, Data owners and Data governance: Having a good data governance initiative across the firm is a great way to document data assets and to decentralize accountability.
- Data contracts: Again not an important aspect, but standardizing SLAs, schema formats and other aspects of datasets can be great foundation for building a great scalable data platform. Linux Open Data Contract Standard (ODCS) is an open source data contract standard.
- Data observability platforms: This is not a necessity but having different aspects of data observability platform like lineage, catalogue, contracts etc.. integrated into one place will save a lot of time and give a better user experience.
- Business Continuity and Resource Planning: At all times, data pipelines should be assigned a primary and secondary owner to ensure redundancy in resources for maintenance and updates. Implementing a rotating support system with personnel addressing various urgent requirements is essential. Moreover, fostering familiarity among data engineers with each other's work promotes cross-collaboration, enabling individuals to contribute to and enhance one another's efforts.