Standard Operating Procedures - everycure-org/matrix-validator GitHub Wiki

2.6 Validating and Transforming Source Data to Comply with the Matrix Schema

Overview

This SOP describes the procedure for validating and transforming source data to align with the Matrix schema. It guides users through the process of checking data for schema compliance, generating reports, and preparing transformation steps to integrate data into the Matrix pipeline. This process helps ensure data sources can be integrated into the Matrix graph without schema conflicts, supporting data quality and pipeline robustness.

Objectives

Ensure consistent data formatting across sources.
Facilitate early detection of schema violations and missing fields.
Provide a structured, auditable pathway from raw input to pipeline-ready format.
Maintain compatibility with evolving schema versions.
Support reproducibility through versioned reports and transformation logs.

Prerequisites

Python ≥ 3.11
matrix-validator installed via pip or GitHub.
Source data in KGX format (.tsv for nodes and edges).
- The nodes file must include: id, name, category (and others as defined in schema)
- The edges file must include: subject, predicate, object, provided_by, etc.
Familiarity with the Matrix schema definitions.
Schema version being validated against (default from latest Matrix schema or specified via tool config)

Installation

There are 2 ways to install:

Install from PyPI: pip install matrix-validator
Install a specific release directly from GitHub: pip install git+https://github.com/everycure-org/[email protected]#egg=matrix-validator

Step-by-Step Instructions

1. Initial Validation of Raw Data

Use this to create a baseline snapshot of schema alignment. This first run should be done before any transformations. Run the validator on the raw source files (unmodified inputs in KGX format):

Optional: pin schema version using environment variable poetry run matrix --validator polars --edges raw_edges.tsv --nodes raw_nodes.tsv

2. Schema Alignment and Transformation

Based on the validation report, implement a transformation script (commonly referred to as a parser) to patch the raw data into schema-compliant format. This might include:

Normalizing CURIEs
Filling required fields (e.g., category, provided_by)
Renaming or mapping edge predicates
Resolving data type mismatches

This patched version should be saved as a .1 revision of the original source.

3. Re-run Validator on Patched Data

This acts as your “unit test” for parser compliance. Output reports from this step should be committed alongside data for transparency. Run the validator again on the patched data: poetry run matrix --validator polars --edges path/to/patched_edges.tsv --nodes path/to/patched_nodes.tsv

Confirm that the data now passes validation or only triggers known, acceptable warnings.

4. Document Changes

Prepare a changelog or patch history file (history.yaml or patch_notes.md) describing:

Original schema violations
Transformations applied
Remaining known issues (if any)

This history should accompany the patched files for reproducibility and transparency. Here is an example history.sh script that preserves the provenance of what data-side modifications were made:

#!/bin/bash
clean_edges_header -i ./30fd1bfc18cd5ccb/robokop-30fd1bfc18cd5ccb_edges.tsv -o ./30fd1bfc18cd5ccb.1/robokop-30fd1bfc18cd5ccb.1_edges.tsv
fix_robokop_bool_columns -i ./30fd1bfc18cd5ccb/robokop-30fd1bfc18cd5ccb_nodes.tsv -o ./30fd1bfc18cd5ccb.1/robokop-30fd1bfc18cd5ccb.1_nodes.tsv
clean_nodes_header -i ./30fd1bfc18cd5ccb/robokop-30fd1bfc18cd5ccb_nodes.tsv -o ./30fd1bfc18cd5ccb.1/robokop-30fd1bfc18cd5ccb.1_nodes_tmp.tsv
mv ./30fd1bfc18cd5ccb.1/robokop-30fd1bfc18cd5ccb.1_nodes_tmp.tsv ./30fd1bfc18cd5ccb.1/robokop-30fd1bfc18cd5ccb.1_nodes.tsv

Notes for Developers

If extending the validator or developing transformation scripts...see Getting-Started-for-Developers

Prefix-level Validation:

In our knowledge graph, CURIEs (Compact URIs) take the form PREFIX:identifier (e.g., CHEBI:15377, MONDO:0005148). Prefix-level validation ensures semantic correctness by checking:

That all CURIEs use valid prefixes as defined by the Biolink Model
That prefixes align with the expected class types, such as:
- CHEBI: → chemical entities
- MONDO: → diseases
- HGNC: → genes

The Validator class automatically sets up these checks by:

Loading the official prefix list from biolink-model-prefix-map.json (via the biolink_model.prefixmaps package).
Accepting supplemental prefixes if specified by the user in a TOML config file:

[biolink]
supplemental_prefixes = ["FOO", "BAR"]

These are merged with the default prefix list.

Mapping prefixes to Biolink classes using preferred_prefixes_per_class.json, which defines which prefixes are appropriate for each Biolink category (e.g., HGNC for biolink:Gene).

Subclasses of the Validator are responsible for using this information to:

Validate that each node’s id has a prefix appropriate for its category
Validate that edge subject and object CURIEs are compatible with the expected domain and range of the edge predicate

This helps enforce semantic alignment between identifiers and their declared types, and is especially important for schema validation and downstream reasoning.

Future Enhancements

This process will become more streamlined as:

Schema definitions stabilize
More reusable transformation functions are added to utility modules
Integration of automatic patch suggestions from validator output is explored

Potential for integrating auto-generated patch suggestions based on failed validations. Plan to support schema evolution tracking using diff tools (e.g., LinkML-diff or custom YAML comparators) to identify breaking changes across schema versions and inform necessary parser updates.