Profiling Signals - accelerate-data/migration-utility GitHub Wiki

Profiling Signals Reference

Detailed signal tables and output field definitions for the six profiling questions answered by /profiling-table. See Profiling Table for the main workflow documentation.

Q1 -- What kind of model is this? (Required)

Determines materialization strategy, dbt tests, and whether SCD2 history logic is needed.

Write-pattern signals:

Proc pattern Classification
Pure INSERT, no UPDATE or DELETE fact_transaction
INSERT with GROUP BY (aggregation before write) fact_aggregate
TRUNCATE + INSERT with descriptive VARCHAR columns dim_non_scd
TRUNCATE + INSERT with measure + FK columns fact_periodic_snapshot
MERGE with simple WHEN MATCHED THEN UPDATE only dim_scd1
MERGE with expire matched row + insert history row (valid_to, is_current) dim_scd2
INSERT then UPDATE targeting milestone date columns fact_accumulating_snapshot
Cross-join INSERT of low-cardinality flag combinations dim_junk

Column shape signals:

Column pattern Signal
valid_from / valid_to / is_current / current_flag dim_scd2
Multiple milestone date columns (order_date, ship_date, close_date) fact_accumulating_snapshot
snapshot_date / as_of_date / period_date fact_periodic_snapshot
Multiple BIT/TINYINT flag columns, all low-cardinality dim_junk candidate
Surrogate PK (_sk) + separate natural key column Dimension (SCD1 or SCD2)
FK columns (_sk) + numeric measure columns Fact table

classification output fields:

Field Type Description
resolved_kind string Enum: dim_non_scd, dim_scd1, dim_scd2, dim_junk, fact_transaction, fact_periodic_snapshot, fact_accumulating_snapshot, fact_aggregate
rationale string Why this classification was chosen
source string Enum: catalog, llm, catalog+llm

Q2 -- Primary key candidate (Required)

Required for unique_key in incremental models. Missing it forces full-refresh or produces duplicates.

Signals beyond catalog PKs:

Source Notes
MERGE ON clause in proc code Business key / table grain -- strongest code-level signal
UPDATE / DELETE WHERE col = @param Single-row lookup key

primary_key output fields:

Field Type Description
columns string[] PK column names
primary_key_type string Enum: surrogate, natural, composite, unknown
rationale string Why these columns were identified as PK and why this type
source string Enum: catalog, llm, catalog+llm

Q3 -- Foreign key candidates (Nice-to-have)

Needed for relationships tests and detecting role-playing / degenerate dimensions.

FK type resolution:

fk_type Rule
standard One fact column joins one dimension key with no multi-role pattern
role_playing Two or more distinct fact columns join the same dimension relation+key
degenerate Column behaves as business key in fact usage but no dimension join target found

foreign_keys[] output fields:

Field Type Description
column string FK column name on this table
references_source_relation string Referenced table (source-side identifier)
references_column string Referenced column
fk_type string Enum: standard, role_playing, degenerate
rationale string Why this FK type was chosen
source string Enum: catalog, llm, catalog+llm

Q4 -- Natural key vs surrogate key (Required)

Determines whether the model calls dbt_utils.generate_surrogate_key and whether the incremental unique_key is a raw column or generated hash.

Signals:

Signal Notes
NEWID() / NEWSEQUENTIALID() / NEXT VALUE FOR in proc body Definitive proc-assigned surrogate
Column name suffix _sk / _guid Surrogate
Column name suffix _code / _number / _num Natural
MERGE ON uses different column from INSERT's PK column MERGE ON = natural key; INSERT PK = surrogate key

natural_key output fields:

Field Type Description
columns string[] Natural key column names
rationale string Why these columns are the natural key
source string Enum: catalog, llm, catalog+llm

Q5 -- Incremental watermark (Required)

Without a watermark column, the model can only be table (full refresh). Operationally unacceptable for large fact tables.

Signals:

Source Notes
WHERE clause in proc body WHERE load_date > @last_run -- nearly definitive
Column name patterns modified_at, updated_at, load_date, etl_date, batch_date, _dt, _ts, _dttm
CDC / CT metadata Informs strategy but does not identify the watermark column

watermark output fields:

Field Type Description
column string Watermark column name
rationale string Why this column was chosen
source string Enum: catalog, llm, catalog+llm

Q6 -- PII handling candidates (Nice-to-have)

Does not affect SQL correctness. Missing PII detection is a compliance risk.

Signals beyond catalog sensitivity classifications:

Source Notes
Column name patterns email, ssn, dob, phone, address, zip, credit_card, passport, national_id, ip_address, birth_date, first_name, last_name, full_name
Column type + context VARCHAR/NVARCHAR with PII-suggestive names

pii_actions[] output fields:

Field Type Description
column string Column containing PII
entity string PII entity type (e.g., email_address, phone_number)
suggested_action string Enum: mask, drop, tokenize, keep
rationale string Why this action was suggested
source string Enum: catalog, llm, catalog+llm

Suggested action meanings:

Action When
mask Default for confirmed PII
drop Column not needed downstream
tokenize Joinability must be preserved
keep Explicit business justification

The source attribution field

Every decision point in the profile includes a source field indicating how the answer was derived:

Value Meaning
catalog Entirely from catalog signals (PKs, FKs, identity columns, sensitivity classifications)
llm Entirely from LLM reasoning over proc body, column names, or patterns
catalog+llm Catalog provided partial signal, LLM completed the answer

Catalog signals are treated as facts. The LLM fills in what the catalog does not answer.

Full profile example

{
  "profile": {
    "status": "ok",
    "writer": "silver.usp_load_dimcustomer",
    "classification": {
      "resolved_kind": "dim_scd1",
      "rationale": "MERGE with WHEN MATCHED THEN UPDATE on all non-key columns. No valid_from/valid_to columns present, ruling out SCD2.",
      "source": "catalog+llm"
    },
    "primary_key": {
      "columns": ["CustomerKey"],
      "primary_key_type": "surrogate",
      "rationale": "CustomerKey is an identity column (catalog PK constraint PK_DimCustomer). Auto-increment seed=1, increment=1.",
      "source": "catalog"
    },
    "natural_key": {
      "columns": ["CustomerAlternateKey"],
      "rationale": "MERGE ON clause uses CustomerAlternateKey to match source to target, confirming it as the business key.",
      "source": "llm"
    },
    "watermark": {
      "column": "ModifiedDate",
      "rationale": "Proc WHERE clause filters on ModifiedDate > @LastLoadDate. DATETIME2 type confirms suitability as watermark.",
      "source": "llm"
    },
    "foreign_keys": [
      {
        "column": "GeographyKey",
        "references_source_relation": "silver.DimGeography",
        "references_column": "GeographyKey",
        "fk_type": "standard",
        "rationale": "Declared FK constraint FK_DimCustomer_Geography. Single column, single dimension -- standard FK.",
        "source": "catalog"
      }
    ],
    "pii_actions": [
      {
        "column": "EmailAddress",
        "entity": "email_address",
        "suggested_action": "mask",
        "rationale": "Column name matches PII pattern 'email'. NVARCHAR type confirms string content.",
        "source": "llm"
      },
      {
        "column": "Phone",
        "entity": "phone_number",
        "suggested_action": "mask",
        "rationale": "Column name matches PII pattern 'phone'.",
        "source": "llm"
      }
    ],
    "warnings": [],
    "errors": []
  }
}