Profiling Signals - accelerate-data/migration-utility GitHub Wiki
Profiling Signals Reference
Detailed signal tables and output field definitions for the six profiling questions answered by /profiling-table. See Profiling Table for the main workflow documentation.
Q1 -- What kind of model is this? (Required)
Determines materialization strategy, dbt tests, and whether SCD2 history logic is needed.
Write-pattern signals:
| Proc pattern | Classification |
|---|---|
| Pure INSERT, no UPDATE or DELETE | fact_transaction |
| INSERT with GROUP BY (aggregation before write) | fact_aggregate |
| TRUNCATE + INSERT with descriptive VARCHAR columns | dim_non_scd |
| TRUNCATE + INSERT with measure + FK columns | fact_periodic_snapshot |
| MERGE with simple WHEN MATCHED THEN UPDATE only | dim_scd1 |
MERGE with expire matched row + insert history row (valid_to, is_current) |
dim_scd2 |
| INSERT then UPDATE targeting milestone date columns | fact_accumulating_snapshot |
| Cross-join INSERT of low-cardinality flag combinations | dim_junk |
Column shape signals:
| Column pattern | Signal |
|---|---|
valid_from / valid_to / is_current / current_flag |
dim_scd2 |
Multiple milestone date columns (order_date, ship_date, close_date) |
fact_accumulating_snapshot |
snapshot_date / as_of_date / period_date |
fact_periodic_snapshot |
| Multiple BIT/TINYINT flag columns, all low-cardinality | dim_junk candidate |
Surrogate PK (_sk) + separate natural key column |
Dimension (SCD1 or SCD2) |
FK columns (_sk) + numeric measure columns |
Fact table |
classification output fields:
| Field | Type | Description |
|---|---|---|
resolved_kind |
string | Enum: dim_non_scd, dim_scd1, dim_scd2, dim_junk, fact_transaction, fact_periodic_snapshot, fact_accumulating_snapshot, fact_aggregate |
rationale |
string | Why this classification was chosen |
source |
string | Enum: catalog, llm, catalog+llm |
Q2 -- Primary key candidate (Required)
Required for unique_key in incremental models. Missing it forces full-refresh or produces duplicates.
Signals beyond catalog PKs:
| Source | Notes |
|---|---|
| MERGE ON clause in proc code | Business key / table grain -- strongest code-level signal |
| UPDATE / DELETE WHERE col = @param | Single-row lookup key |
primary_key output fields:
| Field | Type | Description |
|---|---|---|
columns |
string[] | PK column names |
primary_key_type |
string | Enum: surrogate, natural, composite, unknown |
rationale |
string | Why these columns were identified as PK and why this type |
source |
string | Enum: catalog, llm, catalog+llm |
Q3 -- Foreign key candidates (Nice-to-have)
Needed for relationships tests and detecting role-playing / degenerate dimensions.
FK type resolution:
fk_type |
Rule |
|---|---|
standard |
One fact column joins one dimension key with no multi-role pattern |
role_playing |
Two or more distinct fact columns join the same dimension relation+key |
degenerate |
Column behaves as business key in fact usage but no dimension join target found |
foreign_keys[] output fields:
| Field | Type | Description |
|---|---|---|
column |
string | FK column name on this table |
references_source_relation |
string | Referenced table (source-side identifier) |
references_column |
string | Referenced column |
fk_type |
string | Enum: standard, role_playing, degenerate |
rationale |
string | Why this FK type was chosen |
source |
string | Enum: catalog, llm, catalog+llm |
Q4 -- Natural key vs surrogate key (Required)
Determines whether the model calls dbt_utils.generate_surrogate_key and whether the incremental unique_key is a raw column or generated hash.
Signals:
| Signal | Notes |
|---|---|
NEWID() / NEWSEQUENTIALID() / NEXT VALUE FOR in proc body |
Definitive proc-assigned surrogate |
Column name suffix _sk / _guid |
Surrogate |
Column name suffix _code / _number / _num |
Natural |
| MERGE ON uses different column from INSERT's PK column | MERGE ON = natural key; INSERT PK = surrogate key |
natural_key output fields:
| Field | Type | Description |
|---|---|---|
columns |
string[] | Natural key column names |
rationale |
string | Why these columns are the natural key |
source |
string | Enum: catalog, llm, catalog+llm |
Q5 -- Incremental watermark (Required)
Without a watermark column, the model can only be table (full refresh). Operationally unacceptable for large fact tables.
Signals:
| Source | Notes |
|---|---|
| WHERE clause in proc body | WHERE load_date > @last_run -- nearly definitive |
| Column name patterns | modified_at, updated_at, load_date, etl_date, batch_date, _dt, _ts, _dttm |
| CDC / CT metadata | Informs strategy but does not identify the watermark column |
watermark output fields:
| Field | Type | Description |
|---|---|---|
column |
string | Watermark column name |
rationale |
string | Why this column was chosen |
source |
string | Enum: catalog, llm, catalog+llm |
Q6 -- PII handling candidates (Nice-to-have)
Does not affect SQL correctness. Missing PII detection is a compliance risk.
Signals beyond catalog sensitivity classifications:
| Source | Notes |
|---|---|
| Column name patterns | email, ssn, dob, phone, address, zip, credit_card, passport, national_id, ip_address, birth_date, first_name, last_name, full_name |
| Column type + context | VARCHAR/NVARCHAR with PII-suggestive names |
pii_actions[] output fields:
| Field | Type | Description |
|---|---|---|
column |
string | Column containing PII |
entity |
string | PII entity type (e.g., email_address, phone_number) |
suggested_action |
string | Enum: mask, drop, tokenize, keep |
rationale |
string | Why this action was suggested |
source |
string | Enum: catalog, llm, catalog+llm |
Suggested action meanings:
| Action | When |
|---|---|
mask |
Default for confirmed PII |
drop |
Column not needed downstream |
tokenize |
Joinability must be preserved |
keep |
Explicit business justification |
The source attribution field
Every decision point in the profile includes a source field indicating how the answer was derived:
| Value | Meaning |
|---|---|
catalog |
Entirely from catalog signals (PKs, FKs, identity columns, sensitivity classifications) |
llm |
Entirely from LLM reasoning over proc body, column names, or patterns |
catalog+llm |
Catalog provided partial signal, LLM completed the answer |
Catalog signals are treated as facts. The LLM fills in what the catalog does not answer.
Full profile example
{
"profile": {
"status": "ok",
"writer": "silver.usp_load_dimcustomer",
"classification": {
"resolved_kind": "dim_scd1",
"rationale": "MERGE with WHEN MATCHED THEN UPDATE on all non-key columns. No valid_from/valid_to columns present, ruling out SCD2.",
"source": "catalog+llm"
},
"primary_key": {
"columns": ["CustomerKey"],
"primary_key_type": "surrogate",
"rationale": "CustomerKey is an identity column (catalog PK constraint PK_DimCustomer). Auto-increment seed=1, increment=1.",
"source": "catalog"
},
"natural_key": {
"columns": ["CustomerAlternateKey"],
"rationale": "MERGE ON clause uses CustomerAlternateKey to match source to target, confirming it as the business key.",
"source": "llm"
},
"watermark": {
"column": "ModifiedDate",
"rationale": "Proc WHERE clause filters on ModifiedDate > @LastLoadDate. DATETIME2 type confirms suitability as watermark.",
"source": "llm"
},
"foreign_keys": [
{
"column": "GeographyKey",
"references_source_relation": "silver.DimGeography",
"references_column": "GeographyKey",
"fk_type": "standard",
"rationale": "Declared FK constraint FK_DimCustomer_Geography. Single column, single dimension -- standard FK.",
"source": "catalog"
}
],
"pii_actions": [
{
"column": "EmailAddress",
"entity": "email_address",
"suggested_action": "mask",
"rationale": "Column name matches PII pattern 'email'. NVARCHAR type confirms string content.",
"source": "llm"
},
{
"column": "Phone",
"entity": "phone_number",
"suggested_action": "mask",
"rationale": "Column name matches PII pattern 'phone'.",
"source": "llm"
}
],
"warnings": [],
"errors": []
}
}