Skill Setup DDL - accelerate-data/migration-utility GitHub Wiki

Skill: Setup DDL

Purpose

Extracts DDL and catalog metadata from a live SQL Server (or Fabric Warehouse) and writes local artifact files that downstream skills consume. Produces manifest.json, per-object DDL files in ddl/, and per-object catalog JSON files in catalog/. This is the first skill in the pipeline and must complete before any discovery, scoping, or profiling work.

Invocation

/setup-ddl

No arguments. The skill runs an interactive workflow that prompts for database and schema selection before extracting anything.

Trigger phrases: "set up DDL", "extract DDL from SQL Server", "populate DDL", "connect to the remote database and get DDL", "pull DDL from the source database".

Prerequisites

  • toolbox binary on PATH -- the mssql MCP server requires genai-toolbox. Run toolbox --version to verify. Install from https://github.com/googleapis/genai-toolbox/releases if missing.

  • Environment variables set -- the mssql MCP server reads these at startup:

    Variable Description Example
    MSSQL_HOST SQL Server hostname or IP localhost
    MSSQL_PORT SQL Server port 1433
    MSSQL_DB Default database (use master if no specific default) master
    SA_PASSWORD SQL login password (from env)
  • Project root -- the skill confirms pwd with the user before proceeding. If manifest.json already exists, the skill reads source_database and extracted_schemas from it and skips database selection.

Pipeline

Step 1 -- Select database (interactive)

Lists user databases (database_id > 4) via sys.databases and asks the user to pick one. Skipped when manifest.json already exists.

Step 2 -- Select schemas (interactive)

Lists non-system schemas with object counts (tables, procedures, views, functions) and presents an all option. The user picks one or more schemas.

Step 3 -- Extraction preview and confirm (interactive)

Runs count queries and catalog signal availability checks, then presents a summary:

Extraction preview for [database]
Schemas: silver, gold

  Object counts:
    Tables:     42
    Procedures: 18
    Views:      5
    Functions:  3

  Catalog signals available:
    Primary keys:    38 constraints
    Foreign keys:    22 constraints
    Identity cols:   15 columns
    CDC-tracked:     2 tables

  DDL files will be written to:     ./ddl/
  Catalog files will be written to: ./catalog/

The user must confirm before any files are written.

Step 4 -- Write manifest (deterministic)

uv run --project <shared-path> setup-ddl write-manifest \
  --technology sql_server \
  --database <database> \
  --schemas <comma-separated-schemas>

Step 5 -- Export procedures, views, and functions (MCP + CLI)

For each object type, queries OBJECT_DEFINITION() via the mssql MCP tool, saves the result to .staging/<type>.json, then runs:

uv run --project <shared-path> setup-ddl assemble-modules \
  --input ./.staging/<type>.json \
  --type <procedures|views|functions>

Step 6 -- Export tables (MCP + CLI)

Queries sys.tables, sys.columns, sys.types, and sys.identity_columns, saves to .staging/table_columns.json, then runs:

uv run --project <shared-path> setup-ddl assemble-tables \
  --input ./.staging/table_columns.json

Step 7 -- Extract catalog signals and references (MCP + CLI)

Runs 12 catalog signal queries plus 3 DMF reference queries via MCP, saving each result to .staging/. Then processes all staging files in one pass:

uv run --project <shared-path> setup-ddl write-catalog \
  --staging-dir ./.staging \
  --database <database>

Step 8 -- AST enrichment (deterministic)

uv run --project <shared-path> catalog-enrich --project-root .

Augments catalog files with AST-derived references that catalog queries miss: CTAS/SELECT INTO targets, TRUNCATE targets, and indirect writers through EXEC call chains. Entries carry "detection": "ast_scan".

Step 9 -- Report

Displays a summary of all extracted files and tells the user they can run discover or scoping next.

Reads

This skill reads from a live SQL Server via the mssql MCP tool. No local catalog files are consumed.

Writes

manifest.json

Field Type Required Description
schema_version string yes Always "1.0"
technology string yes Source technology. Enum: sql_server, fabric_warehouse, fabric_lakehouse, snowflake
dialect string yes sqlglot dialect. Enum: tsql, spark, snowflake
source_database string yes Name of the source database
extracted_schemas string[] yes List of schemas included in the extraction
extracted_at string yes ISO 8601 timestamp of extraction
sandbox object no Sandbox metadata (run_id, database) -- added later by the test harness

Technology-to-dialect mapping:

Technology Dialect Delimiter
sql_server tsql GO
fabric_warehouse tsql GO
fabric_lakehouse spark ;
snowflake snowflake ;

ddl/ directory

File Contents
ddl/tables.sql CREATE TABLE statements, one per table
ddl/procedures.sql CREATE PROCEDURE statements
ddl/views.sql CREATE VIEW statements
ddl/functions.sql CREATE FUNCTION statements

catalog/tables/<schema>.<table>.json

Initial fields written by setup-ddl (profile and scoping sections are added later by other skills):

Field Type Required Description
schema string yes Schema name
name string yes Table name
ddl_hash string no SHA-256 of normalized source DDL
stale boolean no true when object was present in prior extraction but absent in latest
columns array no Column definitions: name, sql_type, is_nullable, is_identity
primary_keys array yes Declared PKs: constraint_name, columns[]
unique_indexes array yes Unique indexes: index_name, columns[]
foreign_keys array yes FK constraints: constraint_name, columns[], referenced_schema, referenced_table, referenced_columns[]
auto_increment_columns array yes Identity/auto-increment columns: column, seed, increment, mechanism
change_capture object no CDC/change tracking: enabled, mechanism (enum: cdc, change_tracking, stream, change_data_feed)
sensitivity_classifications array no PII labels: column, label, information_type
referenced_by object yes Inbound references grouped by type (procedures, views, functions), each split into in_scope and out_of_scope

catalog/procedures/<schema>.<proc>.json

See Skill Analyzing Object for the full procedure catalog schema.

Catalog signal staging files

The 12 catalog signal queries produce these staging files in .staging/:

Staging file Signal
table_columns.json Column definitions with types, nullability, identity
pk_unique.json Primary key and unique index constraints
foreign_keys.json Foreign key relationships
identity_columns.json Identity columns
cdc.json CDC-tracked tables
change_tracking.json Change tracking tables (graceful -- tolerates missing feature)
sensitivity.json Sensitivity classifications (graceful)
object_types.json Object type map (U, V, P, FN, IF, TF)
definitions.json All proc/view/function definitions for routing flag scan
proc_params.json Procedure parameters
proc_dmf.json DMF references for procedures
view_dmf.json DMF references for views
func_dmf.json DMF references for functions

JSON Format

manifest.json example

{
  "schema_version": "1.0",
  "technology": "sql_server",
  "dialect": "tsql",
  "source_database": "AdventureWorksDW",
  "extracted_schemas": ["dbo", "silver", "gold"],
  "extracted_at": "2025-03-15T14:30:00Z"
}

catalog/tables/silver.dimcustomer.json example (initial)

{
  "schema": "silver",
  "name": "DimCustomer",
  "columns": [
    { "name": "CustomerKey", "sql_type": "BIGINT", "is_nullable": false, "is_identity": true },
    { "name": "FirstName", "sql_type": "NVARCHAR(50)", "is_nullable": true, "is_identity": false },
    { "name": "Region", "sql_type": "NVARCHAR(50)", "is_nullable": true, "is_identity": false }
  ],
  "primary_keys": [
    { "constraint_name": "PK_DimCustomer", "columns": ["CustomerKey"] }
  ],
  "unique_indexes": [],
  "foreign_keys": [
    {
      "constraint_name": "FK_DimCustomer_Region",
      "columns": ["Region"],
      "referenced_schema": "silver",
      "referenced_table": "DimGeography",
      "referenced_columns": ["GeographyKey"]
    }
  ],
  "auto_increment_columns": [
    { "column": "CustomerKey", "seed": 1, "increment": 1, "mechanism": "identity" }
  ],
  "change_capture": { "enabled": true, "mechanism": "cdc" },
  "sensitivity_classifications": [
    { "column": "FirstName", "label": "Confidential", "information_type": "Name" }
  ],
  "referenced_by": {
    "procedures": {
      "in_scope": [
        { "schema": "silver", "name": "usp_load_DimCustomer", "is_selected": true, "is_updated": true, "detection": "catalog_query" }
      ],
      "out_of_scope": []
    },
    "views": { "in_scope": [], "out_of_scope": [] },
    "functions": { "in_scope": [], "out_of_scope": [] }
  }
}

Troubleshooting

Error Cause Fix
toolbox: command not found genai-toolbox not installed Install from https://github.com/googleapis/genai-toolbox/releases and add to PATH
MSSQL_HOST is not set Missing environment variable Set MSSQL_HOST, MSSQL_PORT, MSSQL_DB, and SA_PASSWORD in .env or shell
USE statement not carrying across calls Each MCP call is a discrete connection Prepend USE [<database>]; to every SQL block -- this is handled automatically by the skill
Change tracking query fails Feature not enabled on server Graceful degradation -- the query uses TRY/CATCH and saves an error marker instead of failing
Sensitivity classifications query fails Feature not available (e.g., SQL Server edition) Graceful degradation -- same TRY/CATCH pattern
DMF reference returns ERROR: rows sys.dm_sql_referenced_entities cannot resolve a proc (e.g., missing dependent object) Error rows are recorded in catalog; the skill continues with remaining objects
Dynamic SQL procs missing from referenced_by sys.dm_sql_referenced_entities resolves at definition time, not runtime Known limitation -- these procs require LLM analysis via Skill Analyzing Object
Re-running on existing project ddl/ and catalog/ already exist Skill warns and asks for confirmation -- re-run fully rebuilds both directories
⚠️ **GitHub.com Fallback** ⚠️