Phase 21: Scoring Criteria Refactor

Auto-generated from .planning/phases/21-scoring-criteria-refactor
Last synced: 2026-04-28

Context & Decisions

Phase 21: Scoring Criteria Refactor — 评分标准模块重构，动态维度驱动 - Context

Gathered: 2026-04-27 Status: Ready for planning

## Phase Boundary

重构 MR 会话评分（Session Scoring）的硬编码 5 维度系统，使 ScoringRubric 成为评分的唯一权威来源（Single Source of Truth）。管理员可自由定义评分维度名称/数量/权重/评分标准，所有评分流程（LLM 评分、Mock 评分、前端展示）统一从 Rubric 动态读取。

不在范围内： Dry Run Scoring（SOP 覆盖度评分）和 Skill Quality Evaluation（Skill 内容质量评分）保持不变——它们评估的是完全不同的东西。

## Implementation Decisions

D-01: 重构范围 — 仅 Session Scoring

只重构 MR 会话评分的 5 个硬编码维度 → Rubric 动态
Dry Run Scoring（dry_run_engine.py）和 Skill Quality Evaluation（skill_evaluation_service.py）不受影响
Skill 的 ## Assessment Rubric 自由文本注入 LLM prompt 的机制保留不变

D-02: Skill 叠加模式 — 自定义评分标准在 Rubric 中添加

管理员可以在 ScoringRubric 中基于默认维度添加自定义维度（如"临床数据引用准确性"）
不同 Skill/Scenario 可以关联不同的 Rubric，实现不同 Skill 有不同评分标准
Skill markdown 中的 ## Assessment Rubric 继续作为 LLM 的额外评分指导文本

D-03: 维度完全自由

管理员可以自由添加/删除/编辑 Rubric 中的任何维度，包括默认的 5 个
无锁定维度，最大灵活性
Rubric 的 dimensions JSON 格式保持现有：[{name, weight, criteria[], max_score}]
权重总和必须等于 100（现有 schema 校验已支持）

D-04: 数据迁移 — 删除旧列

Alembic 迁移将现有 Scenario 的 5 个 weight_* 列数据转换为 Rubric 记录
为每个现有 Scenario 自动创建对应的 Rubric（保留原有权重配置）
添加 rubric_id FK 到 Scenario 模型
迁移后删除旧的 weight_key_message、weight_objection_handling、weight_communication、weight_product_knowledge、weight_scientific_info 列
删除 get_scoring_weights() 方法

D-05: 强制关联 Rubric

每个 Scenario 必须关联一个 Rubric（rubric_id NOT NULL）
迁移时自动为现有 Scenario 创建 Rubric 并关联
新建 Scenario 时必须选择或创建 Rubric
不再需要 get_default_rubric() 回退机制（但可保留为新建 Scenario 时的默认推荐）

D-06: 评分 Prompt 完全动态化

删除 scoring_engine.py 中的 dim_names 硬编码字典
删除 Instructions 中对 5 个具体维度的描述（key_message、objection_handling 等）
维度名称、权重、评分指南全部从 Rubric 的 dimensions JSON 动态生成
Rubric 的 criteria[] 字段直接注入为每个维度的评分指导

D-07: 前端 ScoringWeights 组件完全动态化

删除 WEIGHT_KEYS 硬编码数组和 I18N_KEYS 映射
ScoringWeights 调用 Rubric API 获取维度列表，动态生成滑块
Scenario Editor 中选择 Rubric 后显示对应维度的权重配置
现有 Rubric 管理页 /admin/scoring-rubrics 继续作为独立的 Rubric CRUD 入口

D-08: Mock 评分生成器动态化

_generate_mock_scores() 不再硬编码 5 个维度块
从 Rubric 的 dimensions 动态生成任意数量的维度评分
每个维度的 mock 分数、strengths、weaknesses、suggestions 基于通用模板动态生成

Claude's Discretion

Alembic migration 的具体实现细节（batch mode for SQLite etc.）
Mock 评分生成器中通用 strengths/weaknesses 文案模板设计
Rubric 选择 UI 在 Scenario Editor 中的具体交互设计（下拉框 vs 弹窗）
前端 i18n 处理（自定义维度名称是否需要 i18n）
测试结构和 mock 数据模式

<canonical_refs>

Canonical References

Downstream agents MUST read these before planning or implementing.

评分引擎（Session Scoring）

backend/app/services/scoring_engine.py — LLM 评分 prompt 构建，dim_names 硬编码位置
backend/app/services/scoring_service.py — 评分编排，mock 生成器，rubric 权重解析逻辑
backend/app/services/rubric_service.py — Rubric CRUD，get_default_rubric()

数据模型

backend/app/models/scenario.py — Scenario 模型，5个 weight_* 列，get_scoring_weights()
backend/app/models/scoring_rubric.py — ScoringRubric 模型，JSON dimensions
backend/app/schemas/scoring_rubric.py — Rubric Pydantic schemas，权重校验
backend/app/schemas/scenario.py — Scenario schemas（需要移除 weight 字段，添加 rubric_id）

前端评分组件

frontend/src/components/admin/scoring-weights.tsx — 硬编码 WEIGHT_KEYS，需重构为动态
frontend/src/components/scoring/radar-chart.tsx — 已动态（接收 ScorePoint[]）
frontend/src/components/scoring/dimension-bars.tsx — 已动态（接收 ScoreDetail[]）
frontend/src/components/scoring/feedback-card.tsx — 评分反馈卡片
frontend/src/pages/admin/scoring-rubrics.tsx — Rubric 管理页

Skill 评分集成

backend/app/services/scoring_service.py:235 — _extract_skill_criteria() 函数（保持不变）

迁移

backend/alembic/versions/16f9f0ba6e9d_add_scoring_rubrics_table.py — 现有 Rubric 迁移

</canonical_refs>

<code_context>

Existing Code Insights

Reusable Assets

ScoringRubric model: 已存在，dimensions JSON 格式 [{name, weight, criteria[], max_score}] 已支持动态维度
rubric_service.py: CRUD 已完整（create/get/list/update/delete）
RadarChart + DimensionBars: 已接收动态数据（ScorePoint[] / ScoreDetail[]），无需大改
/admin/scoring-rubrics 页面: Rubric 管理 UI 已存在
Rubric schema 校验: DimensionConfig 已校验权重总和 = 100

Established Patterns

Service layer: 业务逻辑在 services/*.py，router 只做 HTTP
Pydantic v2: ConfigDict(from_attributes=True)，field validators
Alembic: batch operations for SQLite (Gotcha #1)
TanStack Query hooks per domain

Integration Points

scoring_service.py:69-77 — 当前 rubric vs scenario weights 解析逻辑，需改为强制 rubric
scoring_engine.py:103-113 — dim_names 字典和 dimensions_config 生成，需改为动态
scoring_service.py:298-440 — Mock 评分的 5 个硬编码维度块
scenario.py:52-60 — get_scoring_weights() 和 5 个 weight_* 列
Scenario Editor (frontend/src/components/admin/scenario-editor.tsx) — 需要添加 Rubric 选择器

</code_context>

## Specific Ideas

不同 Skill 可以关联不同的 Rubric，实现"同一场景用不同 Skill 时评分标准不同"的需求
HCP 对评分的影响仅通过 LLM prompt 上下文，无需数值修改器
4 种评分场景完全独立：Session Scoring（本次重构）、Dry Run（SOP 覆盖度）、Skill Criteria Injection（文本注入）、Skill Quality Eval（内容质量）

## Deferred Ideas

None — discussion stayed within phase scope

Phase: 21-scoring-criteria-refactor Context gathered: 2026-04-27

Plans (3)

#	Plan File	Status
21-01	21-01-PLAN.md	Complete
21-02	21-02-PLAN.md	Complete
21-03	21-03-PLAN.md	Complete

Research

Click to expand research notes

Phase 21: Scoring Criteria Refactor - Research

Researched: 2026-04-27 Domain: Scoring system refactoring -- eliminate hardcoded dimensions, make ScoringRubric the SSOT Confidence: HIGH

Summary

This phase is a refactoring of an existing, working scoring system. The core problem is that 5 scoring dimensions (key_message, objection_handling, communication, product_knowledge, scientific_info) are hardcoded in 7 locations across the codebase: the Scenario ORM model (5 weight columns), the scoring engine prompt template (dimension-specific instructions), the mock score generator (5 hardcoded dimension blocks), the frontend ScoringWeights component (typed to exactly 5 keys), the frontend Scenario TypeScript types, the analytics recommendation service (dimension-to-column mapping), and the i18n locale files (hardcoded dimension translations).

A ScoringRubric model already exists with a JSON dimensions field that supports arbitrary dimension names, weights, and criteria. The rubric editor UI already supports dynamic dimensions. The refactoring goal is to make this rubric the single source of truth so all scoring flows read dimensions from the rubric rather than from hardcoded scenario columns. This is a structural cleanup, not a feature addition -- the user-facing behavior (multi-dimensional scoring with configurable weights) remains the same, but becomes truly configurable.

Primary recommendation: Add a rubric_id FK to the Scenario model, remove the 5 weight_* columns via Alembic migration with data migration (converting existing weight values to rubric records), then refactor all downstream consumers (scoring engine, mock generator, frontend components) to read dimensions from the rubric. The two separate scoring systems (session scoring and Skill quality scoring) must remain independent -- they have different dimensions and different purposes.

Standard Stack

Core (existing -- no new dependencies)

Library	Version	Purpose	Why Standard
SQLAlchemy 2.0	existing	ORM with async sessions	Project standard [VERIFIED: codebase]
Alembic	existing	Schema migrations with batch mode for SQLite	Project standard [VERIFIED: codebase]
Pydantic v2	existing	Request/response schemas with validators	Project standard [VERIFIED: codebase]
FastAPI	existing	API layer with dependency injection	Project standard [VERIFIED: codebase]
react-hook-form + zod	existing	Frontend form validation	Project standard [VERIFIED: codebase]
recharts	existing	RadarChart, charts for scoring visualization	Project standard [VERIFIED: codebase]

No New Dependencies Required

This refactoring uses exclusively existing libraries. No new packages need to be installed. [VERIFIED: codebase audit]

Architecture Patterns

Current Architecture (Before Refactor)

Scenario model
  ├── weight_key_message: int = 30
  ├── weight_objection_handling: int = 25
  ├── weight_communication: int = 20
  ├── weight_product_knowledge: int = 15
  ├── weight_scientific_info: int = 10
  └── get_scoring_weights() -> dict  # Returns hardcoded 5-key dict

ScoringRubric model (exists but underused)
  └── dimensions: JSON  # [{name, weight, criteria[], max_score}]

scoring_service.py
  ├── Reads scenario.get_scoring_weights() as fallback
  └── Reads rubric dimensions only if a default rubric exists

scoring_engine.py
  ├── SCORING_PROMPT_TEMPLATE: hardcoded dimension instructions
  └── dim_names dict: maps 5 keys to display names

Target Architecture (After Refactor)

Scenario model
  ├── rubric_id: FK -> scoring_rubrics.id (NOT NULL per D-05)
  ├── pass_threshold: int = 70
  └── (weight_* columns REMOVED)

ScoringRubric model (SSOT)
  └── dimensions: JSON  # [{name, weight, criteria[], max_score}]

scoring_service.py
  ├── Resolves rubric: always via scenario.rubric_id (NOT NULL, no fallback needed)
  └── Passes rubric dimensions to scoring engine

scoring_engine.py
  ├── Builds prompt dynamically from rubric dimensions
  └── No hardcoded dimension names or instructions

Frontend
  ├── ScenarioEditor: rubric selector (required field) instead of ScoringWeights
  └── All scoring components: read dimensions from score.details (already dynamic)

Pattern 1: Direct Rubric Lookup (No Fallback Chain)

What: Direct rubric lookup via scenario.rubric_id (NOT NULL per D-05) When to use: Every time a session needs to be scored Example:

# Source: [codebase pattern from rubric_service.py, simplified per D-05]
async def resolve_rubric_dimensions(db: AsyncSession, scenario) -> list[dict]:
    """Resolve rubric dimensions for scoring.
    
    Per D-05: rubric_id is NOT NULL, so direct lookup always succeeds.
    get_default_rubric() fallback is no longer needed for scoring.
    """
    import json as _json
    from app.services.rubric_service import get_rubric
    
    rubric = await get_rubric(db, scenario.rubric_id)
    dims = rubric.dimensions
    return _json.loads(dims) if isinstance(dims, str) else dims

Pattern 2: Dynamic Prompt Building

What: Build scoring prompt from rubric dimensions, not hardcoded names When to use: LLM scoring engine Example:

# Replace hardcoded dim_names dict with rubric-driven config
def build_dimensions_config(rubric_dimensions: list[dict]) -> str:
    lines = []
    for dim in rubric_dimensions:
        name = dim["name"]
        weight = dim["weight"]
        criteria = dim.get("criteria", [])
        criteria_text = "; ".join(criteria) if criteria else "General assessment"
        lines.append(f"- {name}: weight={weight}%, criteria: {criteria_text}")
    return "\n".join(lines)

Pattern 3: Dynamic Mock Score Generation

What: Generate mock scores for arbitrary dimension sets When to use: Mock scoring fallback when LLM unavailable Example:

def _generate_mock_scores(
    rubric_dimensions: list[dict],
    scenario: Scenario,
    messages: list,
    key_messages_status: list[dict],
) -> dict:
    """Generate mock scores for N arbitrary dimensions."""
    dimensions = []
    for dim_config in rubric_dimensions:
        score = min(95, max(60, base_score + random.randint(-8, 10)))
        dimensions.append({
            "dimension": dim_config["name"],
            "score": score,
            "weight": dim_config["weight"],
            "strengths": [...],
            "weaknesses": [...],
            "suggestions": [...],
        })
    # Calculate weighted overall
    overall = sum(d["score"] * d["weight"] / 100 for d in dimensions)
    ...

Anti-Patterns to Avoid

Merging session scoring with Skill quality scoring: These are two separate systems with different dimensions (5 MR-facing vs 6 content-quality). They must remain independent. [VERIFIED: codebase -- Skill scoring uses sop_completeness, knowledge_accuracy, etc.]
Merging with DryRun scoring: DryRun uses executability_score and coverage_percent, which are completely different metrics. Do not touch DryRun scoring. [VERIFIED: codebase]
Breaking backward compatibility on stored data: Existing ScoreDetail rows reference dimension names like key_message. These must remain readable even after the refactoring. New sessions will use rubric-defined names.
Removing ScoringWeights component entirely: Deprecate but keep the file until all references are migrated. The rubric editor already handles dynamic dimensions.

Hardcoded Dimension Locations (Complete Inventory)

#	File	What's Hardcoded	Action
1	`backend/app/models/scenario.py`	5 `weight_*` columns + `get_scoring_weights()` method	Remove columns, add `rubric_id` FK
2	`backend/app/schemas/scenario.py`	5 `weight_*` fields in Create/Update/Response + `validate_weights_sum`	Remove weight fields, add `rubric_id`
3	`backend/app/services/scoring_engine.py`	`dim_names` dict mapping 5 keys to labels, per-dimension instructions in `SCORING_PROMPT_TEMPLATE`	Build dynamically from rubric
4	`backend/app/services/scoring_service.py`	`_generate_mock_scores()` with 5 hardcoded dimension blocks	Rewrite as loop over rubric dimensions
5	`backend/app/services/analytics_service.py`	`weight_map` dict in `get_recommended_scenarios()` mapping dimension to `Scenario.weight_*` columns	Rewrite to query via rubric dimensions
6	`backend/app/services/scenario_service.py`	`clone_scenario()` copies 5 `weight_*` fields	Copy `rubric_id` instead
7	`frontend/src/components/admin/scoring-weights.tsx`	`ScoringWeightsProps` typed to 5 keys, `WEIGHT_KEYS`, `I18N_KEYS`	Deprecate component (rubric editor replaces it)
8	`frontend/src/components/admin/scenario-editor.tsx`	5 weight fields in zod schema, `ScoringWeights` usage, form values	Replace with rubric selector
9	`frontend/src/types/scenario.ts`	`ScoringWeights` interface with 5 keys, `Scenario`/`ScenarioCreate` types	Remove weight fields, add `rubric_id`
10	`frontend/public/locales/en-US/admin.json`	`scenarios.keyMessageDelivery` etc. (5 entries)	Keep for backward compat, mark deprecated
11	`frontend/public/locales/en-US/scoring.json`	`dimensions.keyMessage` etc. (5 entries)	Keep for backward compat, add dynamic fallback
12	`backend/scripts/seed_phase2.py`	Scenario seeds with hardcoded weight values	Update to create rubrics and reference rubric_id
13	`backend/app/startup_seed.py`	Potentially seeds default rubric	Verify/update

Components Already Dynamic (No Changes Needed)

Component	Why It's Already Dynamic
`RadarChart` (scoring)	Reads `currentScores: ScorePoint[]` -- dimension comes from data
`DimensionBars`	Reads `details: ScoreDetail[]` -- iterates whatever is in the array
`FeedbackCard`	Reads single `ScoreDetail` -- displays `detail.dimension` as string
`ScoreSummary`	Only shows overall score and pass/fail -- no dimension awareness
`ReportSection`	Reads `improvements` array -- dimension comes from data
`PerformanceRadar` (analytics)	Reads `currentScores: DimensionPoint[]` -- dynamic
`SkillGapHeatmap`	Builds columns from `data` -- already fully dynamic
`RubricEditor`	Already supports dynamic dimensions with `useFieldArray`
`RubricTable`	Shows dimension count badge -- no hardcoded names
`ScoreDetail` model (backend)	`dimension: String(50)` -- already stores arbitrary names
`SessionScore` model (backend)	No dimension awareness -- stores overall score only

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Weight sum validation	Custom validator	Existing `field_validator` in `RubricCreate` schema	Already validated, tested, handles edge cases [VERIFIED: scoring_rubric.py:30]
Dynamic radar charts	Custom chart component	Existing `recharts.RadarChart` with data-driven config	Already renders N-dimensional data from array input [VERIFIED: radar-chart.tsx]
Proportional weight redistribution	Manual slider math	The rubric editor already handles this via individual sliders	No need to port `adjustWeights` logic [VERIFIED: rubric-editor.tsx]
JSON dimension parsing	Manual JSON.parse	Existing `parse_dimensions_json` validator in `RubricResponse`	Handles both string and list inputs [VERIFIED: scoring_rubric.py:78]

Key insight: The rubric system already has ~80% of what's needed. The refactoring is mainly about removing the parallel hardcoded system and wiring the existing rubric system as the only path.

Common Pitfalls

Pitfall 1: SQLite Batch Migration for Column Removal

What goes wrong: SQLite does not support ALTER TABLE DROP COLUMN natively. Attempting to drop the 5 weight columns will fail. Why it happens: Alembic generates standard ALTER TABLE SQL that SQLite cannot execute. How to avoid: Use render_as_batch=True in Alembic's env.py (already configured per CLAUDE.md Gotcha #1). The migration must use with op.batch_alter_table('scenarios') as batch_op: to recreate the table. Warning signs: Migration fails with "near DROP: syntax error" on SQLite.

Pitfall 2: Breaking Existing ScoreDetail Records

What goes wrong: Existing scored sessions have ScoreDetail rows with dimension values like key_message, objection_handling, etc. If the frontend tries to display these using new rubric-based labels, they may show raw snake_case keys. Why it happens: Dimension names in ScoreDetail are stored as strings, not FK references. They persist the name used at scoring time. How to avoid: The frontend already displays detail.dimension as a raw string. Add a dimension display name mapping utility that checks rubric first, then falls back to i18n translation, then to the raw key. Historical data remains readable. Warning signs: Old session reports show key_message instead of "Key Message Delivery".

Pitfall 3: Analytics Recommendation Query Breaks

What goes wrong: get_recommended_scenarios() in analytics_service.py maps dimension names to Scenario.weight_* columns to find scenarios targeting the user's weakest dimension. After column removal, this query breaks. Why it happens: The weight_map dict directly references ORM column attributes that no longer exist. How to avoid: Rewrite the recommendation algorithm to: (1) find user's weakest dimension from ScoreDetail records, (2) for each active scenario, load its rubric, (3) rank scenarios by the weight of the weakest dimension in their rubric. This is slightly more complex but correct. Warning signs: 500 errors on user dashboard after migration.

Pitfall 4: Null rubric_id on Existing Scenarios

What goes wrong: After adding rubric_id FK and removing weight columns, existing scenarios have rubric_id=NULL and no weight data. Why it happens: The data migration must create rubric records from existing weight values before dropping the columns. How to avoid: Three-step migration within a single Alembic file: (1) Add rubric_id column as nullable, (2) create rubric records from existing weight values using op.execute raw SQL with uuid4() and update scenarios to point to them, (3) alter rubric_id to NOT NULL, then drop weight columns. Warning signs: All existing scenarios lose their scoring configuration.

Pitfall 5: Test File Explosions

What goes wrong: There are 42+ backend test files and 44+ frontend test files referencing the 5 hardcoded dimensions. Updating all at once creates massive, error-prone diffs. Why it happens: Tests hardcode scenario weights and dimension names in fixtures. How to avoid: Create a shared test fixture/factory that generates rubric-based scenarios. Update tests to use the factory. Tests that only assert on "some dimensions exist" (not specific names) may need minimal changes. Warning signs: Hundreds of test failures after model change.

Pitfall 6: Prompt Template Regression

What goes wrong: The LLM scoring prompt template has dimension-specific instructions ("For key_message, consider which key messages were delivered..."). After making it dynamic, the LLM may produce lower-quality scores because it lacks domain-specific guidance. Why it happens: Generic instructions produce generic scores. The current per-dimension instructions encode domain expertise. How to avoid: Move the dimension-specific instructions INTO the rubric's criteria field. The prompt builder reads criteria from the rubric and includes them in the prompt. The default rubric should contain the existing detailed instructions as criteria entries. Warning signs: Score quality drops after refactoring.

Data Migration Strategy

Step 1: Add rubric_id Column (nullable initially)

# In Alembic migration, first batch_alter_table call
with op.batch_alter_table("scenarios") as batch_op:
    batch_op.add_column(
        sa.Column("rubric_id", sa.String(36),
                  sa.ForeignKey("scoring_rubrics.id"), nullable=True)
    )

Step 2: Create Rubrics from Existing Weight Combinations and Link Scenarios

# In the same Alembic migration, use op.execute / connection.execute
# For each unique weight combination in the scenarios table:
# 1. Create a ScoringRubric record with those weights as dimensions JSON
# 2. For scenarios with that weight combination, SET rubric_id to the new rubric's id
# The default 30/25/20/15/10 combination gets is_default=True

Step 3: Enforce NOT NULL and Drop Weight Columns

# After all scenarios have rubric_id populated:
with op.batch_alter_table("scenarios") as batch_op:
    batch_op.alter_column("rubric_id", nullable=False)
    batch_op.drop_column("weight_key_message")
    batch_op.drop_column("weight_objection_handling")
    batch_op.drop_column("weight_communication")
    batch_op.drop_column("weight_product_knowledge")
    batch_op.drop_column("weight_scientific_info")

Handling the Default 30/25/20/15/10 Split

Most scenarios likely use the default weights (30/25/20/15/10). The migration should:

Create ONE default rubric with these weights and is_default=True
Point all default-weight scenarios to this rubric
Create separate rubrics only for scenarios with custom weights

Separate Scoring Systems (DO NOT MERGE)

System	Dimensions	Used By	Location
Session Scoring (this refactor)	configurable via rubric (default 5)	F2F + Conference scoring	scoring_service.py, scoring_engine.py
Skill Quality Scoring	6 fixed (sop_completeness, knowledge_accuracy, etc.)	Skill Evaluator agent	skill-evaluator/references/evaluation-dimensions.md
DryRun Scoring	2 fixed (executability_score, coverage_percent)	Dry Run results	dry_run_service.py

These three systems are architecturally separate and must remain so. The Skill Quality dimensions evaluate content quality (is the training material good?), while Session dimensions evaluate MR performance (did the MR perform well?). DryRun dimensions evaluate SOP executability. They serve fundamentally different purposes.

Code Examples

Alembic Migration: Add rubric_id with data migration, then remove weight columns

# Source: [CLAUDE.md Gotcha #1 pattern, adapted for this use case]
import uuid
import json

def upgrade() -> None:
    # Step 1: Add rubric_id column as nullable
    with op.batch_alter_table("scenarios") as batch_op:
        batch_op.add_column(
            sa.Column("rubric_id", sa.String(36), 
                      sa.ForeignKey("scoring_rubrics.id"), nullable=True)
        )
    
    # Step 2: Data migration -- create rubrics for each unique weight combo
    conn = op.get_bind()
    
    # Find unique weight combinations
    scenarios = conn.execute(sa.text(
        "SELECT id, weight_key_message, weight_objection_handling, "
        "weight_communication, weight_product_knowledge, weight_scientific_info "
        "FROM scenarios"
    )).fetchall()
    
    # Group by weight combo, create one rubric per unique combo
    weight_combos = {}
    for row in scenarios:
        combo_key = (row[1], row[2], row[3], row[4], row[5])
        if combo_key not in weight_combos:
            weight_combos[combo_key] = []
        weight_combos[combo_key].append(row[0])
    
    for (wkm, woh, wc, wpk, wsi), scenario_ids in weight_combos.items():
        rubric_id = str(uuid.uuid4())
        is_default = (wkm == 30 and woh == 25 and wc == 20 and wpk == 15 and wsi == 10)
        dims = json.dumps([
            {"name": "key_message", "weight": wkm, "criteria": [...], "max_score": 100.0},
            {"name": "objection_handling", "weight": woh, "criteria": [...], "max_score": 100.0},
            {"name": "communication", "weight": wc, "criteria": [...], "max_score": 100.0},
            {"name": "product_knowledge", "weight": wpk, "criteria": [...], "max_score": 100.0},
            {"name": "scientific_info", "weight": wsi, "criteria": [...], "max_score": 100.0},
        ])
        conn.execute(sa.text(
            "INSERT INTO scoring_rubrics (id, name, description, scenario_type, dimensions, is_default, created_by) "
            "VALUES (:id, :name, :desc, :stype, :dims, :is_default, :created_by)"
        ), {"id": rubric_id, "name": f"Migrated {'Default' if is_default else 'Custom'} Rubric",
            "desc": "Auto-created from scenario weight columns during migration",
            "stype": "f2f", "dims": dims, "is_default": is_default, "created_by": "system"})
        
        for sid in scenario_ids:
            conn.execute(sa.text(
                "UPDATE scenarios SET rubric_id = :rid WHERE id = :sid"
            ), {"rid": rubric_id, "sid": sid})
    
    # Step 3: Enforce NOT NULL and drop weight columns
    with op.batch_alter_table("scenarios") as batch_op:
        batch_op.alter_column("rubric_id", nullable=False)
        batch_op.drop_column("weight_key_message")
        batch_op.drop_column("weight_objection_handling")
        batch_op.drop_column("weight_communication")
        batch_op.drop_column("weight_product_knowledge")
        batch_op.drop_column("weight_scientific_info")

Dynamic Scoring Prompt Builder

# Source: [adapted from existing scoring_engine.py build_scoring_prompt]
def build_dimensions_instructions(rubric_dimensions: list[dict]) -> str:
    """Build dimension-specific scoring instructions from rubric criteria."""
    lines = []
    for dim in rubric_dimensions:
        name = dim["name"]
        weight = dim["weight"]
        criteria = dim.get("criteria", [])
        lines.append(f"- {name} (weight={weight}%)")
        if criteria:
            for criterion in criteria:
                lines.append(f"  * {criterion}")
    return "\n".join(lines)

Frontend Rubric Selector (replacing ScoringWeights)

// Source: [adapted from existing scenario-editor.tsx pattern]
// In ScenarioEditor form, replace ScoringWeights with:
<div className="grid gap-2">
  <Label>{t("scenarios.scoringRubric")}</Label>
  <Controller
    control={form.control}
    name="rubric_id"
    render={({ field }) => (
      <Select value={field.value ?? ""} onValueChange={field.onChange}>
        <SelectTrigger>
          <SelectValue placeholder="Select scoring rubric" />
        </SelectTrigger>
        <SelectContent>
          {rubrics.map((r) => (
            <SelectItem key={r.id} value={r.id}>
              {r.name} ({r.dimensions.length} dimensions)
            </SelectItem>
          ))}
        </SelectContent>
      </Select>
    )}
  />
</div>

State of the Art

Old Approach	Current Approach	When Changed	Impact
5 weight columns on Scenario	rubric_id FK to ScoringRubric	This phase	Unlimited configurable dimensions
Hardcoded prompt instructions	Rubric criteria field drives prompt	This phase	Admin controls scoring guidance
Mock generator with 5 blocks	Loop over rubric dimensions	This phase	Mock works with any dimension count
ScoringWeights component (5 sliders)	Rubric selector dropdown	This phase	Scenario editor simplified

Deprecated after this phase:

ScoringWeights component -- replaced by rubric selector in ScenarioEditor
Scenario.get_scoring_weights() method -- replaced by rubric resolution
ScoringWeightsProps TypeScript interface -- no longer used
WEIGHT_KEYS and I18N_KEYS constants in scoring-weights.tsx -- no longer used

Assumptions Log

#	Claim	Section	Risk if Wrong
A1	Most existing scenarios use the default 30/25/20/15/10 weights	Data Migration Strategy	If many custom weight combos exist, migration creates many rubrics -- not harmful but less clean
A2	The scoring prompt quality will be maintained by putting existing per-dimension instructions into rubric criteria	Common Pitfalls #6	If criteria field is too short or LLM ignores it, score quality may degrade
A3	Frontend scoring display components are truly dynamic and need no changes	Components Already Dynamic	If any component has hidden hardcoded dimension assumptions, it will break

Open Questions

Should rubric_id be required or nullable on Scenario? (RESOLVED)
- Decision: NOT NULL per D-05. rubric_id is NOT NULL on Scenario. The data migration creates rubric records for all existing scenarios before enforcing the constraint. No fallback chain needed for scoring -- every scenario always has a rubric. get_default_rubric() is retained only as a UI convenience for pre-selecting a rubric when creating new scenarios, not as a scoring fallback.
Should scenario editor show rubric dimensions inline or just a selector? (RESOLVED)
- Decision: Selector with read-only preview per UI-SPEC IC-01. The scenario editor shows a rubric selector dropdown. When a rubric is selected, a read-only dimension preview (name + weight bar + criteria summary) appears below. Full dimension editing goes to the rubric management page via a "Manage Rubrics" link.
Should the data migration run in Alembic or as a seed script? (RESOLVED)
- Decision: Single Alembic migration with inline data migration per D-04. The migration uses raw SQL via op.get_bind() to: (1) read existing weight combinations, (2) create ScoringRubric records, (3) update scenario.rubric_id, (4) enforce NOT NULL, (5) drop weight columns. This keeps the schema change and data migration atomic. The seed scripts (seed_phase2.py, startup_seed.py) are updated separately to create scenarios with explicit rubric_id references.

Project Constraints (from CLAUDE.md)

NEVER modify schema without Alembic migration -- all column changes require proper migrations
render_as_batch for SQLite -- column drops require batch mode (Gotcha #1)
async with for all DB sessions -- all new service code must use async patterns
Service layer = business logic, routers = HTTP only -- rubric resolution belongs in service
Create returns 201, Delete returns 204 -- maintain API conventions
No raw SQL -- use SQLAlchemy ORM or Alembic for all queries (exception: Alembic data migration uses op.execute for raw SQL within the migration itself, which is the standard Alembic pattern)
db.flush() per project convention -- not db.commit() (session middleware handles commit)
Pydantic v2 schemas with from_attributes=True -- all schema updates must use ConfigDict
TypeScript strict: true -- no any types in frontend changes
TanStack Query hooks per domain -- any new hooks follow existing pattern
Conventional commits -- e.g., refactor(scoring): remove hardcoded dimensions from scenario model
server_default in migrations -- for SQLite compatibility with existing rows

Sources

Primary (HIGH confidence)

[Codebase audit] -- All 13 hardcoded locations identified by grep + file read
[backend/app/models/scenario.py] -- Current 5 weight columns
[backend/app/models/scoring_rubric.py] -- Existing rubric model with JSON dimensions
[backend/app/services/scoring_engine.py] -- Hardcoded dim_names and prompt template
[backend/app/services/scoring_service.py] -- Mock generator and rubric fallback logic
[backend/app/services/analytics_service.py] -- weight_map recommendation query
[frontend/src/components/admin/scoring-weights.tsx] -- 5-key typed component
[frontend/src/components/admin/rubric-editor.tsx] -- Already dynamic with useFieldArray
[frontend/src/components/scoring/radar-chart.tsx] -- Already data-driven
[frontend/src/components/scoring/dimension-bars.tsx] -- Already iterates ScoreDetail[]
[CLAUDE.md] -- Project conventions and gotchas

Secondary (MEDIUM confidence)

[backend/app/services/meta_skill_templates/] -- Skill quality scoring dimensions are separate
[backend/scripts/seed_phase2.py] -- Seed data patterns

Metadata

Confidence breakdown:

Hardcoded locations: HIGH -- complete grep audit of entire codebase
Migration strategy: HIGH -- follows established Alembic patterns in project
Frontend impact: HIGH -- verified each component's data flow
Backward compatibility: HIGH -- ScoreDetail stores dimension as string, historical data safe
Prompt quality after refactor: MEDIUM -- depends on criteria field quality (A2)

Research date: 2026-04-27 Valid until: 2026-05-27 (stable internal refactoring, no external dependency risk)

UI Specification

Click to expand UI spec

Phase 21 -- UI Design Contract

Visual and interaction contract for the Scoring Criteria Refactor frontend. Generated by gsd-ui-researcher, verified by gsd-ui-checker. This phase is a refactoring -- the primary UI change is replacing the hardcoded ScoringWeights component with a Rubric selector in the Scenario Editor. Scoring display components (RadarChart, DimensionBars, FeedbackCard) are already dynamic and require no changes.

Design System

Property	Value
Tool	Custom Radix UI wrapper (shadcn-style)
Preset	Figma Make Design System for SaaS (medical brand override)
Component library	Radix UI primitives (Dialog, Select, Tabs, ScrollArea, etc.)
Icon library	lucide-react ^0.460.0
Font	Inter + Noto Sans SC (sans), JetBrains Mono (mono)

Source: Existing project design system (Phase 01 + Phase 10 polish), confirmed via frontend/src/styles/index.css.

Spacing Scale

Declared values (must be multiples of 4):

Token	Value	Usage
xs	4px	Icon gaps, badge internal padding
sm	8px	Compact element spacing, dimension preview list gap
md	16px	Default form field gaps, rubric selector margin
lg	24px	Card body padding, section spacing in scenario editor
xl	32px	Layout gaps between major sections
2xl	48px	Page-level spacing
3xl	64px	Full-page empty state centering

Exceptions: Dimension preview list items use 28px row height (7 x 4px) for dense dimension display without excessive vertical space.

Typography

Role	Size	Weight	Line Height
Body	14px (text-sm)	400	1.5
Label	14px (text-sm)	500	1.5
Heading	20px (text-xl)	500	1.2
Display	28px (text-2xl)	600	1.2

Source: Established project tokens from frontend/src/styles/index.css @layer base rules. No new typography scales introduced in this phase.

Color

Role	Value	Usage
Dominant (60%)	var(--background) #FFFFFF	Page backgrounds, dialog backgrounds
Secondary (30%)	var(--card) #FFFFFF / var(--muted) #F9FAFB	Cards, rubric preview area, dimension list bg
Accent (10%)	var(--primary) #1E40AF	Rubric selector focus ring, "Manage Rubrics" link text, selected rubric highlight
Destructive	var(--destructive) #EF4444	Delete rubric confirmation only

Accent reserved for:

Rubric selector focus ring and selected state border
"Manage Rubrics" navigation link text
Dimension weight progress bar fill in rubric preview
Primary CTA button ("Save Scenario")

Scoring semantic colors (unchanged, already in design system):

var(--strength) #22C55E -- strengths in scoring feedback
var(--weakness) #F97316 -- weaknesses in scoring feedback
var(--improvement) #A855F7 -- improvement suggestions in scoring feedback
var(--chart-1..5) -- RadarChart dimension colors (already dynamic, no changes)

Component Inventory

Changed Components

Component	File	Change Description
ScenarioEditor	`frontend/src/components/admin/scenario-editor.tsx`	Remove ScoringWeights import + 5 weight form fields. Add rubric_id Select field with Rubric selector dropdown. Add read-only dimension preview below selector.
ScoringWeights	`frontend/src/components/admin/scoring-weights.tsx`	DEPRECATED -- file retained for test reference but no longer imported by ScenarioEditor.

New UI Elements (within existing components)

Element	Parent	Description
Rubric Selector	ScenarioEditor	`<Select>` dropdown listing available rubrics by name with dimension count badge. Uses existing Radix Select pattern.
Rubric Dimension Preview	ScenarioEditor	Read-only list below the selector showing each dimension name, weight %, and criteria summary. Rendered when a rubric is selected.
"Manage Rubrics" Link	ScenarioEditor	Text link below rubric selector navigating to `/admin/scoring-rubrics`. Uses `text-primary` color.

Unchanged Components (already dynamic, verified)

Component	File	Why No Change
RadarChart	`frontend/src/components/scoring/radar-chart.tsx`	Reads `ScorePoint[]` -- dimension comes from data
DimensionBars	`frontend/src/components/scoring/dimension-bars.tsx`	Iterates `ScoreDetail[]` -- any dimension count works
FeedbackCard	`frontend/src/components/scoring/feedback-card.tsx`	Displays single `ScoreDetail.dimension` as string
ScoreSummary	`frontend/src/components/scoring/score-summary.tsx`	Only shows overall score, no dimension awareness
ReportSection	`frontend/src/components/scoring/report-section.tsx`	Reads improvements array from data
RubricEditor	`frontend/src/components/admin/rubric-editor.tsx`	Already supports dynamic dimensions with useFieldArray
RubricTable	`frontend/src/components/admin/rubric-table.tsx`	Shows dimension count badge, no hardcoded names
ScoringRubricsPage	`frontend/src/pages/admin/scoring-rubrics.tsx`	Full Rubric CRUD already functional

Interaction Contracts

IC-01: Rubric Selector in Scenario Editor

Trigger: Admin opens Scenario Editor dialog (create or edit).

Layout: The rubric selector replaces the ScoringWeights card. Position in form: after Key Messages, before Pass Threshold.

Behavior:

Selector loads available rubrics via useRubrics() TanStack Query hook (already exists).
Each <SelectItem> shows: rubric name + dimension count badge (e.g., "Default F2F Rubric (5 dimensions)").
Default rubrics (where is_default === true) are listed first with a "(Default)" suffix.
When editing an existing scenario, the selector pre-fills with scenario.rubric_id.
When creating a new scenario, the selector defaults to the default rubric for the selected mode (f2f or conference).
Changing the selected rubric immediately updates the dimension preview below.

Dimension Preview (read-only):

Rendered inside a <Card> with bg-muted/50 background, padding 16px.
Each dimension row: dimension name (left, text-sm font-medium), weight percentage (right, text-sm text-muted-foreground), and a thin progress bar (h-1.5 bg-primary rounded-full) showing weight relative to 100.
Criteria text shown as text-xs text-muted-foreground truncated to 1 line with ellipsis per dimension.
If no rubric selected, show placeholder text: "Select a scoring rubric to see dimensions".

"Manage Rubrics" link:

Below the dimension preview card.
Text: "Manage Rubrics" (en-US) / "管理评分标准" (zh-CN).
Style: text-sm text-primary hover:underline cursor-pointer.
Behavior: opens /admin/scoring-rubrics in the same tab (uses navigate()).

IC-02: Scenario Type Data Flow

Trigger: Admin selects a scenario mode (f2f or conference).

Behavior: When mode changes, if rubric_id is currently the default rubric for the previous mode, auto-switch to the default rubric for the new mode. If rubric_id was manually selected (non-default), keep it unchanged.

IC-03: Historical Score Dimension Display

Trigger: User views scoring feedback for any session (old or new).