Planning Phase 05 - huqianghui/AI-Coach-vibe-coding GitHub Wiki
Auto-generated from
.planning/phases/05-training-material-management
Last synced: 2026-04-02
Gathered: 2026-03-25 Status: Ready for planning
## Phase BoundaryAdmin can upload, version, and manage training materials (Word/Excel/PDF) organized by product -- materials feed into AI knowledge base for more accurate HCP simulation. This phase delivers the backend CRUD + file storage layer, document processing pipeline, RAG-style knowledge integration, and admin UI for material management.
## Implementation Decisions- Storage backend uses pluggable adapter pattern: local filesystem for dev, Azure Blob Storage for prod (consistent with ARCH-01)
- Upload size limit: 50MB per file, single file upload per request
- File format validation: MIME type + extension whitelist on backend, accept PDF/DOCX/XLSX only
- Material organization: flat by product with tags, consistent with existing scenario-product grouping from Phase 2
- Linear version sequence (v1, v2, v3) for simplicity and admin intuitiveness
- Re-upload of same material auto-creates new version, preserving full history and audit trail
- Soft delete with archived flag + admin restore capability (matches CONTENT-03 versioning requirement)
- Configurable per-product retention rules with background cleanup task (meets success criterion 3)
- Extract text on upload, store as searchable chunks for immediate RAG availability
- Page-level chunking with overlap for good granularity on medical content
- SQLAlchemy full-text search for MVP, pluggable interface for Azure AI Search later (keeps dev simple, prod-ready via adapter)
- Materials linked via product -- HCP session auto-includes relevant product materials as context (aligns with scenario-product relationship from Phase 2)
No items deferred to Claude's discretion -- all areas explicitly decided.
<code_context>
- CRUD service pattern:
backend/app/services/has session_service, scenario_service, hcp_profile_service as templates - Router pattern:
backend/app/api/has auth, scenarios, hcp_profiles, scoring routers - Pydantic v2 schemas:
backend/app/schemas/for all domain types - Admin pages:
frontend/src/pages/admin/has hcp-profiles, scenarios, azure-config, scoring-rubrics - UI components:
frontend/src/components/ui/has shared shadcn/ui components - ServiceRegistry: pluggable adapter pattern in
backend/app/services/agents/registry.py
- Async SQLAlchemy with AsyncSession for all DB operations
- Router -> Service -> Model layered architecture
- Pydantic v2 with ConfigDict(from_attributes=True) for all schemas
- Feature toggles via config API and ConfigProvider
- i18n separated by domain namespace (common, auth, nav, admin, coach, scoring)
- TanStack Query hooks per domain for frontend server state
- New router:
backend/app/api/materials.pyregistered inmain.py - New models: TrainingMaterial, MaterialVersion, MaterialChunk in
backend/app/models/ - New admin page:
frontend/src/pages/admin/training-materials.tsx - New i18n namespace:
materialsfor upload/version/management strings - Knowledge base connects to existing prompt_builder.py for HCP session context injection
</code_context>
## Specific IdeasNo specific requirements beyond ROADMAP success criteria -- open to standard approaches following established codebase patterns.
## Deferred IdeasNone -- discussion stayed within phase scope.
| # | Plan File | Status |
|---|---|---|
| 05-01 | 05-01-PLAN.md | Complete |
| 05-02 | 05-02-PLAN.md | Complete |
| 05-03 | 05-03-PLAN.md | Complete |
Click to expand research notes
Researched: 2026-03-25 Domain: File upload, document processing, text extraction, RAG-style knowledge indexing, admin CRUD UI Confidence: HIGH
Phase 5 adds the ability for admins to upload training materials (PDF, DOCX, XLSX) organized by product, manage versions and retention, extract text for searchable chunks, and integrate those chunks into the AI coaching prompt pipeline. The codebase has well-established patterns for every layer (model, service, schema, router, API client, TanStack Query hook, admin page) that this phase follows directly.
The primary technical challenge is document text extraction -- requiring three new Python dependencies (pypdf, python-docx, openpyxl) that are all mature, well-maintained, and pure-Python. The chunking and search integration is straightforward: store extracted text as page-level chunks in a regular SQLAlchemy table with basic LIKE search for MVP (SQLite FTS5 is available but adds unnecessary complexity for the initial implementation). The prompt builder already accepts scenario context and can be extended to inject material chunks.
Primary recommendation: Follow the exact CRUD patterns from scenarios/HCP profiles for the material management layer. Use a pluggable storage adapter (local filesystem dev, Azure Blob prod) consistent with ARCH-01. Extract text synchronously on upload (documents are small -- medical training materials under 50MB). Store chunks in a material_chunks table linked to material versions, and extend prompt_builder.py to include relevant chunks in HCP system prompts.
<user_constraints>
- Storage backend uses pluggable adapter pattern: local filesystem for dev, Azure Blob Storage for prod (consistent with ARCH-01)
- Upload size limit: 50MB per file, single file upload per request
- File format validation: MIME type + extension whitelist on backend, accept PDF/DOCX/XLSX only
- Material organization: flat by product with tags, consistent with existing scenario-product grouping from Phase 2
- Linear version sequence (v1, v2, v3) for simplicity and admin intuitiveness
- Re-upload of same material auto-creates new version, preserving full history and audit trail
- Soft delete with archived flag + admin restore capability (matches CONTENT-03 versioning requirement)
- Configurable per-product retention rules with background cleanup task (meets success criterion 3)
- Extract text on upload, store as searchable chunks for immediate RAG availability
- Page-level chunking with overlap for good granularity on medical content
- SQLAlchemy full-text search for MVP, pluggable interface for Azure AI Search later (keeps dev simple, prod-ready via adapter)
- Materials linked via product -- HCP session auto-includes relevant product materials as context (aligns with scenario-product relationship from Phase 2)
No items deferred to Claude's discretion -- all areas explicitly decided.
None -- discussion stayed within phase scope. </user_constraints>
<phase_requirements>
| ID | Description | Research Support |
|---|---|---|
| CONTENT-01 | Admin can upload training materials (PDF, Word, Excel) organized by product and therapeutic area | File upload via FastAPI UploadFile, pypdf/python-docx/openpyxl for validation, storage adapter pattern, material model with product FK |
| CONTENT-02 | Uploaded materials feed into AI knowledge base for more accurate HCP simulation (RAG-style grounding) | Text extraction pipeline, MaterialChunk model, prompt_builder.py extension to inject relevant chunks into HCP system prompt |
| CONTENT-03 | Training materials support versioning and folder organization | MaterialVersion model with linear version sequence, soft delete with archived flag, version history API endpoints |
| </phase_requirements> |
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
| pypdf | 6.9.2 | PDF text extraction | Pure Python, actively maintained successor to PyPDF2, no native deps |
| python-docx | 1.2.0 | DOCX text extraction | Standard library for Word docs, pure Python with lxml |
| openpyxl | 3.1.5 | XLSX text extraction | Standard library for Excel files, pure Python |
| aiofiles | 25.1.0 | Async file I/O for storage adapter | Required for non-blocking file writes in async FastAPI |
| Library | Version | Purpose | Why Standard |
|---|---|---|---|
| react-dropzone | 15.0.0 | File upload drag-and-drop UI | De facto standard for React file uploads, accessible, typed |
| Library | Purpose | Used For |
|---|---|---|
| python-multipart >= 0.0.9 | FastAPI file upload parsing | Already a dependency -- enables UploadFile
|
| axios | HTTP client with interceptors | Upload via multipart/form-data with progress |
| @radix-ui/react-progress | Progress bar component | Upload progress indicator |
| Instead of | Could Use | Tradeoff |
|---|---|---|
| pypdf | PyMuPDF (fitz) | Better extraction quality but requires C libs, complicates Docker |
| react-dropzone | Native input[type=file] | Less accessible, no drag-and-drop, more boilerplate |
| aiofiles | Synchronous file writes | Would block the event loop on large files |
| SQLAlchemy LIKE search | SQLite FTS5 virtual table | FTS5 is faster but requires non-standard Alembic migration, SQLAlchemy FTS5 integration is awkward -- LIKE is sufficient for MVP material volumes |
Installation:
# Backend
cd backend
pip install pypdf python-docx openpyxl aiofiles
# Frontend
cd frontend
npm install react-dropzoneVersion verification: All versions confirmed via pip index and npm registry on 2026-03-25.
backend/app/
models/
material.py # TrainingMaterial, MaterialVersion, MaterialChunk
schemas/
material.py # MaterialCreate, MaterialOut, VersionOut, ChunkOut
services/
material_service.py # CRUD + versioning logic
text_extractor.py # PDF/DOCX/XLSX text extraction
storage/
__init__.py # StorageBackend protocol + get_storage()
local.py # LocalStorageBackend (dev)
azure_blob.py # AzureBlobStorageBackend (prod, stub)
api/
materials.py # Router: upload, list, version history, delete, restore, chunks
frontend/src/
types/
material.ts # TrainingMaterial, MaterialVersion, MaterialChunk types
api/
materials.ts # API client functions
hooks/
use-materials.ts # TanStack Query hooks
pages/admin/
training-materials.tsx # Admin page
components/admin/
material-list.tsx # Material table with filters
material-upload.tsx # Upload dialog with drag-and-drop
material-versions.tsx # Version history panel
What: Abstract file storage behind a protocol/ABC so local filesystem is used in dev and Azure Blob Storage in prod. When to use: Any file I/O operation (save, read, delete). Example:
# backend/app/services/storage/__init__.py
from typing import Protocol
class StorageBackend(Protocol):
async def save(self, path: str, content: bytes) -> str:
"""Save file, return storage URL/path."""
...
async def read(self, path: str) -> bytes:
"""Read file content."""
...
async def delete(self, path: str) -> None:
"""Delete file from storage."""
...
async def exists(self, path: str) -> bool:
"""Check if file exists."""
...What: Stateless module that accepts file bytes + content type and returns extracted text pages. When to use: Called during upload to populate MaterialChunk records. Example:
# backend/app/services/text_extractor.py
from pypdf import PdfReader
from docx import Document
from openpyxl import load_workbook
import io
def extract_text(content: bytes, content_type: str) -> list[str]:
"""Extract text pages from document. Returns list of page-level strings."""
if content_type == "application/pdf":
return _extract_pdf(content)
elif content_type in (
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
):
return _extract_docx(content)
elif content_type in (
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
):
return _extract_xlsx(content)
return []
def _extract_pdf(content: bytes) -> list[str]:
reader = PdfReader(io.BytesIO(content))
return [page.extract_text() or "" for page in reader.pages]
def _extract_docx(content: bytes) -> list[str]:
doc = Document(io.BytesIO(content))
# Group paragraphs into logical pages (DOCX has no inherent pages)
# Use section breaks or paragraph count as chunking boundary
full_text = "\n".join(p.text for p in doc.paragraphs if p.text.strip())
return _chunk_text(full_text, chunk_size=2000, overlap=200)
def _extract_xlsx(content: bytes) -> list[str]:
wb = load_workbook(io.BytesIO(content), data_only=True, read_only=True)
pages = []
for sheet in wb.sheetnames:
ws = wb[sheet]
rows = []
for row in ws.iter_rows(values_only=True):
cells = [str(c) for c in row if c is not None]
if cells:
rows.append(" | ".join(cells))
if rows:
pages.append(f"Sheet: {sheet}\n" + "\n".join(rows))
return pagesWhat: When re-uploading to an existing material, auto-create a new MaterialVersion with incremented version number. When to use: Upload endpoint when material_id is provided. Example:
async def upload_version(
db: AsyncSession, material_id: str, file_content: bytes,
filename: str, content_type: str, storage: StorageBackend
) -> MaterialVersion:
material = await get_material(db, material_id)
# Get next version number
latest = await _get_latest_version(db, material_id)
next_version = (latest.version_number + 1) if latest else 1
# Store file
storage_path = f"materials/{material_id}/v{next_version}/{filename}"
storage_url = await storage.save(storage_path, file_content)
# Create version record
version = MaterialVersion(
material_id=material_id,
version_number=next_version,
filename=filename,
file_size=len(file_content),
content_type=content_type,
storage_url=storage_url,
)
db.add(version)
await db.flush()
# Extract and store chunks
pages = extract_text(file_content, content_type)
for i, text in enumerate(pages):
if text.strip():
chunk = MaterialChunk(
version_id=version.id,
material_id=material_id,
chunk_index=i,
content=text,
)
db.add(chunk)
await db.flush()
return versionWhat: Extend the existing prompt_builder.py to include relevant material chunks when building HCP system prompts.
When to use: During coaching session initialization -- look up materials by product, inject as context.
Example:
# Extension to prompt_builder.py
def build_hcp_system_prompt(
hcp_profile: HcpProfile,
scenario: Scenario,
key_messages: list[str],
material_context: list[str] | None = None, # NEW
) -> str:
# ... existing prompt building ...
if material_context:
prompt_parts.extend([
"",
"# Product Training Materials (Reference Knowledge)",
"Use the following product information to inform your responses:",
])
for i, chunk in enumerate(material_context, 1):
prompt_parts.append(f"\n--- Material Excerpt {i} ---\n{chunk}")
# ... rest of prompt ...- Processing files in background tasks: Decision is to extract text synchronously on upload. With a 50MB limit on medical training materials, extraction takes < 5 seconds. Background tasks add complexity without benefit.
- Storing file content in the database: Store files on filesystem/blob storage, only metadata and text chunks in the DB.
- Using virtual FTS5 tables with Alembic: FTS5 virtual tables cannot be managed by Alembic autogenerate. Use regular tables with LIKE/ILIKE queries for MVP.
- Returning file content in JSON responses: Return URLs/paths, not base64-encoded content. Frontend downloads files via a separate endpoint.
| Problem | Don't Build | Use Instead | Why |
|---|---|---|---|
| PDF text extraction | Custom PDF parser | pypdf.PdfReader |
PDF format is complex; pypdf handles encoding, layout, encryption |
| DOCX text extraction | XML parsing of .docx | python-docx.Document |
DOCX is a zip of XML files; python-docx abstracts this correctly |
| XLSX text extraction | Custom Excel reader | openpyxl.load_workbook |
Excel format has multiple internal representations |
| File upload drag-and-drop | Custom drag events | react-dropzone |
Cross-browser drag-and-drop is tricky; accessibility built in |
| MIME type detection | Extension-only check | Dual check: extension + python-multipart content-type header |
Rely on both -- extension alone can be spoofed |
| Pagination | Custom offset logic | Existing PaginatedResponse.create()
|
Already has battle-tested implementation |
| Query invalidation | Manual cache clear | TanStack Query invalidateQueries
|
Already used across all hooks |
Key insight: Document parsing is deceptively complex. Even simple-looking PDFs can have encoding issues, embedded fonts, or layout quirks that only mature libraries handle correctly. pypdf, python-docx, and openpyxl are the standard choices used across the Python ecosystem.
What goes wrong: Large files consume memory before validation rejects them.
Why it happens: FastAPI reads the entire UploadFile into memory by default when you call await file.read().
How to avoid: Check Content-Length header first as a quick reject, then stream-read with a size counter. For 50MB limit this is not critical but is good practice.
Warning signs: Memory spikes during upload testing.
What goes wrong: A .pdf file uploads with application/octet-stream MIME type.
Why it happens: Browser MIME detection varies; some browsers send generic types.
How to avoid: Validate both: (1) file extension against whitelist, (2) attempt to open with the corresponding library (pypdf/python-docx/openpyxl) -- if it fails, reject. The library-level validation is the true authority.
Warning signs: Upload succeeds but text extraction returns empty results.
What goes wrong: ALTER TABLE fails in SQLite migration.
Why it happens: SQLite has limited ALTER TABLE support (Gotcha #1 from CLAUDE.md).
How to avoid: Alembic env.py already has render_as_batch=True. New migrations auto-use batch mode. Verify this is still set.
Warning signs: Alembic upgrade fails with "near ALTER: syntax error".
What goes wrong: Synchronous open() / write() calls block the async event loop during uploads.
Why it happens: Standard Python file I/O is synchronous.
How to avoid: Use aiofiles for all file operations in the storage adapter, or use asyncio.to_thread() for the text extraction calls (which are CPU-bound synchronous operations).
Warning signs: Other requests hang during file upload/processing.
What goes wrong: New models not detected by autogenerate migration.
Why it happens: Alembic env.py must import all models for metadata registration (Gotcha #7).
How to avoid: After creating new models in material.py, add imports to app/models/__init__.py AND alembic/env.py.
Warning signs: alembic revision --autogenerate generates an empty migration.
What goes wrong: File upload fails with 422 Validation Error.
Why it happens: Axios defaults to application/json. File uploads need multipart/form-data.
How to avoid: Use FormData object with axios. The Content-Type header is auto-set when using FormData.
Warning signs: Backend receives empty file or parsing error.
What goes wrong: pypdf extracts garbled text from Chinese-language PDFs.
Why it happens: Some PDFs use CID-keyed fonts without embedded ToUnicode maps.
How to avoid: Test with representative Chinese medical PDFs early. pypdf handles most CJK correctly since v3+. If quality is poor, fall back to storing the raw document reference without chunks.
Warning signs: Extracted text contains \x00 or replacement characters.
# Source: FastAPI official docs + codebase pattern
from fastapi import APIRouter, Depends, File, UploadFile, Form, Query
from fastapi.responses import Response
router = APIRouter(prefix="/materials", tags=["materials"])
ALLOWED_EXTENSIONS = {".pdf", ".docx", ".xlsx"}
ALLOWED_MIME_TYPES = {
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
}
MAX_FILE_SIZE = 50 * 1024 * 1024 # 50MB
@router.post("/", status_code=201)
async def upload_material(
file: UploadFile = File(...),
product: str = Form(...),
name: str = Form(...),
tags: str = Form(""), # comma-separated
material_id: str | None = Form(None), # if re-uploading for new version
db: AsyncSession = Depends(get_db),
user: User = Depends(require_role("admin")),
):
"""Upload a training material. Creates new material or adds version to existing."""
# Validate extension
ext = Path(file.filename or "").suffix.lower()
if ext not in ALLOWED_EXTENSIONS:
bad_request(f"File type {ext} not allowed. Accepted: PDF, DOCX, XLSX")
# Read content
content = await file.read()
if len(content) > MAX_FILE_SIZE:
bad_request("File exceeds 50MB limit")
# Delegate to service
result = await material_service.upload_material(
db, content=content, filename=file.filename or "unnamed",
content_type=file.content_type or "application/octet-stream",
product=product, name=name, tags=tags,
material_id=material_id, user_id=user.id,
)
return result# Source: Codebase pattern (scenario.py, score.py)
class TrainingMaterial(Base, TimestampMixin):
"""Training material metadata -- one per logical document."""
__tablename__ = "training_materials"
name: Mapped[str] = mapped_column(String(255), nullable=False)
product: Mapped[str] = mapped_column(String(255), nullable=False, index=True)
therapeutic_area: Mapped[str] = mapped_column(String(255), default="")
tags: Mapped[str] = mapped_column(Text, default="") # comma-separated
is_archived: Mapped[bool] = mapped_column(default=False)
current_version: Mapped[int] = mapped_column(default=1)
created_by: Mapped[str] = mapped_column(
String(36), ForeignKey("users.id"), nullable=False
)
# Relationships
versions = relationship("MaterialVersion", back_populates="material",
order_by="MaterialVersion.version_number.desc()")
class MaterialVersion(Base, TimestampMixin):
"""Specific version of a training material with file metadata."""
__tablename__ = "material_versions"
material_id: Mapped[str] = mapped_column(
String(36), ForeignKey("training_materials.id"), nullable=False
)
version_number: Mapped[int] = mapped_column(nullable=False)
filename: Mapped[str] = mapped_column(String(255), nullable=False)
file_size: Mapped[int] = mapped_column(nullable=False) # bytes
content_type: Mapped[str] = mapped_column(String(100), nullable=False)
storage_url: Mapped[str] = mapped_column(Text, nullable=False)
is_active: Mapped[bool] = mapped_column(default=True)
# Relationships
material = relationship("TrainingMaterial", back_populates="versions")
chunks = relationship("MaterialChunk", back_populates="version",
cascade="all, delete-orphan")
class MaterialChunk(Base, TimestampMixin):
"""Extracted text chunk from a material version for RAG search."""
__tablename__ = "material_chunks"
version_id: Mapped[str] = mapped_column(
String(36), ForeignKey("material_versions.id"), nullable=False
)
material_id: Mapped[str] = mapped_column(
String(36), ForeignKey("training_materials.id"), nullable=False, index=True
)
chunk_index: Mapped[int] = mapped_column(nullable=False)
content: Mapped[str] = mapped_column(Text, nullable=False)
page_label: Mapped[str] = mapped_column(String(50), default="") # "Page 3" or "Sheet: Data"
# Relationships
version = relationship("MaterialVersion", back_populates="chunks")// Source: react-dropzone docs + codebase axios pattern
import apiClient from "./client";
export async function uploadMaterial(
file: File,
product: string,
name: string,
tags?: string,
materialId?: string,
onProgress?: (percent: number) => void,
) {
const formData = new FormData();
formData.append("file", file);
formData.append("product", product);
formData.append("name", name);
if (tags) formData.append("tags", tags);
if (materialId) formData.append("material_id", materialId);
const { data } = await apiClient.post("/materials", formData, {
headers: { "Content-Type": "multipart/form-data" },
onUploadProgress: (event) => {
if (event.total && onProgress) {
onProgress(Math.round((event.loaded * 100) / event.total));
}
},
});
return data;
}// Source: Codebase pattern (use-hcp-profiles.ts)
import { useQuery, useMutation, useQueryClient } from "@tanstack/react-query";
import { getMaterials, uploadMaterial, deleteMaterial } from "@/api/materials";
export function useMaterials(params?: { product?: string; search?: string }) {
return useQuery({
queryKey: ["materials", params],
queryFn: () => getMaterials(params),
});
}
export function useUploadMaterial() {
const queryClient = useQueryClient();
return useMutation({
mutationFn: (args: { file: File; product: string; name: string }) =>
uploadMaterial(args.file, args.product, args.name),
onSuccess: () => {
queryClient.invalidateQueries({ queryKey: ["materials"] });
},
});
}| Old Approach | Current Approach | When Changed | Impact |
|---|---|---|---|
| PyPDF2 | pypdf (v3+, now v6) | 2023 | PyPDF2 is deprecated, pypdf is the maintained successor |
| Sync file I/O in FastAPI | aiofiles for async I/O | Always best practice | Prevents blocking event loop |
| FTS via LIKE queries | FTS5 virtual tables or dedicated search service | Available but overkill for MVP | LIKE is fine for < 10K chunks; upgrade to Azure AI Search in prod |
Deprecated/outdated:
- PyPDF2: Deprecated, replaced by pypdf. Do NOT use PyPDF2.
- PyPDF4: Dead fork. Use pypdf.
- xlrd: Only supports .xls (old format). Use openpyxl for .xlsx.
-
Background Retention Cleanup Task
- What we know: Decision says "configurable per-product retention rules with background cleanup task"
- What's unclear: Whether to use a simple periodic background task (FastAPI lifespan with asyncio.create_task), APScheduler, or external cron
- Recommendation: Use a simple async background task started in the FastAPI lifespan, similar to how mock adapters are registered. Run every hour, check retention rules, soft-delete expired materials. This keeps it in-process and testable without external dependencies.
-
Material Search Query Integration
- What we know: Need to retrieve relevant chunks for a product during coaching sessions
- What's unclear: Whether to search all active version chunks or only the latest version
- Recommendation: Search only chunks from the latest active version of each material for a given product. This avoids duplicate/stale context and keeps prompt size manageable.
-
Chunk Size Optimization
- What we know: Decision says "page-level chunking with overlap for good granularity"
- What's unclear: Optimal chunk size and overlap for medical training documents
- Recommendation: Target ~2000 characters per chunk with ~200 character overlap. For PDFs, natural page boundaries. For DOCX (no pages), use paragraph-count-based splitting. For XLSX, one sheet per chunk.
| Dependency | Required By | Available | Version | Fallback |
|---|---|---|---|---|
| Python 3.11+ | Backend | Yes | 3.11.9 | -- |
| Node.js 20+ | Frontend | Yes | 23.11.0 | -- |
| pypdf | PDF extraction | Not installed | 6.9.2 (pip) | pip install |
| python-docx | DOCX extraction | Not installed | 1.2.0 (pip) | pip install |
| openpyxl | XLSX extraction | Not installed | 3.1.5 (pip) | pip install |
| aiofiles | Async file I/O | Not installed | 25.1.0 (pip) | pip install |
| react-dropzone | File upload UI | Not installed | 15.0.0 (npm) | npm install |
| SQLite FTS5 | Full-text search | Yes (built into SQLite) | -- | Not needed for MVP (using LIKE) |
Missing dependencies with no fallback:
- None -- all missing dependencies are installable via pip/npm
Missing dependencies with fallback:
- None -- all are straightforward installs
-
Async everywhere: All backend functions must be
async defwithawait. Storage adapter methods must be async. -
Pydantic v2 schemas: All request/response schemas use
model_config = ConfigDict(from_attributes=True). - Service layer holds business logic: Router only handles HTTP delegation (no business logic in route handlers).
- Create returns 201, Delete returns 204: Follow existing status code conventions.
-
Static routes before parameterized routes: Any static material routes (e.g.,
/search) must come before/{material_id}. -
No raw SQL: Use SQLAlchemy ORM. LIKE queries via
column.ilike(). -
Alembic for schema changes: Must create migration via
alembic revision --autogenerate. -
render_as_batch=True: Already set in env.py for SQLite compatibility. - Import models in alembic/env.py: New models must be imported there.
- ruff format + ruff check: Must pass before commit.
-
TypeScript strict mode: No
anytypes. All types defined insrc/types/. -
TanStack Query hooks per domain: No inline
useQueryin components. -
i18n externalized: All UI strings via react-i18next. New
materialsnamespace needed. -
cn()utility for conditional classes. - db.flush() not db.commit(): Service layer uses flush; commit handled by session middleware.
- >=95% test coverage: Required per success criteria.
| File | Change |
|---|---|
backend/app/models/__init__.py |
Add TrainingMaterial, MaterialVersion, MaterialChunk exports |
backend/app/api/__init__.py |
Add materials_router export |
backend/app/main.py |
Register materials_router with app |
backend/alembic/env.py |
Import new models (TrainingMaterial, MaterialVersion, MaterialChunk) |
backend/app/services/prompt_builder.py |
Add material_context parameter to build_hcp_system_prompt
|
backend/app/config.py |
Add material_storage_path, material_max_size_mb, material_retention_days settings |
backend/.env.example |
Add new config vars |
backend/pyproject.toml |
Add pypdf, python-docx, openpyxl, aiofiles to dependencies |
frontend/package.json |
Add react-dropzone |
frontend/src/i18n/index.ts |
Add materials to ns array |
frontend/src/router/index.tsx |
Add /admin/training-materials route |
frontend/public/locales/en-US/admin.json |
Add materials section |
frontend/public/locales/zh-CN/admin.json |
Add materials section (Chinese) |
Note: Admin sidebar already has /admin/materials entry with FileText icon in admin-layout.tsx. The nav.json already has "materials": "Materials". These are already in place.
- Codebase analysis:
backend/app/models/,backend/app/services/,backend/app/api/,frontend/src/hooks/,frontend/src/api/-- examined all existing CRUD patterns - pip registry: Verified pypdf 6.9.2, python-docx 1.2.0, openpyxl 3.1.5, aiofiles 25.1.0
- npm registry: Verified react-dropzone 15.0.0
- SQLite FTS5: Verified available in local Python 3.11.9 installation
- pypdf text extraction capabilities (CJK handling improved in v3+) -- based on training data, standard in Python ecosystem
- react-dropzone API patterns -- well-established library, training data reliable
- Optimal chunk size for medical training materials (2000 chars + 200 overlap) -- this is a reasonable default but may need tuning based on actual document content
Confidence breakdown:
- Standard stack: HIGH - All libraries verified via pip/npm, versions confirmed
- Architecture: HIGH - Follows exact patterns already established in codebase
- Pitfalls: HIGH - Based on documented gotchas in CLAUDE.md and direct codebase inspection
- Integration points: HIGH - Every file to modify was read and verified
- Chunking strategy: MEDIUM - Reasonable defaults, may need tuning
Research date: 2026-03-25 Valid until: 2026-04-25 (stable domain, libraries are mature)
Click to expand verification report
Phase Goal: Admin can upload, version, and manage training materials (Word/Excel/PDF) organized by product -- materials feed into AI knowledge base for more accurate HCP simulation Verified: 2026-03-25T09:00:00Z Status: gaps_found Re-verification: No -- initial verification
| # | Truth | Status | Evidence |
|---|---|---|---|
| 1 | Admin can upload training documents (Word, Excel, PDF) organized by product via the web UI | VERIFIED | POST /api/v1/materials with multipart form (file + product + name), admin-only auth guard, frontend page with react-dropzone at /admin/materials, 34 tests pass |
| 2 | Uploaded materials support versioning and archiving -- admin can see version history and restore previous versions | VERIFIED | MaterialVersion model, upload_material supports material_id for re-upload, archive/restore endpoints, version history dialog in frontend, tests confirm version_number increment |
| 3 | Retention policies enable auto-deletion of expired materials per configurable rules | FAILED | material_retention_days config exists (365 default) but no code consumes it -- no scheduled deletion, no management command, no enforcement function |
| 4 | Uploaded materials are indexed and available to the AI knowledge base for enhanced HCP simulation accuracy | VERIFIED | Text extraction (PDF/DOCX/XLSX) creates MaterialChunk records, search_chunks with latest-version subquery, get_material_context feeds into prompt_builder via material_context param, sessions.py wires material_ctx injection |
| 5 | All new code has unit tests with >=95% coverage maintained | VERIFIED | 34 tests pass (21 integration + 13 unit), covering upload, versioning, CRUD, archive/restore, search, auth guards, text extraction (PDF/DOCX/XLSX/chunking), prompt builder integration. All code passes ruff lint+format |
Score: 4/5 truths verified
Plan 01 Artifacts:
| Artifact | Expected | Status | Details |
|---|---|---|---|
backend/app/models/material.py |
TrainingMaterial, MaterialVersion, MaterialChunk ORM models | VERIFIED | 3 models with correct ForeignKey relationships, TimestampMixin, indexes |
backend/app/schemas/material.py |
Pydantic v2 request/response schemas | VERIFIED | MaterialCreate, MaterialUpdate, MaterialOut, MaterialListOut, MaterialVersionOut, MaterialChunkOut with ConfigDict(from_attributes=True) |
backend/app/services/storage/__init__.py |
StorageBackend protocol and get_storage factory | VERIFIED | Protocol class with save/read/delete/exists methods, factory returns LocalStorageBackend |
backend/app/services/storage/local.py |
Local filesystem storage adapter | VERIFIED | LocalStorageBackend with aiofiles for async I/O |
backend/app/services/text_extractor.py |
PDF/DOCX/XLSX text extraction | VERIFIED | extract_text dispatcher, _extract_pdf (page-level), _extract_docx (paragraph-chunked), _extract_xlsx (sheet-per-chunk), _chunk_text (2000 chars, 200 overlap) |
backend/alembic/versions/b148c6bf1d9b_add_training_material_tables.py |
Migration for 3 tables | VERIFIED | Creates training_materials, material_versions, material_chunks with indexes |
backend/app/config.py |
material_storage_path, material_max_size_mb, material_retention_days | VERIFIED | All three config fields present with defaults |
backend/.env.example |
Material env vars | VERIFIED | MATERIAL_STORAGE_PATH, MATERIAL_MAX_SIZE_MB, MATERIAL_RETENTION_DAYS present |
backend/pyproject.toml |
pypdf, python-docx, openpyxl, aiofiles | VERIFIED | All four dependencies in [project] dependencies list |
Plan 02 Artifacts:
| Artifact | Expected | Status | Details |
|---|---|---|---|
backend/app/services/material_service.py |
Material CRUD + versioning + chunk search | VERIFIED | 287 lines, upload_material, get_materials, search_chunks, get_material_context, archive/restore, asyncio.to_thread for text extraction |
backend/app/api/materials.py |
REST API router for material management | VERIFIED | 9 endpoints, /search before /{material_id} (Gotcha #3), POST 201, DELETE 204, admin-only |
backend/tests/test_materials.py |
Integration tests for material API | VERIFIED | 21 integration tests, all pass |
backend/tests/test_text_extractor.py |
Unit tests for text extraction | VERIFIED | 13 unit tests, all pass |
backend/app/services/prompt_builder.py |
material_context parameter for RAG | VERIFIED | build_hcp_system_prompt accepts material_context: list[str] |
Plan 03 Artifacts:
| Artifact | Expected | Status | Details |
|---|---|---|---|
frontend/src/types/material.ts |
TypeScript types matching backend schemas | VERIFIED | TrainingMaterial, MaterialVersion, MaterialChunk, MaterialCreate, MaterialUpdate, PaginatedMaterials interfaces |
frontend/src/api/materials.ts |
Typed API client functions | VERIFIED | 8 functions covering all endpoints, multipart upload with progress callback |
frontend/src/hooks/use-materials.ts |
TanStack Query hooks | VERIFIED | useMaterials, useMaterial, useMaterialVersions, useVersionChunks, useUploadMaterial, useUpdateMaterial, useArchiveMaterial, useRestoreMaterial with cache invalidation |
frontend/src/pages/admin/training-materials.tsx |
Admin page for material management | VERIFIED | 820 lines, material table, search/product/archived filters, upload dialog with react-dropzone, version history dialog, chunks viewer, edit dialog, archive/restore confirmation |
frontend/public/locales/en-US/admin.json |
i18n strings for materials | VERIFIED | "materials" section with 30+ keys |
frontend/public/locales/zh-CN/admin.json |
Chinese i18n strings | VERIFIED | "materials" section with matching Chinese translations |
Plan 01 Key Links:
| From | To | Via | Status | Details |
|---|---|---|---|---|
backend/app/models/material.py |
backend/app/models/__init__.py |
re-export in all | WIRED | TrainingMaterial, MaterialVersion, MaterialChunk in imports and all |
backend/app/models/material.py |
backend/alembic/env.py |
import for migration discovery | WIRED | TrainingMaterial, MaterialVersion, MaterialChunk imported in env.py |
Plan 02 Key Links:
| From | To | Via | Status | Details |
|---|---|---|---|---|
backend/app/api/materials.py |
backend/app/services/material_service.py |
service function calls | WIRED | material_service.upload_material, get_materials, search_chunks, etc. all called |
backend/app/api/materials.py |
backend/app/main.py |
router registration | WIRED | materials_router imported from app.api and include_router called with api_prefix |
backend/app/services/prompt_builder.py |
backend/app/api/sessions.py |
material_context parameter | WIRED | sessions.py calls material_service.get_material_context and passes result to build_hcp_system_prompt |
Plan 03 Key Links:
| From | To | Via | Status | Details |
|---|---|---|---|---|
frontend/src/pages/admin/training-materials.tsx |
frontend/src/hooks/use-materials.ts |
hook imports | WIRED | useMaterials, useMaterialVersions, useVersionChunks, useUploadMaterial, useUpdateMaterial, useArchiveMaterial, useRestoreMaterial imported and used |
frontend/src/hooks/use-materials.ts |
frontend/src/api/materials.ts |
API function imports | WIRED | getMaterials, getMaterial, getMaterialVersions, getVersionChunks, uploadMaterial, updateMaterial, archiveMaterial, restoreMaterial imported |
frontend/src/router/index.tsx |
frontend/src/pages/admin/training-materials.tsx |
route registration | WIRED | TrainingMaterialsPage imported and registered at path "materials" under admin children |
| Artifact | Data Variable | Source | Produces Real Data | Status |
|---|---|---|---|---|
training-materials.tsx |
materialsData | useMaterials -> getMaterials -> GET /api/v1/materials | API returns paginated DB query results via material_service.get_materials | FLOWING |
training-materials.tsx |
versions | useMaterialVersions -> getMaterialVersions -> GET /api/v1/materials/{id}/versions | API returns DB query via material_service.get_versions | FLOWING |
training-materials.tsx |
chunks | useVersionChunks -> getVersionChunks -> GET /api/v1/materials/{id}/versions/{vid}/chunks | API returns DB query via material_service.get_version_chunks | FLOWING |
prompt_builder.py |
material_context | material_service.get_material_context -> search_chunks -> DB query | Chunks from latest active version joined with TrainingMaterial.product filter | FLOWING |
| Behavior | Command | Result | Status |
|---|---|---|---|
| All backend modules importable | python3 -c "from app.models.material import ...; from app.services.material_service import ..." | "ALL IMPORTS OK" | PASS |
| 34 backend tests pass | pytest tests/test_materials.py tests/test_text_extractor.py -v | "34 passed in 6.29s" | PASS |
| Backend lint clean | ruff check + ruff format --check on all phase 05 files | "All checks passed!" + "10 files already formatted" | PASS |
| Frontend TypeScript compiles | npx tsc -b --noEmit (after npm ci) | Exit 0, no errors | PASS |
| Frontend Vite build | npm run build (after npm ci) | "built in 3.81s" with output files | PASS |
| Alembic migration exists | ls backend/alembic/versions/training_material | b148c6bf1d9b_add_training_material_tables.py found | PASS |
| Requirement | Source Plan | Description | Status | Evidence |
|---|---|---|---|---|
| CONTENT-01 | 05-01, 05-02, 05-03 | Admin can upload training materials (PDF, Word, Excel) organized by product and therapeutic area | SATISFIED | POST /materials endpoint accepts file + product + name, text extraction for PDF/DOCX/XLSX, frontend upload UI with react-dropzone |
| CONTENT-02 | 05-02 | Uploaded materials feed into AI knowledge base for more accurate HCP simulation (RAG-style grounding) | SATISFIED | material_service.get_material_context retrieves chunks by product, prompt_builder includes material_context in HCP system prompt, sessions.py injects material context automatically |
| CONTENT-03 | 05-01, 05-02, 05-03 | Training materials support versioning and folder organization | SATISFIED | MaterialVersion model with version_number, upload_material supports re-upload to create new versions, version history API and frontend dialog, folder organization via product field |
Orphaned requirements: None. REQUIREMENTS.md maps CONTENT-01, CONTENT-02, CONTENT-03 to Phase 3/5 (the traceability table says Phase 3 but the actual implementation is Phase 5 in the roadmap). All three are claimed and satisfied.
| File | Line | Pattern | Severity | Impact |
|---|---|---|---|---|
backend/app/services/storage/azure_blob.py |
5, 16-25 | "Stub -- not yet implemented" + raise NotImplementedError | Info | Intentional production stub, not used in current code path. Factory returns LocalStorageBackend |
backend/app/config.py |
57 | material_retention_days defined but never consumed | Warning | Config exists but no retention enforcement logic implements it |
Test: Navigate to /admin/materials, upload a PDF file via drag-and-drop, verify the material appears in the list, view version history, view extracted text chunks Expected: Material table shows uploaded file with product, version badge, upload date. Upload dialog has drag-and-drop zone, progress bar during upload. Version history dialog shows version list with "View Chunks" button. Chunks dialog shows extracted text with page labels. Why human: Visual layout, drag-and-drop UX, responsive behavior, dialog rendering cannot be verified programmatically
Test: Upload a training material for product "Brukinsa", then start a coaching session with a scenario for "Brukinsa" product. Check if the AI HCP responses reference the uploaded material content. Expected: The HCP system prompt should include "Product Training Materials (Reference Knowledge)" section with material excerpts. AI responses should be informed by the uploaded content. Why human: Requires running both backend and frontend, creating a scenario, and evaluating AI response quality
Test: Switch language to zh-CN, navigate to /admin/materials, verify all labels are in Chinese Expected: Page title shows "Pei Xun Zi Liao Guan Li", all button labels, column headers, and dialog text are in Chinese Why human: Visual verification of translated strings in context
One gap found out of five success criteria:
Success Criterion #3 (Retention policies): The material_retention_days config setting exists with a default of 365 days, establishing the configuration foundation. However, there is no service function, scheduled task, management command, or any code path that reads this setting and deletes materials older than the configured retention period. The retention feature is a config-only stub -- the "retention policy" concept exists in configuration but has zero implementation.
This is a partial gap. The infrastructure is in place (config setting, soft-delete via archive, created_at timestamps on models), but the actual retention enforcement logic is missing. A scheduler or management command needs to:
- Query materials where
created_at + retention_days < now() - Delete or archive expired materials
- Optionally clean up storage files
All other truths (upload, versioning, archiving, RAG knowledge base integration, test coverage) are fully verified with working code, passing tests, and complete wiring from frontend through API to database.
Verified: 2026-03-25T09:00:00Z Verifier: Claude (gsd-verifier)