2) Design Documentation - snuhcs-course/swpp-2025-project-team-03 GitHub Wiki

0. Table of Contents

1. Document Revision History

Version Date Message
0.1 2025-10-02 Initial draft created
1.1 2025-10-14 Update API specification
1.2 2025-10-14 Add Table of Contents
1.3 2025-10-16 Update Testing Plan & ERD
1.4 2025-10-16 Update Frontend class diagram
2.1 2025-10-23 Update Testing Plan
2.2 2025-10-30 Update API specification(ver.2)
2.3 2025-10-30 Update ERD & Data models
2.4 2025-10-30 Update architecture diagram
2.5 2025-11-01 Update Iter3 final ERD & API specification(ver.2)
2.6 2025-11-02 Update Testing Plan & Results
3.1 2025-11-09 Added detailed architecture, data flow, and DevOps specifications
3.2 2025-11-11 Added Testing Plan
3.3 2025-11-15 Added Risk Mitigation & Update Architecture, ERD, WireFrame
4.1 2025-11-16 Update Testing Plan
4.2 2025-11-20 Update system architecture description & typo
4.3 2025-11-26 Update ERD & API Overview
4.4 2025-11-29 Added AI Feature Pipeline
4.5 2025-11-30 Restructure AI Feature Pipeline & update test results

2. System Design

2.1 System Architecture

  • Frontend: Android (Jetpack Compose + Retrofit + Hilt DI)
  • Backend: Django REST Framework (team3 server)
  • Database: PostgreSQL (team3 server)
  • Storage: AWS S3 bucket
  • AI Integration: OpenAI GPT-4o/GPT-4o-mini, Google Cloud Speech-to-Text API
  • Pattern: Client-Server architecture with RESTful API integration and JWT authentication

Deployment Topology

  • Environments: Backend server running on a single Linux machine using Gunicorn (15 workers, 4 threads per worker; max 60 concurrent requests), with PostgreSQL database hosted on the same machine, one AWS S3 bucket.
  • Network: Application and database run in the same private network/VPC; ingress handled directly by Gunicorn behind Nginx.
  • Secrets Management: Environment variables via .env; migration to a managed secrets store is on the roadmap.
  • CI/CD: GitHub Actions handles format checks and automatically triggers deployments; server code is updated over SSH/rsync and the backend is restarted.

architecture-iter5-2

Runtime Components

Layer Component Responsibility Tech Stack Observability
Presentation Android App (Jetpack Compose) Render dashboards, microphone capture, file management Kotlin, Jetpack Compose, Hilt, Retrofit Standard Android logging
API Gateway Nginx + Gunicorn TLS termination, reverse proxy, static asset serving Nginx, Gunicorn (15 workers, 4 threads/worker) Nginx access logs, Django request logging
Application Django services (accounts, courses, assignments, questions, reports, submissions, etc) Role-based access control, AI pipeline orchestration (external APIs, PyTorch inference), and quiz lifecycle automation from parsed PDFs Python 3.12, Django REST Framework Django logging, health-check endpoints
Data PostgreSQL 15 Persist transactional data (assignments, enrollments, submissions, etc) with timezone-aware timestamps PostgreSQL, psycopg3 pg_stat_statements, slow-query logging
AI Worker (External) In-process external API calls (synchronous) STT processing, prompt generation, semantic evaluation, tail-question generation executed inside request cycle LangChain, LangGraph, OpenAI GPT-4o/GPT-4o-mini, Google Cloud Speech-to-Text Django application logging, API latency tracking
ML Worker (Internal) In-process model inference (synchronous) Confidence scoring via XGBoost regression on acoustic and semantic features extracted from student responses XGBoost 3.0.5, joblib, scikit-learn Django application logging, inference latency tracking
Storage AWS S3 Store PDFs, assets, etc S3 Intelligent-Tiering, presigned URLs S3 object access logs, lifecycle policies

Data Flow Overview

  1. Teacher uploads PDF: File stored in S3 via presigned URL → Django backend extracts text using PyMuPDF → text chunks sent to GPT-4o for question generation → questions saved in database and associated with assignment.
  2. Students launch quiz: Android app fetches assignment metadata and personal assignment questions via /personal_assignments/{id}/questions/ → UI displays questions sequentially.
  3. Student records response: Audio recorded on device → uploaded as multipart/form-data to /personal_assignments/answer/ → backend runs Google Cloud STT → extracts acoustic and semantic features → XGBoost model predicts confidence score → GPT evaluates correctness → tail question generated via LangGraph if needed.
  4. Results storage: Submission saved with text_answer, state, eval_grade, started_at, submitted_at → personal assignment status updated to reflect progress.
  5. Teacher dashboards: Query aggregated statistics on demand via Django ORM (no materialized views); endpoints include /assignments/teacher-dashboard-stats/, /courses/classes/{id}/students-statistics/, /reports/{class_id}/{student_id}/ for curriculum analysis.

data-flow

Adaptive Feedback Pipeline

  1. Feature Extraction (extract_all_features): Runs Google STT, prosodic analysis (extract_acoustic_features), and semantic coherence metrics (extract_features_from_script). Acoustic cues (pause ratios, f0 slope, silence %) and semantic embeddings feed downstream models, embodying Hasan et al.’s affect perception requirements.
  2. Confidence Inference (run_inference): XGBoost regression produces a continuous certainty score (1–8) and letter grade from the multimodal feature vector, following Pelánek & Jarušek’s guidance to combine behavioral and linguistic signals.
  3. Semantic Planner (planner_node): A temperature-0 GPT-4o-mini call with minimal JSON prompt returns {"is_correct": true|false} based solely on meaning equivalence between model answer and transcript, tolerant of ASR artifacts.
  4. Rule-based Routing (decide_bucket_confidence): Correctness × confidence yields buckets A–D with configurable high/low threshold (default 3.45). This enforces the four-quadrant scaffold aligned with Abar et al.’s ZPD scaffolding findings.
  5. Strategy Selection (decide_plan): Adjusts follow-up frequency by bucket and recalled_time; high-performing learners can graduate to correctness-only responses after repeat success.
  6. Tail Question Actor (actor_node): Bucket-specific strategy strings plus few-shot exemplars guide GPT to emit concise Korean JSON payloads (topic, question, model answer, explanation, difficulty). Sanitizers enforce JSON validity and strip LaTeX/backslash artifacts for real-time use.
  7. State Graph Orchestration (langgraph.StateGraph): Planner → derive → actor/only_correct nodes compiled once and invoked via generate_tail_question(), enabling deterministic flows and granular unit tests.
Bucket Planner Verdict Confidence Range Strategy Focus Actor Difficulty
A Correct ≥ high_thr Enrichment / cross-concept transfer hard
B Correct < high_thr Reinforcement & confidence building medium
C Incorrect ≥ high_thr Misconception diagnosis & correction medium
D Incorrect < high_thr Foundational scaffolding & guided recall easy

bucket-flow

External Integrations

  • OpenAI: GPT-4o-mini/gpt-4o for quiz generation, scoring rubric alignment, and tail question creation; invoked synchronously per answer submission.
  • Google Cloud STT: Speech-to-text transcription (16kHz mono) for uploaded WAV files using the latest_long model with word confidence.
  • AWS S3: Storage for assignment materials and audio artifacts via presigned URL uploads/checks.

2.2 Class Diagrams & Data Models


Frontend – Class Diagrams


fe-classdiagram-iter4-1

fe-classdiagram-iter4-2

fe-classdiagram-iter4-3

fe-classdiagram-iter4-4

fe-classdiagram-iter4-5

fe-classdiagram-iter4-6

fe-classdiagram-iter4-7

fe-classdiagram-iter4-8

fe-classdiagram-iter4-9

fe-classdiagram-iter4-10

fe-classdiagram-iter4-11

fe-classdiagram-iter4-12

fe-classdiagram-iter4-13

fe-classdiagram-iter4-14

fe-classdiagram-iter4-16

fe-classdiagram-iter4-17

fe-classdiagram-iter4-18


Backend – Service & Entity Structure


be-classdiagram-iter4-1

be-classdiagram-iter4-2

be-classdiagram-iter4-3

be-classdiagram-iter4-4

be-classdiagram-iter4-5

be-classdiagram-iter4-6

be-classdiagram-iter4-7

be-classdiagram-iter4-8

be-classdiagram-iter4-9


Database – ERD


erd-iter5

The diagram above illustrates an Entity Relationship Diagram (ERD) outlining the Django models for our tutoring and AI quiz service. Each entity corresponds to a key feature in the platform, and the relationships define how classes, assignments, and personalized learning are connected.

Accounts are distinguished by role (student or teacher), and students register for classes through enrollments. Each class contains a foreign key to the subject, and assignments are created under these classes with structured deadlines and question sets. Materials such as PDFs can be attached to assignments for learning support.

Students receive personalized assignments that track progress and completion, while individual questions include explanations and model answers. Answers are stored with correctness and grading information, enabling both automatic evaluation and tail questions. Overall, this schema supports personalized learning by enabling classroom management, customized tasks, AI-based scoring, and continuous feedback.

Data Dictionary Highlights

Table Purpose Key Columns Notes
accounts_user Stores teacher & student identities id, email, role, display_name, locale, is_active Email unique; soft-delete tracked via is_active
courses_courseclass Represents a class/cohort id, teacher_id, subject_id, name, description, created_at Student count computed via ORM (no stored trigger)
assignments_assignment Assignment metadata id, course_class_id, subject_id, title, description, total_questions, due_at, grade, created_at S3 materials stored via related assignments_material
questions_question Canonical question bank entry id, assignment_id, content, answer, difficulty, curriculum_code Supports multilingual content using translation table
personal_assignments_personalassignment Assignment per student id, assignment_id, student_id, status, solved_num, started_at, submitted_at, created_at Unique constraint (student, assignment)
submissions_submission Individual answer attempt (Answer model) id, question_id, student_id, text_answer, state, eval_grade, started_at, submitted_at, created_at Correctness + confidence stored on each answer

Reports are generated on demand via reports.utils.analyze_achievement.parse_curriculum. The module identifies a student’s weaknesses by mapping their performance to the Korean curriculum achievement standards.

Domain Events

Currently there is no external event bus. Key state transitions occur inside the API layer:

  • AssignmentPublished: creates personal assignments for enrolled students and issues S3 upload keys.
  • SubmissionEvaluated: updates personal-assignment status and creates tail questions when needed.
  • ReportRequested: invokes curriculum analysis synchronously and returns the response.

3. Implementation Details

3.1 Backend API Overview

All endpoints are served under /api/ and return the standard envelope { "success": bool, "data": any, "message": str | null, "error": str | null }. Detailed API documentation and test interface are available at /swagger/.

Core

Method Path Description
GET /core/health/ Health check endpoint
GET /core/error/ Intentionally raises an error (debug/testing)
GET /core/logs/tail Return the last n lines of nohup.out (n query param required)

Authentication

Method Path Description
POST /auth/signup/ Create teacher/student account and issue JWT pair
POST /auth/login/ Authenticate and obtain JWT pair
POST /auth/logout/ Client-initiated logout (stateless)
DELETE /auth/account/ Delete user account and all associated data

Assignments

Method Path Description
GET /assignments/ List assignments (teacherId, classId, status filters)
POST /assignments/create/ Create assignment and return S3 presigned upload URL
GET /assignments/{id}/ Retrieve assignment detail (materials included)
PUT /assignments/{id}/ Update assignment metadata
DELETE /assignments/{id}/ Delete assignment
POST /assignments/{id}/submit/ Placeholder endpoint for manual submission workflow
GET /assignments/{id}/questions/ List generated/base questions
GET /assignments/{id}/results/ Completion summary for personal assignments
GET /assignments/{assignment_id}/s3-check/ Validate uploaded PDF in S3
GET /assignments/teacher-dashboard-stats/ Aggregate counts for teacher dashboard

Questions

Method Path Description
POST /questions/create/ Generate base questions/summary from uploaded material

Courses – Students

Method Path Description
GET /courses/students/ List students (filter by teacherId/classId)
GET /courses/students/{id}/ Student profile with enrolments
PUT /courses/students/{id}/ Update student fields
GET /courses/students/{id}/classes/ List classes that the student is enrolled in

Courses – Classes

Method Path Description
GET /courses/classes/ List classes (optional teacherId)
POST /courses/classes/ Create class
GET /courses/classes/{id}/ Class detail
PUT /courses/classes/{id}/ Update class metadata
DELETE /courses/classes/{id}/ Delete class
GET /courses/classes/{id}/students/ List enrolled students
PUT /courses/classes/{id}/students/ Enrol student via id/name/email
GET /courses/classes/{classId}/students-statistics/ Per-student completion stats
DELETE /courses/classes/{classId}/student/{studentId} Remove student from class

Personal Assignments & Submissions

Method Path Description
GET /personal_assignments/ List personal assignments (student_id or assignment_id required)
GET /personal_assignments/{id}/questions/ Base + tail questions for personal assignment
GET /personal_assignments/{id}/statistics/ Aggregated stats
POST /personal_assignments/{id}/complete/ Mark as submitted
POST /personal_assignments/answer/ Upload WAV answer (multipart)
GET /personal_assignments/answer/ Fetch next question (personal_assignment_id query)
GET /personal_assignments/{id}/correctness/ List answered questions with correctness
GET /personal_assignments/recentanswer/ Most recent in-progress assignment for student

Reports & Catalog

Method Path Description
GET /reports/{class_id}/{student_id}/ Curriculum analysis report generated on demand
GET /catalog/subjects/ List available subjects

Tail Question Service (Implementation Notes)

  • Feature extraction (submissions/utils/feature_extractor/\*) converts audio to transcripts plus acoustic/semantic features.
  • Confidence scoring (submissions/utils/inference.py) uses XGBoost to predict certainty.
  • Tail-question generation (submissions/utils/tail_question_generator/generate_questions_routed.py) routes planner verdicts through bucket strategies with LangGraph + GPT-4o-mini.
  • submissions/views.AnswerSubmitView orchestrates the flow synchronously; tests mock STT/LLM for deterministic coverage.

3.2 Frontend Architecture & State Management

  • Layering: MVVM with ViewModel mediating between UI Composables and repository layer. Repository abstracts Retrofit services and handles API communication.
  • Navigation: NavHost with composable-based navigation for student and teacher flows; route-based navigation with parameters.
  • State Handling: StateFlow and MutableStateFlow for reactive state management; Compose rememberSaveable preserves quiz progress through configuration changes.
  • Offline Strategy: OfflineManager class implemented with file-based caching and pending action queue.
  • Dependency Injection: Hilt modules for API clients, repositories, ViewModels. Testing uses @HiltAndroidTest with custom HiltTestRunner and FakeApiService for deterministic tests.
  • File Management: FileManager handles PDF and audio file operations with URI-based file saving and type detection.

3.3 DevOps, Observability & Governance

  • Infrastructure: Single Linux VM running Nginx + Gunicorn (15 workers, 4 threads per worker, max 60 concurrent requests) + PostgreSQL 15; deployment scripted via shell/rsync.
  • Monitoring: Basic health checks (/core/health/) + Django request logging; Log tail endpoint (/core/logs/tail) available for debugging.
  • Logging: Structured logs to stdout (captured by journald); sensitive fields scrubbed manually in log statements.
  • Security Controls: HTTPS termination with Nginx, JWT authentication, S3 bucket policies for security.
  • Backup & Recovery: DB regular backup, S3 artifact storage with Intelligent-Tiering.
  • Compliance: Raw user audio is not stored; only consent metadata and request logs are tracked for auditability in the compliance roadmap.
  • Frontend Code Quality: Spotless for Kotlin formatting, ktlint, JaCoCo for coverage verification.
  • Backend Code Quality: isort + black + ruff formatting, pre-commit hooks & workflows for automatic checks. pytest-cov for testing.

3.4 Frontend–Backend Integration Status

  • Implemented & in use:
    • Authentication (signup, login, logout with JWT)
    • Assignment CRUD with S3 presigned URLs for PDF uploads
    • Personal assignment answer submission flow with audio file uploads
    • Tail-question generation and retrieval
    • Class and student management (CRUD operations)
    • Teacher dashboard statistics
    • Curriculum analysis reports (/reports/<class_id>/<student_id>/)
  • Integration notes:
    • Frontend uses Retrofit with JWT token injection via Interceptor
    • Error responses are mapped to user-friendly Korean messages via ErrorMessageMapper
    • File uploads use S3 for PDF, multipart data for audio files
    • Backend serves presigned S3 url to frontend
    • Network timeouts configured: 60s connect, 120s read, 60s write
  • Known limitations:
    • Room database for local caching is not currently used

4. AI Feature Pipeline

The VoiceTutor system employs a multi-stage AI pipeline to extract meaningful signals from student responses, generating adaptive questions and optimizing curriculum alignment. This section details the question generation workflows, confidence analysis, and achievement standard inference.

4.1 Base Question Generation

Base questions are generated from PDF learning materials via a multi-stage pipeline orchestrated in backend/questions/views.QuestionCreateView.post:

  1. PDF summarization: The uploaded PDF is first converted to text via OCR/extraction, then summarized using GPT-4o to produce a concise abstract capturing key concepts.
  2. Achievement code inference: The summary is passed to infer_relevant_achievement_codes_from_summary() (detailed in §4.3), which returns 1-5 curriculum achievement standard codes aligned with the material content.
  3. Question generation: The summary and inferred achievement codes are provided as context to GPT-4o via a structured prompt template (multi_quiz_prompt in backend/questions/utils/base_question_generator.py).

Prompt Engineering:

  • System role: "Expert educational quiz designer" who creates review questions aligned with learning materials.
  • Few-shot examples: Three hand-crafted exemplars demonstrating high-quality questions with clear educational intent (e.g., conceptual reasoning rather than rote recall).
  • Task specification: Generate n distinct questions (typically 3-5 per material), each focusing on a different topic within the summary.
  • Quality constraints:
    • Each question must assess understanding or reasoning, not mere factual recall.
    • Avoid generic phrasing (e.g., "Why is X important?").
    • Questions should be suitable for elementary/middle school Korean students.
    • Difficulty levels (easy/medium/hard) are explicitly labeled.
    • No LaTeX notation; math is written in plain text (e.g., "x > 4").
  • Achievement alignment: Questions should reflect the goals represented by the inferred achievement codes.
  • Output format: JSON array of objects, each containing {question, model_answer, explanation, difficulty}.

Model configuration: GPT-4o with temperature=0.5 (moderate creativity) and 90-second timeout. The system retries up to 3 times on API failures or malformed JSON outputs.

Factory instantiation: Generated questions are persisted via BaseQuestionFactory.create_question(), which sets recalled_num=0 and base_question=None to mark them as root nodes in the question chain.

4.2 Tail Question Generation

Tail (follow-up) questions are dynamically generated in response to student answers using a LangGraph workflow (backend/submissions/utils/tail_question_generator/generate_questions_routed.py). The workflow implements a planner-actor architecture with rule-based routing to adapt question difficulty and focus based on student performance.

Workflow Architecture:

The LangGraph state machine consists of four nodes:

  1. Planner node: A GPT-4o-mini agent (temperature=0, JSON mode) evaluates the student's transcribed answer against the model answer using chain-of-thought reasoning. It outputs:

    • is_correct: Binary correctness judgment.
    • reasoning: Step-by-step justification of the assessment.
  2. Derive & route node: Deterministic Python function that:

    • Combines is_correct (from planner) with the ML-inferred confidence score (eval_grade) to assign the student to one of four buckets via a correctness × confidence matrix:
      • Bucket A: Correct + High confidence (≥ high_thr) → "ONLY_CORRECT" (no follow-up).
      • Bucket B: Correct + Low confidence (< high_thr) → "ASK" (reinforcement question).
      • Bucket C: Incorrect + High confidence → "ASK" (misconception correction).
      • Bucket D: Incorrect + Low confidence → "ASK" (scaffolding/hint question).
    • Selects a strategy and few-shot example tailored to the bucket (e.g., "Ask a slightly harder conceptual extension" for B, "Identify the specific misconception and provide a contrasting example" for C).
    • Decides the plan: "ASK" (generate tail question) vs. "ONLY_CORRECT" (skip generation if recalled_time ≥ 4 or bucket A).
  3. Actor node (conditional on plan == "ASK"): A GPT-4o-mini agent (temperature=0.7, JSON mode) generates the tail question. The prompt includes:

    • The original question, model answer, and student answer.
    • The selected strategy and few-shot example.
    • Student learning context: A formatted summary of the student's recent performance across the assignment, including:
      • question_chain: Previous attempts on this question number (base + prior tails) with correctness, confidence, and difficulty.
      • overall_accuracy: Assignment-wide accuracy rate.
      • avg_confidence: Mean confidence score.
      • recent_trend: Sequence of recent correctness outcomes (newest first).
      • weak_concepts: Optionally provided list of concepts where the student struggles.
    • Personalization instructions: The actor is instructed to:
      • Avoid repeating questions the student has already answered correctly.
      • Address recurring misconceptions with contrasting examples.
      • Adjust difficulty based on the student's accuracy trend (increase if improving, simplify if struggling).
      • Tailor the question's scaffolding level to the student's confidence patterns.
    • Output format: JSON object with {topic, question, model_answer, explanation, difficulty}.
  4. Only_correct node (conditional on plan == "ONLY_CORRECT"): Returns a minimal result with no generated question, incrementing recalled_time to track progress.

Routing Logic: After the derive node, the graph conditionally routes to either actor (if ASK) or only_correct (if ONLY_CORRECT), then terminates.

Return Value: The final state contains:

  • is_correct, confidence, bucket, plan: Metadata about the student's performance.
  • recalled_time: Incremented count of follow-ups on this question chain (capped at 4).
  • tail_question: The generated question object (or empty dict if skipped).

4.2.1 Multimodal Feature Extraction

When a student submits a spoken answer via WAV audio file, the system extracts a 25-dimensional feature vector across three complementary modalities: acoustic, linguistic, and semantic. The orchestration is performed by extract_all_features() in backend/submissions/utils/feature_extractor/extract_all_features.py, which sequentially invokes Google Cloud Speech-to-Text API for transcription, followed by feature extractors operating on both the audio signal and the resulting transcript.

Acoustic Features (7 dimensions) are extracted from the raw audio waveform using extract_acoustic_features.py:

  • Prosodic features: Fundamental frequency (f0) statistics computed via PyWorld's DIO + StoneMask algorithm, including min_f0_hz, max_f0_hz, range_f0_hz, and linear slope measures (tot_slope_f0_st_per_s, end_slope_f0_st_per_s) in semitones per second. These capture pitch variation patterns correlated with speaker confidence and affective state.
  • Temporal features: Speech rate (voc_speed in words/sec) and silence distribution (percent_silence), derived from energy-based voice activity detection with morphological filtering to remove short spurious segments.
  • Pause features: pause_cnt_ratio computed as the ratio of detected pauses (silent intervals ≥ 0.5s) to word count.

Linguistic Features (5 dimensions) are extracted from the STT transcript using extract_features_from_script.py:

  • Disfluency markers: repeat_cnt_ratio (repetition count ratio) and filler_words_cnt_ratio (frequency of Korean filler words such as "어", "음", "그니까" detected via rule-based lexicon matching with fuzzy tolerance).
  • Lexical metrics: word_speed (words per sentence), avg_sentence_len (average sentence length in words).

Semantic Features (13 dimensions) are computed by encoding the transcript into sentence embeddings using a locally cached Korean SBERT model (snunlp/KR-SBERT-V40K-klueNLI-augSTS, a Sentence-BERT variant fine-tuned on Korean NLI and STS datasets):

  • Adjacent sentence similarity: Distribution statistics (adj_sim_mean, adj_sim_std, adj_sim_p10, adj_sim_p50, adj_sim_p90) of cosine similarities between consecutive sentence embeddings, capturing local coherence.
  • Similarity thresholding: adj_sim_frac_high and adj_sim_frac_low measure the fraction of adjacent pairs exceeding or falling below predefined similarity thresholds (0.85 and 0.50 respectively), indicating redundancy vs. topic shifts.
  • Global coherence: topic_path_len computes the cumulative Euclidean distance along the sentence embedding trajectory, reflecting semantic drift. dist_to_centroid_mean and dist_to_centroid_std measure dispersion from the response centroid (mean embedding vector), with coherence_score defined as 1.0 - dist_to_centroid_mean.
  • Segmented coherence/diversity: The response is divided into three equal temporal segments. intra_coh averages within-segment cosine similarities (intra-segment coherence), while inter_div computes 1.0 - mean cross-segment similarity (inter-segment diversity).

The pipeline gracefully handles STT failures by substituting a fallback string ("음성 인식 실패") and propagating default feature values (0.0), ensuring robustness to noisy or unintelligible audio inputs.

4.2.2 Confidence Scoring via XGBoost Regression

The extracted 25-dimensional feature vector is fed into a pre-trained XGBoost (v3.0.5) regression model to produce a continuous confidence score. The model, stored as backend/submissions/machine/model.joblib, was trained on an annotated dataset of student speech with human-labeled scoring. Details are described in 4.2.3.

Inference Pipeline (backend/submissions/utils/inference.py):

  1. Feature preprocessing: Ratio-based features (repeat_cnt_ratio, filler_words_cnt_ratio, pause_cnt_ratio) are computed on-the-fly by dividing raw counts by word_cnt. Missing or None values are imputed to 0.0 to prevent null propagation.
  2. Feature alignment: The 25 features are arranged into a pandas DataFrame with columns ordered according to FEATURE_COLUMNS (the exact sequence used during training), ensuring input schema consistency.
  3. Regression prediction: XGBoost predicts a continuous score pred_cont ∈ [1, 8], which is rounded to the nearest integer pred_rounded and clipped to the valid range.
  4. Grade mapping: The rounded score is mapped to a letter grade via the rule: A (7-8), B (5-6), C (3-4), D (1-2).
    Invalid or failed STT transcripts (detected via marker strings such as "음성 인식 실패" or empty content) trigger an automatic minimum score of 1 (grade D)
  5. Output: A dictionary of confidence score (pred_cont, pred_rounded, pred_letter) is returned for downstream routing decisions.

The model exploits gradient-boosted decision trees to capture nonlinear interactions among acoustic, linguistic, and semantic features, achieving inference latency < 0.4 seconds (excluding feature extraction). The joblib-serialized model is loaded once per server lifetime and cached in memory for efficiency.

4.2.3 Training the Confidence Scoring Model

The supervised learning pipeline trains an XGBoost gradient boosting regressor to predict numerical grade scores from the 25-dimensional feature vector. The model formulates confidence scoring as a regression problem, mapping feature vectors to a continuous 1–8 scale that is subsequently discretized into letter grades.

Dataset Construction

Training data is collected from annotated presentation recordings stored as JSON label files (*_presentation.json) containing both extracted features and human-assigned grades. The dataset is partitioned into training and validation splits via directory structure (train/ and valid/). Each sample is filtered to exclude recordings with word_cnt = 0 (preventing division-by-zero in ratio computations) and samples with any missing feature values, ensuring data quality. Human evaluators assigned grades on an 8-point scale: { A+: 8, A0: 7, B+: 6, B0: 5, C+: 4, C0: 3, D+: 2, D0: 1 }, which serve as continuous regression targets.
More details about the original dataset can be found in 9.1 Public Speech Dataset.

Model Architecture

We employ XGBoost (XGBRegressor) with histogram-based tree construction (tree_method='hist') for computational efficiency. The regression objective minimizes squared error (reg:squarederror), with RMSE as the evaluation metric. The architecture balances model capacity against overfitting through depth constraints and regularization.

Class Imbalance Handling

Presentation grade distributions are typically skewed toward middle grades (B, C), with fewer samples at extremes (A, D). To prevent the model from biasing toward majority classes, we apply inverse-frequency sample weighting with power-law smoothing:

$$w_c = \left(\frac{\max_k n_k}{n_c}\right)^\alpha$$

where $n_c$ denotes the count of grade $c$, and $\alpha = 0.5$ provides square-root smoothing to prevent over-correction. Weights are normalized to unit mean and clipped to $[0.5, 3.0]$ to bound the influence of extreme minority classes. These per-sample weights are passed to XGBRegressor.fit() via the sample_weight parameter.

Feature Importance

The trained XGBoost model reveals which features contribute most to confidence score prediction. The top 10 features by gain-based importance are:

Top feature importances:
                   voc_speed: 0.2167
                  word_speed: 0.1133
            repeat_cnt_ratio: 0.0863
      filler_words_cnt_ratio: 0.0630
            avg_sentence_len: 0.0351
                   min_f0_hz: 0.0328
                   intra_coh: 0.0308
             pause_cnt_ratio: 0.0306
            adj_sim_frac_low: 0.0287
        dist_to_centroid_std: 0.0275

The feature importance ranking reveals that speech fluency is the dominant predictor of confidence. The top four features—voc_speed, word_speed, repeat_cnt_ratio, and filler_words_cnt_ratio—collectively account for approximately 50% of the model's predictive power, all of which measure how smoothly and continuously a student speaks rather than what they say. This suggests that confident students maintain steady speech pacing with minimal repetitions and filler words, while hesitant or uncertain students exhibit slower, fragmented speech patterns with frequent disfluencies.

Linguistic structure features (avg_sentence_len, pause_cnt_ratio) and the acoustic feature min_f0_hz contribute moderately, indicating that sentence completeness, pause frequency, and vocal pitch stability provide additional discriminative signals. The presence of semantic coherence features (intra_coh, adj_sim_frac_low, dist_to_centroid_std) in the top 10 demonstrates that logical organization and topical consistency also matter—students who stay on-topic and maintain coherent reasoning tend to score higher.

Overall, the model effectively captures the intuition that confident responses are characterized by fluent delivery, well-formed sentences, and coherent content structure, while uncertain responses manifest through speech disfluencies, fragmented phrasing, and disjointed topic flow.

4.3 Report Generation

4.3.1 Achievement Standard Inference via RoBERTa-GPT Hybrid

To align generated questions with the Korean 2022 national curriculum standards (1,071 unique achievement codes spanning subjects and grade levels), the system employs a two-stage filtering pipeline that balances accuracy with computational efficiency.

Problem Formulation: Given a PDF summary and metadata (subject, grade), the goal is to identify a small subset (≈20) of achievement codes that best represent the learning objectives. Direct GPT API calls with the full standard list (50-100 codes per subject/school level) would incur prohibitive token costs and latency.

Stage 1: RoBERTa Pre-filtering (backend/reports/utils/achievement_inference.py):

  • Model architecture: A multi-class classifier based on klue/roberta-large (Korean language understanding model, ~355M parameters) with a custom classification head:
    • RoBERTa encoder produces contextualized token embeddings.
    • CLS token pooling extracts a fixed-size sentence representation.
    • Dropout (p=0.1) for regularization.
    • Linear classifier projects to 1,071 logits (one per achievement code).
  • Inference process:
    1. The PDF summary is tokenized (max length 256 tokens).
    2. The RoBERTa model computes softmax probabilities over all 1,071 codes.
    3. Only codes corresponding to the current subject/school level (pre-filtered from CSV, typically 50-200 candidates) are retained.
    4. Top-k=20 codes with highest probabilities are selected.
  • Caching: The model and tokenizer are loaded once and cached in _model_cache (thread-safe with lock) to amortize loading overhead across requests.
  • Fallback: If RoBERTa returns <3 results or fails, the system falls back to all subject/school-filtered standards (graceful degradation).

Stage 2: GPT-4o Final Selection (backend/questions/utils/achievement_mapper.py):

  • Input: PDF summary + ≤20 RoBERTa-filtered standards (code, content, grade).
  • Prompt engineering:
    • System message: "Expert in analyzing curriculum achievement standards."
    • User prompt specifies:
      • The summary text.
      • The filtered standards list (formatted as Code: <code>\nContent: <content>\nGrade: <grade>).
      • Selection rules: Return up to max_codes=5 codes that are most directly relevant to the summary. Allow 1-2 codes if only those are clearly relevant. Return empty array [] if none match.
    • Output format: JSON array of achievement codes (e.g., ["2과03-01", "2과03-02"]).
  • Model configuration: GPT-4o with temperature=0.1 for consistent, low-variance selection.
  • Post-processing: The returned codes are validated against the filtered set (to prevent hallucinated codes) and mapped to their content strings for storage.

Performance Improvements:

  • Token reduction: Sending ≤20 standards instead of 50-200 reduces prompt size by 61.3%(average), directly cutting API costs.
  • Latency reduction: Smaller context windows enable faster GPT inference (22% faster per question).
  • Accuracy preservation: RoBERTa's high recall (trained on curriculum text) ensures relevant codes remain in the top-20, while GPT's reasoning refines precision.

Integration: Achievement codes are inferred in QuestionCreateView.post before calling generate_base_quizzes(), and the codes are passed as context to the question generation prompt. The final codes are stored in the Question.achievement_code field for reporting and analytics (enabling teachers to track student performance by curriculum standard).

4.3.2 Training the RoBERTa Achievement Classifier

To infer Korean national curriculum achievement standards from input text (e.g., PDF summaries), we fine-tuned a supervised multiclass classifier based on klue/roberta-large, a Korean-language transformer model. The training dataset was built from AI Hub - Curriculum-Level Subject Dataset.

Dataset Construction

The training dataset was built from a labeled corpus where each input is either sentence or paragraph-level text aligned with a single achievement standard code (e.g., 9과12-04). Each code is associated with a natural language description (content) derived from the official 2022 Ministry of Education achievement standard document.

  • Input: Summary-level educational text (from teacher materials, textbook paragraphs, etc.)
  • Label: Single achievement standard code (1,071 unique labels across subjects and grade levels)

Total 1071 achievement standards each paired with 80 sample texts are used for training.
More details about the dataset can be found in Section 9.2: Curriculum-Level Subject Dataset.

Model Architecture & Training Pipeline

  • Base Model: klue/roberta-large (355M parameters)
  • Classification Head: Linear projection over the [CLS] token for 1,071 classes (softmax output)
  • Input Length: Truncated or padded to 256 tokens
  • Loss Function: Cross-entropy loss (standard for multiclass classification)
  • Optimizer: AdamW
  • Training Epochs: 5
  • Learning Rate: 2e-5 with linear warm-up (10%) and decay
  • Batch Size: 32 (gradient accumulation used if GPU memory limited)
  • Early Stopping: Based on validation loss (patience = 2 epochs)
  • Evaluation Metric: Top-k accuracy (k = 1, 5, 10, 20)

Training was performed using PyTorch and HuggingFace Transformers. The final model was serialized with torch.save() and exported alongside its tokenizer and code mappings.

Overall, the top-20 accuracy reached approximately 96–97%. Most of the remaining misclassifications were on cases that are also difficult for humans to label accurately, like below.

Upon manual inspection, we found that even when the model's prediction differed from the ground truth, it often produced a reasonable and semantically valid result.

Input Text: 한글 맞춤법은 '표준어를 소리대로 적되, 어법에 맞도록 함'을 원칙으로 삼습니다.
True Code: [9국04-06] 한글 맞춤법의 기본 원리와 내용을 이해하고 국어생활에 적용한다.
Pred Code: [10국04-04] 한글 맞춤법의 기본 원리와 내용을 이해한다.

Inference Procedure

During inference, input text is tokenized and passed to the trained classifier to produce softmax probabilities over all 1,071 classes. The top-k predictions are selected and filtered to ensure consistency with codes present in the training set. Each predicted code is mapped back to its curriculum description using a lookup dictionary constructed from the CSV.


5. Testing Plan (Iteration 4+)

  • We use pre-commit to automatically enforce code quality before each commit.

5.1 Unit Testing & Integration Testing

Schedule & Frequency

  • When:
    • Conducted continuously during feature development.
    • Each developer must run unit tests before creating a PR to develop branch.
    • Integration tests are executed every weekend.
  • Frequency:
    • Unit Tests → Daily (local development)
    • Integration Tests → Weekly (managed by PM)

Responsibilities

  • Developers:
    • Write and maintain unit tests for their assigned modules.
    • Ensure tests pass before opening PRs.
  • PM:
    • Reviews PRs and enforces testing compliance.
    • Executes weekly integration tests.

5.1.1 Backend Testing

  • Frameworks: Pytest
  • Architecture: MTV, but templates are not used on backend (API-only).
  • Coverage goal: > 90% coverage per component.
  • Testing Result
    • Passed 532 tests and achieved 93% total coverage.
    • Per component coverage: Models: 100%, Views: 97.7%
    • Excluded manage.py, migrations, commands for dummy db, test files in tests.
---------- coverage: platform win32, python 3.13.5-final-0 -----------
Name                                                                     Stmts   Miss  Cover
--------------------------------------------------------------------------------------------
accounts\admin.py                                                           12      0   100%
accounts\apps.py                                                             4      0   100%
accounts\models.py                                                          32      0   100%
accounts\request_serializers.py                                             13      0   100%
accounts\serializers.py                                                     31      0   100%
accounts\urls.py                                                             3      0   100%
accounts\views.py                                                          102     23    77%
assignments\admin.py                                                         4      0   100%
assignments\apps.py                                                          4      0   100%
assignments\models.py                                                       31      0   100%
assignments\request_serializers.py                                          17      0   100%
assignments\serializers.py                                                  31      0   100%
assignments\urls.py                                                          3      0   100%
assignments\views.py                                                       200      0   100%
catalog\admin.py                                                             3      0   100%
catalog\apps.py                                                              4      0   100%
catalog\models.py                                                            6      0   100%
catalog\request_serializers.py                                               3      0   100%
catalog\serializers.py                                                       6      0   100%
catalog\urls.py                                                              4      0   100%
catalog\views.py                                                            15      0   100%
core\admin.py                                                                0      0   100%
core\apps.py                                                                 4      0   100%
core\authentication.py                                                      26      4    85%
core\models.py                                                               0      0   100%
core\urls.py                                                                 4      0   100%
core\views.py                                                               13      0   100%
courses\admin.py                                                             4      0   100%
courses\apps.py                                                              4      0   100%
courses\models.py                                                           25      0   100%
courses\request_serializers.py                                              20      0   100%
courses\serializers.py                                                      64      0   100%
courses\urls.py                                                              3      0   100%
courses\views.py                                                           285     11    96%
questions\admin.py                                                           3      0   100%
questions\apps.py                                                            4      0   100%
questions\factories.py                                                      25      5    80%
questions\models.py                                                         23      0   100%
questions\request_serializers.py                                            11      0   100%
questions\serializers.py                                                    23      0   100%
questions\urls.py                                                            3      0   100%
questions\utils\achievement_mapper.py                                      107      0   100%
questions\utils\base_question_generator.py                                  79      8    90%
questions\utils\pdf_to_text.py                                              35      1    97%
questions\views.py                                                         125      3    98%
reports\admin.py                                                             0      0   100%
reports\apps.py                                                              4      0   100%
reports\models.py                                                            0      0   100%
reports\serializers.py                                                      17      0   100%
reports\urls.py                                                              4      0   100%
reports\utils\achievement_inference.py                                     160     15    91%
reports\utils\analyze_achievement.py                                       199     23    88%
reports\views.py                                                            31      0   100%
submissions\admin.py                                                         4      0   100%
submissions\apps.py                                                          4      0   100%
submissions\models.py                                                       37      0   100%
submissions\serializers.py                                                  37      0   100%
submissions\urls.py                                                          3      0   100%
submissions\utils\feature_extractor\extract_acoustic_features.py           189     22    88%
submissions\utils\feature_extractor\extract_all_features.py                 36      2    94%
submissions\utils\feature_extractor\extract_features_from_script.py        285     67    76%
submissions\utils\feature_extractor\extract_semantic_features.py            76      0   100%
submissions\utils\inference.py                                              36      3    92%
submissions\utils\tail_question_generator\generate_questions_routed.py     153      1    99%
submissions\utils\wave_to_text.py                                           41      4    90%
submissions\views.py                                                       414     40    90%
voicetutor\settings.py                                                      48      0   100%
voicetutor\urls.py                                                           7      0   100%
--------------------------------------------------------------------------------------------
TOTAL                                                                     3203    232    93%

======================== 532 passed, 416 warnings in 217.01s (0:03:37) ========================
  • Backend Integration Test Summary
    • Verified end-to-end workflows for both teacher and student flows.
    • Checked error handling for invalid requests, missing parameters, and unknown resources.
    • Ensured proper linking across modules (auto personal assignment creation, enrollment relations).
    • Confirmed cascade delete rules function correctly.

Backend Testing Framework Details

The backend uses pytest as the primary testing framework with the following key components:

  • pytest (v8.3.4): Main testing framework
  • pytest-django (v4.9.0): Django integration for pytest
  • pytest-cov (v6.0.0): Code coverage reporting
  • pytest-mock (v3.15.1): Mocking utilities
  • factory_boy (v3.3.1): Test data generation
  • coverage (v7.11.0): Coverage analysis tool

Test Organization

Tests are organized within each Django app under a tests/ directory:

  • accounts/tests/
  • assignments/tests/
  • catalog/tests/
  • core/tests/
  • courses/tests/
  • questions/tests/
  • reports/tests/
  • submissions/tests/
  • tests/ (integration tests)

Test Structure

Unit Tests

  • Model Tests: Test model validation, relationships, and business logic
  • Serializer Tests: Test request/response serialization and validation
  • View Tests: Test API endpoints, authentication, authorization, and response formats
  • Utility Tests: Test helper functions and utility modules

Integration Tests

  • Located in backend/tests/test_integration.py
  • Test complete workflows (e.g., teacher workflow: signup → class creation → student enrollment → assignment creation)
  • Use Django's APIClient for end-to-end API testing

Testing Patterns

Fixtures Tests use pytest fixtures for reusable test data setup:

@pytest.fixture
def api_client():
    return APIClient()

@pytest.fixture
def teacher():
    return Account.objects.create_user(...)

@pytest.fixture
def course_class(teacher, subject):
    return CourseClass.objects.create(...)

Database Access All tests use pytestmark = pytest.mark.django_db to enable database access. Tests use an in-memory SQLite database by default (configured via Django settings).

Mocking

  • Use unittest.mock.patch for mocking external services (e.g., AWS S3, Google Cloud Speech)
  • Mock external API calls and file operations
  • Use factory_boy for generating test data

Test Markers Pytest markers are used to categorize tests:

  • @pytest.mark.slow: Marks slow-running tests (can be deselected with -m "not slow")
  • @pytest.mark.integration: Marks integration tests

Configuration

Test configuration is defined in backend/pytest.ini:

  • Django settings module: voicetutor.settings
  • Test file patterns: tests.py, test_*.py, *_tests.py
  • Test paths: All app directories containing tests
  • Default options: Short traceback format, strict markers, disabled warnings

Running Tests

# Run all tests
pytest

# Run tests for a specific app
pytest assignments/tests/

# Run specific test file
pytest assignments/tests/test_assignment_apis.py -v

# Run with coverage
pytest --cov=. --cov-report=html

# Skip slow tests
pytest -m "not slow"

5.1.2 Frontend Testing

  • Frameworks: JUnit 4, Mockito (5.11.0), Mockito-Kotlin (5.4.0), Turbine (1.0.0), Compose UI Test, Hilt Testing

  • Architecture: MVVM (ViewModel + Repository pattern)

  • Coverage goal: > 80% coverage per component.

  • Coverage enforcement: JaCoCo with jacocoTestCoverageVerification task

  • Testing Result frontend-test-iter5-2

  • Test Organization:

    • Unit tests: 1200+ tests covering ViewModels, Repositories, Utils, Managers
    • Instrumentation tests: Split into 5 groups (connectedDebug1-5), total 900+ tests
    • Total test count: 2100+ tests across unit and instrumentation suites
  • Frontend Integration Test Summary

    • Screen workflows: ui, button, text and functionality checks for every screens
    • Authentication workflows: signup, login, token management
    • Assignment lifecycle: create, edit, delete, submit with file uploads
    • Navigation flows: student/teacher dashboard navigation, deep linking
    • UI state transitions: loading, error, success states with proper error messages
    • Form validation: text input validation, date/time picker interactions
    • File operations: PDF upload, audio recording simulation

Frontend Testing Framework Details

The Android frontend uses a comprehensive multi-layered testing approach with both unit tests and instrumentation tests:

Unit Tests (src/test/)

  • JUnit 4: Core testing framework
  • Mockito (v5.11.0) & Mockito-Kotlin (v5.4.0): Mocking framework
  • Turbine (v1.0.0): Flow testing library for Kotlin Coroutines
  • kotlinx-coroutines-test (v1.8.1): Coroutine testing utilities

Instrumentation Tests (src/androidTest/)

  • AndroidX Test: Android testing framework
  • Espresso: UI interaction testing
  • Compose UI Test: Jetpack Compose UI testing
  • Hilt Testing: Dependency injection testing support
  • MockK (v1.13.10): Mocking library for Kotlin (used in some instrumentation tests)

Test Organization

Unit Tests (src/test/java/com/example/voicetutor/)

  • ViewModels: ui/viewmodel/*Test.kt - Test ViewModel logic, state management, and business logic (AuthViewModel, AssignmentViewModel, ClassViewModel, StudentViewModel, etc.)
  • Repositories: data/repository/*Test.kt - Test data layer, API interactions, error handling
  • Models: data/models/*Test.kt - Test data model validation and transformations
  • Network: data/network/*Test.kt - Test API configuration and network models
  • Utils: utils/*Test.kt - Test utility functions (DateFormatter, StatisticsCalculator, PermissionUtils, ErrorMessageMapper)
  • Managers: Test OfflineManager, ThemeManager, FileManager functionality

Instrumentation Tests (src/androidTest/java/com/example/voicetutor/)

  • Screen Tests: ui/screens/*Test.kt - End-to-end UI tests for complete screens (CreateAssignmentScreen, EditAssignmentScreen, AssignmentDetailedResultsScreen, etc.)
  • Coverage Tests: ui/screens/*CoverageTest.kt - High-coverage tests targeting specific line ranges and edge cases
  • Navigation Tests: ui/navigation/*Test.kt - Navigation flow tests (VoiceTutorNavigation, MainLayout navigation)
  • Component Tests: ui/components/*Test.kt - UI component tests (Button, TextField, Card, Header, etc.)
  • Hilt-based Tests: All instrumentation tests use @HiltAndroidTest with FakeApiService for deterministic API responses

Testing Patterns

ViewModel Testing

@RunWith(MockitoJUnitRunner::class)
class AuthViewModelTest {
    @get:Rule
    val mainDispatcherRule = MainDispatcherRule()

    @Mock
    lateinit var authRepository: AuthRepository

    @Test
    fun signup_success_setsAutoFillAndCurrentUser() = runTest {
        // Given
        val vm = AuthViewModel(authRepository)
        Mockito.`when`(authRepository.signup(...))
            .thenReturn(Result.success(user))

        // When
        vm.signup(...)
        advanceUntilIdle()

        // Then
        vm.currentUser.test { ... }
    }
}

Flow Testing with Turbine

vm.currentUser.test {
    val user = awaitItem()
    assert(user != null)
    cancelAndIgnoreRemainingEvents()
}

Repository Testing

  • Mock ApiService using Mockito
  • Test success and failure scenarios
  • Verify error handling and exception propagation
  • Test network response parsing with Gson

Compose UI Testing (Instrumentation)

  • Use createAndroidComposeRule<MainActivity>() for Compose UI tests
  • Test UI semantics with onNodeWithText, onNodeWithContentDescription, performClick, etc.
  • Use waitUntil and waitForIdle for asynchronous UI updates
  • Test Material3 components (ModalBottomSheet, DatePickerDialog, TimePickerDialog)

Instrumentation Testing with Hilt

@HiltAndroidTest
@RunWith(AndroidJUnit4::class)
class CreateAssignmentScreenHighCoverageTest {
    @get:Rule(order = 0)
    val hiltRule = HiltAndroidRule(this)

    @get:Rule(order = 1)
    val composeRule = createAndroidComposeRule<MainActivity>()

    @Inject
    lateinit var fakeApi: FakeApiService

    @Test
    fun testFileUploadSuccessAndFileListUI() {
        // Test implementation using FakeApiService
        // No need to mock ViewModels - Hilt provides real instances with fake API
    }
}

Configuration

Test configuration is defined in frontend/app/build.gradle.kts:

Test Options

  • isIncludeAndroidResources = true: Include Android resources in unit tests
  • isReturnDefaultValues = true: Return default values for Android framework calls
  • animationsDisabled = true: Disable animations during tests
  • Custom test runner: HiltTestRunner for instrumentation tests
  • Split test execution: Tests organized into 5 groups (connectedDebug1-5) for optimized CI/CD performance

Code Coverage (JaCoCo)

  • JaCoCo (v0.8.11): Code coverage tool
  • Coverage reports generated for both unit tests and Android tests
  • Minimum coverage requirement: 80% overall, 70% per class
  • Coverage reports available in HTML and XML formats
  • Custom exclusions: Hilt-generated classes, R files, BuildConfig

Test Execution

Unit Tests

# Run all unit tests
./gradlew testDebugUnitTest

# Run specific test class
./gradlew testDebugUnitTest --tests "AuthViewModelTest"

# Generate coverage report
./gradlew jacocoTestReport

Instrumentation Tests

# Run all Android instrumentation tests (requires connected device/emulator)
./gradlew connectedDebugAndroidTest

# Alternative
# Run specific test groups (faster execution, make sure not to override coverage.ec file)
# You should manually copy and store coverage.ec files not to lose previous test results!
./gradlew connectedDebug1  # UI components and basic navigation
./gradlew connectedDebug2  # Screen tests
./gradlew connectedDebug3  # Additional screen tests
./gradlew connectedDebug4  # Complex screen workflows
./gradlew connectedDebug5  # High-coverage tests and navigation

# Generate combined coverage report
./gradlew jacocoTestReport

Coverage Verification

# Verify coverage meets minimum requirements (80%)
./gradlew jacocoTestCoverageVerification

Test Coverage Areas

Unit Tests Cover:

  • ViewModel state management and business logic
  • Repository data access and error handling
  • Data model validation and transformations
  • Utility functions (date formatting, statistics, permissions)
  • Manager classes (OfflineManager, ThemeManager, FileManager)
  • Network error mapping and API response handling

Instrumentation Tests Cover:

  • Complete screen workflows (CreateAssignment, EditAssignment, AssignmentDetail, etc.)
  • UI component interactions (buttons, text fields, dialogs, date/time pickers)
  • Navigation flows (VoiceTutorNavigation, MainLayout)
  • Compose UI semantics and accessibility
  • Real device/emulator behavior with Hilt dependency injection

Testing Best Practices

  1. Isolation: Each test is independent and can run in any order
  2. Mocking: External dependencies are mocked to ensure fast, reliable tests (Mockito for unit tests, FakeApiService for instrumentation tests)
  3. Coroutine Testing: Use runTest and advanceUntilIdle() for testing coroutines
  4. Flow Testing: Use Turbine for testing Kotlin Flows
  5. Test Data: Use FakeApiService with predefined responses for consistent instrumentation tests
  6. Coverage: Maintain high test coverage (target: 80% for frontend, 90% for backend)
  7. CI/CD: Tests split into 5 groups for parallel execution and faster CI/CD pipelines
  8. Pre-commit: Spotless enforces Kotlin code formatting before each commit
  9. Hilt Testing: Use @HiltAndroidTest with custom HiltTestRunner for dependency injection in instrumentation tests

Continuous Integration

Both frontend and backend tests are integrated into development workflows:

  • Pre-commit hooks: Spotless formatting checks run automatically before commits
  • Local testing: Developers run unit tests before creating PRs
  • Coverage verification: JaCoCo enforces > 80% coverage
  • Test groups: Android instrumentation tests split into 5 groups (connectedDebug1-5) for optimized execution
  • CI/CD readiness: GitHub Actions configuration for automated format checks and automated deployment

5.2 Acceptance Testing

Selected User Stories

  1. Teacher creates a new class and enrolls a new student
  • As a registered teacher user,
  • I can create a new class with a title, subject, and description, and enroll students into the class,
  • so that I can organize enrolled students and assignments separately by class.
  1. Teacher generates a quiz from a pdf material
  • As a registered teacher user who manages a class,
  • I can upload a PDF learning material and generate a quiz assignment from it,
  • so that enrolled students can view and take the quiz.
  1. Student participates in a voice quiz session
  • As a registered student user enrolled in the class,
  • I can start a conversational quiz session and submit my spoken answers,
  • so that my submission is recorded and I can receive AI-generated follow-up questions tailored to my weak areas.
  1. Student reviews their assignment
  • As a registered student user enrolled in the class,
  • I can review my completed assignment and view the report.
  • so that I can understand which questions I answered correctly or incorrectly.
  1. Teacher accesses student's statistics and reports
  • As a registered teacher user who manages the class,
  • I can access each enrolled student's statistics and reports based on Korean curriculum standards,
  • so that I can assess their understandings and identify weak areas

Responsibilities

  • PM:
    • Plan and coordinate acceptance testing sessions.
    • Verify that each selected user story meets acceptance criteria.
  • Developers:
    • Support setup, fix identified bugs, and assist in validation.

Schedule

  • Conducted at 12/4.

Result

  • passed all 69 test cases

Feedbacks

  • The Student Registration and Delete buttons being visible only after scrolling is inconvenient.
    → Updated so the buttons are now fixed at the bottom.

  • When skipping a question, showing “Grading…” feels awkward; it would be better to indicate that a new question is being generated.
    → Changed the message to “Grading & Generating Question”.

  • When expanding a tail question, it’s hard to notice that it actually opened.
    → Adjusted the behavior so clicking “Expand Tail Question” automatically scrolls down slightly to reveal the generated tail question.


6. External Libraries

Frontend:

  • UI Framework: Jetpack Compose (Material3, Navigation Compose, Lifecycle ViewModel Compose)
  • Networking: Retrofit 2 with Gson converter, OkHttp with logging interceptor
  • Dependency Injection: Hilt (Dagger-based DI for Android)
  • Image Loading: Coil Compose
  • Testing: JUnit 4, Mockito (5.11.0), Mockito-Kotlin (5.4.0), Turbine (1.0.0), Compose UI Test, Hilt Testing
  • Code Quality: Spotless (Kotlin formatting), JaCoCo (0.8.11) for code coverage

Backend:

  • Framework: Django REST Framework
  • Database: PostgreSQL 15 with psycopg3
  • AI/ML: LangChain, LangGraph, OpenAI GPT-4o/GPT-4o-mini, Google Cloud Speech-to-Text API
  • ML Models: XGBoost for confidence scoring
  • Testing: Pytest (8.3.4), pytest-django, pytest-cov, pytest-mock, factory_boy, coverage (7.11.0)

Storage & Infrastructure:

  • Cloud Storage: AWS S3 for material PDFs
  • Web Server: Nginx + Gunicorn (15 workers, 4 threads per worker)

7. Risk Management & Mitigation

Implemented Risk Mitigation Strategies:

Risk Impact Likelihood Mitigation Strategy (Implemented) Implementation Location
Database transaction failures High Medium Django transaction.atomic() ensures atomic operations with automatic rollback on exceptions for question generation operations backend/questions/views.py (line 141)
Network timeout during API calls High High OkHttpClient timeout configuration (60s connect, 120s read, 60s write), user-friendly error messages frontend/app/src/main/java/com/example/voicetutor/di/NetworkModule.kt (lines 45-47)
Partial data saves on creation failures High Medium Try-except blocks catch exceptions before database commits; errors return user-friendly Korean messages All backend views (e.g., backend/assignments/views.py, backend/courses/views.py) wrap operations in try-except
API error response inconsistency Medium Medium Standardized create_api_response() helper function returns consistent error format with Korean messages backend/assignments/views.py (line 27), backend/courses/views.py (line 31), backend/submissions/views.py (line 32)
Exception message exposure to users Low Medium Backend catches exceptions and returns user-friendly Korean messages (e.g., "과제 목록 조회 중 오류가 발생했습니다"); frontend displays server-supplied messages All backend views wrap operations in try-except blocks

Planned Risk Mitigation Strategies:

Risk Impact Likelihood Mitigation Strategy (Planned)
Offline submission backlog Medium High OfflineManager class implemented with file-based caching and pending action queue (max 3 retries); integration with ViewModels and production workflows pending
AI scoring drift produces biased feedback High Medium Golden-set evaluation, regular human review
STT outage or degraded accuracy High Medium Multi-region STT endpoints, fallback provider (Google Cloud Service), pre-cached prompts for manual transcription
PDF parsing failure on complex documents Medium Medium Graceful fallback to manual question creation, PDF preprocessing heuristics, validation before publish
Data breach of stored audio files Critical Low Signed URL expiry (5 min), access logging, strict s3 policy for security

8. Glossary & References

Glossary

  • Tail Question: Follow-up AI question triggered when accuracy/confidence below threshold.
  • Certainty / Confidence Score: Probability (0-1) returned by evaluation model representing confidence in correctness classification.
  • Achievement Code: Identifier aligned with Korean national curriculum competency mapping (e.g., 9과12-04).
  • Guardian Consent: Verified authorization from legal guardian for minors, stored with versioned policy reference.
  • Golden Set: Curated dataset of labeled student responses used to evaluate AI performance.

References

  • Korean Ministry of Education Curriculum Standards (2022 revision)
  • OWASP ASVS v4.0.3
  • WCAG 2.1 AA Guidelines
  • ISO/IEC 27001 Control Mapping for Education Technology

9. Dataset & Research

9.1 Public Speech Dataset

Link: AI Hub — Public Speaking Practice & Assessment Data

Overview

This dataset was constructed to support public speaking practice and assessment.
It contains presentation videos and speech audio, presentation text materials, and evaluation text data.
The goal is to enable research and development in:

  • Public speaking recognition and classification
  • Speaking level evaluation

Speaker Distribution

Code Group Count Ratio
A00 Middle school (Grade 9) 100 12.5%
A01 High school students 200 25%
A02 20s 200 25%
A03 30s 100 12.5%
A04 40s 100 12.5%
A05 50+ 100 12.5%

Dataset Contents

  • Length: 3~4 min per presentation
  • Speakers: Metadata includes age group, gender, occupation, and audience type.
  • Presentations: Topic, type, location, script text, and difficulty level.
  • Utterances: Speech segments with start/end time, syllable count, word count, sentence count, and STT-based transcription.
  • Metadata: File ID, filename, evaluation date, and data format.

Dataset Schema

Category Field Description Type
Speaker Info speaker Speaker ID String
age_flag Age group String
gender Gender String
job Occupation String
aud_flag Audience group String
Presentation Info presentation Presentation ID String
presen_topic Presentation topic String
presen_type Presentation type String
presen_location Presentation location String
presen_script Original presentation script String
presen_difficulty Presentation difficulty String
Utterance Script script Utterance ID String
start_time Utterance start time String
end_time Utterance end time String
script_stt_txt Utterance content (ASR/STT result) String
script_tag_txt Utterance content (tag-mapped) String
syllable_cnt Number of syllables Number
word_cnt Number of words (tokens) Number
audible_word_cnt Number of words clearly perceived by listener Number
sent_cnt Number of sentences Number
Evaluation evaluations Evaluation entry ID String
evaluation.eval_id Evaluator ID String
eval_flag Evaluator type String
eval_grade Overall evaluation grade String
Repetition repeat_cnt Count of repetitions/self-repairs Number
repeat_scr Repetition/self-repair score Number
Filler Words filler_words_cnt Count of fillers (um, uh, etc.) Number
filler_words_scr Filler word score Number
Pause pause_cnt Count of pauses Number
pause_scr Pause score Number
Pronunciation wrong_cnt Count of pronunciation errors Number
wrong_scr Pronunciation score Number
Voice Quality voc_quality Voice quality label String
voc_quality_scr Voice quality score Number
Voice Speed voc_speed Speech rate (words/sec) Float
voc_speed_sec_scr Speech rate score Number
Tagging taglist Tag list String
tag_id Tag ID String
tag_keyword Tag keyword String
tag_type Tag type Integer
Averages repeat_scr Average repetition/self-repair score Float
filler_words_scr Average filler word score Float
pause_scr Average pause score Float
wrong_scr Average pronunciation score Float
voc_quality_scr Average voice quality score Float
voc_speed_sec_scr Average speech rate score Float
eval_grade Average overall evaluation grade String
Meta info.filename File name String
id File ID String
date Evaluation date String
formats Data format String

Participated Organizations

Organization Responsibility
HealthCloud Co., Ltd. Non-verbal data refinement & processing
GNUSoft Co., Ltd. Linguistic AI modeling
ANeut Co., Ltd. Non-verbal processing, AI modeling, authoring tools

9.2 Curriculum-Level Subject Dataset

  • Source: AI Hub - Curriculum-Level Subject Dataset

  • Overview This dataset is designed to support research in curriculum-aligned natural language understanding and multimodal learning. It was constructed through the systematic collection of textual and visual data from official educational materials, such as textbooks and reference guides, across multiple educational stages. These resources were then rigorously annotated and aligned with the achievement standards defined in the 2022 Revised National Curriculum of Korea, across nine core subject domains. The resulting dataset facilitates a range of educational AI tasks, including curriculum-based content inference, standard-level classification, and subject-specific knowledge modeling.

  • Subjects: Science, Korean, Mathematics, English, Social Studies, Sociology, Ethics, Technology–Home Economics, Information (9 subjects in total)

  • Preprocess: The dataset is partitioned into training and validation sets, each containing 80 textual samples per achievement standard to ensure balanced representation across labels.

  • Distribution:
    After preprocessing, we collected a total of 1,071 achievement standards, each paired with 80 sample texts—resulting in 85,680 samples overall.

    Subject Number of Standards Total Samples
    Science 190 15,200
    Korean 209 16,720
    Technology and Home Economics 86 6,880
    Ethics 21 1,680
    Social Studies 173 13,840
    Society and Culture 13 1,040
    Math 241 19,280
    English 84 6,720
    Informatics 54 4,320
    Total 1071 85,680
  • Contributors: Media Group Sarangwasup Co., Ltd.


9.3 Research Papers

Recognizing Uncertainty in Speech
  • Published: 2011 (57 citations)

Topic

The study investigates how prosody (intonation, stress, and speech rate) differs when a speaker is confident versus uncertain, and whether these cues appear across the whole utterance or are localized around critical/target words.


Method

  • Participants / utterances

    • 20 speakers: all native English
    • 5 raters
    • 600 utterances total
  • Scripts

    • Boston public transportation fill-in-the-blank sentences
    • Lexical fill-in-the-blank sentences (with unfamiliar words inserted)
  • Procedure

    • Speakers recorded their self-rated confidence (1–5)
    • Raters judged perceived confidence (1–5)
  • Features Used

    • Pitch (f0)

      • minimum, maximum, mean, std, range
      • relative position in utterance of min pitch
      • relative position in utterance of max pitch
      • absolute slope
    • Intensity (RMS)

      • minimum, maximum, mean, std
      • relative position in utterance of min intensity
      • relative position in utterance of max intensity
    • Temporal

      • total silence, percent silence
      • total duration
      • speaking duration (utterance length − pauses)
      • speaking rate
  • Normalization

    • z-score normalization per speaker

      • For each speaker, all utterances were used to compute ((x - \mu)/\sigma)
      • Each speaker’s mean centered to 0 and std to 1
      • This cancels baseline pitch/loudness/speaking style differences

Results

Significant features

  • Total duration (−0.653)

    • Sentence length was nearly fixed in this setup; not relevant for our service
  • Total silence (−0.644)

  • Speaking duration (−0.515)

    • Same issue as above; discarded
  • Percent silence (−0.459)

  • Absolute slope of f0 (+0.312)

    • Larger values correspond to final rising intonation (questions) or sharp final falls (statements)
    • Likely: final falls (r < 0) = confidence
    • Suggestion: compute f0 slope specifically at the end of the utterance as a feature

Supportive features

  • f0 range

    • Larger range = more variation in pitch → likely uncertainty
  • min f0, max f0

    • Very low minima or very high maxima = uncertainty (similar effect as large range)
  • min RMS

    • Very low intensity suggests uncertainty (quiet speech perceived as low confidence)

Additional findings

  • Mismatch: sometimes speakers reported low confidence, but raters perceived them as confident
  • Indicates that apparent confidence ≠ actual confidence; speakers may project certainty despite inner doubt

Reflections

  • How to adapt speaker-wise normalization when user data is sparse remains an open question
  • No explicit model/weights were suggested for combining features into an ML predictor of uncertainty
  • Study limited to English speakers; Korean prosody may differ
  • Sentence length was artificially fixed, so features tied to total duration are unsuitable for deployment

Design takeaway for tutoring system

  • Correct + Uncertain: “You identified the right idea. Let’s review it once more to solidify your reasoning.”

  • Incorrect + Confident: “This is a common misconception. Let’s carefully compare the key differences.”

Response latency as a predictor of the accuracy of children's reports
  • published in 2011 (69 citations)

As obvious as it may sound,
the longer a student takes to select an answer, the more uncertain they are.

Although the study was conducted in a multiple-choice selection scenario (not directly aligned with our speech-based service),
the analogy is straightforward:

Problem presented → student begins recording / begins speaking

The time lag between these two points can reasonably be considered a meaningful feature for uncertainty.

Fluency issues in L2 academic presentations: Linguistic, cognitive and psychological influences on pausing behaviour
  • published in 2024 (4 citations)

Focus

This study, conducted in an EAP (English for Academic Purposes) class, explored where and why pauses occur in academic presentations by L2 (English as a foreign language) learners, and how they influence fluency. Unlike previous research that mainly emphasized quantitative measures (e.g., number/length of pauses), this work sought to explain the underlying causes—linguistic, cognitive, and psychological.

  • (1) What are the types, positions, and frequencies of pauses?
  • (2) What are the reasons for pauses?

Method

  1. Data

    • 22 EAP students at an Australian university
    • Each gave ~15-minute presentations
    • A 1-minute segment was sampled for each; some 5-minute samples were also analyzed to test representativeness (t-test showed no significant difference)
  2. Acoustic & Transcript Analysis

    • Tools like Praat were used to measure the frequency, duration, and position of silent pauses and filled pauses (um, uh)
    • Position categories: within-clause (MOC) vs. end-of-clause (EOC)
  3. Stimulated Recall Interviews (SRI)

    • Students re-watched their own presentation videos and were asked: “Why did you pause here?”
    • 7 participants took part
    • Researchers compared self-reports with observer interpretations

Results

  • 332 pauses total across 22 speakers

    • 210 filled pauses, 122 silent pauses
    • → Filled pauses more frequent
  • Distribution by position

    • End-of-clause (EOC): 65.6%
    • Mid-clause (MOC): 34.4%
  • Types of pauses

    • Silent pause: complete break in sound, only breathing or silence
    • Filled pause: hesitation sounds (um, uh, er, hmm, “like,” “so,” “you know”)
  • Planned pauses

    • Typically at clause ends or after formulaic phrases (“first of all,” “generally speaking”)
    • Functions: breathing, emphasis, giving processing time to listeners
  • Unplanned pauses

    • Occurred during repetition, self-repair, lexical retrieval, planning, or unclassified cases
    • Lexical retrieval/planning pauses often co-occurred with fillers like um/uh

Causal Model (from SRI)

The study identified overlapping linguistic, cognitive, and psychological causes for pauses:

  1. Linguistic

    • Pauses due to word search, pronunciation difficulty, or L1–L2 translation
    • Self-monitoring and self-repair also triggered pauses
  2. Cognitive

    • Burden of recalling and organizing content (especially interpreting graphs) → more mid-clause pauses and repetitions
    • Observers sometimes assumed “grammar checking,” but SRIs revealed it was often conceptual recall/restructuring instead
  3. Psychological

    • Anxiety/nervousness disrupted language and cognitive processing → increased silent/filled pauses and repetitions
    • Conversely, confident speakers used clause-final pauses strategically (checking audience reaction, emphasis, giving processing time)

A single pause may have multiple overlapping causes (e.g., lexical search + anxiety). The same type of pause may stem from different reasons depending on the speaker. This shows how observer-only analysis can be misleading.


Implications

  • Fluency is not just about speed or number of pauses. Effective presentations require simultaneous conceptual processing (cognitive), language production (linguistic), and psychological management.
  • Clause-final pauses can be strategic and beneficial, serving discourse segmentation, emphasis, or audience processing time.
  • Mid-clause silent pauses generally indicate processing difficulties, though causes vary by individual.

Reflections

  • Differentiating pause types is insightful:

    • Silent pause

      • Short, clause-final → marks discourse boundaries
      • Long, mid-clause → failed recall, conceptual overload
    • Filled pause

      • Short → minor planning strategy, not problematic
      • Long → lexical difficulty, insufficient mastery to express concepts fluently
    • Planned pause

      • Often for emphasis or breathing
    • Unplanned pause

      • Common during recall, self-repair, or at ungrammatical break points
Prosodic Manifestations of Confidence and Uncertainty in Spoken Language — Structured Summary
  • Published: 2008 (Interspeech/ICSLP, Brisbane)
  • Author: Heather Pon-Barry (Harvard SEAS)

Topic

The paper asks where prosodic cues to confidence/uncertainty live in an utterance—are they concentrated on the target word/phrase that triggered the decision, or distributed in the surrounding context? It further contrasts self-reported certainty with perceived (listener-rated) certainty.


Method

Participants & Material

  • Speakers: 20 native English speakers (14F/6M).
  • Annotators: 5 native English raters (perceived certainty).
  • Utterances: 600 total: 200 transit Q&A + 400 vocabulary sentences.

Two elicitation domains

  1. Boston public transit: fill-in-the-blank responses with constrained options (e.g., “Take the red line to the _ and get off at _”).
    • Procedure included viewing the context alone, then context+options; a beep at 1500 ms cued reading aloud; speakers then self-rated certainty (1–5).
  2. Vocabulary: sentences completed by choosing 1 of 4 words from a small pool (incl. rare words to induce uncertainty). Same timing as above.

Annotation

  • Five raters judged perceived certainty (1–5) for all 600 utterances, presented without any textual context. Inter-rater agreement (κ) was modest and in line with prior affect work.

Prosodic features & normalization

  • Pitch (f0): min, max, mean, stdev, range, relative positions (min/max), absolute slope.
  • Intensity (RMS): min, max, mean, stdev, relative positions (min/max).
  • Temporal: total/percent silence, total duration, speaking duration (minus pauses), speaking rate.
  • Normalization: all features z-scored per speaker (centered/standardized within speaker).

Context vs. Target segmentation

To localize cues, authors manually removed the target word region (including any immediately preceding pause) from each recording, producing separate context and target segments for parallel feature extraction.


Results

Perception vs. self-report

  • Perceived certainty exceeded self-reported certainty in 67% of utterances. This gap cautions that listeners often over-estimate a speaker’s confidence relative to the speaker’s own rating.

Whole-utterance correlates (with perceived certainty)

From Table 1 (correlations with mean perceived rating), the strongest effects are temporal:

  • Total duration: −0.653 (longer ⇒ less certain)
  • Total silence: −0.644 (more silence ⇒ less certain)
  • Percent silence: −0.459
  • Speaking duration: −0.515
  • Speaking rate: +0.134 (faster ⇒ slightly more certain)
    Among f0/RMS: absolute f0 slope: +0.312 (steeper global slope associates with higher perceived certainty).

Where do cues live? (Context vs. Target)

  • Percent silence is much stronger in the target region (−0.568) than in context (−0.198): localized hesitations around the decision word flag uncertainty.
  • Range f0 is stronger in context (−0.247) than target (≈0): broader pitch excursions in the surrounding phrase relate to uncertainty.
  • Similar context-dominant patterns appear for min f0, max f0, f0 stdev, min RMS.
  • Some features (e.g., absolute f0 slope, total duration, total silence) are best at the whole-utterance level; splitting doesn’t add predictive value for those.

Interpretation & Takeaways

  1. Temporal rhythm dominates: pauses, silence proportion, and overall length are the most reliable global indicators of perceived certainty.
  2. Local vs. global cues:
    • If you want to locate the uncertainty source, inspect target-region features—especially percent silence.
    • If you’re classifying certainty for the whole utterance, global temporal features and absolute f0 slope are high-value.
  3. Perception ≠ self-report: Systems optimized only for perceived certainty risk missing actual uncertainty that speakers feel but listeners don’t detect.

Limitations (noted by the authors)

  • Read (non-spontaneous) speech was used to tightly control lexical content and repeat target words across certainty levels; generalization to spontaneous speech is future work.
  • Additional planned work includes within-speaker analyses and feature-set comparisons for classification accuracy.

Quick Reference: Most Diagnostic Signals from the Paper

  • ↓ Certainty: more/longer pauses (total/percent silence), longer utterances.
  • ↑ Certainty (weak-moderate): steeper absolute f0 slope, slightly faster rate.
  • Localization: percent silence peaks in target; f0 range patterns dominate in context.
⚠️ **GitHub.com Fallback** ⚠️