🔗 Frontend‐Backend Integration – Epic 3: Speaker Diarisation - fuhui14/SWEN90017-2024-TAP GitHub Wiki

Frontend-Backend Integration - Epic 3 (Speaker Diarisation)

Overview

This document outlines the integration logic between the frontend (React) and backend (Django REST + Celery) for the core transcription workflow.

🔗 Frontend‐Backend Integration – Epic 3: Speaker Diarisation

🧩 Overview

When a user uploads a file and selects a speaker-labelled output format, the backend performs speaker diarisation using advanced libraries (Faster-Whisper and PyAnnote Audio). The frontend displays speaker-labeled text blocks in the transcript viewer.

🖥️ Frontend (React)

Trigger: Same as Epic 2 (transpage.js) — user submits a transcription request
Format Choice: If output_format = diarised, backend will engage speaker separation logic
Result Handling: Response includes timestamps and speaker IDs (e.g., Speaker 1, Speaker 2), which are rendered in the result page with distinct formatting

⚙️ Backend (Django + Celery + PyAnnote)

🔹 Main Logic

File: backend/speaker_identify/transcribe_with_speaker_fasterWhisper.py

🔹 Key Processing Steps:

🔄 Convert uploaded file to WAV (if needed)
🔇 Perform noise reduction via noisereduce + librosa
🧠 Use PyAnnote Audio to segment speakers with pretrained HuggingFace pipeline
✍️ Use Faster-Whisper for transcription on each segmented part
🧩 Merge segments and relabel each portion with the assigned speaker ID
💾 Store diarised result for frontend display and download

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization",
    use_auth_token=HF_TOKEN
)
diarisation = pipeline(buf)

🔹 Output Format

The result is stored in structured format:

[
{ "start": 0.0, "end": 5.0, "speaker": "Speaker 1", "text": "Hello, welcome..." },
{ "start": 5.0, "end": 12.0, "speaker": "Speaker 2", "text": "Thank you..." },
...
]

This data is sent back through the existing transcription task tracking pipeline.

Summary of Flow

User requests diarised format on upload page
Task is routed to diarisation handler on backend
Diarised transcript is processed and stored
Frontend retrieves the result and displays it with speaker annotations

Notes

Processing is more resource-intensive due to PyAnnote model size
CUDA acceleration is used if available (torch.cuda.is_available())
Diarisation logic is modular and can be bypassed for standard transcription