🔗 Frontend‐Backend Integration – Epic 3: Speaker Diarisation - fuhui14/SWEN90017-2024-TAP GitHub Wiki
Frontend-Backend Integration - Epic 3 (Speaker Diarisation)
Overview
This document outlines the integration logic between the frontend (React) and backend (Django REST + Celery) for the core transcription workflow.
🔗 Frontend‐Backend Integration – Epic 3: Speaker Diarisation
🧩 Overview
When a user uploads a file and selects a speaker-labelled output format, the backend performs speaker diarisation using advanced libraries (Faster-Whisper and PyAnnote Audio). The frontend displays speaker-labeled text blocks in the transcript viewer.
🖥️ Frontend (React)
- Trigger: Same as Epic 2 (
transpage.js
) — user submits a transcription request - Format Choice: If
output_format = diarised
, backend will engage speaker separation logic - Result Handling: Response includes timestamps and speaker IDs (e.g., Speaker 1, Speaker 2), which are rendered in the result page with distinct formatting
⚙️ Backend (Django + Celery + PyAnnote)
🔹 Main Logic
File: backend/speaker_identify/transcribe_with_speaker_fasterWhisper.py
🔹 Key Processing Steps:
- 🔄 Convert uploaded file to WAV (if needed)
- 🔇 Perform noise reduction via
noisereduce
+librosa
- 🧠 Use PyAnnote Audio to segment speakers with pretrained HuggingFace pipeline
- ✍️ Use Faster-Whisper for transcription on each segmented part
- 🧩 Merge segments and relabel each portion with the assigned speaker ID
- 💾 Store diarised result for frontend display and download
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization",
use_auth_token=HF_TOKEN
)
diarisation = pipeline(buf)
🔹 Output Format
The result is stored in structured format:
[
{ "start": 0.0, "end": 5.0, "speaker": "Speaker 1", "text": "Hello, welcome..." },
{ "start": 5.0, "end": 12.0, "speaker": "Speaker 2", "text": "Thank you..." },
...
]
This data is sent back through the existing transcription task tracking pipeline.
Summary of Flow
- User requests diarised format on upload page
- Task is routed to diarisation handler on backend
- Diarised transcript is processed and stored
- Frontend retrieves the result and displays it with speaker annotations
Notes
- Processing is more resource-intensive due to PyAnnote model size
- CUDA acceleration is used if available (torch.cuda.is_available())
- Diarisation logic is modular and can be bypassed for standard transcription