Speech to text Transcription - pinocchio61/Architecture GitHub Wiki

Pre-Experiment

The experiment goal and the experiment design.

Purpose

As a crowdsourced speech verification platform, speech transcribing is a key functionality component to enable the crowd can verify speech at ease - by reading off the transcription.

There are various ways to enable speech transcribing, and we mainly investigated 3 categories:

Cloud solution
Open-source libraries
Built-in APIs

Description of the Experiment

To select the most suitable speech-to-text transcription solution, we have conducted an experiment to compare different solutions in the following 4 criteria:

Accuracy
Cost
Ease to use/develop
Task time

We have also considered if the selected tool is compatible with existing system architecture and tradeoffs.

Testing Data

We used a section of Obama speech audio file for the experiment. Chosen audio file is a public speech, has background noise, 29-second long with 54 words count. Audio file sized 458kb.

Artifacts Created

Auto-generated transcription based on the sample audio.

Completion Criteria

An open-source transcription is proposed and clients agree on the decision. Clients need a transcription tool works for mobile application with at least 80% of accuracy and lowest possible cost.

Post-Experiment

Experiment results and decision.

Summary of Findings

Cloud Solution: Amazon Transcribe

Results based on the testing data.

Accuracy: 100%
Cost: $0.0004/s * 29s = $0.0116
Ease to use/develop: easy
Task time: duration of transcribing 62s (for a 2-hour speech, cost = ~$3)

Amazon Transcribe Service has following limitations:

Language: English(different accents), Spanish, Portuguese, Italian, French
Format: mp3, mp4, wav, flac
Maximum audio file length: 4 hours
Maximum audio size: 2GB
Maximum size of custom vocabularies: 50KB

Other cloud solutions: Google Cloud Text-to-Speech, IBM Watson Speech-to-text, Microsoft Bing Speech to text

Open-source Solution: CMU Sphinx

Results based on the testing data.

Accuracy: 20%
Cost: free
Ease to use/develop: complex
- Requires parameter tuning based on the input speeches
- Requires the computing power, eg, GPU
- Difficult to configure and run
Task time: duration of transcribing 11.9s

Built-in APIs:

Results based on research only.

Speech Framework on iOS

Accuracy: Not investigate
Cost: high cost in terms of battery-draining and network traffic
Ease to use/develop: Medium
Task time: No experiment conducted

Limitations:

Task can be performed per day is limited to device and network resources
Strict audio length limit to 1 minute
Lock-in on the iOS

Speech Recognizer API on Android

Accuracy: Not investigate
Cost: high cost in terms of battery-draining and network traffic
Ease to use/develop: Medium
Task time: No experiment conducted

Limitations:

Strict audio length limit to 1 minute
Lock-in on the iOS

Other: Cross-platform app development SDK (Reactive Native/Flutter), which indirectly call on the native APIs. So pros and cons are the same as the above.

Engineer's Recommendations

Amazon Transcribe Service is the most suitable solution to the Pinocchio platform given its high accuracy and ease to use. Since a 2-hour speech will only cost $3, this is not a major concern in the current development phase.