Speech to text Transcription - pinocchio61/Architecture GitHub Wiki
Pre-Experiment
The experiment goal and the experiment design.
Purpose
As a crowdsourced speech verification platform, speech transcribing is a key functionality component to enable the crowd can verify speech at ease - by reading off the transcription.
There are various ways to enable speech transcribing, and we mainly investigated 3 categories:
- Cloud solution
- Open-source libraries
- Built-in APIs
Description of the Experiment
To select the most suitable speech-to-text transcription solution, we have conducted an experiment to compare different solutions in the following 4 criteria:
- Accuracy
- Cost
- Ease to use/develop
- Task time
We have also considered if the selected tool is compatible with existing system architecture and tradeoffs.
Testing Data
We used a section of Obama speech audio file for the experiment. Chosen audio file is a public speech, has background noise, 29-second long with 54 words count. Audio file sized 458kb.
Artifacts Created
Auto-generated transcription based on the sample audio.
Completion Criteria
An open-source transcription is proposed and clients agree on the decision. Clients need a transcription tool works for mobile application with at least 80% of accuracy and lowest possible cost.
Post-Experiment
Experiment results and decision.
Summary of Findings
Amazon Transcribe
Cloud Solution:Results based on the testing data.
- Accuracy: 100%
- Cost: $0.0004/s * 29s = $0.0116
- Ease to use/develop: easy
- Task time: duration of transcribing 62s (for a 2-hour speech, cost = ~$3)
Amazon Transcribe Service has following limitations:
- Language: English(different accents), Spanish, Portuguese, Italian, French
- Format: mp3, mp4, wav, flac
- Maximum audio file length: 4 hours
- Maximum audio size: 2GB
- Maximum size of custom vocabularies: 50KB
Other cloud solutions: Google Cloud Text-to-Speech, IBM Watson Speech-to-text, Microsoft Bing Speech to text
CMU Sphinx
Open-source Solution:Results based on the testing data.
- Accuracy: 20%
- Cost: free
- Ease to use/develop: complex
- Requires parameter tuning based on the input speeches
- Requires the computing power, eg, GPU
- Difficult to configure and run
- Task time: duration of transcribing 11.9s
Built-in APIs:
Results based on research only.
- Speech Framework on iOS
- Accuracy: Not investigate
- Cost: high cost in terms of battery-draining and network traffic
- Ease to use/develop: Medium
- Task time: No experiment conducted
Limitations:
- Task can be performed per day is limited to device and network resources
- Strict audio length limit to 1 minute
- Lock-in on the iOS
- Speech Recognizer API on Android
- Accuracy: Not investigate
- Cost: high cost in terms of battery-draining and network traffic
- Ease to use/develop: Medium
- Task time: No experiment conducted
Limitations:
- Strict audio length limit to 1 minute
- Lock-in on the iOS
Other: Cross-platform app development SDK (Reactive Native/Flutter), which indirectly call on the native APIs. So pros and cons are the same as the above.
Engineer's Recommendations
Amazon Transcribe Service is the most suitable solution to the Pinocchio platform given its high accuracy and ease to use. Since a 2-hour speech will only cost $3, this is not a major concern in the current development phase.