Speech to text Transcription - pinocchio61/Architecture GitHub Wiki

Pre-Experiment

The experiment goal and the experiment design.

Purpose

As a crowdsourced speech verification platform, speech transcribing is a key functionality component to enable the crowd can verify speech at ease - by reading off the transcription.

There are various ways to enable speech transcribing, and we mainly investigated 3 categories:

  1. Cloud solution
  2. Open-source libraries
  3. Built-in APIs

Description of the Experiment

To select the most suitable speech-to-text transcription solution, we have conducted an experiment to compare different solutions in the following 4 criteria:

  • Accuracy
  • Cost
  • Ease to use/develop
  • Task time

We have also considered if the selected tool is compatible with existing system architecture and tradeoffs.

Testing Data

We used a section of Obama speech audio file for the experiment. Chosen audio file is a public speech, has background noise, 29-second long with 54 words count. Audio file sized 458kb.

Artifacts Created

Auto-generated transcription based on the sample audio.

Completion Criteria

An open-source transcription is proposed and clients agree on the decision. Clients need a transcription tool works for mobile application with at least 80% of accuracy and lowest possible cost.

Post-Experiment

Experiment results and decision.

Summary of Findings

Cloud Solution: Amazon Transcribe

Results based on the testing data.

  • Accuracy: 100%
  • Cost: $0.0004/s * 29s = $0.0116
  • Ease to use/develop: easy
  • Task time: duration of transcribing 62s (for a 2-hour speech, cost = ~$3)

Amazon Transcribe Service has following limitations:

  • Language: English(different accents), Spanish, Portuguese, Italian, French
  • Format: mp3, mp4, wav, flac
  • Maximum audio file length: 4 hours
  • Maximum audio size: 2GB
  • Maximum size of custom vocabularies: 50KB

Other cloud solutions: Google Cloud Text-to-Speech, IBM Watson Speech-to-text, Microsoft Bing Speech to text

Open-source Solution: CMU Sphinx

Results based on the testing data.

  • Accuracy: 20%
  • Cost: free
  • Ease to use/develop: complex
    • Requires parameter tuning based on the input speeches
    • Requires the computing power, eg, GPU
    • Difficult to configure and run
  • Task time: duration of transcribing 11.9s

Other open-source solutions

Built-in APIs:

Results based on research only.

  1. Speech Framework on iOS
  • Accuracy: Not investigate
  • Cost: high cost in terms of battery-draining and network traffic
  • Ease to use/develop: Medium
  • Task time: No experiment conducted

Limitations:

  • Task can be performed per day is limited to device and network resources
  • Strict audio length limit to 1 minute
  • Lock-in on the iOS
  1. Speech Recognizer API on Android
  • Accuracy: Not investigate
  • Cost: high cost in terms of battery-draining and network traffic
  • Ease to use/develop: Medium
  • Task time: No experiment conducted

Limitations:

  • Strict audio length limit to 1 minute
  • Lock-in on the iOS

Other: Cross-platform app development SDK (Reactive Native/Flutter), which indirectly call on the native APIs. So pros and cons are the same as the above.

Engineer's Recommendations

Amazon Transcribe Service is the most suitable solution to the Pinocchio platform given its high accuracy and ease to use. Since a 2-hour speech will only cost $3, this is not a major concern in the current development phase.