Comparing Speech Recognition Engines - osprey-voice/osprey GitHub Wiki
Google Cloud Speech-to-Text
Pros
- is great for speech mode
- uses context based inference
- has a large vocabulary
- supports a lot of languages
- is very lightweight on resources since it runs on a remote server
Cons
- proprietary
- isn't free
- uses a remote server
- requires an internet connection
- adds a decent amount of latency
- isn't great for command mode
- doesn't allow you to specify a command graph
- allows you to specify preferred phrases which helps but isn't good enough
- uses context-based inference
- certain keywords having a harder time of being picked up depending on the context
- gives non-deterministic results based on the context
- gets a little pricey if using it a lot
- doesn't allow you to specify a command graph
- at the mercy of Google since it's service-based
Kaldi
An older but up-to-date speech recognition engine that is a DNN-HMM hybrid.
Pros
- there are a lot of models to choose from
- works great for command mode
- can dynamically set the grammar
- the medium size models work pretty well for speech mode
- the medium size models only use about 1 GB of memory
Cons
- the tiny models are missing too much vocabulary for speech mode
- the large models take up a lot of memory, like 2 to 3.5 GB
wav2letter
A newer DNN speech recognition engine from Facebook.
Pros
Cons
- seems a little unapproachable from an end-user perspective
- seems mainly tailored to researchers
- there doesn't seem to be a nice and simple Python API
- impractical to train models as a regular user since it needs a lot of GPUs
Picovoice Cheetah
Pros
- very lightweight
Cons
- has a weird license
- doesn't give great results
Mozilla DeepSpeech
A newer DNN speech recognition engine.
Pros
Cons
- doesn't give the best results
PocketSphinx
Deprecated in favor of using Kaldi with a lightweight model like one of these.
Dragon Dictation
Pros
Cons
- proprietary
- only works on Windows and macOS
- it's possible to run it in a VM or on another computer and stream the results but this can be difficult to set up