Terminology - osprey-voice/osprey GitHub Wiki

Voice Typing

action: The callback that gets executed when a transcript matches a certain rule.
audio:
- erroneous audio: Non voice audio that gets randomly picked up by the microphone such as breathing or background noise that should be filtered out.
- voice audio: Audio from speaking that should be picked up by the voice typing program and converted into a command.
CCR: Continuous Command Recognition. A feature of a voice typing program that allows users to speak several commands consecutively without having to pause between each.
choice: A grammar element and placeholder that can match one of several predefined words.
command: A pairing of a rule and an action.
dictation: When dictating natural language like words, phrases, or sentences as part of a command.
grammar: A nested tree of grammar elements that a rule is compiled into. Also used to refer to the collection of all available rules that a transcript can be matched against.
grammar complexity: A measure of the complexity of a grammar based on the number of rules and how complex the patterns are.
grammar element: A building block of a grammar. There are different types of grammar elements that allow for building different patterns.
keyword: Any word that appear in a rule, including words in a choice element.
match object: A result passed to an action with information based on the rule that was matched and the transcript.
placeholder: A grammar element that acts as a variable for certain words. The value of a placeholder is added to the match object.
rule: A pattern of words and placeholders that a transcript is matched against that maps to an action.
voice typing: Using your voice to control your computer with key presses, commands, and dictation.
voice typing program: A desktop program that allows you to do voice typing.

Speech Recognition

ASR: Automatic Speech Recognition. Another term for speech recognition.
decoding/transcribing: The process of converting audio to text.
- online decoding: Streaming audio in chunks to be decoded in real time with multiple intermediate results and one final result.
- offline decoding: Decoding one chunk of audio and getting one result. If using offline decoding to transcribe a microphone stream, you have to use a VAD to segment the audio and then decode each segment.
enrollment/training: When an individual speaker reads text or vocabulary to a speech recognition system to fine-tune the system for that individual.
- speaker dependent: Systems that use enrollment/training.
- speaker independent: Systems that do not use enrollment/training.
model:
- language model:
RTF: Real Time Factor. Measures how quickly a speech recognition engine is able to return results.
SOTA: State Of The Art
speech recognition engine: Transcribes speech from some given audio based on a given model.
STT: Speech To Text
transcript: A sequence of words that a speech recognition engine generates based on some audio.
VAD: Voice Activity Detection
vocabulary: The set of words that a speech recognition engine can transcribe with a given model.
WER: Word Error Rate. The frequency of incorrectly transcribed words by a speech recognition engine with a given model. A measurement of accuracy.

Microphones

cardioid:

Osprey

Osprey script: A Python file that includes user-specified commands and is loaded by Osprey at runtime.