Survey of ML tools 2023 - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki
Survey of ML tools 2023
Tool evaluation categories:
- Video
- VCLS - Video classification (black/colorbars/noise)
- OBJ - Object detection
- OCR - Optical character recognition
- SCE - Scene/Shot detection
- FACE - Face detection
- Audio
- ACLS - Audio classification (speech/noise/music/silence)
- STT - Speech to Text
- SPK - Speaker detection (diarization)
- Other
- NER = Named entity recognition (People, Places, Brands, etc)
- TOP = Topical determination
- SEN = Sentiment detection
- EMO = Emotion
\ Video Audio Other
Tool Local GPU VCLS OBJ OCR SCE FACE ACLS STT SPK NER TOP SEN EMO
Azure Video emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"}
Whisper emoji-short-name=":tick:"} emoji-short-name=":tick:"}
INA Speech emoji-short-name=":tick:"} emoji-short-name=":tick:"}
AWS
AWS
MediaPipe emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"}
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \
Azure Video Indexer
The quality depends greatly on whether or not the advanced levels are chosen. For example for ACLS it will only identify silence.
There's no word-level STT. Object/Face detection doesn't include the bounding box.
Whisper
INA Speech Segmenter
It will support a GPU and actively looks for one, but fails due to a library dependency.
Document generated by Confluence on Feb 25, 2025 10:39