Survey of ML tools 2023 - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki

Survey of ML tools 2023

Tool evaluation categories:

Video
- VCLS - Video classification (black/colorbars/noise)
- OBJ - Object detection
- OCR - Optical character recognition
- SCE - Scene/Shot detection
- FACE - Face detection
Audio
- ACLS - Audio classification (speech/noise/music/silence)
- STT - Speech to Text
- SPK - Speaker detection (diarization)
Other
- NER = Named entity recognition (People, Places, Brands, etc)
- TOP = Topical determination
- SEN = Sentiment detection
- EMO = Emotion

\ Video Audio Other

Tool Local GPU VCLS OBJ OCR SCE FACE ACLS STT SPK NER TOP SEN EMO

Azure Video (error) emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"}

Whisper (tick) emoji-short-name=":tick:"} emoji-short-name=":tick:"}

INA Speech (tick) emoji-short-name=":tick:"} emoji-short-name=":tick:"}

AWS (error)

MediaPipe (tick) emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"}

\ \ \ \ \ \ \ \ \ \ \ \ \ \ \

Azure Video Indexer

The quality depends greatly on whether or not the advanced levels are chosen. For example for ACLS it will only identify silence.

There's no word-level STT. Object/Face detection doesn't include the bounding box.

Whisper

INA Speech Segmenter

It will support a GPU and actively looks for one, but fails due to a library dependency.

Document generated by Confluence on Feb 25, 2025 10:39