Survey of ML tools 2023 - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki

  1. AMP: Audiovisual Metadata Platform
  2. Documentation
  3. For Developers

Survey of ML tools 2023

Tool evaluation categories:

  • Video
    • VCLS - Video classification (black/colorbars/noise)
    • OBJ - Object detection
    • OCR - Optical character recognition
    • SCE - Scene/Shot detection
    • FACE - Face detection
  • Audio
    • ACLS - Audio classification (speech/noise/music/silence)
    • STT - Speech to Text
    • SPK - Speaker detection (diarization)
  • Other
    • NER = Named entity recognition (People, Places, Brands, etc)
    • TOP = Topical determination
    • SEN = Sentiment detection
    • EMO = Emotion

\ Video Audio Other

Tool Local GPU VCLS OBJ OCR SCE FACE ACLS STT SPK NER TOP SEN EMO

​Azure Video (error)​ emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"}

Whisper (tick) emoji-short-name=":tick:"} emoji-short-name=":tick:"}

INA Speech (tick) emoji-short-name=":tick:"} emoji-short-name=":tick:"}

AWS (error)

AWS (error)

MediaPipe (tick) emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"} emoji-short-name=":tick:"}

\ \ \ \ \ \ \ \ \ \ \ \ \ \ \

\ \ \ \ \ \ \ \ \ \ \ \ \ \ \


Azure Video Indexer

The quality depends greatly on whether or not the advanced levels are chosen.  For example for ACLS it will only identify silence.

There's no word-level STT.  Object/Face detection doesn't include the bounding box.

Whisper

INA Speech Segmenter

It will support a GPU and actively looks for one, but fails due to a library dependency. 

Document generated by Confluence on Feb 25, 2025 10:39