202401 multi‐modal LLM - bluekingsong/vision_material GitHub Wiki
===Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs: MLLM defect in visual reasoning. such as facing to left or right, object visible, object states (open or close etc.), object count, color etc.
most MLLMs use the off-the-shelf CLIP vision encoders to process images. defect in CLIP will arise in MLLM.
method: step1 find “CLIP-blind” image pairs close in CLIP but distant in DINOv2. Step2 human describe difference. step3: issue related question based on human input.
solution: add DINOv2 feature to image features with CLIP feature, interleaved concat/additive
===V∗: Guided Visual Search as a Core Mechanism in Multimodal LLMs: divide whole image to sub-images then identity the target area to answer the question. (VQA)