AI_Grounded with SAM2 - 100-hours-a-week/16-Hot6-wiki GitHub Wiki

상위 문서로 이동 : AI Wiki

Grounded with SAM2

We use Grounded SAM2 to automatically detect and segment objects in an image based on a text prompt (e.g., "monitor. keyboard. mouse.").
It combines Grounding DINO (for text-based object detection) with SAM2 (for high-quality segmentation), producing pixel-accurate masks for each object.

Masks are used

visualize object boundaries
generate per-class binary masks
or apply inpainting models to replace or modify specific regions

Example Output

Test1

Origin Image
Grounded SAM2.1
Mask

desk	deskmat	laptop	monitor	mouse

Test2

Origin Image
Grounded SAM2.1
Mask

desk	deskmat	monitor

keyboard	mouse	speaker

Test3

Origin Image
Grounded SAM2.1
Mask

desk	deskmat	monitor

laptop	keyboard	mouse	speaker

Test4

Origin Image
Grounded SAM2.1
Mask

desk	monitor	desktop

keyboard	mouse	speaker

Issues

When saving masks, objects with the same class name overwrite each other. Only the last instance is preserved
In some cases (e.g., test1), unintended objects such as the other person's mouse may be detected
If the label confidence is low or ambiguous, a single object may be split into multiple segments (e.g., "desktop_monitor" detected as two parts)

Next Steps

Merge all masks into a single combined mask
Apply SDXL inpainting using the original image and the merged mask
Fine-tune the SDXL inpainting model for better domain-specific results

Reference

IDEA-Research / Grounded-SAM-2