AI_Grounded with SAM2 - 100-hours-a-week/16-Hot6-wiki GitHub Wiki

Grounded with SAM2

  • We use Grounded SAM2 to automatically detect and segment objects in an image based on a text prompt (e.g., "monitor. keyboard. mouse.").
  • It combines Grounding DINO (for text-based object detection) with SAM2 (for high-quality segmentation), producing pixel-accurate masks for each object.

Masks are used

  • visualize object boundaries
  • generate per-class binary masks
  • or apply inpainting models to replace or modify specific regions

Example Output

Test1

  • Origin Image test

  • Grounded SAM2.1 Unknown-3

  • Mask

desk deskmat laptop monitor mouse
desk deskmat laptop monitor_desktop mouse

Test2

  • Origin Image test

  • Grounded SAM2.1 bounding box with mask

  • Mask

desk deskmat monitor
desk deskmat monitor_desktop
keyboard mouse speaker
keyboard mouse speaker_speaker

Test3

  • Origin Image test3

  • Grounded SAM2.1 Unknown-3

  • Mask

desk deskmat monitor
desk deskmat monitor_desktop
laptop keyboard mouse speaker
laptop keyboard image speaker_speaker

Test4

  • Origin Image test4

  • Grounded SAM2.1 Unknown-4

  • Mask

desk monitor desktop
desk monitor_desktop laptop_desktop
keyboard mouse speaker
keyboard mouse speaker_speaker

Issues

  • When saving masks, objects with the same class name overwrite each other. Only the last instance is preserved
  • In some cases (e.g., test1), unintended objects such as the other person's mouse may be detected
  • If the label confidence is low or ambiguous, a single object may be split into multiple segments (e.g., "desktop_monitor" detected as two parts)

Next Steps

  • Merge all masks into a single combined mask
  • Apply SDXL inpainting using the original image and the merged mask
  • Fine-tune the SDXL inpainting model for better domain-specific results

Reference