AI_Grounded with SAM2 - 100-hours-a-week/16-Hot6-wiki GitHub Wiki
Grounded with SAM2
- We use Grounded SAM2 to automatically detect and segment objects in an image based on a text prompt (e.g.,
"monitor. keyboard. mouse."
). - It combines Grounding DINO (for text-based object detection) with SAM2 (for high-quality segmentation), producing pixel-accurate masks for each object.
Masks are used
- visualize object boundaries
- generate per-class binary masks
- or apply inpainting models to replace or modify specific regions
Example Output
Test1
-
Origin Image
-
Grounded SAM2.1
-
Mask
desk | deskmat | laptop | monitor | mouse |
---|---|---|---|---|
Test2
-
Origin Image
-
Grounded SAM2.1
-
Mask
desk | deskmat | monitor |
---|---|---|
keyboard | mouse | speaker |
---|---|---|
Test3
-
Origin Image
-
Grounded SAM2.1
-
Mask
desk | deskmat | monitor |
---|---|---|
laptop | keyboard | mouse | speaker |
---|---|---|---|
Test4
-
Origin Image
-
Grounded SAM2.1
-
Mask
desk | monitor | desktop |
---|---|---|
keyboard | mouse | speaker |
---|---|---|
Issues
- When saving masks, objects with the same class name overwrite each other. Only the last instance is preserved
- In some cases (e.g., test1), unintended objects such as the other person's mouse may be detected
- If the label confidence is low or ambiguous, a single object may be split into multiple segments (e.g.,
"desktop_monitor"
detected as two parts)
Next Steps
- Merge all masks into a single combined mask
- Apply SDXL inpainting using the original image and the merged mask
- Fine-tune the SDXL inpainting model for better domain-specific results