Mask R‐CNN - iffatAGheyas/computer-vision-handbook GitHub Wiki

🧠 Semantic & Instance Segmentation with Mask R-CNN

Mask R-CNN extends Faster R-CNN by adding pixel-level segmentation, so it can perform both:

Task	What It Does
📌 Semantic Segmentation	Labels each pixel with a class name
👥 Instance Segmentation	Distinguishes individual object instances

🧩 How Mask R-CNN Works

Backbone CNN (e.g. ResNet + FPN) extracts multi-scale features.
Region Proposal Network (RPN) suggests candidate object regions.
Per-Region Heads
- Classification: predicts object class
- Bounding-Box Regression: refines box coordinates
- Mask Prediction: outputs a pixel-wise mask for each instance

✅ Outputs bounding boxes, class labels and pixel masks for every detected object.

🔑 Why Use Mask R-CNN?

Feature	Benefit
🔍 Instance-aware	Separates individual objects of the same class
🌈 Semantic-level	Labels every pixel, not just bounding boxes
✨ High accuracy	Built on top of Faster R-CNN for precise localization
🐍 PyTorch Built-In	Available via `torchvision.models`

🛠️ Implementation: Mask R-CNN in PyTorch

We use PyTorch’s built-in maskrcnn_resnet50_fpn, pretrained on the COCO dataset (91 classes).

✅ Required Packages

pip install torch torchvision matplotlib opencv-python

🐍 Full Code Example

# Make sure you have these installed:
# pip install torch torchvision matplotlib opencv-python

import os
import torch
from torchvision import models, transforms
import cv2
import numpy as np
import matplotlib.pyplot as plt

# COCO category names (0 is background)
COCO_CLASSES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane',
    'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant',
    'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
    'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack',
    'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis',
    'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
    'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass',
    'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich',
    'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake',
    'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet',
    'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave',
    'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase',
    'scissors', 'teddy bear', 'hair dryer', 'toothbrush'
]

# 1) Load the pretrained Mask R-CNN
model = models.detection.maskrcnn_resnet50_fpn(pretrained=True)
model.eval()

# 2) Path to your image
IMAGE_PATH = "image2.jpg"  # ← replace with your filename
if not os.path.isfile(IMAGE_PATH):
    raise FileNotFoundError(f"Cannot find {IMAGE_PATH}")

# 3) Read & prep the image
img_bgr = cv2.imread(IMAGE_PATH)
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
transform = transforms.Compose([transforms.ToTensor()])
input_tensor = transform(img_rgb).unsqueeze(0)

# 4) Run inference
with torch.no_grad():
    outputs = model(input_tensor)[0]

# 5) Filter predictions by score threshold
threshold = 0.5
boxes  = outputs['boxes'].cpu().numpy()
labels = outputs['labels'].cpu().numpy()
scores = outputs['scores'].cpu().numpy()
masks  = outputs['masks'].cpu().numpy()[:, 0]  # [N, H, W]

keep = scores >= threshold
boxes  = boxes[keep]
labels = labels[keep]
scores = scores[keep]
masks  = masks[keep]

# 6) Overlay masks, boxes & labels (with larger font)
overlay = img_rgb.copy()
for box, lbl, score, mask in zip(boxes, labels, scores, masks):
    x1, y1, x2, y2 = box.astype(int)

    # safe lookup of class name
    if 0 <= lbl < len(COCO_CLASSES):
        cls_name = COCO_CLASSES[lbl]
    else:
        cls_name = f"ID {lbl}"

    colour = tuple(int(c) for c in np.random.randint(0, 256, size=3))

    # draw mask
    mask_bin = (mask >= 0.5)
    coloured_mask = np.zeros_like(overlay, dtype=np.uint8)
    coloured_mask[mask_bin] = colour
    overlay = cv2.addWeighted(overlay, 0.7, coloured_mask, 0.3, 0)

    # draw box
    w, h = x2 - x1, y2 - y1
    rect = (x1, y1, w, h)
    cv2.rectangle(overlay, rect, colour, 3)

    # draw label with bigger font
    cv2.putText(
        overlay,
        f"{cls_name}: {score:.2f}",
        (x1, max(y1 - 10, 20)),      # ensure text isn't off-image
        cv2.FONT_HERSHEY_SIMPLEX,
        1.0,      # fontScale
        colour,
        3         # thickness
    )

# 7) Plot into a Matplotlib figure
fig, ax = plt.subplots(figsize=(10, 8))
ax.imshow(overlay)
ax.axis('off')
ax.set_title("Mask R-CNN Instance Segmentation", fontsize=18)

# make room at the top for the title
fig.subplots_adjust(top=0.90)

# 8) Save the figure as a one-page PDF with extra padding
output_pdf = "output.pdf"
fig.savefig(
    output_pdf,
    format='pdf',
    bbox_inches='tight',
    pad_inches=0.3  # adds a border so nothing is clipped
)
plt.close(fig)

print(f"✅ Segmented image (with larger labels) written to {output_pdf}")

📊 Summary of the Above Full Code

Concept	Description
Model	`maskrcnn_resnet50_fpn` (pretrained on COCO)
Framework	PyTorch + TorchVision
Tasks	Bounding boxes, class labels, pixel-wise instance masks
Dataset	COCO (91 classes including person, car, cat, etc.)
Output Format	Image with overlaid masks, bounding boxes, and labels
Use Cases	Medical imaging, autonomous vehicles, robotics, video analysis

🖼️ Output Example

Mask R-CNN Instance Segmentation

The result shows:

🎨 Colored masks for each object
🟦 Blue bounding boxes
🏷️ Bold class names and confidence scores (e.g., person: 0.98) :contentReference[oaicite:0]{index=0}

💾 Output Saved As

output.pdf

⚠️ Observations & Misclassifications in Mask R-CNN Output

During testing, the pretrained Mask R-CNN model (maskrcnn_resnet50_fpn) was applied to two real-world images using PyTorch and TorchVision. While it successfully generated masks and bounding boxes, it made some notable classification errors.

🧪 Examples from Output

Image	Detected Class	Actual Object	Confidence
`image1.jpg`	horse	dog	0.60+
`image2.jpg`	bottle	girl/person	0.65

📌 Note: The girl was incorrectly labeled as a bottle, and the dog was misclassified as a horse.

❓ Why Did This Happen?

These errors are common when using pretrained models without fine-tuning. Here’s why:

Cause	Explanation
🎨 Model is not fine-tuned	The model is trained on a generic dataset (COCO), not on your specific images or domain. Without task-specific training, accuracy suffers.
🤔 Ambiguous context	The dog might resemble a horse due to size, colour, or pose. Similarly, the model mistook a person for a bottle because of background clutter or bounding-box overlap.
🏷️ Semantic mislabeling	The model may detect a region but assign the wrong class due to lack of context. For example, if the shape of a bottle overlaps with a hand, misclassification can occur.
🍀 Semantic annotation challenge	Training Mask R-CNN properly requires pixel-level ground-truth masks (not just bounding boxes). Each pixel must be labeled with its correct class. Without high-quality annotated data, the model can’t learn precise boundaries.
🖼️ Lighting & posture issues	Shadows, blur, or unusual object poses (e.g. person crouching) can confuse detection and classification networks.

✅ What You Can Do to Improve Accuracy

Solution	Benefit
🎨 Fine-tune on your own annotated dataset	Learns task-specific object classes and appearances
🖼️ Add more diverse training samples	Improves robustness to different backgrounds and angles
🏷️ Use semantic segmentation tools	Annotate training images with pixel-wise masks
🔍 Increase confidence threshold	Filters out low-confidence, noisy predictions
🚀 Try upgraded models like `maskrcnn_swin_t`	Improves accuracy with stronger feature extractors

📝 Summary

The pretrained Mask R-CNN model performs well at generating object masks, but classification accuracy can be weak on unseen or unusual images.

Even though “girl” is not a class in COCO, she should be detected as “person”. When the label is instead predicted as “bottle”, it highlights a failure of context understanding—not necessarily a missing class.

✅ This highlights the need for:

Fine-tuning on task-specific datasets
Pixel-accurate annotations (semantic masks)
Testing and threshold adjustments before deploying in real-world applications