Mask R‐CNN - iffatAGheyas/computer-vision-handbook GitHub Wiki
🧠 Semantic & Instance Segmentation with Mask R-CNN
Mask R-CNN extends Faster R-CNN by adding pixel-level segmentation, so it can perform both:
Task | What It Does |
---|---|
📌 Semantic Segmentation | Labels each pixel with a class name |
👥 Instance Segmentation | Distinguishes individual object instances |
🧩 How Mask R-CNN Works
- Backbone CNN (e.g. ResNet + FPN) extracts multi-scale features.
- Region Proposal Network (RPN) suggests candidate object regions.
- Per-Region Heads
- Classification: predicts object class
- Bounding-Box Regression: refines box coordinates
- Mask Prediction: outputs a pixel-wise mask for each instance
✅ Outputs bounding boxes, class labels and pixel masks for every detected object.
🔑 Why Use Mask R-CNN?
Feature | Benefit |
---|---|
🔍 Instance-aware | Separates individual objects of the same class |
🌈 Semantic-level | Labels every pixel, not just bounding boxes |
✨ High accuracy | Built on top of Faster R-CNN for precise localization |
🐍 PyTorch Built-In | Available via torchvision.models |
🛠️ Implementation: Mask R-CNN in PyTorch
We use PyTorch’s built-in maskrcnn_resnet50_fpn
, pretrained on the COCO dataset (91 classes).
✅ Required Packages
pip install torch torchvision matplotlib opencv-python
🐍 Full Code Example
# Make sure you have these installed:
# pip install torch torchvision matplotlib opencv-python
import os
import torch
from torchvision import models, transforms
import cv2
import numpy as np
import matplotlib.pyplot as plt
# COCO category names (0 is background)
COCO_CLASSES = [
'__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane',
'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant',
'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack',
'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis',
'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass',
'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich',
'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake',
'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet',
'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave',
'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase',
'scissors', 'teddy bear', 'hair dryer', 'toothbrush'
]
# 1) Load the pretrained Mask R-CNN
model = models.detection.maskrcnn_resnet50_fpn(pretrained=True)
model.eval()
# 2) Path to your image
IMAGE_PATH = "image2.jpg" # ← replace with your filename
if not os.path.isfile(IMAGE_PATH):
raise FileNotFoundError(f"Cannot find {IMAGE_PATH}")
# 3) Read & prep the image
img_bgr = cv2.imread(IMAGE_PATH)
img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
transform = transforms.Compose([transforms.ToTensor()])
input_tensor = transform(img_rgb).unsqueeze(0)
# 4) Run inference
with torch.no_grad():
outputs = model(input_tensor)[0]
# 5) Filter predictions by score threshold
threshold = 0.5
boxes = outputs['boxes'].cpu().numpy()
labels = outputs['labels'].cpu().numpy()
scores = outputs['scores'].cpu().numpy()
masks = outputs['masks'].cpu().numpy()[:, 0] # [N, H, W]
keep = scores >= threshold
boxes = boxes[keep]
labels = labels[keep]
scores = scores[keep]
masks = masks[keep]
# 6) Overlay masks, boxes & labels (with larger font)
overlay = img_rgb.copy()
for box, lbl, score, mask in zip(boxes, labels, scores, masks):
x1, y1, x2, y2 = box.astype(int)
# safe lookup of class name
if 0 <= lbl < len(COCO_CLASSES):
cls_name = COCO_CLASSES[lbl]
else:
cls_name = f"ID {lbl}"
colour = tuple(int(c) for c in np.random.randint(0, 256, size=3))
# draw mask
mask_bin = (mask >= 0.5)
coloured_mask = np.zeros_like(overlay, dtype=np.uint8)
coloured_mask[mask_bin] = colour
overlay = cv2.addWeighted(overlay, 0.7, coloured_mask, 0.3, 0)
# draw box
w, h = x2 - x1, y2 - y1
rect = (x1, y1, w, h)
cv2.rectangle(overlay, rect, colour, 3)
# draw label with bigger font
cv2.putText(
overlay,
f"{cls_name}: {score:.2f}",
(x1, max(y1 - 10, 20)), # ensure text isn't off-image
cv2.FONT_HERSHEY_SIMPLEX,
1.0, # fontScale
colour,
3 # thickness
)
# 7) Plot into a Matplotlib figure
fig, ax = plt.subplots(figsize=(10, 8))
ax.imshow(overlay)
ax.axis('off')
ax.set_title("Mask R-CNN Instance Segmentation", fontsize=18)
# make room at the top for the title
fig.subplots_adjust(top=0.90)
# 8) Save the figure as a one-page PDF with extra padding
output_pdf = "output.pdf"
fig.savefig(
output_pdf,
format='pdf',
bbox_inches='tight',
pad_inches=0.3 # adds a border so nothing is clipped
)
plt.close(fig)
print(f"✅ Segmented image (with larger labels) written to {output_pdf}")
📊 Summary of the Above Full Code
Concept | Description |
---|---|
Model | maskrcnn_resnet50_fpn (pretrained on COCO) |
Framework | PyTorch + TorchVision |
Tasks | Bounding boxes, class labels, pixel-wise instance masks |
Dataset | COCO (91 classes including person, car, cat, etc.) |
Output Format | Image with overlaid masks, bounding boxes, and labels |
Use Cases | Medical imaging, autonomous vehicles, robotics, video analysis |
🖼️ Output Example
Mask R-CNN Instance Segmentation
The result shows:
- 🎨 Colored masks for each object
- 🟦 Blue bounding boxes
- 🏷️ Bold class names and confidence scores (e.g.,
person: 0.98
) :contentReference[oaicite:0]{index=0}
💾 Output Saved As
output.pdf
⚠️ Observations & Misclassifications in Mask R-CNN Output
During testing, the pretrained Mask R-CNN model (maskrcnn_resnet50_fpn
) was applied to two real-world images using PyTorch and TorchVision. While it successfully generated masks and bounding boxes, it made some notable classification errors.
🧪 Examples from Output
Image | Detected Class | Actual Object | Confidence |
---|---|---|---|
image1.jpg |
horse | dog | 0.60+ |
image2.jpg |
bottle | girl/person | 0.65 |
📌 Note: The girl was incorrectly labeled as a bottle, and the dog was misclassified as a horse.
❓ Why Did This Happen?
These errors are common when using pretrained models without fine-tuning. Here’s why:
Cause | Explanation |
---|---|
🎨 Model is not fine-tuned | The model is trained on a generic dataset (COCO), not on your specific images or domain. Without task-specific training, accuracy suffers. |
🤔 Ambiguous context | The dog might resemble a horse due to size, colour, or pose. Similarly, the model mistook a person for a bottle because of background clutter or bounding-box overlap. |
🏷️ Semantic mislabeling | The model may detect a region but assign the wrong class due to lack of context. For example, if the shape of a bottle overlaps with a hand, misclassification can occur. |
🍀 Semantic annotation challenge | Training Mask R-CNN properly requires pixel-level ground-truth masks (not just bounding boxes). Each pixel must be labeled with its correct class. Without high-quality annotated data, the model can’t learn precise boundaries. |
🖼️ Lighting & posture issues | Shadows, blur, or unusual object poses (e.g. person crouching) can confuse detection and classification networks. |
✅ What You Can Do to Improve Accuracy
Solution | Benefit |
---|---|
🎨 Fine-tune on your own annotated dataset | Learns task-specific object classes and appearances |
🖼️ Add more diverse training samples | Improves robustness to different backgrounds and angles |
🏷️ Use semantic segmentation tools | Annotate training images with pixel-wise masks |
🔍 Increase confidence threshold | Filters out low-confidence, noisy predictions |
🚀 Try upgraded models like maskrcnn_swin_t |
Improves accuracy with stronger feature extractors |
📝 Summary
The pretrained Mask R-CNN model performs well at generating object masks, but classification accuracy can be weak on unseen or unusual images.
Even though “girl” is not a class in COCO, she should be detected as “person”. When the label is instead predicted as “bottle”, it highlights a failure of context understanding—not necessarily a missing class.
✅ This highlights the need for:
- Fine-tuning on task-specific datasets
- Pixel-accurate annotations (semantic masks)
- Testing and threshold adjustments before deploying in real-world applications