Zero shot classification - iffatAGheyas/computer-vision-handbook GitHub Wiki

🧠 Zero-Shot Image Classification with CLIP

Module 8: Deep Learning in Computer Vision


πŸ“Œ What Is Zero-Shot Classification?

Zero-shot classification refers to the ability of a model to classify new categories it has never seen during trainingβ€”simply by providing text descriptions of those classes.

πŸ’‘ Example: Without seeing a single labeled image of a β€œdog,” the model can still classify an image correctly just by being told to match it with β€œa photo of a dog.”

This is made possible by models like CLIP (Contrastive Language–Image Pretraining) from OpenAI, which learns to associate images and text in a shared embedding space.


πŸ€– How CLIP Works

  • CLIP is trained on 400 million image–text pairs from the internet.
  • It learns to match images to natural language phrases.
  • At inference, we give it:
    • An image
    • A list of text prompts (e.g., β€œa photo of a dog”, β€œa photo of a person”)
  • CLIP then chooses the text prompt that best matches the imageβ€”no further training needed!

πŸ§ͺ Project: Dog vs Person β€” Zero-Shot Classification

We built a simple zero-shot classifier using CLIP to distinguish between Dog and Person, without providing a single training image.

  • βœ… Text prompts: "a photo of a dog", "a photo of a person"
  • βœ… Test directory layout:
testset/
β”œβ”€β”€ Dog/
└── Person/

πŸ’» Full Code

Below is the full code used for this project β€” it:

  • Loads CLIP from HuggingFace πŸ€—
  • Encodes image and text
  • Runs classification on all test images
  • Saves output predictions in a PDF report
# ─────────────────────────────────────────────────────────────────────────────
# Zero-Shot Classification with CLIP (robust HF Hub settings)
# ─────────────────────────────────────────────────────────────────────────────

# 1) INSTALL DEPENDENCIES (only need to run once)
!pip install torch torchvision transformers pillow matplotlib --quiet

# 2) TUNE HF-HUB ENV VARS TO AVOID TIMEOUTS / SYMLINK MESSAGES
import os
os.environ["HF_HUB_DOWNLOAD_RETRIES"]    = "10"   # try up to 10 times
os.environ["HF_HUB_REQUEST_TIMEOUT"]     = "60"   # 60s per request
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

# 3) IMPORT EVERYTHING
import os
import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

# ── CONFIG ───────────────────────────────────────────────────────────────────
TEST_DIR    = "testset"               # contains subfolders "Dog" and "Person"
CLASS_FOLDERS = ["Dog", "Person"]
CLASS_LABELS  = ["dog", "person"]     # text prompts for zero-shot
PDF_OUT     = "results_zeroshot.pdf"  # output multi-page PDF
# ─────────────────────────────────────────────────────────────────────────────

# 4) SETUP DEVICE, MODEL & TOKENIZER
device    = "cuda" if torch.cuda.is_available() else "cpu"
model     = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# 5) PRECOMPUTE TEXT FEATURES
text_inputs = processor(
    text=[f"a photo of a {lbl}" for lbl in CLASS_LABELS],
    return_tensors="pt",
    padding=True
).to(device)
with torch.no_grad():
    text_feats = model.get_text_features(**text_inputs)
    text_feats = text_feats / text_feats.norm(dim=-1, keepdim=True)

# 6) RUN ZERO-SHOT ON EVERY TEST IMAGE AND WRITE PDF
with PdfPages(PDF_OUT) as pdf:
    for actual in CLASS_FOLDERS:
        folder = os.path.join(TEST_DIR, actual)
        for fn in sorted(os.listdir(folder)):
            if not fn.lower().endswith((".jpg","jpeg","png")):
                continue

            # load & preprocess image
            img_path = os.path.join(folder, fn)
            img_pil  = Image.open(img_path).convert("RGB")
            inputs   = processor(images=img_pil, return_tensors="pt").to(device)

            # encode & normalize image
            with torch.no_grad():
                img_feats = model.get_image_features(**inputs)
                img_feats = img_feats / img_feats.norm(dim=-1, keepdim=True)

            # similarity & prediction
            logits   = (img_feats @ text_feats.T)  # (1,2)
            probs    = logits.softmax(dim=-1)[0]
            pred_idx = int(probs.argmax())
            predicted = CLASS_LABELS[pred_idx]

            # plot & save a page
            fig, ax = plt.subplots(figsize=(6,6))
            ax.imshow(img_pil)
            ax.axis("off")
            ax.set_title(f"Actual: {actual}    Predicted: {predicted}", fontsize=14)
            pdf.savefig(fig, bbox_inches="tight")
            plt.close(fig)

print(f"βœ… Done β€” wrote zero-shot results to {PDF_OUT}")

πŸ“Έ Annotated Predictions

Each test image was labeled with:

  • Actual class (based on the folder name)
  • Predicted class by CLIP
  • Saved in: results_zeroshot.pdf

image image image image image image image image image image

βœ… Summary Table

Feature Description
Model CLIP (ViT-B/32) via HuggingFace
Task Zero-shot classification (no training images used)
Prompt Style "a photo of a [class]"
Classes Dog, Person
Output PDF report of all predictions

🧠 Key Takeaways

  • CLIP generalizes from language to vision
  • You don’t need to retrain the model β€” just supply text labels
  • This allows building powerful classifiers from natural language, useful in:
    • Low-data environments
    • Fast prototyping
    • Multi-modal systems