Zero shot classification - iffatAGheyas/computer-vision-handbook GitHub Wiki

🧠 Zero-Shot Image Classification with CLIP

Module 8: Deep Learning in Computer Vision

📌 What Is Zero-Shot Classification?

Zero-shot classification refers to the ability of a model to classify new categories it has never seen during training—simply by providing text descriptions of those classes.

💡 Example: Without seeing a single labeled image of a “dog,” the model can still classify an image correctly just by being told to match it with “a photo of a dog.”

This is made possible by models like CLIP (Contrastive Language–Image Pretraining) from OpenAI, which learns to associate images and text in a shared embedding space.

🤖 How CLIP Works

CLIP is trained on 400 million image–text pairs from the internet.
It learns to match images to natural language phrases.
At inference, we give it:
- An image
- A list of text prompts (e.g., “a photo of a dog”, “a photo of a person”)
CLIP then chooses the text prompt that best matches the image—no further training needed!

🧪 Project: Dog vs Person — Zero-Shot Classification

We built a simple zero-shot classifier using CLIP to distinguish between Dog and Person, without providing a single training image.

✅ Text prompts: "a photo of a dog", "a photo of a person"
✅ Test directory layout:

testset/
├── Dog/
└── Person/

💻 Full Code

Below is the full code used for this project — it:

Loads CLIP from HuggingFace 🤗
Encodes image and text
Runs classification on all test images
Saves output predictions in a PDF report

# ─────────────────────────────────────────────────────────────────────────────
# Zero-Shot Classification with CLIP (robust HF Hub settings)
# ─────────────────────────────────────────────────────────────────────────────

# 1) INSTALL DEPENDENCIES (only need to run once)
!pip install torch torchvision transformers pillow matplotlib --quiet

# 2) TUNE HF-HUB ENV VARS TO AVOID TIMEOUTS / SYMLINK MESSAGES
import os
os.environ["HF_HUB_DOWNLOAD_RETRIES"]    = "10"   # try up to 10 times
os.environ["HF_HUB_REQUEST_TIMEOUT"]     = "60"   # 60s per request
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

# 3) IMPORT EVERYTHING
import os
import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

# ── CONFIG ───────────────────────────────────────────────────────────────────
TEST_DIR    = "testset"               # contains subfolders "Dog" and "Person"
CLASS_FOLDERS = ["Dog", "Person"]
CLASS_LABELS  = ["dog", "person"]     # text prompts for zero-shot
PDF_OUT     = "results_zeroshot.pdf"  # output multi-page PDF
# ─────────────────────────────────────────────────────────────────────────────

# 4) SETUP DEVICE, MODEL & TOKENIZER
device    = "cuda" if torch.cuda.is_available() else "cpu"
model     = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# 5) PRECOMPUTE TEXT FEATURES
text_inputs = processor(
    text=[f"a photo of a {lbl}" for lbl in CLASS_LABELS],
    return_tensors="pt",
    padding=True
).to(device)
with torch.no_grad():
    text_feats = model.get_text_features(**text_inputs)
    text_feats = text_feats / text_feats.norm(dim=-1, keepdim=True)

# 6) RUN ZERO-SHOT ON EVERY TEST IMAGE AND WRITE PDF
with PdfPages(PDF_OUT) as pdf:
    for actual in CLASS_FOLDERS:
        folder = os.path.join(TEST_DIR, actual)
        for fn in sorted(os.listdir(folder)):
            if not fn.lower().endswith((".jpg","jpeg","png")):
                continue

            # load & preprocess image
            img_path = os.path.join(folder, fn)
            img_pil  = Image.open(img_path).convert("RGB")
            inputs   = processor(images=img_pil, return_tensors="pt").to(device)

            # encode & normalize image
            with torch.no_grad():
                img_feats = model.get_image_features(**inputs)
                img_feats = img_feats / img_feats.norm(dim=-1, keepdim=True)

            # similarity & prediction
            logits   = (img_feats @ text_feats.T)  # (1,2)
            probs    = logits.softmax(dim=-1)[0]
            pred_idx = int(probs.argmax())
            predicted = CLASS_LABELS[pred_idx]

            # plot & save a page
            fig, ax = plt.subplots(figsize=(6,6))
            ax.imshow(img_pil)
            ax.axis("off")
            ax.set_title(f"Actual: {actual}    Predicted: {predicted}", fontsize=14)
            pdf.savefig(fig, bbox_inches="tight")
            plt.close(fig)

print(f"✅ Done — wrote zero-shot results to {PDF_OUT}")

📸 Annotated Predictions

Each test image was labeled with:

Actual class (based on the folder name)
Predicted class by CLIP
Saved in: results_zeroshot.pdf

✅ Summary Table

Feature	Description
Model	CLIP (ViT-B/32) via HuggingFace
Task	Zero-shot classification (no training images used)
Prompt Style	`"a photo of a [class]"`
Classes	Dog, Person
Output	PDF report of all predictions

🧠 Key Takeaways

CLIP generalizes from language to vision
You don’t need to retrain the model — just supply text labels
This allows building powerful classifiers from natural language, useful in:
- Low-data environments
- Fast prototyping
- Multi-modal systems