Zero shot classification - iffatAGheyas/computer-vision-handbook GitHub Wiki
π§ Zero-Shot Image Classification with CLIP
Module 8: Deep Learning in Computer Vision
π What Is Zero-Shot Classification?
Zero-shot classification refers to the ability of a model to classify new categories it has never seen during trainingβsimply by providing text descriptions of those classes.
π‘ Example: Without seeing a single labeled image of a βdog,β the model can still classify an image correctly just by being told to match it with βa photo of a dog.β
This is made possible by models like CLIP (Contrastive LanguageβImage Pretraining) from OpenAI, which learns to associate images and text in a shared embedding space.
π€ How CLIP Works
- CLIP is trained on 400 million imageβtext pairs from the internet.
- It learns to match images to natural language phrases.
- At inference, we give it:
- An image
- A list of text prompts (e.g., βa photo of a dogβ, βa photo of a personβ)
- CLIP then chooses the text prompt that best matches the imageβno further training needed!
π§ͺ Project: Dog vs Person β Zero-Shot Classification
We built a simple zero-shot classifier using CLIP to distinguish between Dog and Person, without providing a single training image.
- β
Text prompts:
"a photo of a dog"
,"a photo of a person"
- β Test directory layout:
testset/
βββ Dog/
βββ Person/
π» Full Code
Below is the full code used for this project β it:
- Loads CLIP from HuggingFace π€
- Encodes image and text
- Runs classification on all test images
- Saves output predictions in a PDF report
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Zero-Shot Classification with CLIP (robust HF Hub settings)
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# 1) INSTALL DEPENDENCIES (only need to run once)
!pip install torch torchvision transformers pillow matplotlib --quiet
# 2) TUNE HF-HUB ENV VARS TO AVOID TIMEOUTS / SYMLINK MESSAGES
import os
os.environ["HF_HUB_DOWNLOAD_RETRIES"] = "10" # try up to 10 times
os.environ["HF_HUB_REQUEST_TIMEOUT"] = "60" # 60s per request
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
# 3) IMPORT EVERYTHING
import os
import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
# ββ CONFIG βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TEST_DIR = "testset" # contains subfolders "Dog" and "Person"
CLASS_FOLDERS = ["Dog", "Person"]
CLASS_LABELS = ["dog", "person"] # text prompts for zero-shot
PDF_OUT = "results_zeroshot.pdf" # output multi-page PDF
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# 4) SETUP DEVICE, MODEL & TOKENIZER
device = "cuda" if torch.cuda.is_available() else "cpu"
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
# 5) PRECOMPUTE TEXT FEATURES
text_inputs = processor(
text=[f"a photo of a {lbl}" for lbl in CLASS_LABELS],
return_tensors="pt",
padding=True
).to(device)
with torch.no_grad():
text_feats = model.get_text_features(**text_inputs)
text_feats = text_feats / text_feats.norm(dim=-1, keepdim=True)
# 6) RUN ZERO-SHOT ON EVERY TEST IMAGE AND WRITE PDF
with PdfPages(PDF_OUT) as pdf:
for actual in CLASS_FOLDERS:
folder = os.path.join(TEST_DIR, actual)
for fn in sorted(os.listdir(folder)):
if not fn.lower().endswith((".jpg","jpeg","png")):
continue
# load & preprocess image
img_path = os.path.join(folder, fn)
img_pil = Image.open(img_path).convert("RGB")
inputs = processor(images=img_pil, return_tensors="pt").to(device)
# encode & normalize image
with torch.no_grad():
img_feats = model.get_image_features(**inputs)
img_feats = img_feats / img_feats.norm(dim=-1, keepdim=True)
# similarity & prediction
logits = (img_feats @ text_feats.T) # (1,2)
probs = logits.softmax(dim=-1)[0]
pred_idx = int(probs.argmax())
predicted = CLASS_LABELS[pred_idx]
# plot & save a page
fig, ax = plt.subplots(figsize=(6,6))
ax.imshow(img_pil)
ax.axis("off")
ax.set_title(f"Actual: {actual} Predicted: {predicted}", fontsize=14)
pdf.savefig(fig, bbox_inches="tight")
plt.close(fig)
print(f"β
Done β wrote zero-shot results to {PDF_OUT}")
πΈ Annotated Predictions
Each test image was labeled with:
- Actual class (based on the folder name)
- Predicted class by CLIP
- Saved in:
results_zeroshot.pdf
β Summary Table
Feature | Description |
---|---|
Model | CLIP (ViT-B/32) via HuggingFace |
Task | Zero-shot classification (no training images used) |
Prompt Style | "a photo of a [class]" |
Classes | Dog, Person |
Output | PDF report of all predictions |
π§ Key Takeaways
- CLIP generalizes from language to vision
- You donβt need to retrain the model β just supply text labels
- This allows building powerful classifiers from natural language, useful in:
- Low-data environments
- Fast prototyping
- Multi-modal systems