Computer Vision with Hugging Face Transformers - ua-datalab/Generative-AI GitHub Wiki

Overview of Computer Vision and Hugging Face Transformers

(Credit: Google DeepMind. Unsplash.com)

[!tip] :computer: Please see this slide presentation

1. Introduction to Computer Vision

Computer Vision (CV) is a field of artificial intelligence that enables machines to interpret and make decisions based on visual data, such as images and videos. The goal is to simulate the way humans see and understand the world. CV powers applications like facial recognition, object detection, and image classification.

2. Main Computer Vision Tasks

Image Classification: Assigning a label or category to an image (e.g., identifying whether an image contains a cat or a dog). (See HF Tutorial).
Object Detection: Identifying and localizing objects within an image (e.g., detecting multiple objects in a single image with bounding boxes). (See HF Tutorial).
Semantic Segmentation: Classifying each pixel in an image into a category (e.g., distinguishing between the background and different objects). (See HF Tutorial).
Instance Segmentation: Identifying individual instances of objects in an image, providing pixel-level masks for each object. (See HF Tutorial).
Image Generation: Creating new images from a dataset or based on a given input (e.g., generating high-resolution images from text descriptions). (See HF Tutorial).
Face Recognition: Identifying or verifying a person based on facial features from an image. (See HF Tutorial).
Pose Estimation: Predicting the pose or position of a person or object in an image. (See HF Tutorial).
Optical Character Recognition (OCR): Converting text from images into machine-encoded text. (See HF Tutorial).

3. Advantages of Hugging Face Transformers for Computer Vision

Vision Transformers (ViT): Hugging Face supports Vision Transformers, which apply the Transformer architecture to image data, enabling state-of-the-art performance on tasks like image classification and segmentation.
Pre-trained Models: Access to pre-trained models that can be fine-tuned for specific CV tasks, reducing the need for extensive computational resources and labeled data.
Interdisciplinary Application: Integration of vision and language tasks (e.g., image captioning, visual question answering) using multi-modal transformers.
Ease of Use: User-friendly APIs make it simple to apply complex models to CV tasks without needing to build them from scratch.
Community Support: Extensive documentation and a large community contribute to a rich ecosystem for developers working on CV tasks.
Flexibility: Models like CLIP (Contrastive Language–Image Pretraining) allow for innovative tasks such as zero-shot image classification, where models can classify images without task-specific training data.

4. Learning Resources

Books:
- "Deep Learning for Computer Vision" by Rajalingappaa Shanmugamani.
- "Computer Vision: Algorithms and Applications" by Richard Szeliski.
- "Hands-On Computer Vision with TensorFlow 2" by Benjamin Planche and Eliot Andres.
Papers:
- Papers with Code: Computer Vision
Online Courses:
- Deep Learning Specialization by Andrew Ng on Coursera (includes a module on CV).
- Stanford CS231n: Deep Learning for Computer Vision – A comprehensive course on computer vision.
- Hugging Face Course – Specific chapters on Vision Transformers and CV tasks.
Documentation:
- Hugging Face Vision Documentation – Information on using Vision Transformers with Hugging Face.
- OpenCV Documentation – Extensive documentation on the popular OpenCV library for computer vision.
Tutorials and Blogs:
- Deep Learning for Computer Vision. Run.AI
- Deep Learning For Computer Vision: Essential Models and Practical Real-World Applications. Farooq Alvi. OpenCV.
- Hugging Face Blog - Computer Vision – Articles on the latest advancements in CV with Transformers.
- Overview of Vision Language Models. Aman.AI.

5. General References

Dosovitskiy, A., et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929.
Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020.
He, K., et al. (2016). Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385v1.

6. Jupyter Notebook Examples

[!note] :notebook_with_decorative_cover: Read and execute the next Jupyter Notebook example in Google Colab.

This workshop will introduce participants to the core concepts of Computer Vision, the various tasks it involves, and how Hugging Face Transformers can be effectively utilized to advance these tasks.

import torch
from PIL import Image
from transformers import AutoTokenizer, ViTFeatureExtractor, AutoModelForImageClassification

# Basic notions: computer vision, image preprocessing, image classification
# Advantages of Hugging Face Transformers for CV: pre-trained models, transfer learning
# Trends: vision transformers, object detection, image generation

# Load a pre-trained model and feature extractor
feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224-in21k")

# Load a dataset (e.g., CIFAR10)
from torchvision import datasets, transforms

transform = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

trainset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=2)

# Preprocess the data
def preprocess_image(image):
    inputs = feature_extractor(images=image, return_tensors="pt")
    return inputs

# Fine-tune the model
# ... (similar to NLP fine-tuning)

# Activity: Try different image classification datasets and experiment with data augmentation.

Created: 08/16/2024 (C. Lizárraga); Last update: 09/12/2024 (C. Lizárraga)

CC BY-NC-SA

UArizona DataLab, Data Science Institute, University of Arizona, 2024.