google-vit-base-patch16-224

Overview

The Vision Transformer (ViT) model, as introduced in the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al., underwent pre-training on ImageNet-21k with a resolution of 224x224. Subsequently, it was fine-tuned on ImageNet 2012, consisting of 1 million images and 1,000 classes, also at a resolution of 224x224. The model was first released in this repository, but the weights were converted to PyTorch from the timm repository by Ross Wightman, who had previously converted the weights from JAX to PyTorch.

An image is treated as a sequence of patches and it is processed by a standard Transformer encoder as used in NLP. These patches are linearly embedded, and a [CLS] token is added at the beginning of the sequence for classification tasks. The model also requires absolute position embeddings before feeding the sequence Transformer encoder. So the pre-training creates an inner representation of images that can be used to extract features that are useful for downstream tasks. For instance, if a dataset of labeled images is available, a linear layer can be placed on top of the pre-trained encoder, to train a standard classifier.

Training Details

Training Data

The ViT model is pre-trained on ImageNet-21k dataset with a resolution of 224x224 and fine-tuned on ImageNet 2012, consisting of 1 million images and 1,000 classes.

Training Procedure

In the preprocessing step, images are resized to the same resolution 224x224. Then normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).

The model was trained on TPUv3 hardware (8 cores). All models are trained using Adam with β1 = 0.9, β2 = 0.999, with a batch size of 4096, a high weight decay of 0.1, learning rate warmup of 10k steps. Authors found that it is beneficial to additionally apply gradient clipping at global norm 1. Training resolution is 224. For more details on hyperparameters refer to table 3 of the original-paper.

For more details on self-supervised pre-training (ImageNet-21k) followed by supervised fine-tuning (ImageNet-1k) refer to the section 3 and 4 of the original-paper.

Evaluation Results

For ViT image classification benchmark results, Refer to table 2 and table 5 of the original-paper.

License

apache-2.0

Inference Samples

Inference type	Python sample (Notebook)	CLI with YAML
Real time	image-classification-online-endpoint.ipynb	image-classification-online-endpoint.sh
Batch	image-classification-batch-endpoint.ipynb	image-classification-batch-endpoint.sh

Finetuning Samples

Task	Use case	Dataset	Python sample (Notebook)	CLI with YAML
Image Multi-class classification	Image Multi-class classification	fridgeObjects	fridgeobjects-multiclass-classification.ipynb	fridgeobjects-multiclass-classification.sh
Image Multi-label classification	Image Multi-label classification	multilabel fridgeObjects	fridgeobjects-multilabel-classification.ipynb	fridgeobjects-multilabel-classification.sh

Evaluation Samples

Task	Use case	Dataset	Python sample (Notebook)
Image Multi-class classification	Image Multi-class classification	fridgeObjects	image-multiclass-classification.ipynb
Image Multi-label classification	Image Multi-label classification	multilabel fridgeObjects	image-multilabel-classification.ipynb

Sample input and output

Sample input

{
  "input_data": ["image1", "image2"]
}

Note: "image1" and "image2" string should be in base64 format or publicly accessible urls.

Sample output

[
  [
    {
      "label" : "can",
      "score" : 0.91
    },
    {
      "label" : "carton",
      "score" : 0.09
    },
  ],
  [
    {
      "label" : "carton",
      "score" : 0.9
    },
    {
      "label" : "can",
      "score" : 0.1
    },
  ]
]

Visualization of inference result for a sample image

Version: 18

Tags

model_specific_defaults : {'apply_deepspeed': 'true', 'apply_ort': 'true'} training_dataset : imagenet-1k, imagenet-21k

View in Studio: https://ml.azure.com/registries/azureml/models/google-vit-base-patch16-224/version/18

Properties

SharedComputeCapacityEnabled: True

SHA: 2ddc9d4e473d7ba52128f0df4723e478fa14fb80

finetuning-tasks: image-classification

finetune-min-sku-spec: 4|1|28|176

finetune-recommended-sku: Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

evaluation-min-sku-spec: 4|1|28|176

evaluation-recommended-sku: Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

inference-min-sku-spec: 2|0|14|28

inference-recommended-sku: Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

models google vit base patch16 224 - Azure/azureml-assets GitHub Wiki

google-vit-base-patch16-224

Overview

Training Details

Training Data

Training Procedure

Evaluation Results

License

Inference Samples

Finetuning Samples

Evaluation Samples

Sample input and output

Sample input

Sample output

Visualization of inference result for a sample image

Tags

Properties

⚠️ GitHub.com Fallback ⚠️

models google vit base patch16 224 - Azure/azureml-assets GitHub Wiki

google-vit-base-patch16-224

Overview

Training Details

Training Data

Training Procedure

Evaluation Results

License

Inference Samples

Finetuning Samples

Evaluation Samples

Sample input and output

Sample input

Sample output

Visualization of inference result for a sample image

Tags

Properties

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️